我從 pdf 中抓取了一張表格，所有內容都放入了資料框的一個元素中。我設法將所有內容分成單獨的列，但 r 對列名感到困惑。第一列是“州”，應該包括所有州名，但分開后是空白的。第二列是“州藥物處方集”，在分離后，錯誤地在其中包含了州名稱。它還缺少許多其他資訊。任何可能的修復？

為簡單起見，我將列重命名為“x”。

library(tabulizer)
library(pdftools)
library(rJava)
library(tidyverse)
url4 = "https://oppe.pharmacy.washington.edu/PracticumSite/forms/2019_Survey_of_Pharmacy_Law.pdf?-session=Students_Session:42F94F5D0a61a20754trv33D875D&fbclid=IwAR0qeK2tYmyI7T_8ict1Hnew9JxPkpt0bvajI3KL3IFDWg6JHNSSFWGlKY4"

out <- pdf_text(url4)
df=as.data.frame(out[[93]],header=F)
df = df %>%
  rename(x = `out[[93]]`) %>% 
    mutate(x=strsplit(x, "\n")) %>%
    unnest(x)
df=df[-c(1:2),]
df2=df %>% separate(x, c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))

表應該是什么樣子。如果您訪問源檔案，請訪問原始檔案的第 82 頁。

我也試過這個，它保留了 col 名稱，但洗掉了資料

df3 = df %>% separate(x, sep = " ", into = c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))

uj5u.com熱心網友回復：

第 82 頁包括其他內容，如21. Drug Product Selection Lawsor 等??。

你最好像洗掉它們一樣，

dummy <- strsplit(df$`out[[93]]`, '\\n\n')

此程序將該頁面分為四個部分，并且您要查找的表格是該串列的第二個物件。

df2 <- df %>%
  rename(x = `out[[93]]`) %>%
  mutate(x = stringr::str_split(x, '\\n\n',simplify = T)[2]) %>%
  mutate(x = strsplit(x, '\\n')) %>%
  unnest() %>%
  .[-c(1:3), ]

現在df2是表格內容。所以，用兩個以上的空格分開這個，

df2 %>% separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\\s{2,}") %>%
  select(-a)

會給出結果。'a' 是虛擬的，因為separate前面有空白值。這是結果的一部分。

  State   `State Drug Fo…` `Two-line Rx F…` `Permissive or…` `How to Preven…`
   <chr>   <chr>            <chr>            <chr>            <chr>           
 1 Alabama None             Yes              P, BBB           A               
 2 Alaska  None             No               P                B               
 3 Arizona None             No               P                I               
 4 Arkans… None             No               P                B               
 5 Califo… None             No               P                EE              
 6 Colora… None             No               P                J               
 7 Connec… None             No               P                E, F            
 8 Delawa… None             No               P                E               
 9 Distri… Positive         No               P                B               
10 Florida Negative L       No               M                B

從一行中完成`df`

df %>%
  rename(x = `out[[93]]`) %>%
  mutate(x = stringr::str_split(x, '\\n\n',simplify = T)[2]) %>%
  mutate(x = strsplit(x, '\\n')) %>%
  unnest() %>%
  .[-c(1:3),] %>%
  separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\\s{2,}") %>%
  select(-a)

你可以試試這個

as.data.frame(pdf_text(url4)[[93]],header=F) %>%
  rename(x = `out[[93]]`) %>%
  mutate(x = stringr::str_split(x, '\\n\n',simplify = T)[2]) %>%
  mutate(x = strsplit(x, '\\n')) %>%
  unnest() %>%
  .[-c(1:3),] %>%
  separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\\s{2,}") %>%
  select(-a)

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/484917.html

標籤：r 数据框 pdf 网页抓取

上一篇：識別出現在特定年份而不是另一組年份的記錄

下一篇：有人可以幫我解決這個問題嗎，需要創建函式將列總和加倍

在r中將列分成多列時丟失資料

從一行中完成df

你可以試試這個

從一行中完成`df`