我從 pdf 中抓取了一張表格,所有內容都放入了資料框的一個元素中。我設法將所有內容分成單獨的列,但 r 對列名感到困惑。第一列是“州”,應該包括所有州名,但分開后是空白的。第二列是“州藥物處方集”,在分離后,錯誤地在其中包含了州名稱。它還缺少許多其他資訊。任何可能的修復?
為簡單起見,我將列重命名為“x”。
library(tabulizer)
library(pdftools)
library(rJava)
library(tidyverse)
url4 = "https://oppe.pharmacy.washington.edu/PracticumSite/forms/2019_Survey_of_Pharmacy_Law.pdf?-session=Students_Session:42F94F5D0a61a20754trv33D875D&fbclid=IwAR0qeK2tYmyI7T_8ict1Hnew9JxPkpt0bvajI3KL3IFDWg6JHNSSFWGlKY4"
out <- pdf_text(url4)
df=as.data.frame(out[[93]],header=F)
df = df %>%
rename(x = `out[[93]]`) %>%
mutate(x=strsplit(x, "\n")) %>%
unnest(x)
df=df[-c(1:2),]
df2=df %>% separate(x, c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))
表應該是什么樣子。如果您訪問源檔案,請訪問原始檔案的第 82 頁。
我也試過這個,它保留了 col 名稱,但洗掉了資料
df3 = df %>% separate(x, sep = " ", into = c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))
uj5u.com熱心網友回復:
第 82 頁包括其他內容,如21. Drug Product Selection Lawsor 等??。
你最好像洗掉它們一樣,
dummy <- strsplit(df$`out[[93]]`, '\\n\n')
此程序將該頁面分為四個部分,并且您要查找的表格是該串列的第二個物件。
df2 <- df %>%
rename(x = `out[[93]]`) %>%
mutate(x = stringr::str_split(x, '\\n\n',simplify = T)[2]) %>%
mutate(x = strsplit(x, '\\n')) %>%
unnest() %>%
.[-c(1:3), ]
現在df2是表格內容。所以,用兩個以上的空格分開這個,
df2 %>% separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\\s{2,}") %>%
select(-a)
會給出結果。'a' 是虛擬的,因為separate前面有空白值。這是結果的一部分。
State `State Drug Fo…` `Two-line Rx F…` `Permissive or…` `How to Preven…`
<chr> <chr> <chr> <chr> <chr>
1 Alabama None Yes P, BBB A
2 Alaska None No P B
3 Arizona None No P I
4 Arkans… None No P B
5 Califo… None No P EE
6 Colora… None No P J
7 Connec… None No P E, F
8 Delawa… None No P E
9 Distri… Positive No P B
10 Florida Negative L No M B
從一行中完成df
df %>%
rename(x = `out[[93]]`) %>%
mutate(x = stringr::str_split(x, '\\n\n',simplify = T)[2]) %>%
mutate(x = strsplit(x, '\\n')) %>%
unnest() %>%
.[-c(1:3),] %>%
separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\\s{2,}") %>%
select(-a)
你可以試試這個
as.data.frame(pdf_text(url4)[[93]],header=F) %>%
rename(x = `out[[93]]`) %>%
mutate(x = stringr::str_split(x, '\\n\n',simplify = T)[2]) %>%
mutate(x = strsplit(x, '\\n')) %>%
unnest() %>%
.[-c(1:3),] %>%
separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\\s{2,}") %>%
select(-a)
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/484917.html
