我需要從 pdf 檔案中讀取 df,這是一個示例表

到目前為止,我能夠使用以下塊將資料作為原始行讀取
library(pdftools)
library(tidyverse)
pdf_file <- pdf_text("exm.pdf")
raw_df <- pdf_file %>%
read_lines() %>%
data.frame() %>%
rename(rawline = 1)
raw_df <- raw_df %>%
mutate(
rawline = str_replace(string = rawline,
pattern = "^ \\s*",
replacement = "")
)
這是原始df的結構
> raw_df
rawline
1 Id Name Address Mobile
2 1 Kiran Bengaluru, 99999 99999
3 Mysore Road
4 6th Lane
5 2 John Mandya 77777 77777
6 Taluka Junction
7 3 Ravi Mysore 88888 88888
我怎樣才能把它轉換成一個合適的df?我嘗試使用正則運算式過濾掉以數字開頭的行,但之后我就卡住了。我需要收集地址行(開頭沒有數字)并將它們附加到上一個地址文本,然后將行拆分為列。我嘗試根據 id、name、address 和 Mobile 之間的空間進行拆分,但它并非在所有行中都保持不變。我該如何解決這個問題?提前致謝。
編輯
正如建議的那樣,我嘗試了pdf_data并得到了一個像這樣的表格(head(15)),其中包含文本的 x 和 y 位置
# A tibble: 15 x 6
width height x y space text
<int> <int> <int> <int> <lgl> <chr>
1 8 11 77 74 FALSE Id
2 5 11 77 88 FALSE 1
3 26 11 181 74 FALSE Name
4 23 11 181 88 FALSE Kiran
5 5 11 77 129 FALSE 2
6 20 11 181 129 FALSE John
7 5 11 77 156 FALSE 3
8 18 11 181 156 FALSE Ravi
9 35 11 294 74 FALSE Address
10 48 11 294 88 FALSE Bengaluru,
11 33 11 294 102 TRUE Mysore
12 22 11 330 102 FALSE Road
13 5 11 294 115 FALSE 6
14 5 6 299 114 TRUE th
15 21 11 308 115 FALSE Lane
基于此表,我可以過濾掉 x 值并將列作為向量。但如果值中有空格(如地址),則此過濾將不起作用。有沒有辦法根據 x 和 y 值收集地址列?
基本上我需要根據一個值(例如:x == 294)收集行,直到出現相同的值,然后我可以使用str_c將這些單元格合并為一個字串。
uj5u.com熱心網友回復:
根據您的第一種方法,在獲取 row_df 后嘗試此功能:
library(dplyr)
parse_pdfs_lines_ById<- function(raw_df){
# ----delete rownames : the first character and space
raw_df=raw_df%>%
mutate(rawline=sub('.', '', rawline))%>%
# ----remove the first space to keep Id as a first word
mutate(rawline=gsub('^ ', '', rawline))
# ------ now ignore the raw of colnames
raw_df=data.frame(rawline=raw_df[-1,])
# ---------assign the correct id to correct line
# id=""
# initialize index of line
i=1
while (i<nrow(raw_df))
{
if(grepl("^[0-9]",raw_df$rawline[i]))
{
# get the id , first word of line ./!\ not the first character! e.g : id == 22 )
id=stringr::word(raw_df$rawline[i],1)
}else{
raw_df$rawline[i]=paste0(id,raw_df$rawline[i])
}
i=i 1
}
# > raw_df
# rawline
# 1 Kiran Bengaluru, 99999 99999
# 1 Mysore Road
# 1 6th Lane
# 2 John Mandya 77777 77777
# 2 Taluka Junction
# 3 Ravi Mysore 88888 88888
# ------build the dataframe
col_df= list("Id","Name", "Address", "Mobile")
raw_df2 =setNames(data.frame(matrix(ncol = 4, nrow = 0),stringsAsFactors = F),col_df)
for (j in 1:nrow(raw_df))
{
# split the line of dataframe by double space or more
line= unlist(strsplit(raw_df$rawline[j]," "))
df_line= data.frame(t(line),stringsAsFactors = F)
# if all 4 column exist , affect column names else these is just Id and Part2 of adress ==>column Adress2
names(df_line) = unlist(ifelse(length(line)==4,
list(col_df),
list(c("Id","Adress2")))
)
# rbind even the number of column is not the same
raw_df2=plyr::rbind.fill(raw_df2,df_line )
}
# ----- clean final dataframe
final_df = raw_df2%>%
# replace Na with emty value
mutate_all(~ifelse(is.na(.), "", .))%>%
group_by(Id)%>%
mutate(Address= paste(Address,Adress2,collapse = " "))%>% #put collapse ="\r\n" to display the exact format
# keep just the first line by Id
slice(1)%>%
# remove adress2 column
select(-Adress2)%>%
ungroup()
return(final_df)
}
在您的第一個示例上應用函式,結果是:
raw_df = data.frame(rawline=
c("1 Id Name Address Mobile",
"2 1 Kiran Bengaluru, 99999 99999",
"3 Mysore Road",
"4 6th Lane",
"5 2 John Mandya 77777 77777",
"6 Taluka Junction",
"7 3 Ravi Mysore 88888 88888")
)
final_df=parse_pdfs_lines_ById(raw_df)
final_df
# final_df
# A tibble: 3 x 4
# Id Name Address Mobile
# <chr> <chr> <chr> <chr>
# 1 Kiran "Bengaluru, Mysore Road 6th Lane" 99999 99999
# 2 John "Mandya Taluka Junction" 77777 77777
# 3 Ravi "Mysore " 88888 88888
希望這會有所幫助!,如果某些東西不起作用或不夠清楚,請告訴我。(更新回應格式)。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/521465.html
