如何將原始行轉換為df-有解無憂

我需要從 pdf 檔案中讀取 df，這是一個示例表

如何將原始行轉換為df

到目前為止，我能夠使用以下塊將資料作為原始行讀取

library(pdftools)
library(tidyverse)

pdf_file <- pdf_text("exm.pdf")

raw_df <- pdf_file %>%
  read_lines() %>%
  data.frame() %>% 
  rename(rawline = 1)

raw_df <- raw_df %>% 
  mutate(
    rawline = str_replace(string = rawline,
                          pattern = "^ \\s*",
                          replacement = "")
    )

這是原始df的結構

> raw_df
                                     rawline
1      Id   Name    Address           Mobile
2 1    Kiran   Bengaluru,        99999 99999
3                                Mysore Road
4                                   6th Lane
5 2    John    Mandya            77777 77777
6                            Taluka Junction
7 3    Ravi    Mysore            88888 88888

我怎樣才能把它轉換成一個合適的df？我嘗試使用正則運算式過濾掉以數字開頭的行，但之后我就卡住了。我需要收集地址行（開頭沒有數字）并將它們附加到上一個地址文本，然后將行拆分為列。我嘗試根據 id、name、address 和 Mobile 之間的空間進行拆分，但它并非在所有行中都保持不變。我該如何解決這個問題？提前致謝。

編輯

正如建議的那樣，我嘗試了pdf_data并得到了一個像這樣的表格（head（15）），其中包含文本的 x 和 y 位置

# A tibble: 15 x 6
   width height     x     y space text      
   <int>  <int> <int> <int> <lgl> <chr>     
 1     8     11    77    74 FALSE Id        
 2     5     11    77    88 FALSE 1         
 3    26     11   181    74 FALSE Name      
 4    23     11   181    88 FALSE Kiran     
 5     5     11    77   129 FALSE 2         
 6    20     11   181   129 FALSE John      
 7     5     11    77   156 FALSE 3         
 8    18     11   181   156 FALSE Ravi      
 9    35     11   294    74 FALSE Address   
10    48     11   294    88 FALSE Bengaluru,
11    33     11   294   102 TRUE  Mysore    
12    22     11   330   102 FALSE Road      
13     5     11   294   115 FALSE 6         
14     5      6   299   114 TRUE  th        
15    21     11   308   115 FALSE Lane

基于此表，我可以過濾掉 x 值并將列作為向量。但如果值中有空格（如地址），則此過濾將不起作用。有沒有辦法根據 x 和 y 值收集地址列？

基本上我需要根據一個值（例如：x == 294）收集行，直到出現相同的值，然后我可以使用str_c將這些單元格合并為一個字串。

uj5u.com熱心網友回復：

根據您的第一種方法，在獲取 row_df 后嘗試此功能：

library(dplyr)
parse_pdfs_lines_ById<- function(raw_df){
 
# ----delete rownames : the first character and space
raw_df=raw_df%>%
 mutate(rawline=sub('.', '', rawline))%>%
 # ----remove the first space to keep Id as a first word
 mutate(rawline=gsub('^ ', '', rawline))  

# ------ now ignore the raw of colnames
raw_df=data.frame(rawline=raw_df[-1,])


# ---------assign  the correct id to  correct line 
# id=""
# initialize index of line
i=1
while (i<nrow(raw_df))
{
 if(grepl("^[0-9]",raw_df$rawline[i]))
 {
   # get the id , first word of line ./!\ not the first character! e.g : id == 22 )
   id=stringr::word(raw_df$rawline[i],1)
 }else{ 
   raw_df$rawline[i]=paste0(id,raw_df$rawline[i])
 }   
 i=i 1
}
# > raw_df
#                                    rawline
# 1    Kiran   Bengaluru,        99999 99999
# 1                               Mysore Road
# 1                                  6th Lane
# 2    John    Mandya            77777 77777
# 2                           Taluka Junction
# 3    Ravi    Mysore            88888 88888



# ------build the dataframe

col_df= list("Id","Name", "Address", "Mobile")
raw_df2 =setNames(data.frame(matrix(ncol = 4, nrow = 0),stringsAsFactors = F),col_df)

for (j in 1:nrow(raw_df))
{
 # split the line of dataframe by  double space or more
 line= unlist(strsplit(raw_df$rawline[j],"    "))
 df_line= data.frame(t(line),stringsAsFactors = F)
 # if all 4 column exist , affect column names else these is just Id and Part2 of adress ==>column Adress2
 names(df_line) = unlist(ifelse(length(line)==4,
                                list(col_df),
                                list(c("Id","Adress2")))
 )
 # rbind even the number of column is not the same
 raw_df2=plyr::rbind.fill(raw_df2,df_line )
}

# ----- clean final dataframe

final_df = raw_df2%>%
 # replace Na with emty value
 mutate_all(~ifelse(is.na(.), "", .))%>%
 group_by(Id)%>%
 mutate(Address= paste(Address,Adress2,collapse = " "))%>% #put collapse ="\r\n" to display the exact format
 # keep just the first line by Id 
 slice(1)%>%
 # remove adress2 column 
 select(-Adress2)%>%
 ungroup()
return(final_df)
}

在您的第一個示例上應用函式，結果是：

raw_df  = data.frame(rawline=
                       c("1      Id   Name    Address           Mobile",
                         "2 1    Kiran   Bengaluru,        99999 99999",
                         "3                                Mysore Road",
                         "4                                   6th Lane",
                         "5 2    John    Mandya            77777 77777",
                         "6                            Taluka Junction",
                         "7 3    Ravi    Mysore            88888 88888")
)
final_df=parse_pdfs_lines_ById(raw_df)
final_df
# final_df
# A tibble: 3 x 4
# Id    Name  Address                              Mobile     
# <chr> <chr> <chr>                                <chr>      
# 1     Kiran "Bengaluru,   Mysore Road  6th Lane" 99999 99999
# 2     John  "Mandya   Taluka Junction"           77777 77777
# 3     Ravi  "Mysore "                            88888 88888

希望這會有所幫助！，如果某些東西不起作用或不夠清楚，請告訴我。（更新回應格式）。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/521465.html

標籤：rpdf文本挖掘pdf工具

上一篇：使用pywin32將Excel轉換為PDF[錯誤]

下一篇：如何修復wordpress生成憑證PDF上的錯誤“不正確的幻數”？