如何檢查PDF是掃描影像還是包含R中的文本-有解無憂

我正在對 R 中的幾個 PDF（>1000）進行結構方程模型。

但是，有些 PDF 是可讀的，而另一些是掃描的，即我需要通過 OCR 功能運行它們。

因此，我需要找到一種方法來自動識別哪些 PDF 包含文本，哪些不包含。具體來說，我希望找到一種方法來回傳給定的 PDF 是否應該通過 OCR 運行。

有誰知道 R 中的任何函式或包可能有助于做到這一點 - 我可以找到一些 Python 解決方案，但無法在 R 中識別一些。

uj5u.com熱心網友回復：

你可以使用這樣的方法（正如@danlooo 已經建議但我想把它拼出來）：

files <- list.files("/home/johannes/pdfs/",
                    pattern = ".pdf$",
                    full.names = TRUE)

pdfs_l <- lapply(files, function(f) {
  out <- pdftools::pdf_text(f)
  # I set the test to an arbitrary number of characters, it works for me but you want
  # to maybe fine tune it a bit
  contains_text <- nchar(out) > 15
  if (!contains_text) {
    out <- pdftools::pdf_ocr_text(f)
  }
  data.frame(text = out, ocr = !contains_text)
})

pdfs_l |>
  dplyr::bind_rows() |>
  dplyr::mutate(text = trimws(text)) |>
  tibble::as_tibble()
#> # A tibble: 22 × 2
#>    text                                                                    ocr  
#>    <chr>                                                                   <lgl>
#>  1 "TEAM MEMBERS:\n                                                      … FALSE
#>  2 "WS 21/22                                                             … FALSE
#>  3 "WS 21/22                                                             … FALSE
#>  4 "TEAM MEMBERS:\n                                                      … FALSE
#>  5 "TEAM MEMBERS:\n                                                      … FALSE
#>  6 "Key Concepts in Political Communication\n    @Agenda Setting, Priming… FALSE
#>  7 "Key Concepts in Political Communication\n    @Agenda Setting, Priming… FALSE
#>  8 "ELECTIONS AND CAMPAIGNS\n                                            … FALSE
#>  9 ""                                                                      TRUE 
#> 10 ""                                                                      TRUE 
#> # … with 12 more rows

^{由reprex 包于 2022-02-10 創建(v2.0.1)}

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/428372.html

標籤：r pdf 文本

上一篇：將句子串列（帶有ntlk的標記）與pandas資料框中的列匹配

下一篇：如何在節點JS框架中使用x509證書檔案自簽名pdf