我正在開發一個腳本,該腳本通過回圈從目錄中的所有 pdf 檔案中提取文本,并將它們插入到 csv 檔案的各個單元格中。我可以成功地將輸出寫入單元格。但是,我需要 csv 檔案來包含"text"與另一個 csv 合并的標題。到目前為止,我嘗試插入該標題時csv_writer遇到了困難。
例如,下面的代碼成功地從 pdfs 中提取和插入文本,但為每個提取的檔案寫入一個新的標題:
import pdfplumber
import csv
import glob
pdfs = glob.glob("dir\*.pdf")
for pf in pdfs:
with pdfplumber.open(pf) as pdf, \
open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['text']) # code for inserting header
text = []
for page in pdf.pages:
extracted_text = page.extract_text()
if extracted_text:
text.append(extracted_text)
csv_output.writerow([' '.join(text)])
我嘗試的另一種方法同樣不成功。我嘗試首先將標題寫入 csv,并將回圈的輸出附加到 csv。但是,由于某種原因,pdf 輸出的格式完全被打亂了,文本分散在多個單元格而不是單個單元格中。
pdfs = glob.glob("dir\*.pdf")
# code for writing header
file = open("pdf_output.csv", "w", newline="")
writer = csv.writer(file)
headers = ['text']
writer.writerow(headers)
for pf in pdfs:
with pdfplumber.open(pf) as pdf, \
open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
csv_output = csv.writer(f_output)
text = []
for page in pdf.pages:
extracted_text = page.extract_text()
if extracted_text:
text.append(extracted_text)
csv_output.writerow([' '.join(text)])
Any suggestions on workarounds or better approaches for this challenge would be immensely welcome.
uj5u.com熱心網友回復:
您可以先打開 csv,插入標題,然后遍歷您的 PDF:
import pdfplumber
import csv
import glob
pdfs = glob.glob("dir\*.pdf")
with open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['text'])
for pf in pdfs:
with pdfplumber.open(pf) as pdf, \
open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
csv_output = csv.writer(f_output)
text = []
for page in pdf.pages:
extracted_text = page.extract_text()
if extracted_text:
text.append(extracted_text)
csv_output.writerow([' '.join(text)])
或者只是檢查它是否是第一次迭代:
import pdfplumber
import csv
import glob
pdfs = glob.glob("dir\*.pdf")
for i, pf in enumerate(pdfs):
with pdfplumber.open(pf) as pdf, \
open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
csv_output = csv.writer(f_output)
if i == 0: csv_output.writerow(['text'])
text = []
for page in pdf.pages:
extracted_text = page.extract_text()
if extracted_text:
text.append(extracted_text)
csv_output.writerow([' '.join(text)])
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/358581.html
標籤:python csv pdf pdfplumber
下一篇:如何從base64字串顯示PDF
