我正在使用 PyPdf 從 pdf 檔案中讀取文本。但是 pyPDF 不會逐行讀取 pdf 中的文本,它以某種隨意的方式讀取。當它甚至不存在于pdf中時,將新行放在某個地方。
import PyPDF2
pdf_path = r'C:\Users\PDFExample\Desktop\Temp\sample.pdf'
pdfFileObj = open(pdf_path, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
page_nos = pdfReader.numPages
for i in range(page_nos):
# Creating a page object
pageObj = pdfReader.getPage(i)
# Printing Page Number
print("Page No: ",i)
# Extracting text from page
# And splitting it into chunks of lines
text = pageObj.extractText().split(" ")
# Finally the lines are stored into list
# For iterating over list a loop is used
for i in range(len(text)):
# Printing the line
# Lines are seprated using "\n"
print(text[i],end="\n\n")
print()
這給了我內容
Our Ref :
21
1
8
88
1
11
5
Name:
S
ky Blue
Ref 1 :
1
2
-
34
-
56789
-
2021/2
Ref 2:
F2021004
444
Amount:
$
1
00
.
11
...
而預期是
Our Ref :2118881115 Name: Sky Blue Ref 1 :12-34-56789-2021/2 Ref 2:F2021004444
Amount: $100.11 Total Paid:$0.00 Balance: $100.11 Date of A/C: 01/08/2021 Date Received: 10/12/2021
Last Paid: Amt Last Paid: A/C Status: CLOSED Collector : Sunny Jane
這是pdf檔案的鏈接 https://drive.internxt.com/s/file/a6ce09dd3b967bfc131a/a1f64430147399ab527527436e686b0ee67011e7248ec3cc834e233596e162cf
uj5u.com熱心網友回復:
我嘗試了一個名為 pdfplumber 的不同包。它能夠以我想要的方式逐行閱讀 pdf。
1.安裝包pdfplumber
pip install pdfplumber
2.獲取文本并將其存盤在某個容器中
import pdfplumber
pdf_text = None
with pdfplumber.open(pdf_path) as pdf:
first_page = pdf.pages[0]
pdf_text = first_page.extract_text()
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/493627.html
標籤:Python pdf 文件处理 pypdf python-pdf阅读器
上一篇:忽略CombinePdf例外
下一篇:標志引數是另一個標志
