嗨,我曾經使用以下代碼從以下保險合同中提取文本(作為字串):
import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
def extract_text_by_page(pdf_path):
with open(pdf_path, 'rb') as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=False):
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager,
fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager,
converter)
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
yield text
# close open handles
converter.close()
fake_file_handle.close()
def extract_text(pdf_path):
text = ""
for page in extract_text_by_page(pdf_path):
#print(page)
text= page " " text
return text
# Driver code
if __name__ == '__main__':
text=extract_text('document.pdf')
print(text)
我想提取以下以紅色突出顯示的值和相關值(以藍色突出顯示的值)
一些示例輸出:
print(oneri_di_registrazione_atti_giudiziari)
“Valore in lite minimo 300 Limite di indennizzo 500”(作為字串)
print(tutela_dati_personali)
“Massimale per eventto 25000”(作為字串)
如果我更改檔案,紅色部分不會改變,但藍色部分可能希望將數值鏈接到他們的紅色對應項,有人知道怎么做嗎?如果可以提供幫助,我還會分享我提取的原始字串
Valoreinliteminimoeuro300,00OneridiregistrazionediattigiudiziariLimitediindennizzoeuro500,00PerlagaranziaATTIVITA'AZIENDALECOMPLETAMassimalepereventoeuro50.000,00SpeseperunsecondolegaledomiciliatarioLimitediindennizzoeuro2.000,00ControversiecontrattualiconifornitoriValoreinlitemassimoEuro50.000,00ControversiecontrattualiconifornitoriScoperto20%PerlagaranziaSALUTEESICUREZZASULLAVOROMassimalepereventoeuro25.000,00PerlagaranziaTUTELADEIDATIPERSONALIMassimalepereventoeuro25.000,00
如果您需要更多資訊,請發表評論,提前感謝任何能夠解決它的人
uj5u.com熱心網友回復:
免責宣告:我是borb這個答案中使用的庫的作者
我在庫的示例存盤庫中描述了一個與您類似的場景。為了完整起見,我將在這里重復答案。
#!chapter_005/src/snippet_008.py
import typing
from decimal import Decimal
from borb.pdf.canvas.geometry.rectangle import Rectangle
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import LocationFilter
from borb.toolkit import RegularExpressionTextExtraction
from borb.toolkit import PDFMatch
from borb.toolkit import SimpleTextExtraction
def main():
# set up RegularExpressionTextExtraction
# fmt: off
l0: RegularExpressionTextExtraction = RegularExpressionTextExtraction("[nN]isi .* aliquip")
# fmt: on
# process Document
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [l0])
assert doc is not None
# find match
m: typing.Optional[PDFMatch] = next(iter(l0.get_matches_for_page(0)), None)
assert m is not None
# get page width
w: Decimal = doc.get_page(0).get_page_info().get_width()
# change rectangle to get more text
r0: Rectangle = m.get_bounding_boxes()[0]
r1: Rectangle = Rectangle(
r0.get_x() r0.get_width(), r0.get_y(), w - r0.get_x(), r0.get_height()
)
# process document (again) filtering by rectangle
l1: LocationFilter = LocationFilter(r1)
l2: SimpleTextExtraction = SimpleTextExtraction()
l1.add_listener(l2)
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [l1])
assert doc is not None
# get text
print(l2.get_text_for_page(0))
if __name__ == "__main__":
main()
這個想法是您用來RegularExpressionTextExtraction在 PDF 中查找文本。然后這個類可以回傳一個PDFMatch物件串列,其中包含匹配文本的邊界框。
然后,您可以對這些Rectangle物件執行某些操作(在您的情況下,將它們移動到 的最右側Page)并從給定的 PDF 中提取文本Rectangle。
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/530694.html
上一篇:如何使用JetpackCompose為孩子禁用布局修改器
下一篇:列舉串列上的按位邏輯運算子
