我正在嘗試將 pdf 轉換為兩個串列:標題和內容。但我發現此功能不適用于pdf 最后幾頁。
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer,LTChar
#pdf--> title list and content list
def extract_title_content(path):
title=[]
content=[]
a=""
b=""
mode,minn= check_size(path)
for page_layout in extract_pages(path):
title.append(a)
content.append(b)
a=""
b=""
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
if character.size > mode:
a =character.get_text()
elif character.size> minn:
b =character.get_text()
else:
pass
return title,content
uj5u.com熱心網友回復:
在您的外部回圈中,您首先將最近提取的較大文本添加a到 totitle并將中等文本添加到bto content,然后清除aand b,然后將新文本提取到aand b:
for page_layout in extract_pages(path):
title.append(a)
content.append(b)
a=""
b=""
[... extract into a and b ...]
因此,您從最后一頁提取的內容永遠不會添加到titleand 中content。
為了解決這個問題無論是移動的加入a,并b以title和content 后灌裝a和b:
for page_layout in extract_pages(path):
[... extract into a and b ...]
title.append(a)
content.append(b)
a=""
b=""
或者,如果您出于某種原因在填充之前進行添加,請在回圈之后再次顯式添加:
for page_layout in extract_pages(path):
title.append(a)
content.append(b)
a=""
b=""
[... extract into a and b ...]
title.append(a)
content.append(b)
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/405044.html
標籤:
