目標:如果 pdf 行包含子字串,則復制整個句子(跨多行)。
我能夠print()在line與phrase中出現。
現在,一旦我找到了 this line,我想回傳迭代,直到找到一個句子終止符:. ! ?,從上一個句子開始,再次向前迭代直到下一個句子終止符。
這是因為我可以print()了解該短語所屬的整個句子。
但是,我scrape_sentence()遇到了無限運行的遞回錯誤。
Jupyter 筆記本:
# pip install PyPDF2
# pip install pdfplumber
# ---
# import re
import glob
import PyPDF2
import pdfplumber
# ---
phrase = "Responsible Care Company"
# SENTENCE_REGEX = re.pattern('^[A-Z][^?!.]*[?.!]$')
def scrape_sentence(sentence, lines, index, phrase):
if '.' in lines[index] or '!' in lines[index] or '?' in lines[index]:
return sentence.replace('\n', '').strip()
sentence = scrape_sentence(lines[index-1] sentence, lines, index-1, phrase) # previous line
sentence = scrape_sentence(sentence lines[index 1], lines, index 1, phrase) # following line
sentence = sentence.replace('!', '.')
sentence = sentence.replace('?', '.')
sentence = sentence.split('.')
sentence = [s for s in sentence if phrase in s]
sentence = sentence[0] # first occurance
print(sentence)
return sentence
# ---
with pdfplumber.open('../data/gri/reports/GPIC_Sustainability_Report_2020__-_40_Years_of_Sustainable_Success.pdf') as opened_pdf:
for page in opened_pdf.pages:
text = page.extract_text()
lines = text.split('\n')
i = 0
sentence = ''
while i < len(lines):
if 'and Knowledge of Individuals; Behaviours; Attitudes, Perception ' in lines[i]:
sentence = scrape_sentence('', lines, i) # !
print(sentence) # !
i = 1
輸出:
connection and the linkage to the relevant UN’s 17 SDGs.and Leadership. We have long realized and recognized that there
短語:
Responsible Care Company
句子(跨多行):
"GPIC is a Responsible Care Company certified for RC 14001
since July 2010."
PDF(第 2 頁)。
如果還有什么我可以添加到帖子中,請告訴我。
uj5u.com熱心網友回復:
我在這里通過從scrape_sentence().
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/371662.html
