pdf水管工|從動態列布局中提取文本-有解無憂

在帖子底部嘗試了解決方案。

我有接近作業的代碼，可以跨多行提取包含短語的句子。

但是，某些頁面具有列。所以各自的輸出是不正確的；其中單獨的文本被錯誤地合并為一個壞句子。

此問題已在以下帖子中解決：

方案一
解決方案2

題：

我如何“if-condition”是否有列？

頁面可能沒有列，
頁面可能有 2 列以上。
頁面也可能有頁眉和頁腳（可以省略）。

.pdf動態文本布局示例：PDF (pg. 2)。

Jupyter 筆記本：

# pip install PyPDF2
# pip install pdfplumber

# ---

import pdfplumber

# ---

def scrape_sentence(phrase, lines, index):
    # -- Gather sentence 'phrase' occurs in --
    sentence = lines[index]
    print("-- sentence --", sentence)
    print("len(lines)", len(lines))
    
    # Previous lines
    pre_i, flag = index, 0
    while flag == 0:
        pre_i -= 1
        if pre_i <= 0:
            break
            
        sentence = lines[pre_i]   sentence
        
        if '.' in lines[pre_i] or '!' in lines[pre_i] or '?' in lines[pre_i] or '  ?  ' in lines[pre_i]:
            flag == 1
    
    print("\n", sentence)
    
    # Following lines
    post_i, flag = index, 0
    while flag == 0:
        post_i  = 1
        if post_i >= len(lines):
            break
            
        sentence = sentence   lines[post_i] 
        
        if '.' in lines[post_i] or '!' in lines[post_i] or '?' in lines[post_i] or '  ?  ' in lines[pre_i]:
            flag == 1 
    
    print("\n", sentence)
    
    # -- Extract --
    sentence = sentence.replace('!', '.')
    sentence = sentence.replace('?', '.')
    sentence = sentence.split('.')
    sentence = [s for s in sentence if phrase in s]
    print(sentence)
    sentence = sentence[0].replace('\n', '').strip()  # first occurance
    print(sentence)
    
    return sentence

# ---

phrase = 'Gulf Petrochemical Industries Company'

with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf:
    for page in opened_pdf.pages:
        text = page.extract_text()
        if text == None:
            continue
        lines = text.split('\n')
        i = 0
        sentence = ''
        while i < len(lines):
            if phrase in lines[i]:
                sentence = scrape_sentence(phrase, lines, i)
            i  = 1

錯誤輸出示例：

-- sentence -- being a major manufacturer within the kingdom of  In 2012, Gulf Petrochemical Industries Company becomes part of 
len(lines) 47

 Company (GPIC)gulf petrochemical industries company (gpic) is a leading joint venture setup and owned by the government of the kingdom of bahrain, saudi basic industries corporation (sabic), kingdom of saudi arabia and petrochemical industries company (pic), kuwait. gpic was set up for the purposes of manufacturing fertilizers and petrochemicals. being a major manufacturer within the kingdom of  In 2012, Gulf Petrochemical Industries Company becomes part of 

 Company (GPIC)gulf petrochemical industries company (gpic) is a leading joint venture setup and owned by the government of the kingdom of bahrain, saudi basic industries corporation (sabic), kingdom of saudi arabia and petrochemical industries company (pic), kuwait. gpic was set up for the purposes of manufacturing fertilizers and petrochemicals. being a major manufacturer within the kingdom of  In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption. represented by natural gas purchases, empowering bahraini nationals through training & employment, utilisation of local contractors and suppliers, energy consumption and other financial, commercial, environmental and social activities that arise as a part of our core operations within the kingdom.GPIC becomes an organizational stakeholder of Global Reporting for the purpose of clarity throughout this report,  Initiative ( GRI) in 2014. By supporting GRI, Organizational ‘gpic’, ’we’ ‘us’, and ‘our’ refer to the gulf  Stakeholders (OS) like GPIC, demonstrate their commitment to transparency, accountability and sustainability to a worldwide petrochemical industries company; ‘sabic’ refers to network of multi-stakeholders.the saudi basic industries corporation; ‘pic’ refers to the petrochemical industries company, kuwait; ‘nogaholding’ refers to the oil and gas holding company, kingdom of bahrain; and ‘board’ refers to our board of directors represented by a group formed by nogaholding, sabic and pic.the oil and gas holding company (nogaholding) is  GPIC is a Responsible Care Company certified for RC 14001 since July 2010. We are committed to the safe, ethical and the business and investment arm of noga (national environmentally sound management of the petrochemicals oil and gas authority) and steward of the bahrain  and fertilizers we make and export. Stakeholders’ well-being is government’s investment in the bahrain petroleum  always a key priority at GPIC.company (bapco), the bahrain national gas company (banagas), the bahrain national gas expansion company (bngec), the bahrain aviation fuelling company (bafco), the bahrain lube base oil company, the gulf petrochemical industries company (gpic), and tatweer petroleum.GPIC SuStaInabIlIty RePoRt 2016 01ii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01
[' being a major manufacturer within the kingdom of  In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption']
being a major manufacturer within the kingdom of  In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption

...

Attempted Minimal Solution: This will separate text into 2 columns; regardless if there are 2.

# pip install PyPDF2
# pip install pdfplumber

# ---

import pdfplumber
import decimal

# ---

with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf:
    for page in opened_pdf.pages:
        left = page.crop((0, 0, decimal.Decimal(0.5) * page.width, decimal.Decimal(0.9) * page.height))
        right = page.crop((decimal.Decimal(0.5) * page.width, 0, page.width, page.height))
        
        l_text = left.extract_text()
        r_text = right.extract_text()
        print("\n -- l_text --", l_text)
        print("\n -- r_text --", r_text)
        text = str(l_text)   " "   str(r_text)

Please let me know if there is anything else I should clarify.

uj5u.com熱心網友回復：

此答案使您能夠按預期順序抓取文本。

Towards Data Science 文章PDF Text Extraction in Python：

與 PyPDF2 相比，PDFMiner 的范圍要有限得多，它實際上只專注于從 pdf 檔案的源資訊中提取文本。

from io import StringIO

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert_pdf_to_string(file_path):
    output_string = StringIO()
    with open(file_path, 'rb') as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)

    return(output_string.getvalue())

file_path = ''  # !
text = convert_pdf_to_string(file_path)
print(text)

之后可以進行清潔。

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/373041.html

標籤：python if-statement text-extraction information-extraction pdfplumber

上一篇：有誰知道一個如果前面的if會不會比下一個if快？

下一篇：Intellij2021.3：更新到Intellij2021.3后，Maven無法決議依賴項