我撰寫了一個代碼來將多個 pdf 檔案轉換為 .txt 檔案。代碼作業得很好,但我遇到的主要問題是,雖然有一個擴展名,但我得到了雙重擴展名,意思是“companyA.pdf”到“companyA.pdf.txt”。我不確定我在哪里犯了錯誤。以下是代碼:
'''
import os
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
pdf_folder_path = os.getcwd() #Get the path of the current folder
text_folder_path = os.getcwd() '/' 'text_folder' #Notation of path is mac specification. For windows'/'To'\'Correct to.
os.makedirs(text_folder_path, exist_ok=True)
pdf_file_name = os.listdir(pdf_folder_path)
#name is a PDF file (ends.pdf) returns TRUE, otherwise FALSE is returned.
def pdf_checker(name):
pdf_regex = re.compile(r'. \.pdf')
if pdf_regex.search(str(name)):
return True
else:
return False
#Convert PDF to text file
def convert_pdf_to_txt(path, txtname, buf=True):
rsrcmgr = PDFResourceManager()
if buf:
outfp = StringIO()
else:
outfp = file(txtname, 'w')
codec = 'utf-8'
laparams = LAParams()
laparams.detect_vertical = True
device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
fp.close()
device.close()
if buf:
text = outfp.getvalue()
make_new_text_file = open(text_folder_path '/' path '.txt', 'w')
make_new_text_file.write(text)
make_new_text_file.close()
outfp.close()
#Get the pdf file name in the folder and list it
for name in pdf_file_name:
if pdf_checker(name):
convert_pdf_to_txt(name, name '.txt') # pdf_Use checker and TRUE (end is.For pdf) proceed to conversion)
else:
pass #Pass if not a PDF file
'''
uj5u.com熱心網友回復:
我建議.pdf在名稱字串以它結尾時運行正則運算式來洗掉,如下所示:
if pdf_checker(name):
newName = re.sub(r'\.pdf$', '.txt', name)
convert_pdf_to_txt(name, newName)
然后替換這一行:
make_new_text_file = open(text_folder_path '/' path '.txt', 'w')
具有以下內容:
make_new_text_file = open(text_folder_path '/' txtname, 'w')
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/513776.html
上一篇:如何找到knitRpdf的寬度,以便我可以將文本居中
下一篇:當在acrobatPrepare表單中復制粘貼文本欄位時,每個按鈕名稱中都會出現“#”符號,我們可以使用javascript洗掉這些“#”符號嗎?
