我正在嘗試從檔案夾中的所有 PDF 檔案中提取以下資訊,PDF 檔案是 CV:作業專案的電子郵件地址、名字、姓氏。
我已經成功地使用以下代碼提取了電子郵件地址:
from io import StringIO
from pdfminer3.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer3.converter import TextConverter
from pdfminer3.layout import LAParams
from pdfminer3.pdfpage import PDFPage
import subprocess
from subprocess import call
import os
import re
working_directory = './folder'
file_list = [] # define file_list to save all dxf files
email_list = {} # define file_list to save all dxf files
for subdir, dirs, files in os.walk(working_directory):
for file in files:
if file.endswith('.pdf'):
file_list.append(file)
for input_file in file_list:
pagenums = set()
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = open('./folder/' input_file, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close()
match = re.search(r'[\w\.-] @[a-z0-9\.-] ', text)
try:
email = match.group(0)
except AttributeError:
email = match
if email is None:
pass
else:
email_list.update({input_file: email})
print(email_list[input_file])
email_list
但是無法提取名字和姓氏,任何幫助將不勝感激!
uj5u.com熱心網友回復:
您可以找到電子郵件資訊,因為它背后有邏輯
match = re.search(r'[\w\.-] @[a-z0-9\.-] ', text)
但是您還必須找出一個邏輯來找出 PDF 檔案的名字和姓氏。
Dear,例如,也許是一個特定的領域
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/415056.html
標籤:
