我有一個大約 1000 頁的 PDF 檔案,我想洗掉一些以找不到特定單詞為條件的頁面。例如,代碼將搜索特定單詞,例如“STACKOVER”,如果在頁面上找不到該單詞,則洗掉該頁面并繼續到下一頁,最后保存檔案。
uj5u.com熱心網友回復:
這樣做的方法是:首先,定義您要查找的搜索詞(在我的情況下,我在醫學期刊上對其進行了測驗并搜索了searchwords=['unclear risk for poorly'])。其次,查找包含該單詞或字串的所有頁面,并將頁碼存盤在串列中 ( pages_to_delete)。為了安全起見,我將它們放在 csv 檔案中,給出找到特定搜索詞的頁面。第三,打開原始pdf,洗掉串列中包含的頁面并保存為新的pdf。
import PyPDF2
import re
from PyPDF2 import PdfFileWriter, PdfFileReader
pdfFileObj=open(r'C:\Users\s-degossondevarennes\......\dddtest.pdf',mode='rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages
pages_text=[]
words_start_pos={}
words={}
searchwords=['unclear risk for poorly']
pages_to_delete = []
with open('Pages.csv', 'w') as f:
f.write('{0},{1}\n'.format("Sheet Number", "Search Word"))
for word in searchwords:
for page in range(number_of_pages):
print(page)
pages_text.append(pdfReader.getPage(page).extractText())
words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
words[page]=[pages_text[page][value:value len(word)] for value in words_start_pos[page]]
for page in words:
for i in range(0,len(words[page])):
if str(words[page][i]) != 'nan':
f.write('{0},{1}\n'.format(page 1, words[page][i]))
pages_to_delete.append(page)
infile = PdfFileReader(r'C:\Users\s-degossondevarennes\.......\dddtest.pdf', 'rb')
output = PdfFileWriter()
for i in range(infile.getNumPages()):
if i not in pages_to_delete:
p = infile.getPage(i)
output.addPage(p)
with open('Newdddtest.pdf', 'wb') as f:
output.write(f)
更新
如果您想忽略文本是否為粗體或不替換
words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
和
words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page])]
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/418398.html
標籤:
