在多個檔案中搜索單詞的最有效方法-有解無憂

對于我的碩士論文，我下載了大量與金融相關的檔案。我的目標是找到一組特定的詞（“第 11 章”）來標記所有經歷過債務重組程序的公司。問題是我有超過 120 萬個小檔案，這使得搜索效率非常低。現在我撰寫了非常基本的代碼，我達到了每 40-50 秒 1000 個檔案的速度。我想知道是否有一些特定的庫或方法（甚至編程語言）可以更快地搜索。這是我目前使用的功能

def get_items(m):
    word = "chapter 11"
    f = open(m, encoding='utf8')
    document = f.read()
    f.close()
    return (word in document.lower())
# apply the function to the list of names:
l_v1 = list(map(get_items,filenames))

檔案大小在 5 到 4000 KB 之間變化

uj5u.com熱心網友回復：

試試 Unix 工具，grep.

如果檔案很少，您可以執行以下操作：

grep -i "chapter 11" file1 file2 ...

或者，

grep -i "chapter 11" file*.txt

如果檔案很多，可以grep結合find：

find . -type f | xargs grep -i "chapter 11"

另一個強大的工具是ack（用 Perl 撰寫的）——參見https://beyondgrep.com/。

uj5u.com熱心網友回復：

好吧，您可以使用執行緒將檔案名串列拆分為兩個或更小的串列并同時搜索。

穿線解釋

執行緒庫檔案

這是一個例子：

import threading

def get_items(m):
    word = "chapter 11"
    f = open(m, encoding='utf8')
    document = f.read()
    f.close()
    return (word in document.lower())
# apply the function to the list of names:
l_v1 = list(map(get_items,filenames))

x = threading.Thread(target=get_items, args=(l_v1[:len(l_v1) // 2],))
y = threading.Thread(target=get_items, args=(l_v1[len(l_v1) // 2:],))

x.start()
y.start()

uj5u.com熱心網友回復：

這是一種稍微不同的方法，我們使用多執行緒來構建包含字串 'chapter 11' 的檔案名串列

from concurrent.futures import ThreadPoolExecutor

filenames = [] # list of filenames
results = [] # list of filenames containing 'chapter 11'
word = 'chapter 11' # lowercase

def process(filename):
    try:
        with open(filename, encoding='utf-8') as infile:
            if word in infile.read().lower():
                results.append(filename)
    except Exception:
        pass

with ThreadPoolExecutor() as executor:
    executor.map(process, filenames)

print(results)

編輯：

OP 說過要處理的所有檔案都在一個目錄/檔案夾中。在這種情況下，與其構建檔案名串列，不如這樣做：

from concurrent.futures import ThreadPoolExecutor
from os.path import join
from os import listdir
import re

results = [] # list of filenames containing 'chapter 11'
cp = re.compile('chapter 11', re.IGNORECASE)
DIR = '' # directory containing files to be processed

def process(filename):
    try:
        with open(join(DIR, filename), encoding='utf-8') as infile:
            if cp.search(infile.read()):
                results.append(filename)
    except Exception:
        pass

with ThreadPoolExecutor() as executor:
    executor.map(process, listdir(DIR))

print(results)

此更改還包含使用正則運算式搜索模式的想法，該模式可能比使用in更有效，也可能不更有效。

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/518018.html

標籤：Python表现文件搜索单词

上一篇：Jinja2-創建包含前綴串列中的物件的python串列

下一篇：拆分熊貓中的地址列