使用復雜的if陳述句在Python中進行更快的文本搜索-有解無憂

我有大量帶有標點符號的長文本檔案。這里提供了三個簡短的示例：

doc = ["My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you?", "My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you love dogs?", "My house, the most beautiful!, is NEAR the #sea. I really love holidays, do you?"]

我有如下幾組詞：

wAND = set(["house", "near"])
wOR = set(["seaside"])
wNOT = set(["dogs"])

我想搜索所有滿足以下條件的文本檔案：

(any(w in doc for w in wOR) or not wOR) and (all(w in doc for w in wAND) or not wAND) and (not any(w in doc for w in wNOT) or not wNOT)

需要每個括號中的or not條件，因為這三個串列可能為空。請注意，在應用條件之前，我還需要從標點符號中清除文本，將其轉換為小寫，并將其拆分為一組 words，這需要額外的時間。

此程序將匹配第一個文本，doc但不匹配第二個和第三個文本。實際上，第二個不匹配，因為它包含單詞“dogs”，而第三個不匹配，因為它不包含單詞“seaside”。

我想知道是否可以以更快的方式解決這個一般問題（wOR、wAND 和 wNOT 串列中的單詞發生變化），從而避免文本預處理以進行清理。也許使用快速正則運算式解決方案，可能使用 Trie()。那可能嗎？或任何其他建議？

uj5u.com熱心網友回復：

您的解決方案在檔案的長度上似乎是線性的 - 如果沒有排序，您將無法獲得比這更好的結果，因為您要查找的單詞可能在檔案中的任何位置。您可以嘗試在整個檔案中使用一個回圈：

or_satisfied = False
for w in doc:
    if word in wAND: wAND.remove(word)
    if not or_satisfied and word in wOR: or_satisfied = True
    if word in wNOT: return False
return or_satisfied and not wAND

uj5u.com熱心網友回復：

您可以為您擁有的詞袋構建正則運算式，并使用它們：

def make_re(word_set):
    return re.compile(
        r'\b(?:{})\b'.format('|'.join(re.escape(word) for word in word_set)),
        flags=re.I,
    )


wAND_re = make_re(wAND)
wOR_re = make_re(wOR)
wNOT_re = make_re(wNOT)

def re_match(doc):
    if not wOR_re.search(doc):
        return False
    if wNOT_re.search(doc):
        return False
    found = set()
    expected = len(wAND)
    for word in re.finditer(r'\w ', doc):
        found.add(word)
        if len(found) == expected:
            break
    return len(found) == expected

一個快速的時間測驗似乎說這比原來的速度快 89%（并且通過了原來的“測驗套件”），可能顯然是因為

不需要清理檔案（因為\bs 限制匹配單詞并re.I處理大小寫規范化）
正則運算式在本機代碼中運行，這往往總是比 Python 快

name='original'      iters=10000 time=0.206 iters_per_sec=48488.39
name='re_match'      iters=20000 time=0.218 iters_per_sec=91858.73
name='bag_match'     iters=10000 time=0.203 iters_per_sec=49363.58

bag_match我對使用集合交集的原始評論建議在哪里：

def bag_match(doc):
    bag = set(clean_doc(doc))
    return (
        (bag.intersection(wOR) or not wOR) and
        (bag.issuperset(wAND) or not wAND) and
        (not bag.intersection(wNOT) or not wNOT)
    )

如果您已經將檔案清理為可迭代的單詞（這里我只是打了個耳光，@lru_cache在clean_doc現實生活中您可能不會這樣做，因為您的檔案可能都是唯一的并且快取無濟于事），那么 bag_match 很多快點：

name='orig-with-cached-clean-doc' iters=50000 time=0.249 iters_per_sec=200994.97
name='re_match-with-cached-clean-doc' iters=20000 time=0.221 iters_per_sec=90628.94
name='bag_match-with-cached-clean-doc' iters=100000 time=0.265 iters_per_sec=377983.60

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/477275.html

標籤：Python 细绳列表

上一篇：創建資料框在特定單詞之后提取帶有句點的單詞

下一篇：如何從已經包含雙引號的檔案中讀取字串？