Trie使用了太多記憶體-有解無憂

我試圖從名為 web2.txt 的檔案中獲取由字母“crbtfopkgevyqdzsh”組成的所有單詞。下面貼出的單元格跟隨一個代碼塊，該代碼塊不正確地將整個運行回傳到一個完整的單詞，例如對于震驚的單詞它會回傳s、sh、sho、shoc、shock、shocke、shock

所以我嘗試了一個特里（知道雙關語）。

web2.txt 大小為 2.5 MB，包含 2,493,838 個不同長度的單詞。下面單元格中的嘗試破壞了我的 Google Colab 筆記本。我什至升級到 Google Colab Pro，然后升級到 Google Colab Pro 來嘗試容納代碼塊，但還是太多了。除了嘗試獲得相同的結果之外，還有更有效的想法嗎？

# Find the words3 word list here:  svnweb.freebsd.org/base/head/share/dict/web2?view=co

trie = {}

with open('/content/web2.txt') as words3:


    for word in words3:
        cur = trie
        for l in word:
            cur  = cur.setdefault(l, {})
            cur['word'] = True # defined if this node indicates a complete word
        
def findWords(word, trie = trie, cur = '', words3 = []):
    for i, letter in enumerate(word):
        if letter in trie:
            if 'word' in trie[letter]:
                words3.append(cur)
            findWords(word, trie[letter], cur letter, words3 )    
            # first example: findWords(word[:i]   word[i 1:], trie[letter], cur letter, word_list )

    return [word for word in words3 if word in words3]

words3 = findWords("crbtfopkgevyqdzsh")

我正在使用 Pyhton3

uj5u.com熱心網友回復：

嘗試是矯枉過正。大約有 200,000 個單詞，因此您只需將它們全部遍歷一遍，看看您是否可以使用基本字串中的字母組成單詞。

這是的一個很好的用例collections.Counter，它為我們提供了一種獲取任意字串字母頻率（即“計數器”）的簡潔方法：

from collections import Counter

base_counter = Counter("crbtfopkgevyqdzsh")
with open("data.txt") as input_file:
    for line in input_file:
        line = line.rstrip()
        line_counter = Counter(line.lower())
        # Can use <= instead if on Python 3.10
        if line_counter & base_counter == line_counter:
            print(line)

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/495155.html

標籤：Python 列表 for循环 nlp 特里

上一篇：使用for回圈更新dict值

下一篇：在R中創建多個表，然后將所有表合并為一個