將字串的開始和結束索引中的每個單詞映射到字典-有解無憂

我正在嘗試查找字串中每個單詞的索引范圍（開始索引和結束索引，空格被省略，并且索引從 1 開始以供人類閱讀。）。我認為最好的方法是做一個串列串列，其中每個嵌套串列都包含單詞以及開始和結束索引的串列。從示例字串中，我得到以下串列：

text = "i have a list of lists that contain a word and there indices my method works except with repeated words like of or a or the or it"

產量：

boundaries_list=[['i', [1, 1]], ['have', [3, 6]], ['a', [4, 4]], ['list', [10, 13]], ['of', [15, 16]], ['lists', [18, 22]], ['that', [24, 27]], ['contain', [29, 35]], ['a', [4, 4]], ['word', [39, 42]], ['and', [44, 46]], ['there', [48, 52]], ['indices', [54, 60]], ['my', [62, 63]], ['method', [65, 70]], ['works', [72, 76]], ['except', [78, 83]], ['with', [85, 88]], ['repeated', [90, 97]], ['words', [99, 103]], ['like', [105, 108]], ['of', [15, 16]], ['or', [40, 41]], ['a', [4, 4]], ['or', [40, 41]], ['the', [48, 50]], ['or', [40, 41]], ['it', [86, 87]]]

這有效，但它的可讀性不是很好。將它編譯成字典肯定會很好。字典可以作業，除非您有多個相同的鍵。對我來說，這意味著重復單詞的第一次出現將是該單詞唯一一次出現并被合并到字典中，因此不包括該重復單詞的任何其他出現的索引范圍。

為了解決這個問題，我嘗試defaultdict在字典串列上使用 , 但這只會給我第一個單詞的索引范圍重復 n 個單詞的出現次數。

例如：

for one_d in boundaries_list:

    nested_list_to_nested_dict = dict({one_d[0]:one_d[1]  })
    new_list.append(nested_list_to_nested_dict)


res = defaultdict(list)

for d in new_list:
    for k, v in d.items():
        res[k].append(v)

print(res)
>>> defaultdict(<class 'list'>, {'i': [[1, 1]], 'have': [[3, 6]], 'a': [[4, 4], [4, 4], [4, 4]], 'list': [[10, 13]], 'of': [[15, 16], [15, 16]], 'lists': [[18, 22]], 'that': [[24, 27]], 'contain': [[29, 35]], 'word': [[39, 42]], 'and': [[44, 46]], 'there': [[48, 52]], 'indices': [[54, 60]], 'my': [[62, 63]], 'method': [[65, 70]], 'works': [[72, 76]], 'except': [[78, 83]], 'with': [[85, 88]], 'repeated': [[90, 97]], 'words': [[99, 103]], 'like': [[105, 108]], 'or': [[40, 41], [40, 41], [40, 41]], 'the': [[48, 50]], 'it': [[86, 87]]})

任何幫助深表感謝。

uj5u.com熱心網友回復：

您可以使用匹配物件的re, withstart和屬性：end

import re
from collections import defaultdict

text = "i have a list of lists that contain a word and there indices my method works except with repeated words like of or a or the or it"

output = defaultdict(list)
for m in re.finditer(r"\S ", text):
    output[m.group(0)].append((m.start(0) 1, m.end(0)))

print(output)
# defaultdict(<class 'list'>, {'i': [(1, 1)], 'have': [(3, 6)], 'a': [(8, 8), (37, 37), (116, 116)], 'list': [(10, 13)], 'of': [(15, 16), (110, 111)], 'lists': [(18, 22)], 'that': [(24, 27)], 'contain': [(29, 35)], 'word': [(39, 42)], 'and': [(44, 46)], 'there': [(48, 52)], 'indices': [(54, 60)], 'my': [(62, 63)], 'method': [(65, 70)], 'works': [(72, 76)], 'except': [(78, 83)], 'with': [(85, 88)], 'repeated': [(90, 97)], 'words': [(99, 103)], 'like': [(105, 108)], 'or': [(113, 114), (118, 119), (125, 126)], 'the': [(121, 123)], 'it': [(128, 129)]})

uj5u.com熱心網友回復：

我添加了一個雙空格只是為了測驗

text = "i have a  list of lists that contain a word and there indices my method works except with repeated words like of or a or the or it"

from collections import defaultdict
new_dict = defaultdict(list)
offset = 0
for word in text.split(" "):
    new_dict[word].append([offset, offset len(word)])
    offset  = len(word)   1;

new_dict

輸出：

defaultdict(list,
            {'i': [[0, 1]],
             'have': [[2, 6]],
             'a': [[7, 8], [37, 38], [116, 117]],
             '': [[9, 9]],
             'list': [[10, 14]],
             'of': [[15, 17], [110, 112]],
             'lists': [[18, 23]],
             'that': [[24, 28]],
             'contain': [[29, 36]],
             'word': [[39, 43]],
             'and': [[44, 47]],
             'there': [[48, 53]],
             'indices': [[54, 61]],
             'my': [[62, 64]],
             'method': [[65, 71]],
             'works': [[72, 77]],
             'except': [[78, 84]],
             'with': [[85, 89]],
             'repeated': [[90, 98]],
             'words': [[99, 104]],
             'like': [[105, 109]],
             'or': [[113, 115], [118, 120], [125, 127]],
             'the': [[121, 124]],
             'it': [[128, 130]]})

dict 索引準確地給出了字串切片的開始和結束。例如text[128:130]等于“它”

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/528174.html

標籤：Python字典嵌套列表

上一篇：第一百零五篇:變數的原始值和參考值

下一篇：提取和更改具有2個條件的字典項