所以,我想要做的是將字串中的一些單詞轉換為字典中各自的單詞并保持原樣。例如,通過將輸入作為:
standarisationn("well-2-34 2 @$#beach bend com")
我希望輸出為:
"well-2-34 2 @$#bch bnd com"
我使用的代碼是:
def standarisationn(addr):
a=re.sub(',', ' ', addr)
lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
"arcade":"arc",
"apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
"av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
"beach":"bch",
"bend":"bnd",
"blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
"boul":"blvd","boulevard":"blvd","boulv":"blvd",
"bottm":"bot","bottom":"bot",
"branch":"br","brnch":"br",
"brdge":"brg","bridge":"brg",
"bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
"camp":"cmp",
"canyn":"cny","canyon":"cny","cnyn":"cny",
"southwest":"sw" ,"northwest":"nw"}
temp=re.findall(r"[A-Za-z0-9] |\S", a)
print(temp)
res = []
for wrd in temp:
res.append(lookp_dict.get(wrd,wrd))
res = ' '.join(res)
return str(res)
但它給出了錯誤的輸出:
'well - 2 - 34 2 @ $ % 23beach bnd com'
那是有太多空格,甚至沒有將“海灘”轉換為“bch”。所以,這就是問題所在。我認為首先將字串按空格拆分,然后按特殊字符和數字拆分結果元素,然后使用字典,然后首先用沒有空格的特殊字符連接分隔的字串,然后用空格連接所有串列。誰能建議如何解決這個問題或任何更好的方法?
uj5u.com熱心網友回復:
您可以使用字典的鍵構建正則運算式,確保它們不包含在另一個單詞中(即不直接在前面或后面跟一個字母):
import re
def standarisationn(addr):
addr = re.sub(r'(,|\s )', " ", addr)
lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
"arcade":"arc",
"apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
"av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
"beach":"bch",
"bend":"bnd",
"blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
"boul":"blvd","boulevard":"blvd","boulv":"blvd",
"bottm":"bot","bottom":"bot",
"branch":"br","brnch":"br",
"brdge":"brg","bridge":"brg",
"bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
"camp":"cmp",
"canyn":"cny","canyon":"cny","cnyn":"cny",
"southwest":"sw" ,"northwest":"nw"}
for wrd in lookp_dict:
addr = re.sub(rf'(?:^|(?<=[^a-zA-Z])){wrd}(?=[^a-zA-Z]|$)', lookp_dict[wrd], addr)
return addr
print(standarisationn("well-2-34 2 @$#beach bend com"))
該運算式由三部分構成:
^匹配字串的開頭(?<=[^a-zA-Z])是后視(即非捕獲運算式),檢查前面的字符是否是字母{wrd}是你字典的關鍵(?=[^a-zA-Z]|$)是一個前瞻(即非捕獲運算式),檢查后面的字符是字母還是字串的結尾
輸出:
well-2-34 2 @$#bch bnd com
編輯:如果將回圈替換為以下內容,則可以編譯整個運算式并僅使用 re.sub 一次:
repl_pattern = re.compile(rf"(?:^|(?<=[^a-zA-Z]))({'|'.join(lookp_dict.keys())})(?=([^a-zA-Z]|$))")
addr = re.sub(repl_pattern, lambda x: lookp_dict[x.group(1)], addr)
如果您的字典增長,這應該會快得多,因為我們使用您的所有字典鍵構建了一個運算式:
({'|'.join(lookp_dict.keys())})被解釋為(allee|alley|...- re.sub 中的 lambda 函式用 lookp_dict 中的相應值替換匹配元素(例如,請參閱此鏈接以獲取更多詳細資訊)
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/338404.html
上一篇:MongoDB加入主鍵/外鍵
下一篇:如何連接兩個表以在視圖中獲得輸出
