問題
我有一個大約 1000 個敬語的串列,請參閱下面的示例。
例如,給定一個名稱的輸入字串"her majesty queen elizabeth windsor",該函式應回傳"elizabeth windsor"。如果名稱的開頭沒有敬語(為了簡化問題),函式應該簡單地回傳名稱本身(例如elizabeth windsor-> elizabeth windsor)。
我有非常嚴格的延遲限制,因此需要盡可能優化此代碼。
作業解決方案
這是我的作業解決方案,有一些額外的限制來減少誤報(例如lance既是敬語又是名字),請參閱單元測驗:
def strip_honorific(source: str, honorifics: List[str]) -> str:
source_tokens = source.split()
if len(source_tokens) > 2:
for honorific in honorifics:
if source.startswith(f"{honorific} "):
stripped_source = source[len(honorific) 1 :]
if len(stripped_source.split()) > 1:
return stripped_source
return source
單元測驗
def test_honorifics():
assert strip_honorific(source="her majesty queen elizabeth windsor", honorifics = honorifics) == "elizabeth windsor"
assert strip_honorific(source="elizabeth windsor", honorifics = honorifics) == "elizabeth windsor"
assert strip_honorific(source="mrs elizabeth windsor", honorifics = honorifics) == "elizabeth windsor"
assert strip_honorific(source="mrselizabeth windsor", honorifics = honorifics) == "mrselizabeth windsor"
assert strip_honorific(source="mrselizabeth windsor", honorifics = honorifics) == "mrselizabeth windsor"
assert strip_honorific(source="her majesty queen", honorifics = honorifics) == "her majesty queen"
assert strip_honorific(source="her majesty queen elizabeth", honorifics = honorifics) == "her majesty queen elizabeth"
assert strip_honorific(source="kapitan fred", honorifics = honorifics) == "kapitan fred"
test_honorifics()
基準
對于基本基準,我使用了以下敬語串列(減去省略號)。
source_lst = [
"her majesty queen elizabeth windsor",
"mr fred wilson",
"the rt hon nolan borak",
"his most eminent highness simon smithson",
"kapteinis jurijs jakov?evs",
"miss nancy garland",
"missnancy garland",
]
times = []
for _ in range(1000):
for source in source_lst:
t0 = time.time()
strip_honorific(source=source, honorifics = honorifics)
times.append(time.time() - t0)
print(f"Mean time: {sum(times)/ len(times)}s") # Mean time: 5.11584963117327e-06s
敬語串列
honorifics = [
"mr",
"mrs",
"the hon",
"the hon dr",
"the hon lady",
"the hon lord",
"the hon mrs",
"the hon sir",
"the honourable",
"the rt hon",
"her majesty queen",
"his majesty king",
"vina",
"flottiljamiral",
"superintendent",
"rabbi",
"diraja",
"domnul",
"kindralleitnant",
"countess",
"pan",
"khatib",
"zur",
"vice",
"don",
"flotiles",
"dipl",
"his most eminent highness",
...
"the reverend",
"archbishop",
"sheik",
"shaikh",
"the rt hon lord",
"la tres honorable"
"ekselence",
"kapteinis",
"kapitan",
"excellenza"
"mr",
"mrs",
"miss"
]
uj5u.com熱心網友回復:
首先,我對如何處理以下輸入有疑問:
“達納夫人”
當取敬語“the hon lady”時,它不匹配,因為“dana”只是一個詞,但是當取較短的敬語“the hon”時,它匹配,而“lady dana”將是剝離版本,但保留“女士”會很奇怪,因為它顯然是更長敬意的一部分。
我認為由于第一個較長的匹配,不應進行其他嘗試,也不應從輸入字串中洗掉任何內容。我在下面的嘗試中采用了這種方法。
我將提供兩種選擇:
- 使用正則運算式
- 使用基于單詞的 trie
使用您的基準沒有太大差異,但對于實際資料,匹配和不匹配之間的比率可能會對總運行時間產生一些影響。
正則運算式解決方案
預處理:
honorifics_re = re.compile(fr"^(?:{'|'.join(sorted(honorifics, key=len, reverse=True))}) (\S ( \S)?.*)")
實際功能:
def strip_honorific(source: str, honorifics) -> str:
m = honorifics.match(source)
return m[1] if m and m[2] else source
打電話給honorifics = honorifics_re
嘗試解決方案
預處理:
def make_trie(honorifics):
root = {}
for honorific in honorifics:
node = root
for word in honorific.split():
if word not in node:
node[word] = {}
node = node[word]
node["$$"] = len(honorific) 1
return root
實際功能:
def strip_honorific(source: str, honorifics) -> str:
node = honorifics
for word in source.split():
if word not in node:
if "$$" in node:
rest = source[node["$$"]:]
if " " in rest:
return rest
return source
node = node[word]
return source
打電話給honorifics = make_trie(honorifics)
uj5u.com熱心網友回復:
通過將敬語串列重新格式化為另一種形式,我能夠將性能提高 2 倍以上。
def reformat_honorifics(honorifics):
honorifics_by_letter_dct = {}
for honorific in honorifics:
honorifics_by_letter_dct[honorific[0]] = honorifics_by_letter_dct.get(honorific[0], {})
honorifics_by_letter_dct[honorific[0]][len(honorific.split()[0])] = honorifics_by_letter_dct[honorific[0]].get(len(honorific.split()[0]), []) [honorific]
return honorifics_by_letter_dct
reformatted_honorifics = reformat_honorifics(honorifics)
def strip_honorific_1(source: str, honorifics) -> str:
source_tokens = source.split()
if len(source_tokens) > 2:
if honorifics.get(source_tokens[0][0]):
if honorifics[source_tokens[0][0]].get(len(source_tokens[0])):
for honorific in honorifics[source_tokens[0][0]].get(len(source_tokens[0])):
if source.startswith(f"{honorific} "):
stripped_source = source[len(honorific) 1 :]
if len(stripped_source.split()) > 1:
return stripped_source
return source
平均時間:2.1538e-06s
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/533232.html
上一篇:試圖找出執行時間差異的原因
