目前我正在努力尋找一種優雅的方式來編程我面臨的問題。
我有一個大型資料框,其中包含一個帶有部門名稱的列:
輸入
demo = pd.DataFrame(
{'Department':
['AA','AA1','BB team 1','AA but also a bit of nonsense',
'BB','AA','department BB','Complete nonsense']}
)
Department
AA
AA1
BB team 1
AA but also a bit of nonsense
BB
AA
department BB
Complete nonsense
我還有一個已知部門的串列:
known_departments = ['AA','BB']
可以看出,串列中有三種型別的部門:
- 與已知部門完全匹配的部門,這些應該保持不變。
- 作為已知部門的變體的部門。即:它包含部門名稱,但存在一些其他文本。這些應該映射到原來的已知部門。
- 完整的廢話部門,與已知部門沒有任何匹配,這些也應該保持原樣。
期望輸出
Department Department_simplified
AA AA
AA1 AA
BB team 1 BB
AA but also a bit of nonsense AA
BB BB
AA AA
department BB BB
Complete nonsense Complete nonsense
更新
感謝克里斯和索福克勒斯的回答。雖然使用str.extractand看起來更優雅str.findall,但在性能方面,apply function 在我的實際 df 上表現優于兩者:
Solution %%timeit -n20
Chris 1.65s ± 311 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
sophocles 1.14s ± 294 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
Paul 680 ms ± 174 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
uj5u.com熱心網友回復:
您可以str.extract在此處使用,并構建由|(或)分隔的部門串列作為模式。
import pandas as pd
known_departments = ['AA','BB']
demo = pd.DataFrame(
{'Department':
['AA','AA1','BB team 1','AA but also a bit of nonsense',
'BB','AA','department BB','Complete nonsense']}
)
demo['Department_simplified'] = demo.Department.str.extract(f"({'|'.join(known_departments)})")
# If you need to fill nulls with the original dept name
demo['Department_simplified'].fillna(demo['Department'], inplace=True)
print(demo)
輸出
Department Department_simplified
0 AA AA
1 AA1 AA
2 BB team 1 BB
3 AA but also a bit of nonsense AA
4 BB BB
5 AA AA
6 department BB BB
7 Complete nonsense Complete nonsense
uj5u.com熱心網友回復:
您可以首先使用str.findall串列元素 (known_departments) 回傳 Department 列的匹配子字串。對于沒有回傳任何內容的那些,您只需使用 Department 中的值,因為沒有任何匹配
demo['Department_simplified'] = demo['Department']\
.str.findall('|'.join(known_departments)).str.join('')
demo['Department_simplified'] = np.where(
demo['Department_simplified'].eq(''),demo['Department'],demo['Department_simplified'])
印刷:
Department Department_simplified
0 AA AA
1 AA1 AA
2 BB team 1 BB
3 AA but also a bit of nonsense AA
4 BB BB
5 AA AA
6 department BB BB
7 Complete nonsense Complete nonsense
uj5u.com熱心網友回復:
我目前將 apply 與一個函式結合使用來獲得我的結果。
代碼:
def item_in_string(string, list_of_items):
for item in list_of_items:
if item in string:
return item
return string
demo['Department_simplified'] = demo.Department.apply(
lambda x: item_in_string(x, known_departments) if isinstance(x, str) else x)
但是,這感覺不是很有效,也不是pythonic。
我想知道是否有人有更好的方法來解決這個問題。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/399409.html
