查找字串是否具有串列中的元素并從串列中回傳該元素-有解無憂

目前我正在努力尋找一種優雅的方式來編程我面臨的問題。

我有一個大型資料框，其中包含一個帶有部門名稱的列：

輸入

demo = pd.DataFrame(
    {'Department':
        ['AA','AA1','BB team 1','AA but also a bit of nonsense',
        'BB','AA','department BB','Complete nonsense']}
    )

Department
AA
AA1
BB team 1
AA but also a bit of nonsense
BB
AA
department BB 
Complete nonsense

我還有一個已知部門的串列：

known_departments = ['AA','BB']

可以看出，串列中有三種型別的部門：

與已知部門完全匹配的部門，這些應該保持不變。
作為已知部門的變體的部門。即：它包含部門名稱，但存在一些其他文本。這些應該映射到原來的已知部門。
完整的廢話部門，與已知部門沒有任何匹配，這些也應該保持原樣。

期望輸出

Department                      Department_simplified
AA                              AA
AA1                             AA
BB team 1                       BB
AA but also a bit of nonsense   AA
BB                              BB
AA                              AA
department BB                   BB
Complete nonsense               Complete nonsense

更新

感謝克里斯和索福克勒斯的回答。雖然使用str.extractand看起來更優雅str.findall，但在性能方面，apply function 在我的實際 df 上表現優于兩者：

Solution    %%timeit -n20
Chris       1.65s ± 311 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
sophocles   1.14s ± 294 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
Paul        680 ms ± 174 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)

uj5u.com熱心網友回復：

您可以str.extract在此處使用，并構建由|（或）分隔的部門串列作為模式。

import pandas as pd

known_departments = ['AA','BB']

demo = pd.DataFrame(
    {'Department':
        ['AA','AA1','BB team 1','AA but also a bit of nonsense',
        'BB','AA','department BB','Complete nonsense']}
    )

demo['Department_simplified'] = demo.Department.str.extract(f"({'|'.join(known_departments)})")

# If you need to fill nulls with the original dept name
demo['Department_simplified'].fillna(demo['Department'], inplace=True)

print(demo)

輸出

    Department Department_simplified
0                             AA                    AA
1                            AA1                    AA
2                      BB team 1                    BB
3  AA but also a bit of nonsense                    AA
4                             BB                    BB
5                             AA                    AA
6                  department BB                    BB
7              Complete nonsense     Complete nonsense

uj5u.com熱心網友回復：

您可以首先使用str.findall串列元素 (known_departments) 回傳 Department 列的匹配子字串。對于沒有回傳任何內容的那些，您只需使用 Department 中的值，因為沒有任何匹配

demo['Department_simplified'] = demo['Department']\
    .str.findall('|'.join(known_departments)).str.join('')

demo['Department_simplified'] = np.where(
    demo['Department_simplified'].eq(''),demo['Department'],demo['Department_simplified'])

印刷：

                      Department Department_simplified
0                             AA                    AA
1                            AA1                    AA
2                      BB team 1                    BB
3  AA but also a bit of nonsense                    AA
4                             BB                    BB
5                             AA                    AA
6                  department BB                    BB
7              Complete nonsense     Complete nonsense

uj5u.com熱心網友回復：

我目前將 apply 與一個函式結合使用來獲得我的結果。

代碼：

def item_in_string(string, list_of_items):
    for item in list_of_items:
        if item in string:
            return item
    return string

demo['Department_simplified'] = demo.Department.apply(
    lambda x: item_in_string(x, known_departments) if isinstance(x, str) else x)

但是，這感覺不是很有效，也不是pythonic。

我想知道是否有人有更好的方法來解決這個問題。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/399409.html

標籤：Python 熊猫表现

上一篇：MATLAB的時序可靠嗎？如果是，我們可以用julia、fortran等重現性能嗎？

下一篇：如何有效地利用Scheme中的懶惰？