在Dataframe中加速數百萬個正則運算式替換-有解無憂

我有：

大約 40k 個二元/三元詞的位置串列。
['San Francisco CA', 'Oakland CA', 'San Diego CA',...]
具有數百萬行的 Pandas DataFrame。

字串列	string_column_location_removed
漢堡王奧克蘭加州	漢堡王
沃爾瑪核桃溪加州	沃爾瑪

我目前正在遍歷位置串列，如果該位置存在于中string_column，則創建一個string_column_location_removed洗掉該位置的新列。

這是我的嘗試，雖然它有效，但速度很慢。關于如何加快速度的任何想法？

我試過從這個和這個中獲取想法，但不確定如何使用 Pandas Dataframe 來真正推斷它。

from random import choice
from string import ascii_lowercase, digits
import pandas 

#making random list here 
chars = ascii_lowercase   digits
locations_lookup_list  = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')

strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
             "Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this", "Thank you so Much!"] * 250000

df = pd.DataFrame(strings_for_df)

def location_remove(txnString):
    for locationString in locations_lookup_list:   
        if re.search(f'\\b{locationString}\\b', txnString):  
            return re.sub(f'\\b{locationString}\\b','', txnString)
        else:
            continue

df['string_column_location_removed'] = df['string_column'].apply(lambda x: location_remove(x))

uj5u.com熱心網友回復：

使用trrex，它構建了一個與此資源中相同的模式（實際上它受到該答案的啟發）：

from random import choice
from string import ascii_lowercase, digits

import pandas as pd
import trrex as tx

# making random list here
chars = ascii_lowercase   digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')

strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
                  "Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this",
                  "Thank you so Much!"] * 250000

df = pd.DataFrame(strings_for_df, columns=["string_column"])
pattern = tx.make(locations_lookup_list, suffix="", prefix="")

df["string_column_location_removed"] = df["string_column"].str.replace(pattern, "", regex=True)
print(df)

輸出

                              string_column      string_column_location_removed
0                    Burger King Oakland CA                        Burger King 
1                   Walmart Walnut Creek CA                            Walmart 
2                   Random Other Thing Here             Random Other Thing Here
3           Another random other thing here     Another random other thing here
4        Really Appreciate the help on this  Really Appreciate the help on this
...                                     ...                                 ...
1499995             Walmart Walnut Creek CA                            Walmart 
1499996             Random Other Thing Here             Random Other Thing Here
1499997     Another random other thing here     Another random other thing here
1499998  Really Appreciate the help on this  Really Appreciate the help on this
1499999                  Thank you so Much!                  Thank you so Much!

[1500000 rows x 2 columns]

計時（運行的str.replace）

%timeit df["string_column"].str.replace(pattern, "", regex=True)
8.84 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

時間不包括構建模式所需的時間。

免責宣告我是trrex的作者

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/325040.html

標籤：Python 正则表达式熊猫表现代替

上一篇：對于VARCHAR，將None替換為NULL或空白，將Pandas資料幀中的INT欄位替換為0或空白

下一篇：如何將元組作為輸入，并將其作為新元組回傳，但只能使用奇數？