我有:
大約 40k 個二元/三元詞的位置串列。
['San Francisco CA', 'Oakland CA', 'San Diego CA',...]具有數百萬行的 Pandas DataFrame。
| 字串列 | string_column_location_removed |
|---|---|
| 漢堡王奧克蘭加州 | 漢堡王 |
| 沃爾瑪核桃溪加州 | 沃爾瑪 |
我目前正在遍歷位置串列,如果該位置存在于 中string_column,則創建一個string_column_location_removed洗掉該位置的新列。
這是我的嘗試,雖然它有效,但速度很慢。關于如何加快速度的任何想法?
我試過從這個和這個中獲取想法,但不確定如何使用 Pandas Dataframe 來真正推斷它。
from random import choice
from string import ascii_lowercase, digits
import pandas
#making random list here
chars = ascii_lowercase digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')
strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
"Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this", "Thank you so Much!"] * 250000
df = pd.DataFrame(strings_for_df)
def location_remove(txnString):
for locationString in locations_lookup_list:
if re.search(f'\\b{locationString}\\b', txnString):
return re.sub(f'\\b{locationString}\\b','', txnString)
else:
continue
df['string_column_location_removed'] = df['string_column'].apply(lambda x: location_remove(x))
uj5u.com熱心網友回復:
使用trrex,它構建了一個與此資源中相同的模式(實際上它受到該答案的啟發):
from random import choice
from string import ascii_lowercase, digits
import pandas as pd
import trrex as tx
# making random list here
chars = ascii_lowercase digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')
strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
"Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this",
"Thank you so Much!"] * 250000
df = pd.DataFrame(strings_for_df, columns=["string_column"])
pattern = tx.make(locations_lookup_list, suffix="", prefix="")
df["string_column_location_removed"] = df["string_column"].str.replace(pattern, "", regex=True)
print(df)
輸出
string_column string_column_location_removed
0 Burger King Oakland CA Burger King
1 Walmart Walnut Creek CA Walmart
2 Random Other Thing Here Random Other Thing Here
3 Another random other thing here Another random other thing here
4 Really Appreciate the help on this Really Appreciate the help on this
... ... ...
1499995 Walmart Walnut Creek CA Walmart
1499996 Random Other Thing Here Random Other Thing Here
1499997 Another random other thing here Another random other thing here
1499998 Really Appreciate the help on this Really Appreciate the help on this
1499999 Thank you so Much! Thank you so Much!
[1500000 rows x 2 columns]
計時(運行的str.replace)
%timeit df["string_column"].str.replace(pattern, "", regex=True)
8.84 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
時間不包括構建模式所需的時間。
免責宣告我是trrex的作者
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/325040.html
