我有一個字串串列,比如說:
fruit_list = ["apple", "banana", "coconut"]
我有一些 Pandas 資料框,例如:
import pandas as pd
data = [['Apple farm', 10], ['Banana field', 15], ['Coconut beach', 14], ['corn field', 10]]
df = pd.DataFrame(data, columns = ['fruit_source', 'value'])
我想根據現有列“fruit_source”的文本搜索填充一個新列。我想要填充的是與 df 中的特定列匹配的任何元素。一種寫法是:
df["fruit"] = NaN
for index, row in df.iterrows():
for fruit in fruit_list:
if fruit in row['fruit_source']:
df.loc[index,'fruit'] = fruit
else:
df.loc[index,'fruit'] = "fruit not found"
其中資料框填充了水果來源收集的水果的新列。
但是,當將其擴展到更大的資料幀時,此迭代可能會成為基于性能的問題。原因是,隨著更多行的引入,迭代也會由于遍歷串列而爆炸。
有沒有更有效的方法可以完成?
uj5u.com熱心網友回復:
你可以讓 Pandas 做這樣的作業:
# Prime series with the "fruit not found" value
df['fruit'] = "fruit not found"
for fruit in fruit_list:
# Generate boolean series of rows matching the fruit
mask = df['fruit_source'].str.contains(fruit, case=False)
# Replace those rows in-place with the name of the fruit
df['fruit'].mask(mask, fruit, inplace=True)
print(df) 然后會說
fruit_source value fruit
0 Apple farm 10 apple
1 Banana field 15 banana
2 Coconut beach 14 coconut
3 corn field 10 fruit not found
uj5u.com熱心網友回復:
str.extract與正則運算式模式一起使用以避免回圈:
import re
pattern = fr"({'|'.join(fruit_list)})"
df['fruit'] = df['fruit_source'].str.extract(pattern, flags=re.IGNORECASE) \
.fillna('fruit not found')
輸出:
>>> df
fruit_source value fruit
0 Apple farm 10 Apple
1 Banana field 15 Banana
2 Coconut beach 14 Coconut
3 corn field 10 fruit not found
>>> pattern
'(apple|banana|coconut)'
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/406337.html
標籤:
上一篇:用資料框中的絕對路徑替換相對路徑
下一篇:熊貓使用iloc選擇
