我正在處理資料框中的一列字串,并嘗試提取與給定單詞串列中的任何單詞匹配的所有單詞。它提取了所有匹配的單詞和子字串,我怎樣才能只得到單詞?非常感謝!
我的代碼:
import pandas as pd
cl =['dust', 'yes inr', 'inner']
data = [[1, 'dust industr yes inr'], [2, 'state inner'],[3, 'dustry']]
df = pd.DataFrame(data, columns = ['ID', 'Text'])
df['findWord'] = df['Text'].str.extractall(f"({'|'.join(cl)})").groupby(level=0).agg(', '.join)
print(df)
當前輸出:如何只能提取單詞dust,而不是'industry'的子字串
ID Text findWord
0 1 dust industr yes inr dust, dust, yes inr
1 2 state inner inner
2 3 dustry dust
預期輸出:
ID Text findWord
0 1 dust industr yes inr dust, yes inr
1 2 state inner inner
2 3 dustry Nan
uj5u.com熱心網友回復:
也許是這樣的:
import pandas as pd
import numpy as np
cl =['dust', 'inner']
data = [[1, 'dust industry inner'], [2, 'state inner'],[3, 'dustry']]
df = pd.DataFrame(data, columns = ['ID', 'Text'])
df['findWord'] = [', '.join(set(d.split(' ')).intersection(set(cl))) for d in df['Text'].to_numpy()]
df = df.replace('', np.NaN)
ID Text findWord
0 1 dust industry inner dust, inner
1 2 state inner inner
2 3 dustry NaN
更新 1:用正則運算式模式試試這個:
import pandas as pd
cl =['dust', 'yes inr', 'inner']
data = [[1, 'dust industr yes inr'], [2, 'state inner'],[3, 'dustry']]
df = pd.DataFrame(data, columns = ['ID', 'Text'])
regex = '({})'.format('|'.join('\\b{}\\b'.format(c) for c in cl))
df['findWord'] = df['Text'].str.extractall(regex).groupby(level=0).agg(', '.join)
ID Text findWord
0 1 dust industr yes inr dust, yes inr
1 2 state inner inner
2 3 dustry NaN
uj5u.com熱心網友回復:
通過添加單詞邊界來修復您的正則運算式模式\b,使其僅匹配完整單詞,然后用于str.findall查找此模式的所有出現
df['findWord'] = df['Text'].str.findall(r'\b(%s)\b' % '|'.join(cl)).str.join(', ')
ID Text findWord
0 1 dust industr yes inr dust, yes inr
1 2 state inner inner
2 3 dustry
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/427012.html
