我有一個資料框:
import pandas as pd
data =[[28, ['first'], 'apple edible', 23, 'apple is an edible fruit'],
[28, ['first'], 'apple edible', 34, 'fruit produced by an apple tree'],
[28, ['first'], 'apple edible', 39, 'the apple is a pome edible fruit'],
[21, ['second'], 'green plants', 11, 'plants are green'],
[21, ['second'], 'green plants', 7, 'plant these perennial green flowers']]
df = pd.DataFrame(data, columns=['day', 'group', 'bigram', 'count', 'sentence'])
--- -------- ------------ ----- -----------------------------------
|day|group |bigram |count|sentence |
--- -------- ------------ ----- -----------------------------------
|28 |[first] |apple edible|23 |apple is an edible fruit |
|28 |[first] |apple edible|34 |fruit produced by an apple tree |
|28 |[first] |apple edible|39 |the apple is a pome edible fruit |
|21 |[second]|green plants|11 |plants are green |
|21 |[second]|green plants|7 |plant these perennial green flowers|
--- -------- ------------ ----- -----------------------------------
我需要找到二元組與句子的交集。此外,找到第 一個交叉點并將其標記為 True。也就是說,在第一個交叉點之后,剩余的交叉點已經被標記為 False。詞序并不重要。
所以我想要這個結果:
--- -------- ------------ ----- -------------------------------- --------
|day|group |bigram |count|sentence | |
--- -------- ------------ ----- -------------------------------- --------
|28 |[first] |apple edible|23 |apple is an edible fruit |True |
|28 |[first] |apple edible|34 |fruit produced by an apple tree |False |
|28 |[first] |apple edible|39 |the apple is a pome edible fruit|False |
|21 |[second]|green plants|11 |plant these perennial flowers |False |
|21 |[second]|green plants|7 |plants are green |True |
--- -------- ------------ ----- -------------------------------- --------
uj5u.com熱心網友回復:
首先通過將拆分值轉換為集合來測驗所有交集,issubset然后僅選擇第一個Trues per bigram:
df['new'] = [set(b.split()).issubset(a.split()) for a,b in zip(df['sentence'],df['bigram'])]
df['new'] = ~df.duplicated(['bigram','new']) & df['new']
print (df)
day group bigram count sentence \
0 28 [first] apple edible 23 apple is an edible fruit
1 28 [first] apple edible 34 fruit produced by an apple tree
2 28 [first] apple edible 39 the apple is a pome edible fruit
3 21 [second] green plants 11 plants are green
4 21 [second] green plants 7 plant these perennial green flowers
new
0 True
1 False
2 False
3 True
4 False
如果應該交換二元組中的順序并需要第一個交集,請使用:
df['new'] = ~df.assign(bigram=df['bigram'].apply(lambda x: frozenset(x.split()))).duplicated(['bigram','new']) & df['new']
uj5u.com熱心網友回復:
您可以使用兩個步驟,一個是識別 bigram 是句子子集的行(使用issubset),然后只保留第一個 True:
# use python sets to identify the matching bigrams
df['intersection'] = [set(a.split()).issubset(b.split())
for a,b in zip(df['bigram'], df['sentence'])]
# select the non-first matches and replace with False
df.loc[~df.index.isin(df.groupby(df['group'].str[0])['intersection'].idxmax()),
'intersection'] = False
輸出:
day group bigram count sentence intersection
0 28 [first] apple edible 23 apple is an edible fruit True
1 28 [first] apple edible 34 fruit produced by an apple tree False
2 28 [first] apple edible 39 the apple is a pome edible fruit False
3 21 [second] green plants 11 plant these perennial green flowers False
4 21 [second] green plants 7 plants are green True
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/493413.html
