我有一個看起來像這樣的 DataFrame:
id sentences ind tar
0 In samples of depression injected intraneously... depression albumin
0 Monomethylmethacrylate in whole blood was asso... depression albumin
1 In samples of depression injected intraneously... depression hip
1 Monomethylmethacrylate in whole blood was asso... depression hip
2 The GVH kinetics and cellular characteristics ... GVH,GVH,GVH,GVH... PFC
2 Effects on PFCgeneword responses to thymus-dep... GVH,GVH,GVH,GVH... PFC
2 The unresponsive state which developed in GVHg... GVH,GVH,GVH,GVH... PFC
2 Furthermore, GVHgeneword spleen cells suppress... GVH,GVH,GVH,GVH... PFC
2 This active suppressor effect was found to be ... GVH,GVH,GVH,GVH... PFC
2 The delayed transfer of GVHgeneword cells to i... GVH,GVH,GVH,GVH... PFC
我只想保留在相應ind的.tarsentence
ind問題是,當我在or中有多個元素時tar,即使其中一個元素存在于 上sentence,它也不匹配,因為它使用整個字串作為術語。例如,在第 5 行,即使句子中存在單詞GVH,它也將其用作ind整個值GVH,GVH,GVH,GVH,而不是單獨使用每個 GVH 術語。有人可以幫助解決這個問題嗎?到目前為止,這是我的代碼:
df['check_ind'] = df.apply(lambda x: x.ind in x.sentences, axis=1)
df['check_tar'] = df.apply(lambda x: x.tar in x.sentences, axis=1)
df = df.loc[(df['check_ind'] == True) | (df['check_tar'] == True)]
print(df.sentences.iloc[4], '\n')
print(df.indications.iloc[4], '\n')
print(df.targets.iloc[4], '\n')
print(df.check_ind.iloc[4], '\n')
print(df.check_tar.iloc[4], '\n')
>>>> The GVH kinetics and cellular characteristics indicated that suppressor T cells exert an anti-mitotic influence on antigen-stimulated B-cell proliferation. .
>>>> GVH,GVH,GVH,GVH,GVH,GVH
>>>> PFC
>>>> False (This should return TRUE since GVH is in the sentence)
>>>> False
資料:
{'id': [0, 0, 1, 1, 2, 2, 2, 2, 2, 2],
'sentences': ['In samples of depression injected intraneously...',
'Monomethylmethacrylate in whole blood was asso...',
'In samples of depression injected intraneously...',
'Monomethylmethacrylate in whole blood was asso...',
'The GVH kinetics and cellular characteristics ...',
'Effects on PFCgeneword responses to thymus-dep...',
'The unresponsive state which developed in GVHg...',
'Furthermore, GVHgeneword spleen cells suppress...',
'This active suppressor effect was found to be ...',
'The delayed transfer of GVHgeneword cells to i...'],
'ind': ['depression', 'depression', 'depression',
'depression', 'GVH,GVH,GVH,GVH...',
'GVH,GVH,GVH,GVH...', 'GVH,GVH,GVH,GVH...',
'GVH,GVH,GVH,GVH...', 'GVH,GVH,GVH,GVH...',
'GVH,GVH,GVH,GVH...'],
'tar': ['albumin', 'albumin', 'hip', 'hip', 'PFC', 'PFC',
'PFC', 'PFC', 'PFC', 'PFC']}
uj5u.com熱心網友回復:
您的代碼當前將x.ind其視為一個簡單值。
從概念上講x.ind,它不是單個值,而是逗號分隔的值串列。
在 python 中,您可以使用 . 將逗號分隔的串列轉換為實際的 python 串列x.split(',')。此外,str.strip()對于洗掉可能的空格很有用(例如,如果您有"GVH ,GVH ",則可能應該忽略空格)。
最后,內置函式any方便all將條件廣播到串列。
df['check_ind'] = df.apply(lambda x: any(v.strip() in x.sentences for v in x.split(',')), axis=1)
uj5u.com熱心網友回復:
你可以先concat“ind”和“tar”列,這樣你就可以只做一次評估。
然后使用str.split explode評估apply器檢查是否存在任何“ind”或“tar”。然后groupby any回到原來的形狀:
new_df = pd.concat((df[['id','sentences','ind']], df[['id','sentences','tar']].rename(columns={'tar':'ind'})))
new_df['ind'] = new_df['ind'].str.split(',')
msk = new_df.explode('ind').apply(lambda x: x['ind'] in x['sentences'], axis=1).groupby(level=0).any()
out = df[msk]
輸出:
id sentences ind tar
0 0 In samples of depression injected intraneously... depression albumin
2 1 In samples of depression injected intraneously... depression hip
4 2 The GVH kinetics and cellular characteristics ... GVH,GVH,GVH,GVH... PFC
5 2 Effects on PFCgeneword responses to thymus-dep... GVH,GVH,GVH,GVH... PFC
6 2 The unresponsive state which developed in GVHg... GVH,GVH,GVH,GVH... PFC
7 2 Furthermore, GVHgeneword spleen cells suppress... GVH,GVH,GVH,GVH... PFC
9 2 The delayed transfer of GVHgeneword cells to i... GVH,GVH,GVH,GVH... PFC
uj5u.com熱心網友回復:
ind 中以逗號分隔的術語是否總是重復的?
如果是,您可以嘗試以下方法:
df['check_ind'] = df.apply(lambda x: x.ind.split(',')[0] in x.sentences, axis=1)
這將搜索逗號之前的第一個術語。
uj5u.com熱心網友回復:
一種有效的方法是對每個組使用一個正則運算式(因為您有許多重復的 ind/tar 組合:
import re
regex = df[['ind', 'tar']].apply(lambda s: '|'.join(map(re.escape, set(x for e in s.values
for x in e.split(',')))),
axis=1)
df['match'] = df.groupby(regex).apply(lambda s: s['sentences'].str.contains(s.name)).droplevel(0)
輸出:
id sentences ind tar match
0 0 In samples of depression injected intraneously... depression albumin True
1 0 Monomethylmethacrylate in whole blood was asso... depression albumin False
2 1 In samples of depression injected intraneously... depression hip True
3 1 Monomethylmethacrylate in whole blood was asso... depression hip False
4 2 The GVH kinetics and cellular characteristics ... GVH,GVH,GVH,GVH PFC True
5 2 Effects on PFCgeneword responses to thymus-dep... GVH,GVH,GVH,GVH PFC True
6 2 The unresponsive state which developed in GVHg... GVH,GVH,GVH,GVH PFC True
7 2 Furthermore, GVHgeneword spleen cells suppress... GVH,GVH,GVH,GVH PFC True
8 2 This active suppressor effect was found to be ... GVH,GVH,GVH,GVH PFC False
9 2 The delayed transfer of GVHgeneword cells to i... GVH,GVH,GVH,GVH PFC True
正則運算式/組:
0 albumin|depression
1 albumin|depression
2 hip|depression
3 hip|depression
4 GVH|PFC
5 GVH|PFC
6 GVH|PFC
7 GVH|PFC
8 GVH|PFC
9 GVH|PFC
uj5u.com熱心網友回復:
您可以定義一個同時檢查兩者的方法,然后在apply(). 此方法還可用于拆分這些行中的每一行中的值,假設,從未在文本中使用過,并且所有串列都采用這種不帶空格的精確表示法。
import pandas
def sent_contains_ind_or_tar(row):
return any(ind in row["sentences"] for ind in row["ind"].split(",")) or any(ind in row["sentences"] for ind in row["tar"].split(","))
df = df[df.apply(sent_contains_ind_or_tar, axis=1)]
例如:
df = pandas.DataFrame([[1, "abc", "u", "v"],
[2, "xyz", "x", "z"],
[3, "xya", "x", "z"]],
columns=["id", "sentences", "ind", "tar"])
print(df)
> id sentences ind tar
0 1 abc u v
1 2 xyz x z
2 3 xya x z
def sent_contains_ind_or_tar(row):
return any(ind in row["sentences"] for ind in row["ind"].split(",")) or any(ind in row["sentences"] for ind in row["tar"].split(","))
df = df[df.apply(sent_contains_ind_or_tar, axis=1)]
print(df)
> id sentences ind tar
1 2 xyz x z
編輯:在方法中添加了串列案例
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/434030.html
下一篇:groupby總計/小計
