在列之間查找字串匹配-有解無憂

我有一個看起來像這樣的 DataFrame：

id      sentences                                           ind                 tar
0       In samples of depression injected intraneously...   depression        albumin
0       Monomethylmethacrylate in whole blood was asso...   depression        albumin
1       In samples of depression injected intraneously...   depression          hip
1       Monomethylmethacrylate in whole blood was asso...   depression          hip
2       The GVH kinetics and cellular characteristics ...   GVH,GVH,GVH,GVH...  PFC
2       Effects on PFCgeneword responses to thymus-dep...   GVH,GVH,GVH,GVH...  PFC
2       The unresponsive state which developed in GVHg...   GVH,GVH,GVH,GVH...  PFC
2       Furthermore, GVHgeneword spleen cells suppress...   GVH,GVH,GVH,GVH...  PFC
2       This active suppressor effect was found to be ...   GVH,GVH,GVH,GVH...  PFC
2       The delayed transfer of GVHgeneword cells to i...   GVH,GVH,GVH,GVH...  PFC

我只想保留在相應ind的.tarsentence

ind問題是，當我在or中有多個元素時tar，即使其中一個元素存在于上sentence，它也不匹配，因為它使用整個字串作為術語。例如，在第 5 行，即使句子中存在單詞GVH，它也將其用作ind整個值GVH,GVH,GVH,GVH，而不是單獨使用每個 GVH 術語。有人可以幫助解決這個問題嗎？到目前為止，這是我的代碼：

df['check_ind'] = df.apply(lambda x: x.ind in x.sentences, axis=1)
df['check_tar'] = df.apply(lambda x: x.tar in x.sentences, axis=1)
df = df.loc[(df['check_ind'] == True) | (df['check_tar'] == True)]

print(df.sentences.iloc[4], '\n')

print(df.indications.iloc[4], '\n')

print(df.targets.iloc[4], '\n')

print(df.check_ind.iloc[4], '\n')

print(df.check_tar.iloc[4], '\n')


>>>> The GVH kinetics and cellular characteristics indicated that suppressor T cells exert an anti-mitotic influence on antigen-stimulated B-cell proliferation. . 

>>>> GVH,GVH,GVH,GVH,GVH,GVH 

>>>> PFC 

>>>> False (This should return TRUE since GVH is in the sentence)

>>>> False

資料：

{'id': [0, 0, 1, 1, 2, 2, 2, 2, 2, 2],
 'sentences': ['In samples of depression injected intraneously...',
  'Monomethylmethacrylate in whole blood was asso...',
  'In samples of depression injected intraneously...',
  'Monomethylmethacrylate in whole blood was asso...',
  'The GVH kinetics and cellular characteristics ...',
  'Effects on PFCgeneword responses to thymus-dep...',
  'The unresponsive state which developed in GVHg...',
  'Furthermore, GVHgeneword spleen cells suppress...',
  'This active suppressor effect was found to be ...',
  'The delayed transfer of GVHgeneword cells to i...'],
 'ind': ['depression', 'depression', 'depression',
         'depression', 'GVH,GVH,GVH,GVH...',
         'GVH,GVH,GVH,GVH...', 'GVH,GVH,GVH,GVH...',
         'GVH,GVH,GVH,GVH...', 'GVH,GVH,GVH,GVH...',
         'GVH,GVH,GVH,GVH...'],
 'tar': ['albumin', 'albumin', 'hip', 'hip', 'PFC', 'PFC',
         'PFC', 'PFC', 'PFC', 'PFC']}

uj5u.com熱心網友回復：

您的代碼當前將x.ind其視為一個簡單值。

從概念上講x.ind，它不是單個值，而是逗號分隔的值串列。

在 python 中，您可以使用 . 將逗號分隔的串列轉換為實際的 python 串列x.split(',')。此外，str.strip()對于洗掉可能的空格很有用（例如，如果您有"GVH ,GVH "，則可能應該忽略空格）。

最后，內置函式any方便all將條件廣播到串列。

df['check_ind'] = df.apply(lambda x: any(v.strip() in x.sentences for v in x.split(',')), axis=1)

uj5u.com熱心網友回復：

你可以先concat“ind”和“tar”列，這樣你就可以只做一次評估。

然后使用str.split explode評估apply器檢查是否存在任何“ind”或“tar”。然后groupby any回到原來的形狀：

new_df = pd.concat((df[['id','sentences','ind']], df[['id','sentences','tar']].rename(columns={'tar':'ind'})))
new_df['ind'] = new_df['ind'].str.split(',')
msk = new_df.explode('ind').apply(lambda x: x['ind'] in x['sentences'], axis=1).groupby(level=0).any()
out = df[msk]

輸出：

   id                                          sentences                 ind      tar  
0   0  In samples of depression injected intraneously...          depression  albumin  
2   1  In samples of depression injected intraneously...          depression      hip  
4   2  The GVH kinetics and cellular characteristics ...  GVH,GVH,GVH,GVH...      PFC  
5   2  Effects on PFCgeneword responses to thymus-dep...  GVH,GVH,GVH,GVH...      PFC  
6   2  The unresponsive state which developed in GVHg...  GVH,GVH,GVH,GVH...      PFC  
7   2  Furthermore, GVHgeneword spleen cells suppress...  GVH,GVH,GVH,GVH...      PFC  
9   2  The delayed transfer of GVHgeneword cells to i...  GVH,GVH,GVH,GVH...      PFC

uj5u.com熱心網友回復：

ind 中以逗號分隔的術語是否總是重復的？

如果是，您可以嘗試以下方法：

df['check_ind'] = df.apply(lambda x: x.ind.split(',')[0] in x.sentences, axis=1)

這將搜索逗號之前的第一個術語。

uj5u.com熱心網友回復：

一種有效的方法是對每個組使用一個正則運算式（因為您有許多重復的 ind/tar 組合：

import re
regex = df[['ind', 'tar']].apply(lambda s: '|'.join(map(re.escape, set(x for e in s.values
                                                                       for x in e.split(',')))),
                                                    axis=1)

df['match'] = df.groupby(regex).apply(lambda s: s['sentences'].str.contains(s.name)).droplevel(0)

輸出：

   id                                          sentences              ind      tar  match
0   0  In samples of depression injected intraneously...       depression  albumin   True
1   0  Monomethylmethacrylate in whole blood was asso...       depression  albumin  False
2   1  In samples of depression injected intraneously...       depression      hip   True
3   1  Monomethylmethacrylate in whole blood was asso...       depression      hip  False
4   2  The GVH kinetics and cellular characteristics ...  GVH,GVH,GVH,GVH      PFC   True
5   2  Effects on PFCgeneword responses to thymus-dep...  GVH,GVH,GVH,GVH      PFC   True
6   2  The unresponsive state which developed in GVHg...  GVH,GVH,GVH,GVH      PFC   True
7   2  Furthermore, GVHgeneword spleen cells suppress...  GVH,GVH,GVH,GVH      PFC   True
8   2  This active suppressor effect was found to be ...  GVH,GVH,GVH,GVH      PFC  False
9   2  The delayed transfer of GVHgeneword cells to i...  GVH,GVH,GVH,GVH      PFC   True

正則運算式/組：

0    albumin|depression
1    albumin|depression
2        hip|depression
3        hip|depression
4               GVH|PFC
5               GVH|PFC
6               GVH|PFC
7               GVH|PFC
8               GVH|PFC
9               GVH|PFC

uj5u.com熱心網友回復：

您可以定義一個同時檢查兩者的方法，然后在apply(). 此方法還可用于拆分這些行中的每一行中的值，假設,從未在文本中使用過，并且所有串列都采用這種不帶空格的精確表示法。

import pandas

def sent_contains_ind_or_tar(row):
    return any(ind in row["sentences"] for ind in row["ind"].split(",")) or any(ind in row["sentences"] for ind in row["tar"].split(","))

df = df[df.apply(sent_contains_ind_or_tar, axis=1)]

例如：

df = pandas.DataFrame([[1, "abc", "u", "v"],
                       [2, "xyz", "x", "z"],
                       [3, "xya", "x", "z"]],
                      columns=["id", "sentences", "ind", "tar"])
print(df)
>    id sentences ind tar
  0   1       abc   u   v
  1   2       xyz   x   z
  2   3       xya   x   z


def sent_contains_ind_or_tar(row):
    return any(ind in row["sentences"] for ind in row["ind"].split(",")) or any(ind in row["sentences"] for ind in row["tar"].split(","))

df = df[df.apply(sent_contains_ind_or_tar, axis=1)]
print(df)
>    id sentences ind tar
  1   2       xyz   x   z

編輯：在方法中添加了串列案例

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/434030.html

標籤：Python 熊猫数据框 nlp

上一篇：根據模式拆分PandasDataFrame列

下一篇：groupby總計/小計