使用兩個資料框如何將查找值作為另一個資料框中列中的子字串進行比較，以在匹配項存在時創建新列-有解無憂

我正在嘗試使用兩個資料幀，一個作為查找表來查找我的資料集資料幀列中值的子字串匹配。找到該值后，我想用該值創建一個新列并遍歷整個列并從初始列中洗掉匹配的子字串并回圈直到沒有更多匹配項。

我遇到的第一個問題是我無法匹配或回傳匹配的值，除非它是一個精確的字串。棘手的部分是有時 Ingredient_Name 包含單個成分的多個單詞。

這是我的代碼的一個較小示例，注釋部分包括我嘗試的錯誤或問題：

import pandas as pd

singleingredientdata = {
    'Ingredient_Name':['ACEBUTOLOL','ACETAMINOPHEN','ACETYLSALICYLIC ACID','CAFFEINE','COLISTIN','HYDROCORTISONE','NEOMYCIN','THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[1,1,2,1,1,1,1,2,1],
'Num_Of_Ingredients':[1,1,1,1,1,1,1,1,1]
}

multiingredientdata = {
'Ingredient_Name':['ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE','ACEBUTOLOL ACETYLSALICYLIC ACID','COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[4,3,5,1],
'Num_Of_Ingredients':[3,2,4,1]
}

df1 = pd.DataFrame(data=singleingredientdata)
df2 = pd.DataFrame(data=multiingredientdata)
ingredientcount = df2["Num_Of_Ingredients"]
max_value = ingredientcount.max()



df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name'])  ##Doesn't flag True unless it finds a single igredient exists in the string
##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
#df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]]  ## passing 4 items instead of a single pass being implied??
##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any())  TypeError: first argument must be string or compiled pattern
iterator = 1
for j in range(0,max_value):
        col_name = 'Ingredient_Name'   str(iterator)
#        contain_values = df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
#        df2[col_name]= df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
        iterator  = 1 

print(df2)

理想情況下，我的結果如下所示：

Ingredient_Name  Ingredient_Name1 Igredient_Name2       Ingredient_Name3  Ingredient_Name4
                 ACETAMINOPHEN    ACETYLSALICYLIC ACID  CAFFEINE
                 ACEBUTOLOL       ACETYLSALICYLIC ACID 
                 COLISTIN         HYDROCORTISONE        NEOMYCIN          THONZONIUM BROMIDE
                 BROMIDE

原始 Ingredient_Name 將包含在查找中未找到的任何值，在此示例中沒有。

到目前為止，我試圖在成分上進行匹配的內容如下，我已經包含了錯誤訊息和該行代碼的問題：

 df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name'])  ##Doesn't flag True unless it finds a single igredient exists in the string
    ##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
    #df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]]  ## passing 4 items instead of a single pass being implied??
    ##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any())  TypeError: first argument must be string or compiled pattern

我能夠在精確字串上匹配的部分回傳以下結果，但我想回傳值而不是 true/fase 并匹配子字串而不是精確匹配：

                                     Ingredient_Name  WordCount  Num_Of_Ingredients  Exists
0        ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE          4                   3   False
1                    ACEBUTOLOL ACETYLSALICYLIC ACID          3                   2   False
2  COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BR...          5                   4   False
3                                            BROMIDE          1                   1    True

也許我以錯誤的方式解決這個問題，或者我很接近，但我沒有抓住什么。您可以提供的任何幫助引導我朝著正確的方向前進，我很感激！

uj5u.com熱心網友回復：

為了保持一列無與倫比的成分，我能想到的最好的就是這個。如果不匹配的成分不那么重要，您最好使用其他答案中提到的其他內置函式進行字串和模式匹配。這可能不是執行此操作的最有效方法。

def match_ingredients(row, df):
  base_str = row['Ingredient_Name']
  result_count = 1
  result = {}
  for idx, ingredient in df.iterrows():
    if ingredient['Ingredient_Name'] in base_str:
        result[f'Ingredient_{result_count}'] = ingredient['Ingredient_Name']
        result_count  = 1
        base_str = base_str.replace(ingredient['Ingredient_Name'], "")
  result['Ingredient_Name'] = base_str

  return result

result = df2.apply(match_ingredients,axis=1, result_type='expand', args=(df1,))

df2.apply(match_ingredients)在每一行上執行函式并將函式df2的行型別回應組合到另一個資料幀中。它需要一個df1引數，以便我們可以迭代每個成分（這也可以修改為成分串列），并且in可以用作本機 Python 中的子字串檢查。如果字串在總成分串列內，那么我們使用從總成分串列replace中“減去”它。

這里的另一件事是回傳的字典將其鍵視為列名，因此我們可以將剩余的基本字串（替換所有匹配的字串后）分配給常量列名Ingredient Name。

result_type = 'expand' 意味著如果可能，函式的回應將變成多列。

申請檔案。

uj5u.com熱心網友回復：

我不完全明白你真正想要什么，但也許這可以幫助你？

pattern = '|'.join(df1['Ingredient_Name'].tolist())
out = df2['Ingredient_Name'].str.findall(pattern).apply(pd.Series)
out.columns = 'Ingredient_Name_'   (out.columns   1).astype(str)
out = df2.join(out)
print(out)

# Output:
                                       Ingredient_Name  WordCount  Num_Of_Ingredients  \
0          ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE          4                   3   
1                      ACEBUTOLOL ACETYLSALICYLIC ACID          3                   2   
2  COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE          5                   4   
3                                              BROMIDE          1                   1   

  Ingredient_Name_1     Ingredient_Name_2 Ingredient_Name_3   Ingredient_Name_4  
0     ACETAMINOPHEN  ACETYLSALICYLIC ACID          CAFFEINE                 NaN  
1        ACEBUTOLOL  ACETYLSALICYLIC ACID               NaN                 NaN  
2          COLISTIN        HYDROCORTISONE          NEOMYCIN  THONZONIUM BROMIDE  
3           BROMIDE                   NaN               NaN                 NaN

uj5u.com熱心網友回復：

使用str.extractall獲得所有比賽
unstack 轉換為單獨的列

output = df2['Ingredient_Name'].str.extractall(f"({'|'.join(df1['Ingredient_Name'])})").unstack()

#formatting
output = output.droplevel(0,1).rename_axis(None, axis=1).add_prefix("Ingredient_Name_")

>>> output
  Ingredient_Name_0     Ingredient_Name_1 Ingredient_Name_2   Ingredient_Name_3
0     ACETAMINOPHEN  ACETYLSALICYLIC ACID          CAFFEINE                 NaN
1        ACEBUTOLOL  ACETYLSALICYLIC ACID               NaN                 NaN
2          COLISTIN        HYDROCORTISONE          NEOMYCIN  THONZONIUM BROMIDE
3           BROMIDE                   NaN               NaN                 NaN

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/383775.html

標籤：Python 熊猫数据框

上一篇：基于行合并兩個資料幀

下一篇：有沒有一種方法可以根據條件從資料框中洗掉行？