Pandas：根據條件將值從一個資料幀合并到另一個資料幀-有解無憂

使用模糊邏輯和fuzzywuzzy模塊，我能夠將名稱（來自一個資料幀）與短名稱（來自另一個資料幀）相匹配。這兩個資料框還包含一個表 ISIN。

這是應用邏輯后我得到的資料幀。

ISIN                                      Name Currency         Value  % Weight  Asset Type Comments/ Assumptions          matches
236   NaN            Partnerre Ltd 4.875% Perp Sr:J      USD  1.684069e 05    0.0004         NaN                   NaN
237   NaN  Berkley (Wr) Corporation 5.700% 03/30/58      USD  6.955837e 04    0.0002         NaN                   NaN
238   NaN             Tc Energy Corp Flt Perp Sr:11      USD  6.380262e 04    0.0001         NaN                   NaN   TC ENERGY CORP
239   NaN                      Cash and Equivalents      USD  2.166579e 07    0.0499         NaN                   NaN
240   NaN                                       AUM      NaN  4.338766e 08    0.9999         NaN                   NaN  AUM IND BARC US

創建了一個新列“匹配”，這基本上意味著來自第二個資料幀的短名稱匹配來自第一個資料幀的名稱。

來自 dataframe1 的 ISIN 為空，來自 dataframe2 的 ISIN 存在。在隨后的匹配（來自第一個資料幀的名稱和來自第二個資料幀的短名稱）時，我想將相關的 ISIN 從第二個資料幀添加到第一個資料幀。

如何將 ISIN 從第二個資料幀獲取到第一個資料幀，以便我的最終輸出看起來像這樣？

ISIN                                      Name Currency         Value  % Weight  Asset Type Comments/ Assumptions          matches
236   NaN            Partnerre Ltd 4.875% Perp Sr:J      USD  1.684069e 05    0.0004         NaN                   NaN
237   NaN  Berkley (Wr) Corporation 5.700% 03/30/58      USD  6.955837e 04    0.0002         NaN                   NaN
238   78s9             Tc Energy Corp Flt Perp Sr:11      USD  6.380262e 04    0.0001         NaN                   NaN   TC ENERGY CORP
239   NaN                      Cash and Equivalents      USD  2.166579e 07    0.0499         NaN                   NaN
240   123e                                       AUM      NaN  4.338766e 08    0.9999         NaN                   NaN  AUM IND BARC US

編輯：資料幀及其原始形式 df1

ISIN                                 Name Currency       Value  % Weight  Asset Type                              Comments/ Assumptions
0   NaN     Transcanada Trust 5.875 08/15/76      USD  7616765.00    0.0176         NaN  https://assets.cohenandsteers.com/assets/conte...
1   NaN      Bp Capital Markets Plc Flt Perp      USD  7348570.50    0.0169         NaN  Holding value for each constituent is derived ...
2   NaN       Transcanada Trust Flt 09/15/79      USD  7341250.00    0.0169         NaN                                                NaN
3   NaN      Bp Capital Markets Plc Flt Perp      USD  6734022.32    0.0155         NaN                                                NaN
4   NaN  Prudential Financial 5.375% 5/15/45      USD  6508290.68    0.0150         NaN                                                NaN
(241, 7)

df2

Short Name          ISIN
0  ABU DHABI COMMER  AEA000201011
1  ABU DHABI NATION  AEA002401015
2  ABU DHABI NATION  AEA006101017
3  ADNOC DRILLING C  AEA007301012
4  ALPHA DHABI HOLD  AEA007601015
(66987, 2)

編輯 2：從資料幀中獲取匹配項的模糊邏輯

df1 = pd.read_excel('file.xlsx', sheet_name=1, usecols=[1, 2, 3, 4, 5, 6, 8], header=1)
df2 = pd.read_excel("Excel files/file2.xlsx", sheet_name=0, usecols=[1, 2], header=1)

# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []

# converting dataframe column
# to list of elements
# to do fuzzy matching
list1 = df1['Name'].tolist()
list2 = df2['Short Name'].tolist()

# taking the threshold as 80
threshold = 93

# iterating through list1 to extract
# it's closest match from list2
for i in list1:
    mat1.append(process.extractOne(i, list2, scorer=fuzz.token_set_ratio))
df1['matches'] = mat1

# iterating through the closest matches
# to filter out the maximum closest match
for j in df1['matches']:
    if j[1] >= threshold:
        p.append(j[0])
    mat2.append(",".join(p))
    p = []

# storing the resultant matches back
# to df1
df1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using token_set_ratio():")
#print(df1.to_csv('todays-result1.csv'))
print(df1.head(20))

uj5u.com熱心網友回復：

假設您的第一個資料幀的 ISIN 填充為空，那么簡單的合并就可以滿足您的需求。如果您需要保留第一個資料幀中的非空 ISIN，則需要使用布爾掩碼：-

df1 = pd.DataFrame(
  [[None, "Apple", "appl"], 
  [None, "Google", "ggl"], 
  [None, "Amazon", 'amzn']], 
  columns=["ISIN", "Name", "matches"]
)

df2 = pd.DataFrame(
  [["ISIN1", "appl"], 
  ["ISIN2", "ggl"]], 
  columns= ["ISIN", "Short Name"]
)

missing_isin = df1['ISIN'].isnull()

df1.loc[missing_isin, 'ISIN'] = df1.loc[missing_isin][['matches']].merge(
    df2[['ISIN', 'Short Name']], 
    how='left', 
    left_on='matches', 
    right_on='Short Name'
)['ISIN']

left_on / right_on :- 與資料框匹配的列名

how='left':- (簡單來說) 保留最左邊資料框的順序/索引，查看檔案了解更多資訊

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/383784.html

標籤：Python 蟒蛇-3.x 熊猫数据框

上一篇：將某些分類變數更改為統一條目

下一篇：用不同的索引減去pandasDataframes