使用模糊邏輯和fuzzywuzzy模塊,我能夠將名稱(來自一個資料幀)與短名稱(來自另一個資料幀)相匹配。這兩個資料框還包含一個表 ISIN。
這是應用邏輯后我得到的資料幀。
ISIN Name Currency Value % Weight Asset Type Comments/ Assumptions matches
236 NaN Partnerre Ltd 4.875% Perp Sr:J USD 1.684069e 05 0.0004 NaN NaN
237 NaN Berkley (Wr) Corporation 5.700% 03/30/58 USD 6.955837e 04 0.0002 NaN NaN
238 NaN Tc Energy Corp Flt Perp Sr:11 USD 6.380262e 04 0.0001 NaN NaN TC ENERGY CORP
239 NaN Cash and Equivalents USD 2.166579e 07 0.0499 NaN NaN
240 NaN AUM NaN 4.338766e 08 0.9999 NaN NaN AUM IND BARC US
創建了一個新列“匹配”,這基本上意味著來自第二個資料幀的短名稱匹配來自第一個資料幀的名稱。
來自 dataframe1 的 ISIN 為空,來自 dataframe2 的 ISIN 存在。在隨后的匹配(來自第一個資料幀的名稱和來自第二個資料幀的短名稱)時,我想將相關的 ISIN 從第二個資料幀添加到第一個資料幀。
如何將 ISIN 從第二個資料幀獲取到第一個資料幀,以便我的最終輸出看起來像這樣?
ISIN Name Currency Value % Weight Asset Type Comments/ Assumptions matches
236 NaN Partnerre Ltd 4.875% Perp Sr:J USD 1.684069e 05 0.0004 NaN NaN
237 NaN Berkley (Wr) Corporation 5.700% 03/30/58 USD 6.955837e 04 0.0002 NaN NaN
238 78s9 Tc Energy Corp Flt Perp Sr:11 USD 6.380262e 04 0.0001 NaN NaN TC ENERGY CORP
239 NaN Cash and Equivalents USD 2.166579e 07 0.0499 NaN NaN
240 123e AUM NaN 4.338766e 08 0.9999 NaN NaN AUM IND BARC US
編輯:資料幀及其原始形式 df1
ISIN Name Currency Value % Weight Asset Type Comments/ Assumptions
0 NaN Transcanada Trust 5.875 08/15/76 USD 7616765.00 0.0176 NaN https://assets.cohenandsteers.com/assets/conte...
1 NaN Bp Capital Markets Plc Flt Perp USD 7348570.50 0.0169 NaN Holding value for each constituent is derived ...
2 NaN Transcanada Trust Flt 09/15/79 USD 7341250.00 0.0169 NaN NaN
3 NaN Bp Capital Markets Plc Flt Perp USD 6734022.32 0.0155 NaN NaN
4 NaN Prudential Financial 5.375% 5/15/45 USD 6508290.68 0.0150 NaN NaN
(241, 7)
df2
Short Name ISIN
0 ABU DHABI COMMER AEA000201011
1 ABU DHABI NATION AEA002401015
2 ABU DHABI NATION AEA006101017
3 ADNOC DRILLING C AEA007301012
4 ALPHA DHABI HOLD AEA007601015
(66987, 2)
編輯 2:從資料幀中獲取匹配項的模糊邏輯
df1 = pd.read_excel('file.xlsx', sheet_name=1, usecols=[1, 2, 3, 4, 5, 6, 8], header=1)
df2 = pd.read_excel("Excel files/file2.xlsx", sheet_name=0, usecols=[1, 2], header=1)
# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []
# converting dataframe column
# to list of elements
# to do fuzzy matching
list1 = df1['Name'].tolist()
list2 = df2['Short Name'].tolist()
# taking the threshold as 80
threshold = 93
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extractOne(i, list2, scorer=fuzz.token_set_ratio))
df1['matches'] = mat1
# iterating through the closest matches
# to filter out the maximum closest match
for j in df1['matches']:
if j[1] >= threshold:
p.append(j[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches back
# to df1
df1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using token_set_ratio():")
#print(df1.to_csv('todays-result1.csv'))
print(df1.head(20))
uj5u.com熱心網友回復:
假設您的第一個資料幀的 ISIN 填充為空,那么簡單的合并就可以滿足您的需求。如果您需要保留第一個資料幀中的非空 ISIN,則需要使用布爾掩碼:-
df1 = pd.DataFrame(
[[None, "Apple", "appl"],
[None, "Google", "ggl"],
[None, "Amazon", 'amzn']],
columns=["ISIN", "Name", "matches"]
)
df2 = pd.DataFrame(
[["ISIN1", "appl"],
["ISIN2", "ggl"]],
columns= ["ISIN", "Short Name"]
)
missing_isin = df1['ISIN'].isnull()
df1.loc[missing_isin, 'ISIN'] = df1.loc[missing_isin][['matches']].merge(
df2[['ISIN', 'Short Name']],
how='left',
left_on='matches',
right_on='Short Name'
)['ISIN']
left_on / right_on :- 與資料框匹配的列名
how='left':- (簡單來說) 保留最左邊資料框的順序/索引,查看檔案了解更多資訊
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/383784.html
上一篇:將某些分類變數更改為統一條目
