熊貓資料框的最佳條件連接-有解無憂

我有一種情況，我想df_a加入df_b

實際上，它們dataframes具有形狀：(389944, 121)和(1098118, 60)

如果以下任何條件為真，我需要有條件地加入這兩個資料框。如果是多個，則只需要加入一次：

df_a.player == df_b.handle 
df_a.website == df_b.url 
df_a.website == df_b.web_addr
df_a.website == df_b.notes

舉個例子……

df_a:

播放器	網站	商品
邁克爾·喬丹	www.michaeljordan.com	是
勒布朗·詹姆斯	www.kingjames.com	是
科比·布萊恩特	www.mamba.com	是
拉里·伯德	www.larrybird.com	是
盧卡·東契奇	www.77.com	?

df_b:

平臺	網址	web_addr	筆記	處理	追隨者	下列的
推特	https://twitter.com/luka7doncic	www.77.com		luka7doncic	1500000	347
推特	www.larrybird.com	https://en.wikipedia.org/wiki/Larry_Bird	www.larrybird.com
推特		https://www.michaeljordansworld.com/	www.michaeljordan.com
推特	https://twitter.com/kobebryant	https://granitystudios.com/	https://granitystudios.com/	科比·布萊恩特	14900000	514
推特	fooman.com	thefoo.com	富吧	美食家	1	1
推特	www.stackoverflow.com

理想情況下，df_a可以left joined引入df_b、和欄位handlefollowersfollowing

播放器	網站	商品	處理	追隨者	下列的
邁克爾·喬丹	www.michaeljordan.com	是	NH	0	0
勒布朗·詹姆斯	www.kingjames.com	是	空值	空值	空值
科比·布萊恩特	www.mamba.com	是	科比·布萊恩特	14900000	514
拉里·伯德	www.larrybird.com	是	NH	0	0
盧卡·東契奇	www.77.com	?	luka7doncic	1500000	347

下面是一個最小的、可重現的示例：

import pandas as pd, numpy as np

df_a = pd.DataFrame.from_dict({'player': {0: 'michael jordan',  1: 'Lebron James',  2: 'Kobe Bryant',  3: 'Larry Bird',  4: 'luka Doncic'}, 'website': {0: 'www.michaeljordan.com',  1: 'www.kingjames.com',  2: 'www.mamba.com',  3: 'www.larrybird.com',  4: 'www.77.com'}, 'merch': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'N'}, 'handle': {0: 'nh', 1: np.nan, 2: 'Kobe Bryant', 3: 'nh', 4: 'luka7doncic'}, 'followers': {0: 0.0, 1: np.nan, 2: 14900000.0, 3: 0.0, 4: 1500000.0}, 'following': {0: 0.0, 1: np.nan, 2: 514.0, 3: 0.0, 4: 347.0}})
df_b = pd.DataFrame.from_dict({'platform': {0: 'Twitter',  1: 'Twitter',  2: 'Twitter',  3: 'Twitter',  4: 'Twitter',  5: 'Twitter'}, 'url': {0: 'https://twitter.com/luka7doncic',  1: 'www.larrybird.com',  2: np.nan,  3: 'https://twitter.com/kobebryant',  4: 'fooman.com',  5: 'www.stackoverflow.com'}, 'web_addr': {0: 'www.77.com',  1: 'https://en.wikipedia.org/wiki/Larry_Bird',  2: 'https://www.michaeljordansworld.com/',  3: 'https://granitystudios.com/',  4: 'thefoo.com', 5: np.nan}, 'notes': {0: np.nan,  1: 'www.larrybird.com',  2: 'www.michaeljordan.com',  3: 'https://granitystudios.com/',  4: 'foobar',  5: np.nan}, 'handle': {0: 'luka7doncic',  1: 'nh',  2: 'nh',  3: 'Kobe Bryant',  4: 'foobarman',  5: 'nh'}, 'followers': {0: 1500000, 1: 0, 2: 0, 3: 14900000, 4: 1, 5: 0}, 'following': {0: 347, 1: 0, 2: 0, 3: 514, 4: 1, 5: 0}})

cols_to_join = ['url', 'web_addr', 'notes']

on_handle = df_a.merge(right=df_b, left_on='player', right_on='handle', how='left')

res_df = []
res_df.append(on_handle)
for right_col in cols_to_join:
    try:
        temp = df_a.merge(right=df_b, left_on='website', right_on=right_col, how='left')
    except:
        temp = None
    if temp is not None:
        res_df.append(temp)
    
final = pd.concat(res_df, ignore_index=True)
final.drop_duplicates(inplace=True)

final

但是，這會產生帶有重復列的錯誤結果。

我怎樣才能更有效地做到這一點并獲得正確的結果？

uj5u.com熱心網友回復：

采用：

#for same input
df_a = df_a.drop(['handle','followers','following'], axis=1)
# print (df_a)

#meltying df_b for column website from cols_to_join
cols_to_join = ['url', 'web_addr', 'notes']
df2 = df_b.melt(id_vars=df_b.columns.difference(cols_to_join), value_name='website')
#because duplicates, removed dupes by website
df2 = df2.sort_values('followers', ascending=False).drop_duplicates('website')

print (df2)
    followers  following       handle platform  variable  \
9    14900000        514  Kobe Bryant  Twitter  web_addr   
3    14900000        514  Kobe Bryant  Twitter       url   
6     1500000        347  luka7doncic  Twitter  web_addr   
12    1500000        347  luka7doncic  Twitter     notes   
0     1500000        347  luka7doncic  Twitter       url   
10          1          1    foobarman  Twitter  web_addr   
4           1          1    foobarman  Twitter       url   
16          1          1    foobarman  Twitter     notes   
5           0          0           nh  Twitter       url   
7           0          0           nh  Twitter  web_addr   
8           0          0           nh  Twitter  web_addr   
1           0          0           nh  Twitter       url   
14          0          0           nh  Twitter     notes   

                                     website  
9                https://granitystudios.com/  
3             https://twitter.com/kobebryant  
6                                 www.77.com  
12                                       NaN  
0            https://twitter.com/luka7doncic  
10                                thefoo.com  
4                                 fooman.com  
16                                    foobar  
5                      www.stackoverflow.com  
7   https://en.wikipedia.org/wiki/Larry_Bird  
8       https://www.michaeljordansworld.com/  
1                          www.larrybird.com  
14                     www.michaeljordan.com

#2 times merge and because same index values replace missing values
dffin1 = df_a.merge(df_b.drop(cols_to_join   ['platform'], axis=1), left_on='player', right_on='handle', how='left')
dffin2 = df_a.merge(df2.drop(['platform','variable'], axis=1), on='website', how='left')

dffin = dffin2.fillna(dffin1)
print (dffin)
           player                website merch   followers  following  \
0  michael jordan  www.michaeljordan.com     Y         0.0        0.0   
1    Lebron James      www.kingjames.com     Y         NaN        NaN   
2     Kobe Bryant          www.mamba.com     Y  14900000.0      514.0   
3      Larry Bird      www.larrybird.com     Y         0.0        0.0   
4     luka Doncic             www.77.com     N   1500000.0      347.0   

        handle  
0           nh  
1          NaN  
2  Kobe Bryant  
3           nh  
4  luka7doncic

uj5u.com熱心網友回復：

您可以通過left_on并right_on使用串列 -

final = df_a.merge(
    right=df_b, 
    left_on=['player', 'website', 'website', 'website'], 
    right_on=['handle', 'url', 'web_addr', 'notes'],
    how='left'
)

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/474928.html

標籤：Python 熊猫加入条件语句

上一篇：SQL：從組中獲取最大日期的另一列

下一篇：標準SQL：將顯式交叉連接重寫為WITH子句