我有一種情況,我想df_a加入df_b
實際上,它們dataframes具有形狀:(389944, 121)和(1098118, 60)
如果以下任何條件為真,我需要有條件地加入這兩個資料框。如果是多個,則只需要加入一次:
df_a.player == df_b.handle
df_a.website == df_b.url
df_a.website == df_b.web_addr
df_a.website == df_b.notes
舉個例子……
df_a:
| 播放器 | 網站 | 商品 |
|---|---|---|
| 邁克爾·喬丹 | www.michaeljordan.com | 是 |
| 勒布朗·詹姆斯 | www.kingjames.com | 是 |
| 科比·布萊恩特 | www.mamba.com | 是 |
| 拉里·伯德 | www.larrybird.com | 是 |
| 盧卡·東契奇 | www.77.com | ? |
df_b:
| 平臺 | 網址 | web_addr | 筆記 | 處理 | 追隨者 | 下列的 |
|---|---|---|---|---|---|---|
| 推特 | https://twitter.com/luka7doncic | www.77.com | luka7doncic | 1500000 | 347 | |
| 推特 | www.larrybird.com | https://en.wikipedia.org/wiki/Larry_Bird | www.larrybird.com | |||
| 推特 | https://www.michaeljordansworld.com/ | www.michaeljordan.com | ||||
| 推特 | https://twitter.com/kobebryant | https://granitystudios.com/ | https://granitystudios.com/ | 科比·布萊恩特 | 14900000 | 514 |
| 推特 | fooman.com | thefoo.com | 富吧 | 美食家 | 1 | 1 |
| 推特 | www.stackoverflow.com |
理想情況下,df_a可以left joined引入df_b、和欄位handlefollowersfollowing
| 播放器 | 網站 | 商品 | 處理 | 追隨者 | 下列的 |
|---|---|---|---|---|---|
| 邁克爾·喬丹 | www.michaeljordan.com | 是 | NH | 0 | 0 |
| 勒布朗·詹姆斯 | www.kingjames.com | 是 | 空值 | 空值 | 空值 |
| 科比·布萊恩特 | www.mamba.com | 是 | 科比·布萊恩特 | 14900000 | 514 |
| 拉里·伯德 | www.larrybird.com | 是 | NH | 0 | 0 |
| 盧卡·東契奇 | www.77.com | ? | luka7doncic | 1500000 | 347 |
下面是一個最小的、可重現的示例:
import pandas as pd, numpy as np
df_a = pd.DataFrame.from_dict({'player': {0: 'michael jordan', 1: 'Lebron James', 2: 'Kobe Bryant', 3: 'Larry Bird', 4: 'luka Doncic'}, 'website': {0: 'www.michaeljordan.com', 1: 'www.kingjames.com', 2: 'www.mamba.com', 3: 'www.larrybird.com', 4: 'www.77.com'}, 'merch': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'N'}, 'handle': {0: 'nh', 1: np.nan, 2: 'Kobe Bryant', 3: 'nh', 4: 'luka7doncic'}, 'followers': {0: 0.0, 1: np.nan, 2: 14900000.0, 3: 0.0, 4: 1500000.0}, 'following': {0: 0.0, 1: np.nan, 2: 514.0, 3: 0.0, 4: 347.0}})
df_b = pd.DataFrame.from_dict({'platform': {0: 'Twitter', 1: 'Twitter', 2: 'Twitter', 3: 'Twitter', 4: 'Twitter', 5: 'Twitter'}, 'url': {0: 'https://twitter.com/luka7doncic', 1: 'www.larrybird.com', 2: np.nan, 3: 'https://twitter.com/kobebryant', 4: 'fooman.com', 5: 'www.stackoverflow.com'}, 'web_addr': {0: 'www.77.com', 1: 'https://en.wikipedia.org/wiki/Larry_Bird', 2: 'https://www.michaeljordansworld.com/', 3: 'https://granitystudios.com/', 4: 'thefoo.com', 5: np.nan}, 'notes': {0: np.nan, 1: 'www.larrybird.com', 2: 'www.michaeljordan.com', 3: 'https://granitystudios.com/', 4: 'foobar', 5: np.nan}, 'handle': {0: 'luka7doncic', 1: 'nh', 2: 'nh', 3: 'Kobe Bryant', 4: 'foobarman', 5: 'nh'}, 'followers': {0: 1500000, 1: 0, 2: 0, 3: 14900000, 4: 1, 5: 0}, 'following': {0: 347, 1: 0, 2: 0, 3: 514, 4: 1, 5: 0}})
cols_to_join = ['url', 'web_addr', 'notes']
on_handle = df_a.merge(right=df_b, left_on='player', right_on='handle', how='left')
res_df = []
res_df.append(on_handle)
for right_col in cols_to_join:
try:
temp = df_a.merge(right=df_b, left_on='website', right_on=right_col, how='left')
except:
temp = None
if temp is not None:
res_df.append(temp)
final = pd.concat(res_df, ignore_index=True)
final.drop_duplicates(inplace=True)
final
但是,這會產生帶有重復列的錯誤結果。
我怎樣才能更有效地做到這一點并獲得正確的結果?
uj5u.com熱心網友回復:
采用:
#for same input
df_a = df_a.drop(['handle','followers','following'], axis=1)
# print (df_a)
#meltying df_b for column website from cols_to_join
cols_to_join = ['url', 'web_addr', 'notes']
df2 = df_b.melt(id_vars=df_b.columns.difference(cols_to_join), value_name='website')
#because duplicates, removed dupes by website
df2 = df2.sort_values('followers', ascending=False).drop_duplicates('website')
print (df2)
followers following handle platform variable \
9 14900000 514 Kobe Bryant Twitter web_addr
3 14900000 514 Kobe Bryant Twitter url
6 1500000 347 luka7doncic Twitter web_addr
12 1500000 347 luka7doncic Twitter notes
0 1500000 347 luka7doncic Twitter url
10 1 1 foobarman Twitter web_addr
4 1 1 foobarman Twitter url
16 1 1 foobarman Twitter notes
5 0 0 nh Twitter url
7 0 0 nh Twitter web_addr
8 0 0 nh Twitter web_addr
1 0 0 nh Twitter url
14 0 0 nh Twitter notes
website
9 https://granitystudios.com/
3 https://twitter.com/kobebryant
6 www.77.com
12 NaN
0 https://twitter.com/luka7doncic
10 thefoo.com
4 fooman.com
16 foobar
5 www.stackoverflow.com
7 https://en.wikipedia.org/wiki/Larry_Bird
8 https://www.michaeljordansworld.com/
1 www.larrybird.com
14 www.michaeljordan.com
#2 times merge and because same index values replace missing values
dffin1 = df_a.merge(df_b.drop(cols_to_join ['platform'], axis=1), left_on='player', right_on='handle', how='left')
dffin2 = df_a.merge(df2.drop(['platform','variable'], axis=1), on='website', how='left')
dffin = dffin2.fillna(dffin1)
print (dffin)
player website merch followers following \
0 michael jordan www.michaeljordan.com Y 0.0 0.0
1 Lebron James www.kingjames.com Y NaN NaN
2 Kobe Bryant www.mamba.com Y 14900000.0 514.0
3 Larry Bird www.larrybird.com Y 0.0 0.0
4 luka Doncic www.77.com N 1500000.0 347.0
handle
0 nh
1 NaN
2 Kobe Bryant
3 nh
4 luka7doncic
uj5u.com熱心網友回復:
您可以通過left_on并right_on使用串列 -
final = df_a.merge(
right=df_b,
left_on=['player', 'website', 'website', 'website'],
right_on=['handle', 'url', 'web_addr', 'notes'],
how='left'
)
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/474928.html
