我有兩個資料框x
和y
. DFx
包含兩個分組變數S
和A
,以及一個值變數V
。我想從這個 DF 中洗掉定義在y
. DFy
包含變數S
和變數 A_D
,它們共同定義需要從 中洗掉哪個 (S,A) 對x
。
但是, , 中的每個元素y['A_D']
(可以)包含來自 的元素字串A
,用逗號分隔。這些單獨的元素中的每一個都應該被洗掉x
(對于 中的特定元素S
)。此外,如果y['A_D']
包含其中特定元素的所有S
元素,則應洗掉 中的整個 S 組x
。
我找到了一個我想要的解決方案,但我的問題是,有沒有更簡單或更 Pythonic 的方法來解決這個問題?
import pandas as pd
import numpy as np
# Define x
x = pd.DataFrame({'S': np.repeat(['s1','s2','s3'], 5),
'A': [j for i in range(3) for j in ['a','b','c','d','e']],
'V': np.random.uniform(size=15) })
# Define y. Which (S,A) pairs should be deleted from x. For 's1' all rows should be deleted.
# For 's2' 'a' and 'd' rows should be deleted and for 's3' the 'c' row should be deleted.
y = pd.DataFrame({'S':['s1','s2','s3'],
'A_D':['all','a, d', 'c']})
# My solution:
# expand y to a new DF z. Comma separated elements in 'A_D' become separate elements. Also strip whitespace.
z = []
for i, r in y.iterrows():
z.append(pd.DataFrame({'S' : r[0],
'A_D': [u.strip() for u in str(r[1]).split(',')]}))
z = pd.concat(z)
# first delete S-groups defined by `all`
x_d = x.merge(z[z['A_D']=='all'],how='left')
x_d = x_d[x_d['A_D']!='all'].drop(columns= 'A_D')
# then drop (S,A) pairs.
x_d = x_d.merge(z[z['A_D']!='all'],how='left', left_on = ['S','A'], right_on = ['S', 'A_D'])
x_d = x_d[pd.isna(x_d['A_D'])].drop(columns= 'A_D').reset_index(drop=True)
# The required result:
print(x_d)
為清楚起見,物件如下所示:
x
Out[1]:
S A V
0 s1 a 0.758516
1 s1 b 0.522200
2 s1 c 0.190511
3 s1 d 0.544617
4 s1 e 0.480378
5 s2 a 0.191016
6 s2 b 0.714625
7 s2 c 0.852788
8 s2 d 0.142410
9 s2 e 0.909382
10 s3 a 0.895031
11 s3 b 0.153444
12 s3 c 0.751675
13 s3 d 0.227501
14 s3 e 0.586467
y
Out[2]:
S A_D
0 s1 all
1 s2 a, d
2 s3 c
z
Out[3]:
S A_D
0 s1 all
0 s2 a
1 s2 d
0 s3 c
x_d
Out[4]:
S A V
0 s2 b 0.714625
1 s2 c 0.852788
2 s2 e 0.909382
3 s3 a 0.895031
4 s3 b 0.153444
5 s3 d 0.227501
6 s3 e 0.586467
uj5u.com熱心網友回復:
x
###
S A V
0 s1 a 0.490194
1 s1 b 0.875381
2 s1 c 0.384808
3 s1 d 0.063960
4 s1 e 0.003159
5 s2 a 0.188624
6 s2 b 0.400527
7 s2 c 0.137458
8 s2 d 0.162291
9 s2 e 0.337899
10 s3 a 0.101296
11 s3 b 0.464031
12 s3 c 0.407629
13 s3 d 0.222498
14 s3 e 0.802472
無論用', '
, ' ,'
,分隔','
y
###
S A_D
0 s1 all
1 s2 a, d ,c
2 s3 c,d
y['A_D'] = y['A_D'].replace('all', ', '.join(x['A'].unique()))
y = y.assign(A_D=y['A_D'].str.split(',')).explode('A_D')
y['A_D'] = y['A_D'].str.strip()
output = x[~x.set_index(['S','A']).index.isin(y.set_index(['S','A_D']).index)].reset_index(drop=True)
output
###
S A V
0 s2 b 0.400527
1 s2 e 0.337899
2 s3 a 0.101296
3 s3 b 0.464031
4 s3 e 0.802472
uj5u.com熱心網友回復:
這是我的解決方案,它至少更短:)
def filter_group(group, filter_rule):
return (None if filter_rule == 'all'
else group[~group["A"].isin(filter_rule.replace(' ', '').split(','))])
x_d = pd.concat(filter_group(x.groupby('S').get_group(grp), filter_rule)
for grp, filter_rule in dict(zip(y["S"], y["A_D"])).items())
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/505753.html