我有以下資料框:
df = pd.DataFrame({'DEC': ['Food','Kyl','Food','Fashion'],
'JAN': ['Food','Kyl','Kyl','Food'],
'FEB': ['Food','Kyl','Kyl','Food'],
'MAR': ['Food','Kyl','Kyl','Fashion'],
'COUNTS': [4988, 1976, 797, 613]})
DEC JAN FEB MAR COUNTS
Food Food Food Food 4988
Kyl Kyl Kyl Kyl 1976
Food Kyl Kyl Kyl 797
Fashion Food Food Fashion 613
我想要做的是對 DEC-MAR 的四列進行每行比較,并創建一個新列,如果它們都相同,則輸出值,或者在存在差異的情況下輸出 2-3-4 值。
因此,所需的輸出 DataFrame 是
DEC JAN FEB MAR COUNTS COMPARISON
Food Food Food Food 4988 Food
Kyl Kyl Kyl Kyl 1976 Kyl
Food Kyl Kyl Kyl 797 Food-Kyl
Fashion Food Food Fashion 61 Fashion-Food
我想我可以通過定義一個使用np.where并進行比較的函式來做到這一點,但是要進行的比較太多以至于效率不高(即 Col1==Col2 或 Col1==Col3 或 Col1==Col4 或Col2==Cole3 等....)
這是我開始但由于所有手動作業而放棄的,希望有更好的解決方案:
def group_combs(row):
if (row['DEC'] == row['JAN']) &
(row['DEC'] == row['JAN']) &
(row['DEC'] == row['JAN']):
val = row['DEC']
elif row['FEB'] == row['MAR']:
val = row['FEB']
else:
val = -1
return val
除了手動比較所有列之外,還有更好的方法嗎?
提前致謝!!
uj5u.com熱心網友回復:
對去重連接值使用串列推導以提高性能:
df['COMPARISON'] = ['-'.join(dict.fromkeys(x))for x in df.drop('COUNTS', axis=1).to_numpy()]
print (df)
DEC JAN FEB MAR COUNTS COMPARISON
0 Food Food Food Food 4988 Food
1 Kyl Kyl Kyl Kyl 1976 Kyl
2 Food Kyl Kyl Kyl 797 Food-Kyl
3 Fashion Food Food Fashion 613 Fashion-Food
時間:
#4k rows
df = pd.concat([df] * 1000, ignore_index=True)
In [135]: %timeit df['COMPARISON'] = (df.drop(columns='COUNTS').agg(lambda s: '-'.join(dict.fromkeys(s)), axis=1))
176 ms ± 6.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [136]: %timeit df['COMPARISON'] = ['-'.join(dict.fromkeys(x)) for x in df.drop('COUNTS', axis=1).to_numpy()]
15.9 ms ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
#40k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [138]: %timeit df['COMPARISON'] = (df.drop(columns='COUNTS').agg(lambda s: '-'.join(dict.fromkeys(s)), axis=1))
1.68 s ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [139]: %timeit df['COMPARISON'] = ['-'.join(dict.fromkeys(x))for x in df.drop('COUNTS', axis=1).to_numpy()]
87.7 ms ± 4.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
uj5u.com熱心網友回復:
洗掉不需要的列并使用 和drop聚合唯一值(按順序),然??后將它們回傳:groupbyaggdict.fromkeysjoin
df['COMPARISON'] = (df
.drop(columns='COUNTS')
.agg(lambda s: '-'.join(dict.fromkeys(s)), axis=1)
)
輸出:
DEC JAN FEB MAR COUNTS COMPARISON
0 Food Food Food Food 4988 Food
1 Kyl Kyl Kyl Kyl 1976 Kyl
2 Food Kyl Kyl Kyl 797 Food-Kyl
3 Fashion Food Food Fashion 613 Fashion-Food
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/456662.html
