作為特征工程的一部分,我想使用 groupby 之后的列的計數作為模型的特征,這是我嘗試過的
>>> import pandas as pd
>>> from collections import Counter
>>> df = pd.DataFrame({'col1':['a','b','a','c','a','b'],'col2':['val1','val2','val2','val1','val2','val2'],'col3':['val3','val4','val3','val4','val3','val4']})
>>> df
col1 col2 col3
0 a val1 val3
1 b val2 val4
2 a val2 val3
3 c val1 val4
4 a val2 val3
5 b val2 val4
>>> test = df.groupby('col1').agg(list)
col2 col3
col1
a [val1, val2, val2] [val3, val3, val3]
b [val2, val2] [val4, val4]
c [val1] [val4]
>>> test['col2'] = test['col2'].apply(lambda x: Counter(x))
>>> test['col3'] = test['col3'].apply(lambda x: Counter(x))
>>> test
col2 col3
col1
a {'val1': 1, 'val2': 2} {'val3': 3}
b {'val2': 2} {'val4': 2}
c {'val1': 1} {'val4': 1}
稍后我可以將字典擴展為單獨的列,因此最終輸出將是:
>>> final = pd.concat([test.drop(['col2'], axis=1), test['col2'].apply(pd.Series)], axis=1)
>>> final = pd.concat([final.drop(['col3'], axis=1), final['col3'].apply(pd.Series)], axis=1)
val1 val2 val3 val4
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0
我覺得有一個更簡單的解決方案,任何幫助表示贊賞。
uj5u.com熱心網友回復:
是的,melt crosstab:
df2 = df.melt(id_vars='col1', value_name='count')
pd.crosstab(df2['col1'], df2['count'])
輸出:
count val1 val2 val3 val4
col1
a 1 2 3 0
b 0 2 0 2
c 1 0 0 1
如果你想要NaN:
df3 = pd.crosstab(df2['col1'], df2['count'])
df3.mask(df3.eq(0))
輸出:
count val1 val2 val3 val4
col1
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0
uj5u.com熱心網友回復:
df = pd.concat([df[['col1','col2']], df[['col1','col3']].rename(columns={"col3": "col2"})])
df = df.pivot_table(index = 'col1', columns = 'col2',aggfunc=len)
print(df)
輸出:
col2 val1 val2 val3 val4
col1
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/360259.html
標籤:蟒蛇-3.x 熊猫 数据框 pandas-groupby
上一篇:操作資料幀行Python
下一篇:如何在列上添加標題
