我有一個df,我想計算每組第三個五分位數的平均值。做法是寫一個自定義函式,對每個組進行申請;但有一些問題。代碼:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': pd.Series(np.array(range(20))), 'B': ['a','a','a','a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b']})
def func_mean_quintile(df):
# Make sure data is in DataFrame
df = pd.DataFrame(df)
df['pct'] = pd.to_numeric(pd.cut(df.iloc[:,0], 5, labels=np.r_[1:6]))
avg = df[df['pct'] == 3].iloc[:,0].mean()
return np.full((len(df)), avg)
df['C'] = df.groupby('B')['A'].apply(func_mean_quintile)
結果NaN適用于所有列C
不知道哪里錯了?
另外,如果您知道如何使自定義函式更好地執行,請幫助
謝謝
uj5u.com熱心網友回復:
沒有功能的建議解決方案
你不需要函式;這應該做計算:
q_lo = 0.4 # start of 3d quintile
q_hi = 0.6 # end of 3d quintile
(df.groupby('B')
.apply(lambda g:g.assign(C = g.loc[(g['A'] >= g['A'].quantile(q_lo)) & (g['A'] < g['A'].quantile(q_hi)), 'A' ].mean()))
.reset_index(drop = True)
)
輸出:
A B C
0 0 a 4.5
1 1 a 4.5
2 2 a 4.5
3 3 a 4.5
4 4 a 4.5
5 5 a 4.5
6 6 a 4.5
7 7 a 4.5
8 8 a 4.5
9 9 a 4.5
10 10 b 14.5
11 11 b 14.5
12 12 b 14.5
13 13 b 14.5
14 14 b 14.5
15 15 b 14.5
16 16 b 14.5
17 17 b 14.5
18 18 b 14.5
19 19 b 14.5
您的原始解決方案
如果您將行替換df['C'] = ...為
df['C'] = df.groupby('B')['A'].transform(func_mean_quintile)
uj5u.com熱心網友回復:
像這樣做:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': pd.Series(np.array(range(20))), 'B':['a','a','a','a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b' ,'b']})
def func_mean_quintile(df):
# Make sure data is in DataFrame
df = pd.DataFrame(df)
df['pct'] = pd.to_numeric(pd.cut(df.iloc[:,0], 5, labels=np.r_[1:6]))
avg = df[df['pct'] == 3].iloc[:,0].mean()
return np.full((len(df)), avg)
means = df.groupby('B').apply(func_mean_quintile)
df['C'][df["B"]=='a'] = means["a"]
df['C'][df["B"]=='b'] = means["b"]
這將為您提供所需的輸出。
uj5u.com熱心網友回復:
如果你把它分成兩個不同的步驟,認為它更容易。首先用它所在的分位數標記每個資料點。其次,只是每個分位數的聚合。
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"a": pd.Series(np.array(range(20))),
"b": ["a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b"],
}
)
df["a_quantile"] = pd.cut(df.a, bins=4, labels=["q1", "q2", "q3", "q4"])
df_agg = df.groupby("a_quantile").agg({"a": ["mean"]})
df_agg.head()
聚合結果如下所示:
Out[9]:
a
mean
a_quantile
q1 2
q2 7
q3 12
q4 17
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/412462.html
標籤:
下一篇:匹配串列中資料框列中的單詞
