我為每一列創建了不同的 bin,并根據這些對 DataFrame 進行分組。
import pandas as pd
import numpy as np
np.random.seed(100)
df = pd.DataFrame(np.random.randn(100, 4), columns=['a', 'b', 'c', 'value'])
# for simplicity, I use the same bin here
bins = np.arange(-3, 4, 0.05)
df['a_bins'] = pd.cut(df['a'], bins=bins)
df['b_bins'] = pd.cut(df['b'], bins=bins)
df['c_bins'] = pd.cut(df['c'], bins=bins)
的輸出df.groupby(['a_bins','b_bins','c_bins']).size() 表明組長度為 2685619。
計算各組的統計量
然后,每個組的統計資料是這樣計算的:
%%timeit
df.groupby(['a_bins','b_bins','c_bins']).agg({'value':['mean']})
>>> 16.9 s ± 637 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
預期輸出
- 有沒有可能加快這個速度?
- 更快的方法還應該支持通過輸入值來查找
a, b, and c值,如下所示:
df.groupby(['a_bins','b_bins','c_bins']).agg({'value':['mean']}).loc[(-1.72, 0.32, 1.18)]
>>> -0.252436
uj5u.com熱心網友回復:
這是一個很好的用例scipy.stats.binned_statistic_dd。下面的代碼段僅計算平均統計資料,但支持許多其他統計資料(請參閱上面鏈接的檔案):
import numpy as np
import pandas as pd
np.random.seed(100)
df = pd.DataFrame(np.random.randn(100, 4), columns=["a", "b", "c", "value"])
# for simplicity, I use the same bin here
bins = np.arange(-3, 4, 0.05)
df["a_bins"] = pd.cut(df["a"], bins=bins)
df["b_bins"] = pd.cut(df["b"], bins=bins)
df["c_bins"] = pd.cut(df["c"], bins=bins)
# this takes about 35 seconds
result_pandas = df.groupby(["a_bins", "b_bins", "c_bins"]).agg({"value": ["mean"]})
from scipy.stats import binned_statistic_dd
# this takes about 20 ms
result_scipy = binned_statistic_dd(
df[["a", "b", "c"]].to_numpy(), df["value"], bins=(bins, bins, bins)
)
# this is a verbose way to get a dataframe representation
# for many purposes this probably will not be needed
# takes about 5 seconds
temp_list = []
for na, a in enumerate(result_scipy[1][0][:-1]):
for nb, b in enumerate(result_scipy[1][1][:-1]):
for nc, c in enumerate(result_scipy[1][2][:-1]):
value = result_scipy[0][na, nb, nc]
temp_list.append([a, b, c, value])
result_scipy_as_df = pd.DataFrame(temp_list, columns=list("abcx"))
# check that the result is the same
result_scipy_as_df["x"].describe() == result_pandas["value"]["mean"].describe()
如果您有興趣進一步加快速度,這個答案可能會有用。
uj5u.com熱心網友回復:
因為您的 3 列的 bin 是相同的,所以使用codesfromcat訪問器:
%timeit df.groupby([df['a_bins'].cat.codes, df['b_bins'].cat.codes, df['c_bins'].cat.codes])['value'].mean()
1.82 ms ± 27.6 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
uj5u.com熱心網友回復:
對于這些資料,我建議您對資料進行透視并傳遞平均值。通常,這會更快,因為您正在訪問整個資料幀,而不是遍歷每個組:
(df
.pivot(None, ['a_bins', 'b_bins', 'c_bins'], 'value')
.mean()
.sort_index() # ignore this if you are not fuzzy on order
)
a_bins b_bins c_bins
(-2.15, -2.1] (0.25, 0.3] (-1.3, -1.25] 0.929100
(0.75, 0.8] (-0.3, -0.25] 0.480411
(-2.05, -2.0] (-0.1, -0.05] (0.3, 0.35] -1.684900
(0.75, 0.8] (-0.25, -0.2] -1.184411
(-2.0, -1.95] (-0.6, -0.55] (-1.2, -1.15] -0.021176
...
(1.7, 1.75] (-0.75, -0.7] (1.05, 1.1] -0.229518
(1.85, 1.9] (-0.4, -0.35] (1.8, 1.85] 0.003017
(1.9, 1.95] (-1.45, -1.4] (0.1, 0.15] 0.949361
(2.05, 2.1] (-0.35, -0.3] (-0.65, -0.6] 0.763184
(2.25, 2.3] (-0.95, -0.9] (0.1, 0.15] 2.539432
這與 groupby 的輸出匹配:
(df
.groupby(['a_bins','b_bins','c_bins'])
.agg({'value':['mean']})
.dropna()
.squeeze()
)
a_bins b_bins c_bins
(-2.15, -2.1] (0.25, 0.3] (-1.3, -1.25] 0.929100
(0.75, 0.8] (-0.3, -0.25] 0.480411
(-2.05, -2.0] (-0.1, -0.05] (0.3, 0.35] -1.684900
(0.75, 0.8] (-0.25, -0.2] -1.184411
(-2.0, -1.95] (-0.6, -0.55] (-1.2, -1.15] -0.021176
...
(1.7, 1.75] (-0.75, -0.7] (1.05, 1.1] -0.229518
(1.85, 1.9] (-0.4, -0.35] (1.8, 1.85] 0.003017
(1.9, 1.95] (-1.45, -1.4] (0.1, 0.15] 0.949361
(2.05, 2.1] (-0.35, -0.3] (-0.65, -0.6] 0.763184
(2.25, 2.3] (-0.95, -0.9] (0.1, 0.15] 2.539432
Name: (value, mean), Length: 100, dtype: float64
樞軸選項在我的 PC 上提供了 3.72 毫秒的速度,而我不得不終止 groupby 選項,因為它花費的時間太長(我的 PC 很舊:))
同樣,這有效/更快的原因是因為平均值正在擊中整個資料幀,而不是通過 groupby 中的組。
至于你的另一個問題,你可以很容易地索引它:
bin_mean = (df
.pivot(None, ['a_bins', 'b_bins', 'c_bins'], 'value')
.mean()
.sort_index() # ignore this if you are not fuzzy on order
)
bin_mean.loc[(-1.72, 0.32, 1.18)]
-0.25243603652138985
主要問題是分類的 Pandas 將回傳所有行(這很浪費,而且效率不高);通過observed = True,你應該注意到一個顯著的改進:
(df.groupby(['a_bins','b_bins','c_bins'], observed=True)
.agg({'value':['mean']})
)
value
mean
a_bins b_bins c_bins
(-2.15, -2.1] (0.25, 0.3] (-1.3, -1.25] 0.929100
(0.75, 0.8] (-0.3, -0.25] 0.480411
(-2.05, -2.0] (-0.1, -0.05] (0.3, 0.35] -1.684900
(0.75, 0.8] (-0.25, -0.2] -1.184411
(-2.0, -1.95] (-0.6, -0.55] (-1.2, -1.15] -0.021176
... ...
(1.7, 1.75] (-0.75, -0.7] (1.05, 1.1] -0.229518
(1.85, 1.9] (-0.4, -0.35] (1.8, 1.85] 0.003017
(1.9, 1.95] (-1.45, -1.4] (0.1, 0.15] 0.949361
(2.05, 2.1] (-0.35, -0.3] (-0.65, -0.6] 0.763184
(2.25, 2.3] (-0.95, -0.9] (0.1, 0.15] 2.539432
我的 PC 上的速度約為 7.39 毫秒,比樞軸選項少約 2 倍,但現在速度更快,這是因為僅使用/回傳資料幀中存在的分類。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/390545.html
標籤:Python 熊猫 麻木的 scipy scipy.stats
上一篇:Pythonpandas使用fillna()來避免對NaN值進行錯誤拆分
下一篇:將二進制矩陣轉換為原始矩陣
