我需要a_b根據列a和列生成b列df,如果兩者a都b大于0,a_b則賦值為1,如果兩者a都b小于0,a_b則賦值為-1,我使用的是double np.where。
我的代碼如下,在哪里generate_data生成demo data并get_result用于production,get_result需要在哪里運行4 million times:
import numpy as np
import pandas as pd
rand = np.random.default_rng(seed=0)
pd.set_option('display.max_columns', None)
def generate_data() -> pd.DataFrame:
_df = pd.DataFrame(rand.uniform(-1, 1, size=(10,7)), columns=['a', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6'])
return _df
def get_result(_df: pd.DataFrame) -> pd.DataFrame:
a = _df.a.to_numpy()
for col in ['b1', 'b2', 'b3', 'b4', 'b5', 'b6']:
b = _df[col].to_numpy()
_df[f'a_{col}'] = np.where(
(a > 0) & (b > 0), 1., np.where(
(a < 0) & (b < 0), -1., 0.)
)
return _df
def main():
df = generate_data()
print(df)
df = get_result(df)
print(df)
if __name__ == '__main__':
main()
generate_data 生成的資料:
a b1 b2 b3 b4 b5 b6
0 0.273923 -0.460427 -0.918053 -0.966945 0.626540 0.825511 0.213272
1 0.458993 0.087250 0.870145 0.631707 -0.994523 0.714809 -0.932829
2 0.459311 -0.648689 0.726358 0.082922 -0.400576 -0.154626 -0.943361
3 -0.751433 0.341249 0.294379 0.230770 -0.232645 0.994420 0.961671
4 0.371084 0.300919 0.376893 -0.222157 -0.729807 0.442977 0.050709
5 -0.379516 -0.028329 0.778976 0.868087 -0.284410 0.143060 -0.356261
6 0.188600 -0.324178 -0.216762 0.780549 -0.545685 0.246374 -0.831969
7 0.665288 0.574197 -0.521261 0.752968 -0.882864 -0.327766 -0.699441
8 -0.099321 0.592649 -0.538716 -0.895957 -0.190896 -0.602974 -0.818494
9 0.160665 -0.402608 0.343990 -0.600969 0.884226 -0.269780 -0.789009
我想要的結果:
a b1 b2 b3 b4 b5 b6 a_b1 \
0 0.273923 -0.460427 -0.918053 -0.966945 0.626540 0.825511 0.213272 0.0
1 0.458993 0.087250 0.870145 0.631707 -0.994523 0.714809 -0.932829 1.0
2 0.459311 -0.648689 0.726358 0.082922 -0.400576 -0.154626 -0.943361 0.0
3 -0.751433 0.341249 0.294379 0.230770 -0.232645 0.994420 0.961671 0.0
4 0.371084 0.300919 0.376893 -0.222157 -0.729807 0.442977 0.050709 1.0
5 -0.379516 -0.028329 0.778976 0.868087 -0.284410 0.143060 -0.356261 -1.0
6 0.188600 -0.324178 -0.216762 0.780549 -0.545685 0.246374 -0.831969 0.0
7 0.665288 0.574197 -0.521261 0.752968 -0.882864 -0.327766 -0.699441 1.0
8 -0.099321 0.592649 -0.538716 -0.895957 -0.190896 -0.602974 -0.818494 0.0
9 0.160665 -0.402608 0.343990 -0.600969 0.884226 -0.269780 -0.789009 0.0
a_b2 a_b3 a_b4 a_b5 a_b6
0 0.0 0.0 1.0 1.0 1.0
1 1.0 1.0 0.0 1.0 0.0
2 1.0 1.0 0.0 0.0 0.0
3 0.0 0.0 -1.0 0.0 0.0
4 1.0 0.0 0.0 1.0 1.0
5 0.0 0.0 -1.0 0.0 -1.0
6 0.0 1.0 0.0 1.0 0.0
7 0.0 1.0 0.0 0.0 0.0
8 -1.0 -1.0 -1.0 -1.0 -1.0
9 1.0 0.0 1.0 0.0 0.0
績效評價:
%timeit get_result(df)
1.56 ms ± 54.7 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
怎么可能更快?
uj5u.com熱心網友回復:
因為你標記表現,我推薦你,使用numba和并行計算如下:(如果我們直接將值輸入到并行函式,我們可以實作3.35 μs)
import numpy as np
import numba as nb
import pandas as pd
@nb.njit( parallel=True )
def parallel_fun(vals):
a = vals[:,0]
new_vals = np.empty((10,6))
for i in nb.prange(6):
b = vals[:,i 1]
for j in nb.prange(10):
val = 0
if (a[j] >0) and (b[j]>0): val =1
elif (a[j] <0) and (b[j]<0) : val= -1
new_vals[j,i] = val
return new_vals
def get_result_3(_df: pd.DataFrame) -> pd.DataFrame:
vals = _df[['a','b1', 'b2', 'b3', 'b4', 'b5', 'b6']].to_numpy()
new_vals = parallel_fun(vals)
return pd.DataFrame(new_vals, columns=[f'a_{b}' for b in ['b1', 'b2', 'b3', 'b4', 'b5', 'b6']])
_df = pd.DataFrame(np.random.uniform(-1, 1, size=(10,7)), columns=['a', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6'])
vals = _df[['a','b1', 'b2', 'b3', 'b4', 'b5', 'b6']].to_numpy()
colab的基準測驗:
%timeit get_result_3(_df)
# 658 μs per loop
%timeit parallel_fun(vals)
# 3.35 μs per loop
uj5u.com熱心網友回復:
對于一個小的資料框(10,7),矢量化幾乎沒有什么好處,所以我不確定在那里可以獲得多少。但是,您可以重寫代碼以使其更具可讀性(盡管這可能是主觀的):
def get_result2(_df: pd.DataFrame) -> pd.DataFrame:
bcols = [c for c in _df.columns if c.startswith('b')]
bcols_names = [f'a_{c}' for c in bcols]
a_sign = np.sign(df['a']).values.reshape(-1,1)
b_signs = np.sign(df[bcols])
_df[bcols_names] = ( b_signs == a_sign ) * a_sign
return _df
您可以使用以下命令檢查這是否給出了相同的結果:
x = get_result(df)
y = get_result2(df)
print(x.equals(y))
# True
然而,在我的測驗中,這個函式并沒有在運行時產生一致的改進。我猜想對于更大的資料集可能會更好。
uj5u.com熱心網友回復:
有人用純 numpy 回答我:
import numpy as np
rand = np.random.default_rng(seed=0)
a = rand.uniform(low=-1, high=1, size=(10, 1))
b = rand.uniform(low=-1, high=1, size=(10, 6))
def signs():
sa = np.sign(a)
return sa * (sa == np.sign(b))
def main():
signs()
return
if __name__ == '__main__':
main()
%timeit signs()
10.2 μs ± 678 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/488956.html
上一篇:為什么在Python中計算網格和向量的矩陣乘法時會收到警告?
下一篇:使用串列理解決議非常大的陣列很慢
