如何迭代每個組的列值并跟蹤總和-有解無憂

我有 4 個資料框，如下所示

df_raw = pd.DataFrame(
    {'stud_id' : [101, 101,101],
     'prod_id':[12,13,16],
     'total_qty':[100,1000,80],
     'ques_date' : ['13/11/2020', '10/1/2018','11/11/2017']})

df_accu = pd.DataFrame(
    {'stud_id' : [101,101,101],
     'prod_id':[12,13,16],
     'accu_qty':[10,500,10],
     'accu_date' : ['13/08/2021','02/11/2019','17/12/2018']})

df_inv = pd.DataFrame(
    {'stud_id' : [101,101,101],
     'prod_id':[12,13,18],
     'inv_qty':[5,100,15],
     'inv_date' : ['16/02/2022', '22/11/2020','19/10/2019']})

df_bkl = pd.DataFrame(
    {'stud_id' : [101,101,101,101],
     'prod_id' :[12,12,12,17],
     'bkl_qty' :[15,40,2,10],
     'bkl_date':['16/01/2022', '22/10/2021','09/10/2020','25/06/2020']})

我的目標是找出以下內容

a) 獲取閾值超過 50% 的日期

閾值由以下公式給出

threshold = (((df_inv['inv_qty'] df_bkl['bkl_qty'] df_accu['accu_qty'])/df_raw['total_qty'])*100)

我們必須以相同的順序添加。意思是，首先，我們必須添加inv_qty，然后是bkl_qty最后accu_qty。我們這樣做是為了在它們超過總數量的 50% 時識別正確的日期。此外，這必須為每個stud_id和計算prod_id。

但問題是df_bkl有多個相同的記錄，stud_id而且prod_id這是設計使然。真實資料也是這樣的。而df_accuanddf_inv將只有一行stud_id和prod_id。

在 df['bkl_qty'] 的上述公式中，we have to use each value of df['bkl_qty']計算總和。

例如：讓我們采用stud_id = 101and prod_id = 12。

His total_qty = 100, inv_qty = 5, his accu_qty=10. but he has three bkl_qty values - 15,40 and 2. So, threshold has to be computed in a fashion like below

5 (is value of inv_qty) 15 (is 1st value of bkl_qty) 40 (is 2nd value of bkl_qty) 2 (is 3rd value of bkl_qty) 10(is value of accu_qty)

So, now with the above, we can know that his threshold exceeded 50% when his bkl_qty value was 40. Meaning, 5 15 40 = 60 (which is greater than 50% of total_qty (100)).

I was trying something like below

df_stage_1 = df_raw.merge(df_inv,on=['stud_id','prod_id'], how='left').fillna(0)
df_stage_2 = df_stage_1.merge(df_bkl,on=['stud_id','prod_id'])
df_stage_3 = df_stage_2.merge(df_accu,on=['stud_id','prod_id'])
df_stage_3['threshold'] = ((df_stage_3['inv_qty']   df_stage_3['bkl_qty']   df_stage_3['accu_qty'])/df_stage_3['total_qty'])*100

But this is incorrect as I am not able to do each value by value for bkl_qty from df_bkl

In this post, I have shown only sample data with one stud_id=101 but in real time I have more than 1000's of stud_id and prod_id.

因此，任何優雅而有效的方法都是有用的。我們必須在百萬記錄資料集上應用這個邏輯。

我希望我的輸出如下所示。每當總和值超過 total_qty 的 50% 時，我們需要獲取對應的日期

stud_id,prod_id,total_qty,threshold,threshold_date
  101     12       100       72      22/10/2021

uj5u.com熱心網友回復：

它可以使用groupby和cumsumwhich 進行累積求和來實作。

# add cumulative sum column to df_bkl
df_bkl['csum'] = df_bkl.groupby(['stud_id','prod_id'])['bkl_qty'].cumsum()

# use df_bkl['csum'] to compute threshold instead of bkl_qty
df_stage_3['threshold'] = ((df_stage_3['inv_qty']   df_stage_3['csum']   df_stage_3['accu_qty'])/df_stage_3['total_qty'])*100
# check if inv_qty already exceeds threshold
df_stage_3.loc[df_stage_3.inv_qty > df_stage_3.total_qty/2, 'bkl_date'] = df_stage_3['inv_date']

# next doing some filter and merge to arrive at the desired df
gt_thres = df_stage_3[df_stage_3['threshold'] > df_stage_3['total_qty']/2]
df_f1 = gt_thres.groupby(['stud_id','prod_id','total_qty'])['threshold'].min().to_frame(name='threshold').reset_index()
df_f2 = gt_thres.groupby(['stud_id','prod_id','total_qty'])['threshold'].max().to_frame(name='threshold_max').reset_index()

df = pd.merge(df_f1, df_stage_3, on=['stud_id','prod_id','total_qty','threshold'], how='inner')
df2 = pd.merge(df,df_f2, on=['stud_id','prod_id','total_qty'], how='inner')
df2 = df2[['stud_id','prod_id','total_qty','threshold','bkl_date']].rename(columns={'threshold_max':'threshold', 'bkl_date':'threshold_date'})
print(df2)

提供輸出為：

   stud_id  prod_id  total_qty  threshold threshold_date
0      101       12        100       72.0     22/10/2021

這行得通嗎？

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/433569.html

標籤：Python 熊猫数据框麻木的熊猫-groupby

上一篇：求平均值的函式

下一篇：如何開始計算特定索引處的行值？