在pandas中對資料進行上采樣和劃分-有解無憂

我正在嘗試對 pandas datetime-indexed dataframe 進行上采樣，以便生成的資料在新條目上平均分配。

例如，假設我有一個資料框，它存盤每個月的成本，我想獲得一個資料框，它總結了每個月每天的等效成本：

df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31], 
                    [pd.to_datetime('2023-02-01'), 14]], 
                    columns=['time', 'cost']
                    )
      .set_index("time")
     )

1 月份的每日費用為 1 美元（或您喜歡的任何貨幣），2 月份為 0.5 美元。我在圖片中的目標：

在 pandas 中對資料進行上采樣和劃分

經過一番努力，我設法獲得了下一個似乎可以滿足我要求的代碼片段：

# add a value to perform a correct resampling
df.loc[df.index.max()   relativedelta(months=1)] = 0

# forward-fill over the right scale
# then divide each entry per the number of rows in the month
df = (df
      .resample('1d')
      .ffill()
      .iloc[:-1]
      .groupby(lambda x: datetime(x.year, x.month, 1))
      .transform(lambda x: (x / x.count()))
      )

但是，這并不完全可以：

usingtransform迫使我擁有具有單列的資料框；
我需要以不同的格式對我的原始頻率進行多次硬編碼（同時在資料幀的末尾和 groupby 中添加一個額外的值），使函式設計變得困難；
它僅適用于均勻間隔的日期時間索引（即使在我的情況下也可以）；
它仍然很復雜。

有沒有人有改進該代碼段的建議？

uj5u.com熱心網友回復：

如果我們采用df' 的月份索引并將它們擴展為天數范圍，同時將df' 的值除以那些天的數字并分配給每一天，所有這些都通過串列理解（編輯：對于每天平均分布的值）：

import pandas as pd

# initial DataFrame
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31], 
                    [pd.to_datetime('2023-02-01'), 14]], 
                   columns=['time', 'cost']
       ).set_index("time"))

# reformat to months
df.index = df.index.strftime('%m-%Y')

df1 = pd.concat(  # concatenate the resulted DataFrames into one
    [pd.DataFrame(  # make a DataFrame from a row in df 
        [v / pd.Period(i).days_in_month  # each month's value divided by n of days in a month
         for d in range(pd.Period(i).days_in_month)],  # repeated for as many times as there are days
        index=pd.date_range(start=i, periods=pd.Period(i).days_in_month, freq='D'))  # days range
     for i, v in df.iterrows()])  # for each df's index and value
df1

輸出：

           cost
2023-01-01  1.0
2023-01-02  1.0
2023-01-03  1.0
2023-01-04  1.0
2023-01-05  1.0
2023-01-06  1.0
2023-01-07  1.0
2023-01-08  1.0
2023-01-09  1.0
2023-01-10  1.0
2023-01-11  1.0
    ...     ...
2023-02-13  0.5
2023-02-14  0.5
2023-02-15  0.5
2023-02-16  0.5
2023-02-17  0.5
2023-02-18  0.5
2023-02-19  0.5
2023-02-20  0.5
2023-02-21  0.5
2023-02-22  0.5
2023-02-23  0.5
2023-02-24  0.5
2023-02-25  0.5
2023-02-26  0.5
2023-02-27  0.5
2023-02-28  0.5

可以做些什么來避免日常成本的均勻分布以及多列的情況？這是一個擴展df：

# additional columns and a row
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31, 62, 23], 
                    [pd.to_datetime('2023-02-01'), 14, 28, 51],
                    [pd.to_datetime('2023-03-01'), 16, 33, 21]], 
                   columns=['time', 'cost1', 'cost2', 'cost3']
       ).set_index("time"))

# reformat to months
df.index = df.index.strftime('%m-%Y')
df

輸出：

          cost1  cost2  cost3
time            
01-2023   31     62     23
02-2023   14     28     51
03-2023   16     33     21

受此問題的啟發，這是我針對每月成本可能會被隨機每日成本上采樣的情況提出的建議。此解決方案可擴展到列數和行數：

df1 = pd.concat(  # concatenate the resulted DataFrames into one
    [pd.DataFrame(  # make a DataFrame from a row in df 
        # here we make a Series with random Dirichlet distributed numbers
        # with length of a month and a column's value as the sum 
        [pd.Series((np.random.dirichlet(np.ones(pd.Period(i).days_in_month), size=1)*v
                   ).flatten())  # the product is an ndarray that needs flattening 
         for v in row],  # for every column value in a row 
        # index renamed as columns because of the created DataFrame's shape 
        index=df.columns  
                    # transpose and set the proper index
                    ).T.set_index(
                           pd.date_range(start=i,
                                         periods=pd.Period(i).days_in_month,
                                         freq='D'))  
      for i, row in df.iterrows()])  # iterate over every row

輸出：

            cost1       cost2       cost3
2023-01-01  1.703177    1.444117    0.160151
2023-01-02  0.920706    3.664460    0.823405
2023-01-03  1.210426    1.194963    0.294093
2023-01-04  0.214737    1.286273    0.923881
2023-01-05  1.264553    0.380062    0.062829
...         ...         ...         ...
2023-03-27  0.124092    0.615885    0.251369
2023-03-28  0.520578    1.505830    1.632373
2023-03-29  0.245154    3.094078    0.308173
2023-03-30  0.530927    0.406665    1.149860
2023-03-31  0.276992    1.115308    0.432090
90 rows × 3 columns

斷言每月總和：

df1.groupby(pd.Grouper(freq='M')).agg('sum')

輸出：

            cost1   cost2   cost3
2023-01-31  31.0    62.0    23.0
2023-02-28  14.0    28.0    51.0
2023-03-31  16.0    33.0    21.0

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/516251.html

標籤：Python熊猫约会时间时间序列熊猫重采样

上一篇：如何將一個檔案中的日期時間視窗與另一個檔案中的日期時間重疊？

下一篇：在python中構建特殊日期時間范圍的快速方法