我正在嘗試對 pandas datetime-indexed dataframe 進行上采樣,以便生成的資料在新條目上平均分配。
例如,假設我有一個資料框,它存盤每個月的成本,我想獲得一個資料框,它總結了每個月每天的等效成本:
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31],
[pd.to_datetime('2023-02-01'), 14]],
columns=['time', 'cost']
)
.set_index("time")
)
1 月份的每日費用為 1 美元(或您喜歡的任何貨幣),2 月份為 0.5 美元。我在圖片中的目標:

經過一番努力,我設法獲得了下一個似乎可以滿足我要求的代碼片段:
# add a value to perform a correct resampling
df.loc[df.index.max() relativedelta(months=1)] = 0
# forward-fill over the right scale
# then divide each entry per the number of rows in the month
df = (df
.resample('1d')
.ffill()
.iloc[:-1]
.groupby(lambda x: datetime(x.year, x.month, 1))
.transform(lambda x: (x / x.count()))
)
但是,這并不完全可以:
- using
transform迫使我擁有具有單列的資料框; - 我需要以不同的格式對我的原始頻率進行多次硬編碼(同時在資料幀的末尾和 groupby 中添加一個額外的值),使函式設計變得困難;
- 它僅適用于均勻間隔的日期時間索引(即使在我的情況下也可以);
- 它仍然很復雜。
有沒有人有改進該代碼段的建議?
uj5u.com熱心網友回復:
如果我們采用df' 的月份索引并將它們擴展為天數范圍,同時將df' 的值除以那些天的數字并分配給每一天,所有這些都通過串列理解(編輯:對于每天平均分布的值):
import pandas as pd
# initial DataFrame
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31],
[pd.to_datetime('2023-02-01'), 14]],
columns=['time', 'cost']
).set_index("time"))
# reformat to months
df.index = df.index.strftime('%m-%Y')
df1 = pd.concat( # concatenate the resulted DataFrames into one
[pd.DataFrame( # make a DataFrame from a row in df
[v / pd.Period(i).days_in_month # each month's value divided by n of days in a month
for d in range(pd.Period(i).days_in_month)], # repeated for as many times as there are days
index=pd.date_range(start=i, periods=pd.Period(i).days_in_month, freq='D')) # days range
for i, v in df.iterrows()]) # for each df's index and value
df1
輸出:
cost
2023-01-01 1.0
2023-01-02 1.0
2023-01-03 1.0
2023-01-04 1.0
2023-01-05 1.0
2023-01-06 1.0
2023-01-07 1.0
2023-01-08 1.0
2023-01-09 1.0
2023-01-10 1.0
2023-01-11 1.0
... ...
2023-02-13 0.5
2023-02-14 0.5
2023-02-15 0.5
2023-02-16 0.5
2023-02-17 0.5
2023-02-18 0.5
2023-02-19 0.5
2023-02-20 0.5
2023-02-21 0.5
2023-02-22 0.5
2023-02-23 0.5
2023-02-24 0.5
2023-02-25 0.5
2023-02-26 0.5
2023-02-27 0.5
2023-02-28 0.5
可以做些什么來避免日常成本的均勻分布以及多列的情況?這是一個擴展df:
# additional columns and a row
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31, 62, 23],
[pd.to_datetime('2023-02-01'), 14, 28, 51],
[pd.to_datetime('2023-03-01'), 16, 33, 21]],
columns=['time', 'cost1', 'cost2', 'cost3']
).set_index("time"))
# reformat to months
df.index = df.index.strftime('%m-%Y')
df
輸出:
cost1 cost2 cost3
time
01-2023 31 62 23
02-2023 14 28 51
03-2023 16 33 21
受此問題的啟發,這是我針對每月成本可能會被隨機每日成本上采樣的情況提出的建議。此解決方案可擴展到列數和行數:
df1 = pd.concat( # concatenate the resulted DataFrames into one
[pd.DataFrame( # make a DataFrame from a row in df
# here we make a Series with random Dirichlet distributed numbers
# with length of a month and a column's value as the sum
[pd.Series((np.random.dirichlet(np.ones(pd.Period(i).days_in_month), size=1)*v
).flatten()) # the product is an ndarray that needs flattening
for v in row], # for every column value in a row
# index renamed as columns because of the created DataFrame's shape
index=df.columns
# transpose and set the proper index
).T.set_index(
pd.date_range(start=i,
periods=pd.Period(i).days_in_month,
freq='D'))
for i, row in df.iterrows()]) # iterate over every row
輸出:
cost1 cost2 cost3
2023-01-01 1.703177 1.444117 0.160151
2023-01-02 0.920706 3.664460 0.823405
2023-01-03 1.210426 1.194963 0.294093
2023-01-04 0.214737 1.286273 0.923881
2023-01-05 1.264553 0.380062 0.062829
... ... ... ...
2023-03-27 0.124092 0.615885 0.251369
2023-03-28 0.520578 1.505830 1.632373
2023-03-29 0.245154 3.094078 0.308173
2023-03-30 0.530927 0.406665 1.149860
2023-03-31 0.276992 1.115308 0.432090
90 rows × 3 columns
斷言每月總和:
df1.groupby(pd.Grouper(freq='M')).agg('sum')
輸出:
cost1 cost2 cost3
2023-01-31 31.0 62.0 23.0
2023-02-28 14.0 28.0 51.0
2023-03-31 16.0 33.0 21.0
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/516251.html
