我有以下資料框(示例):
import pandas as pd
data = [['A', '2022-09-01', 2], ['A', '2022-09-02', 1], ['A', '2022-09-04', 3], ['A', '2022-09-06', 2],
['A', '2022-09-07', 1], ['A', '2022-09-07', 2], ['A', '2022-09-08', 4], ['A', '2022-09-09', 2],
['B', '2022-09-01', 2], ['B', '2022-09-03', 4], ['B', '2022-09-04', 2], ['B', '2022-09-05', 2],
['B', '2022-09-07', 1], ['B', '2022-09-08', 3], ['B', '2022-09-10', 2]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'value'])
df['date'] = pd.to_datetime(df['date'])
df['diff_days'] = (df['date']-df['date'].groupby(df['group']).transform('first')).dt.days
group date value diff_days
0 A 2022-09-01 2 0
1 A 2022-09-02 1 1
2 A 2022-09-04 3 3
3 A 2022-09-06 2 5
4 A 2022-09-07 1 6
5 A 2022-09-07 2 6
6 A 2022-09-08 4 7
7 A 2022-09-09 2 8
8 B 2022-09-01 2 0
9 B 2022-09-03 4 2
10 B 2022-09-04 2 3
11 B 2022-09-05 2 4
12 B 2022-09-07 1 6
13 B 2022-09-08 3 7
14 B 2022-09-10 2 9
我想用每組上一個日期的值填寫缺失的日期。我可以使用這個答案中的代碼,但問題是我可能每組都有重復的條目(日期)。以下錯誤回傳:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(
['date', 'group']
).unstack(
fill_value=-999
).asfreq(
'D', fill_value=-999
).stack().sort_index(level=1).reset_index()
df.replace(-999, np.nan).ffill()
ValueError: Index contains duplicate entries, cannot reshape
我想要的輸出應該是這樣的:
data = [['A', '2022-09-01', 2, 0], ['A', '2022-09-02', 1, 1], ['A', '2022-09-03', 1, 1], ['A', '2022-09-04', 3, 3],
['A', '2022-09-05', 3, 3], ['A', '2022-09-06', 2, 5], ['A', '2022-09-07', 1, 6], ['A', '2022-09-07', 2, 6],
['A', '2022-09-08', 4, 7], ['A', '2022-09-09', 2, 8],
['B', '2022-09-01', 2, 0], ['B', '2022-09-02', 2, 0], ['B', '2022-09-03', 4, 2], ['B', '2022-09-04', 2, 3],
['B', '2022-09-05', 2, 4], ['B', '2022-09-06', 2, 4], ['B', '2022-09-07', 1, 6], ['B', '2022-09-08', 3, 7],
['B', '2022-09-09', 3, 7], ['B', '2022-09-10', 2, 9]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'value', ' diff_days'])
group date value diff_days
0 A 2022-09-01 2 0
1 A 2022-09-02 1 1
2 A 2022-09-03 1 1
3 A 2022-09-04 3 3
4 A 2022-09-05 3 3
5 A 2022-09-06 2 5
6 A 2022-09-07 1 6
7 A 2022-09-07 2 6
8 A 2022-09-08 4 7
9 A 2022-09-09 2 8
10 B 2022-09-01 2 0
11 B 2022-09-02 2 0
12 B 2022-09-03 4 2
13 B 2022-09-04 2 3
14 B 2022-09-05 2 4
15 B 2022-09-06 2 4
16 B 2022-09-07 1 6
17 B 2022-09-08 3 7
18 B 2022-09-09 3 7
19 B 2022-09-10 2 9
一些日期解釋:
- 對于 A 組,缺少日期“2022-09-03”和“2022-09-05”。如您所見,這些獲取前一個日期的值。
- 對于 B 組,缺少日期“2022-09-02”、“2022-09-06”和“2022-09-09”。同樣對于這些,它們從上一行獲取值。
所以我想知道是否有人知道如何填寫這些缺失的日期,即使每個組可能有重復的條目使用Pandas?
uj5u.com熱心網友回復:
解決方案
c = ['group', 'date']
m = df[c].duplicated(keep='last')
s = df[~m].set_index('date').groupby('group').resample('D').ffill()
out = pd.concat([df[m], s.droplevel(0).reset_index()]).sort_values(c)
這是如何作業的?
- 識別重復的行
group和date resample使用前向填充洗掉欺騙和資料框Concat具有重新采樣行的重復行以獲得結果
結果
group date value diff_days
0 A 2022-09-01 2 0
1 A 2022-09-02 1 1
2 A 2022-09-03 1 1
3 A 2022-09-04 3 3
4 A 2022-09-05 3 3
5 A 2022-09-06 2 5
4 A 2022-09-07 1 6
6 A 2022-09-07 2 6
7 A 2022-09-08 4 7
8 A 2022-09-09 2 8
9 B 2022-09-01 2 0
10 B 2022-09-02 2 0
11 B 2022-09-03 4 2
12 B 2022-09-04 2 3
13 B 2022-09-05 2 4
14 B 2022-09-06 2 4
15 B 2022-09-07 1 6
16 B 2022-09-08 3 7
17 B 2022-09-09 3 7
18 B 2022-09-10 2 9
uj5u.com熱心網友回復:
您可以使用幫助列對日期進行重復資料洗掉:
(df.assign(n=df.groupby(['group', 'date']).cumcount())
.pivot(index=['date', 'n'], columns='group')
.ffill()
.stack().reset_index()
.sort_values(by=['group', 'date'], ignore_index=True)
[df.columns]
)
輸出:
group date value diff_days
0 A 2022-09-01 2.0 0.0
1 A 2022-09-02 1.0 1.0
2 A 2022-09-03 1.0 1.0
3 A 2022-09-04 3.0 3.0
4 A 2022-09-05 3.0 3.0
5 A 2022-09-06 2.0 5.0
6 A 2022-09-07 1.0 6.0
7 A 2022-09-07 2.0 6.0
8 A 2022-09-08 4.0 7.0
9 A 2022-09-09 2.0 8.0
10 A 2022-09-10 2.0 8.0
11 B 2022-09-01 2.0 0.0
12 B 2022-09-02 2.0 0.0
13 B 2022-09-03 4.0 2.0
14 B 2022-09-04 2.0 3.0
15 B 2022-09-05 2.0 4.0
16 B 2022-09-06 2.0 4.0
17 B 2022-09-07 1.0 6.0
18 B 2022-09-07 1.0 6.0
19 B 2022-09-08 3.0 7.0
20 B 2022-09-09 3.0 7.0
21 B 2022-09-10 2.0 9.0
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/514674.html
下一篇:新日期為一周中的每一天生成新的
