有效地計算日期在兩列之間的記錄-有解無憂

假設我有這個 DataFrame：

	用戶	子日期	取消訂閱日期	團體
0	愛麗絲	2021-01-01 00:00:00	2021-02-09 00:00:00	一種
1	鮑勃	2021-02-03 00:00:00	2021-04-05 00:00:00	乙
2	查理	2021-02-03 00:00:00	鈉鹽	一種
3	戴夫	2021-01-29 00:00:00	2021-09-01 00:00:00	乙

計算每個日期和每個組的訂閱用戶的最有效方法是什么？換句話說，要獲取此 DataFrame：

日期	團體	涂膠
2021-01-01	一種	1
2021-01-01	乙	0
2021-01-02	一種	1
2021-01-02	乙	0
...	...	...
2021-02-03	一種	2
2021-02-03	乙	2
...	...	...
2021-02-10	一種	1
2021-02-10	乙	2
...	...	...

這是初始化示例 df 的片段：

import pandas as pd
import datetime as dt

users = pd.DataFrame(
    [
        ["alice", "2021-01-01", "2021-02-09", "A"],
        ["bob", "2021-02-03", "2021-04-05", "B"],
        ["charlie", "2021-02-03", None, "A"],
        ["dave", "2021-01-29", "2021-09-01", "B"],
    ],
    columns=["user", "sub_date", "unsub_date", "group"],
)

users[["sub_date", "unsub_date"]] = users[["sub_date", "unsub_date"]].apply(
    pd.to_datetime
)

uj5u.com熱心網友回復：

為方便起見，使用較小的日期范圍

注意：我的用戶 df 與 OP 不同。我已經更改了幾個日期以使輸出更小

In [26]: import pandas as pd
    ...: import datetime as dt
    ...:
    ...: users = pd.DataFrame(
    ...:     [
    ...:         ["alice", "2021-01-01", "2021-01-05", "A"],
    ...:         ["bob", "2021-01-03", "2021-01-07", "B"],
    ...:         ["charlie", "2021-01-03", None, "A"],
    ...:         ["dave", "2021-01-09", "2021-01-11", "B"],
    ...:     ],
    ...:     columns=["user", "sub_date", "unsub_date", "group"],
    ...: )
    ...:
    ...: users[["sub_date", "unsub_date"]] = users[["sub_date", "unsub_date"]].apply(
    ...:     pd.to_datetime
    ...: )

In [81]: users
Out[81]:
      user   sub_date unsub_date group
0    alice 2021-01-01 2021-01-05     A
1      bob 2021-01-03 2021-01-07     B
2  charlie 2021-01-03        NaT     A
3     dave 2021-01-09 2021-01-11     B

In [82]: users.melt(id_vars=['user', 'group'])
Out[82]:
      user group    variable      value
0    alice     A    sub_date 2021-01-01
1      bob     B    sub_date 2021-01-03
2  charlie     A    sub_date 2021-01-03
3     dave     B    sub_date 2021-01-09
4    alice     A  unsub_date 2021-01-05
5      bob     B  unsub_date 2021-01-07
6  charlie     A  unsub_date        NaT
7     dave     B  unsub_date 2021-01-11

# dropna to remove rows with no unsub_date
# sort_values to sort by date
# sub_date exists -> map to 1, else -1 then take cumsum to get # of subbed people at that date

In [85]: melted = users.melt(id_vars=['user', 'group']).dropna().sort_values('value')
    ...: melted['sub_value'] = np.where(melted['variable'] == 'sub_date', 1, -1) # or melted['variable'].map({'sub_date': 1, 'unsub_date': -1})
    ...: melted['sub_cumsum_group'] = melted.groupby('group')['sub_value'].cumsum()
    ...: melted
Out[85]:
      user group    variable      value  sub_value  sub_cumsum_group
0    alice     A    sub_date 2021-01-01          1                 1
1      bob     B    sub_date 2021-01-03          1                 1
2  charlie     A    sub_date 2021-01-03          1                 2
4    alice     A  unsub_date 2021-01-05         -1                 1
5      bob     B  unsub_date 2021-01-07         -1                 0
3     dave     B    sub_date 2021-01-09          1                 1
7     dave     B  unsub_date 2021-01-11         -1                 0

In [93]: idx = pd.date_range(melted['value'].min(), melted['value'].max(), freq='1D')
    ...: idx
Out[93]:
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08',
               '2021-01-09', '2021-01-10', '2021-01-11'],
              dtype='datetime64[ns]', freq='D')

In [94]: melted.set_index('value').groupby('group')['sub_cumsum_group'].apply(lambda x: x.reindex(idx).ffill().fillna(0))
Out[94]:
group
A      2021-01-01    1.0
       2021-01-02    1.0
       2021-01-03    2.0
       2021-01-04    2.0
       2021-01-05    1.0
       2021-01-06    1.0
       2021-01-07    1.0
       2021-01-08    1.0
       2021-01-09    1.0
       2021-01-10    1.0
       2021-01-11    1.0
B      2021-01-01    0.0
       2021-01-02    0.0
       2021-01-03    1.0
       2021-01-04    1.0
       2021-01-05    1.0
       2021-01-06    1.0
       2021-01-07    0.0
       2021-01-08    0.0
       2021-01-09    1.0
       2021-01-10    1.0
       2021-01-11    0.0
Name: sub_cumsum_group, dtype: float64

uj5u.com熱心網友回復：

資料由階梯函式描述，有效地計算日期在兩列之間的記錄

下一步是在您想要的任何日期對階梯函式進行采樣，例如一月的每一天。

sc.sample(stepfunctions, pd.date_range("2021-01-01", "2021-02-01")).melt(ignore_index=False).reset_index()

結果是這樣

   group   variable  value
0      A 2021-01-01      1
1      B 2021-01-01      0
2      A 2021-01-02      1
3      B 2021-01-02      0
4      A 2021-01-03      1
..   ...        ...    ...
59     B 2021-01-30      1
60     A 2021-01-31      1
61     B 2021-01-31      1
62     A 2021-02-01      1
63     B 2021-02-01      1

uj5u.com熱心網友回復：

嘗試這個？

>>> users.groupby(['sub_date','group'])[['user']].count()

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/333444.html

標籤：Python 熊猫

上一篇：如何使用具有公共列值的其他行中的值替換nan值

下一篇：如何使用熊貓計算列中具有特定字串值的行數？