對于我的資料框中的每個 id,我想在給定 id 的最小和最大日期內的隨機日期范圍期間(可以是 1、2、4 周)之間對行進行采樣。例如,對于給定的 id,我有介于兩者之間的資料2022-09-01,2022-09-28我想隨機抽樣 7 天,這將為我提供隨機開始日期和結束日期之間的所有記錄,例如2022-09-06and 2022-09-12。
我的一個想法是隨機抽樣一個 start_date,然后在它上面添加我想要的間隔以找到 end_date,檢查它是否是一個可行的 end_date,如果不重復第一個 start_date 抽樣,如果可行,然后根據該范圍過濾資料,但是通過這種方法,我錯過了很多組合,并且覺得必須有更好的方法。
示例 df:
data = [{'account_id': 1, 'date': Timestamp('2022-09-01 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-02 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-03 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-04 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-05 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-06 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-07 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-08 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-09 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-10 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-11 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-12 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-13 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-14 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-15 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-16 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-17 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-18 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-19 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-20 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-21 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-22 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-23 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-24 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-25 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-26 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-27 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-28 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-01 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-02 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-03 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-04 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-05 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-06 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-07 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-08 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-09 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-10 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-11 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-12 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-13 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-14 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-15 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-16 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-17 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-18 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-19 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-20 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-21 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-22 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-23 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-24 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-25 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-26 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-27 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-28 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-29 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-30 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-10-01 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-10-02 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-10-03 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-10-04 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-10-05 00:00:00')}]
df = pd.DataFrame(data)
輸出示例:
output_data = [{'account_id': 1, 'date': Timestamp('2022-09-06 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-07 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-08 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-09 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-10 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-11 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-12 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-15 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-16 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-17 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-18 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-19 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-20 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-21 00:00:00')}]
output = pd.DataFrame(output_data)
uj5u.com熱心網友回復:
IIUC,這是您可以做到的一種方法:
基本上你 groupby account_id,然后創建一個pd.date_range組,從它的最小值開始,在它的最大值減去你選擇的天數(或其他任何東西。有關更多資訊,請閱讀檔案pandas.tseries.offsets.DateOffset)你選擇,因為你不能得到隨機 7如果您在組最大值前 2 天開始,則增加天數。
pd.date_range回傳一個串列,其中包含您可以開始的所有可能日期。從這個串列中,您選擇一個使用日期random.choice并將其傳遞給另一個pd.date_range作為開始日期,以及您選擇的間隔數。
import pandas as pd
import random
num_of_days = 7 # or whatever interval you choose
out = (
df
.groupby('account_id')
.apply(lambda grp:
pd.date_range(
start = random.choice(
[*pd.date_range(
grp['date'].min(),
grp['date'].max() - pd.DateOffset(days=num_of_days)
)
]
),
periods = num_of_days)
)
.explode()
.reset_index(name='random_consecutive_days')
)
print(out)
account_id random_consecutive_days
0 1 2022-09-13
1 1 2022-09-14
2 1 2022-09-15
3 1 2022-09-16
4 1 2022-09-17
5 1 2022-09-18
6 1 2022-09-19
7 2 2022-09-20
8 2 2022-09-21
9 2 2022-09-22
10 2 2022-09-23
11 2 2022-09-24
12 2 2022-09-25
13 2 2022-09-26
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/524766.html
上一篇:透視資料框
下一篇:動態Where子句
