如何為idswhere創建日期范圍的資料框indicator = 1?
#Proxy main high frequency dataframe
main_data = [['site a', '2021-03-05 01:00:00', 1],
['site a', '2021-03-05 01:30:00', 1],
['site a', '2021-03-05 02:00:00', 0],
['site a', '2021-03-05 02:30:00', 1],
['site a', '2021-03-05 02:30:00', 1],
['site b', '2021-04-08 20:00:00', 0],
['site b', '2021-04-09 20:00:00', 1],
['site b', '2021-04-10 20:00:00', 1],
['site b', '2021-04-10 20:30:00', 1]]
# Create the pandas DataFrame
main_df = pd.DataFrame(main_data, columns = ['id', 'timestamp', 'indicator'])
main_df['timestamp'] = pd.to_datetime(main_df['timestamp'], infer_datetime_format=True)
print(main_df)
id timestamp indicator
0 site a 2021-03-05 01:00:00 1
1 site a 2021-03-05 01:30:00 1
2 site a 2021-03-05 02:00:00 0
3 site a 2021-03-05 02:30:00 1
4 site a 2021-03-05 02:30:00 1
5 site b 2021-04-08 20:00:00 0
6 site b 2021-04-09 20:00:00 1
7 site b 2021-04-10 20:00:00 1
8 site b 2021-04-10 20:30:00 1
所需的輸出資料框:
print(desired_df)
id start end
0 site a 2021-03-05 01:00:00 2021-03-05 01:30:00
1 site a 2021-03-05 02:30:00 2021-03-05 02:30:00
2 site b 2021-04-09 20:00:00 2021-04-10 20:30:00
uj5u.com熱心網友回復:
您可以將 grouby 與這樣的命名聚合一起使用,首先創建指標組 1,ind_grp,eq為零和cumsum:
ind_grp = main_df['indicator'].eq(0).cumsum()
main_df.groupby(['id', ind_grp], as_index=False)\
.agg(start=('timestamp', 'min'),
end=('timestamp','max'))
輸出:
id start end
0 site a 2021-03-05 01:00:00 2021-03-05 01:30:00
1 site a 2021-03-05 02:00:00 2021-03-05 02:30:00
2 site b 2021-04-08 20:00:00 2021-04-10 20:30:00
uj5u.com熱心網友回復:
國際大學聯盟:
groupby指標列的序列并記錄“開始”和“結束”的min值max。- 洗掉不需要的列和重復項。
main_df["start"] = main_df.groupby(main_df["indicator"].ne(main_df["indicator"].shift()).cumsum())["timestamp"].transform("min")
main_df["end"] = main_df.groupby(main_df["indicator"].ne(main_df["indicator"].shift()).cumsum())["timestamp"].transform("max")
output = main_df[main_df["indicator"].eq(1)].drop_duplicates(["start", "end"])
>>> output
id start end
0 site a 2021-03-05 01:00:00 2021-03-05 01:30:00
3 site a 2021-03-05 02:30:00 2021-03-05 02:30:00
6 site b 2021-04-09 20:00:00 2021-04-10 20:30:00
uj5u.com熱心網友回復:
這是一個解決方案:
group = main_df[main_df['indicator'] == 1].groupby(main_df['indicator'].ne(main_df['indicator'].shift(1)).cumsum()[main_df['indicator'] == 1])
pd.DataFrame({'id': group['id'].first().tolist(), 'start': group['timestamp'].first().tolist(), 'end': group['timestamp'].last().tolist()})
輸出:
>>> desired_df
id start end
0 site a 2021-03-05 01:00:00 2021-03-05 01:30:00
1 site a 2021-03-05 02:30:00 2021-03-05 02:30:00
2 site b 2021-04-09 20:00:00 2021-04-10 20:30:00
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/442399.html
