我需要創建一個資料框來start洗掉end多個ids. 我將使用start和enddatetimes 來聚合高頻 pandas 資料框中的值,因此我需要洗掉mst_df.
import pandas as pd
#Proxy reference dataframe
master = [['site a', '2021-07-08 00:00:00', '2021-07-08 10:56:00'],
['site a', '2021-07-08 06:00:00', '2021-07-08 12:00:00'], #slightly overlapping
['site a', '2021-07-08 17:36:00', '2021-07-09 11:40:00'],
['site a', '2021-07-08 18:00:00', '2021-07-09 11:40:00'], #overlapping
['site a', '2021-07-09 00:00:00', '2021-07-09 05:40:00'], #overlapping
['site b', '2021-07-08 00:00:00', '2021-07-08 10:24:00'],
['site b', '2021-07-08 06:00:00', '2021-07-08 10:24:00'], #overlapping
['site b', '2021-07-08 17:32:00', '2021-07-09 11:12:00'],
['site b', '2021-07-08 18:00:00', '2021-07-09 11:12:00'], #overlapping
['site b', '2021-07-09 00:00:00', '2021-07-09 13:00:00']] #slightly overlapping
mst_df = pd.DataFrame(master, columns = ['id', 'start', 'end'])
mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end'] = pd.to_datetime(mst_df['end'], infer_datetime_format=True)
所需的資料框:
id start end
site a 2021-07-08 00:00:00 2021-07-08 12:00:00
site a 2021-07-08 17:36:00 2021-07-09 11:40:00
site b 2021-07-08 00:00:00 2021-07-08 10:24:00
site b 2021-07-08 17:32:00 2021-07-09 13:00:00
uj5u.com熱心網友回復:
我不知道pandas這是否有特殊功能。它有Interval.overlaping()來檢查兩個范圍是否重疊(它甚至可以使用datetime),但我沒有看到合并這兩個范圍的功能,所以它仍然需要自己的代碼來合并。幸運的是,這很容易。
行按順序排序,start因此行不會重疊,我在-loopprevious_end < next_start中使用它。for
但首先我分組site分別與每個站點一起作業。
接下來,我得到第一行(as previous)并與其他行(asnext)和 check previous_end < next_start.
如果是,True那么我可以放入previous結果串列并開始next使用previous其余行。
如果是,False那么我從兩行創建新范圍并使用它來處理其余行。
最后我添加previous到串列中。
處理完所有組后,我將所有組都轉換為 DataFrame。
import pandas as pd
#Proxy reference dataframe
master = [
['site a', '2021-07-08 00:00:00', '2021-07-08 10:56:00'],
['site a', '2021-07-08 06:00:00', '2021-07-08 12:00:00'], # slightly overlapping
['site a', '2021-07-08 17:36:00', '2021-07-09 11:40:00'],
['site a', '2021-07-08 18:00:00', '2021-07-09 11:40:00'], # overlapping
['site a', '2021-07-09 00:00:00', '2021-07-09 05:40:00'], # overlapping
['site b', '2021-07-08 00:00:00', '2021-07-08 10:24:00'],
['site b', '2021-07-08 06:00:00', '2021-07-08 10:24:00'], # overlapping
['site b', '2021-07-08 17:32:00', '2021-07-09 11:12:00'],
['site b', '2021-07-08 18:00:00', '2021-07-09 11:12:00'], # overlapping
['site b', '2021-07-09 00:00:00', '2021-07-09 13:00:00'] # slightly overlapping
]
mst_df = pd.DataFrame(master, columns = ['id', 'start', 'end'])
mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end'] = pd.to_datetime(mst_df['end'], infer_datetime_format=True)
result = []
for val, group in mst_df.groupby('id'):
# get first
prev = group.iloc[0]
for idx, item in group[1:].iterrows():
if prev['end'] < item['start']:
# not overlapping - put previous to results and use next as previous
result.append(prev)
prev = item
else:
# overlappig - create on range start, end
prev['start'] = min(prev['start'], item['start'])
prev['end'] = max(prev['end'], item['end'])
# add when there is no next item
result.append(prev)
print(pd.DataFrame(result))
結果:
id start end
0 site a 2021-07-08 00:00:00 2021-07-08 12:00:00
2 site a 2021-07-08 17:36:00 2021-07-09 11:40:00
5 site b 2021-07-08 00:00:00 2021-07-08 10:24:00
7 site b 2021-07-08 17:32:00 2021-07-09 13:00:00
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/448023.html
上一篇:計算資料幀中具有分鐘差異的連續行
