給定的虛擬資料框如下:
----------------- ----------------- -------
| Start_Date | End_Date | Price |
----------------- ----------------- -------
| 01/01/2021 0:00 | 01/01/2021 0:59 | 10 |
| 01/01/2021 0:01 | 01/01/2021 0:01 | 20 |
| 01/01/2021 0:02 | 01/01/2021 0:02 | 24 |
| 01/01/2021 0:03 | 01/01/2021 0:03 | 23 |
| 01/01/2021 0:07 | 01/01/2021 0:07 | 34 |
| 01/01/2021 0:08 | 01/01/2021 0:08 | 37 |
| 01/01/2021 0:10 | 01/01/2021 0:10 | 21 |
| 01/01/2021 0:12 | 01/01/2021 0:12 | 22 |
| 01/01/2021 0:14 | 01/01/2021 0:14 | 56 |
----------------- ----------------- -------
可以使用以下代碼生成上述資料框:
data = {'Start_Date':['2021-01-01 00:00:00', '2021-01-01 00:01:00', '2021-01-01 00:02:00', '2021-01-01 00:03:00', '2021-01-01 00:07:00',
'2021-01-01 00:08:00', '2021-01-01 00:10:00', '2021-01-01 00:12:00', '2021-01-01 00:14:00'],
'End_Date':['2021-01-01 00:59:00', '2021-01-01 00:01:59', '2021-01-01 00:02:59', '2021-01-01 00:03:59', '2021-01-01 00:07:59',
'2021-01-01 00:08:59', '2021-01-01 00:10:59', '2021-01-01 00:12:59', '2021-01-01 00:14:59'],
'Avg_Price':[10, 20, 24, 23, 34, 37, 21, 22, 56]}
df1 = pd.DataFrame(data)
df1['Start_Date'] = pd.to_datetime(df1['Start_Date'])
df1['End_Date'] = pd.to_datetime(df1['End_Date'])
可以看出,存在資料缺失的日期范圍。缺失的范圍可以在下面的資料框中看到:
--------------------- --------------------- -------
| Start_Date | End_Date | Price |
--------------------- --------------------- -------
| 2021-01-01 00:00:00 | 2021-01-01 00:59:00 | 10 |
| 2021-01-01 00:01:00 | 2021-01-01 00:01:59 | 20 |
| 2021-01-01 00:02:00 | 2021-01-01 00:02:59 | 24 |
| 2021-01-01 00:03:00 | 2021-01-01 00:03:59 | 23 |
| 2021-01-01 00:04:00 | NaT | NaN |
| 2021-01-01 00:05:00 | NaT | NaN |
| 2021-01-01 00:06:00 | NaT | NaN |
| 2021-01-01 00:07:00 | 2021-01-01 00:07:59 | 34 |
| 2021-01-01 00:08:00 | 2021-01-01 00:08:59 | 37 |
| 2021-01-01 00:09:00 | NaT | NaN |
| 2021-01-01 00:10:00 | 2021-01-01 00:10:59 | 21 |
| 2021-01-01 00:11:00 | NaT | NaN |
| 2021-01-01 00:12:00 | 2021-01-01 00:12:59 | 22 |
| 2021-01-01 00:13:00 | NaT | NaN |
| 2021-01-01 00:14:00 | 2021-01-01 00:14:59 | 56 |
--------------------- --------------------- -------
以上可以使用以下代碼生成:
df2 = pd.DataFrame(index=pd.date_range('2021-01-01 00:00:00', '2021-01-01 00:14:00', freq='min'))
df2 = df2.join(df1.set_index('Start_Date'))
我想要串列串列來獲取缺少資料的日期范圍。
預期產出
result = [['2021-01-01 00:04:00','2021-01-01 00:06:00'], ['2021-01-01 00:09:00','2021-01-01 00:09:00'],
['2021-01-01 00:11:00', '2021-01-01 00:11:00'], ['2021-01-01 00:13:00','2021-01-01 00:13:00']]
實作所需輸出的優雅方式是什么?
uj5u.com熱心網友回復:
想法是通過創造連續組NaN由sSeries.notna用Series.cumsum,過濾器只有NaN人民共同倒置~,并通過總GroupBy.agg用min和max,最后轉換為字串,然后嵌套listS:
m = df2['End_Date'].notna()
L = (df2.index.to_series()
.groupby(m.cumsum()[~m])
.agg(['min','max'])
.astype(str)
.to_numpy()
.tolist())
print (L)
[['2021-01-01 00:04:00', '2021-01-01 00:06:00'],
['2021-01-01 00:09:00', '2021-01-01 00:09:00'],
['2021-01-01 00:11:00', '2021-01-01 00:11:00'],
['2021-01-01 00:13:00', '2021-01-01 00:13:00']]
uj5u.com熱心網友回復:
空檢查與np.split串列理解相結合,適用于共享的示例資料:
boolean = df2.End_Date.isna().cumsum()
# get position count for each null end
val = boolean[boolean.gt(0) & boolean.duplicated()].array.unique()
nulls = df2.index[df2.End_Date.isna()]
[[ent[0], ent[-1]]
for ent in np.split(nulls, val)
if ent.size > 0]
[[Timestamp('2021-01-01 00:04:00'), Timestamp('2021-01-01 00:06:00')],
[Timestamp('2021-01-01 00:09:00'), Timestamp('2021-01-01 00:09:00')],
[Timestamp('2021-01-01 00:11:00'), Timestamp('2021-01-01 00:11:00')],
[Timestamp('2021-01-01 00:13:00'), Timestamp('2021-01-01 00:13:00')]]
# if you prefer string form :
[[str(ent[0]), str(ent[-1])]
for ent in np.split(nulls, val)
if ent.size > 0]
[['2021-01-01 00:04:00', '2021-01-01 00:06:00'],
['2021-01-01 00:09:00', '2021-01-01 00:09:00'],
['2021-01-01 00:11:00', '2021-01-01 00:11:00'],
['2021-01-01 00:13:00', '2021-01-01 00:13:00']]
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/313014.html
