我有一個包含兩列的資料框,一列帶有狀態,另一列帶有狀態開始的日期時間:
>>> df
status date_start
0 NaN 2021-12-06 09:00:00
1 busy 2021-12-06 09:17:02
2 free 2021-12-06 09:18:32
3 busy 2021-12-06 09:32:45
4 busy 2021-12-06 09:41:07
5 busy 2021-12-06 10:08:01
6 free 2021-12-06 10:17:00
7 NaN 2021-12-06 10:18:01
資料集已按 排序date_start,從最舊到最新。
我需要添加另一列,它會告訴我每一行“忙碌”期開始的日期時間 ( date_start_busy)。規則是:
- 如果狀態為“free”或“NaN”,
date_start_busy則為“NaN” - 如果狀態為“忙”且前一個狀態為“空閑”,則
date_start_busy=date_start - 如果狀態為“忙”且前一個狀態也是“忙”,那么
date_start_busy應該是前一個date_start_busy
最終的資料框應如下所示:
>>> df
status date_start date_start_busy
0 NaN 2021-12-06 09:00:00 NaN
1 busy 2021-12-06 09:17:02 2021-12-06 09:17:02
2 free 2021-12-06 09:18:32 NaN
3 busy 2021-12-06 09:32:45 2021-12-06 09:32:45
4 busy 2021-12-06 09:41:07 2021-12-06 09:32:45
5 busy 2021-12-06 10:08:01 2021-12-06 09:32:45
6 free 2021-12-06 10:17:00 NaN
7 NaN 2021-12-06 10:18:01 NaN
我了解如何使用 for 回圈來完成此操作,但是我的資料庫非常大,我想以矢量化方式進行操作以實作更好的性能。
提前致謝!
uj5u.com熱心網友回復:
一種選擇是np.select:
cond1 = df.status.isna() | df.status.eq('free')
cond2 = df.status.shift().eq('free') & df.status.eq('busy')
cond3 = df.status.shift().eq('busy') & df.status.eq('busy')
# some extra steps to take care of the third condition
# which requires picking the very first value
temp1 = temp1 = df.status.ne('busy').cumsum()
temp2 = df.status.eq('busy')
temp3 = df.date_start.groupby([temp1, temp2], sort = False).transform('first')
temp3 = np.where(temp2, temp3, np.nan)
condlist = [cond1, cond2, cond3]
choicelist = [np.nan, df.date_start, temp3]
df.assign(date_start_busy = np.select(condlist,
choicelist,
default = df.date_start)
)
status date_start date_start_busy
0 NaN 2021-12-06 09:00:00 NaN
1 busy 2021-12-06 09:17:02 2021-12-06 09:17:02
2 free 2021-12-06 09:18:32 NaN
3 busy 2021-12-06 09:32:45 2021-12-06 09:32:45
4 busy 2021-12-06 09:41:07 2021-12-06 09:32:45
5 busy 2021-12-06 10:08:01 2021-12-06 09:32:45
6 free 2021-12-06 10:17:00 NaN
7 NaN 2021-12-06 10:18:01 NaN
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/376307.html
