PythonPandas：根據同一列的前一行和其他列的條件填充列的元素-有解無憂

我有一個包含兩列的資料框，一列帶有狀態，另一列帶有狀態開始的日期時間：

>>> df
  status           date_start
0    NaN  2021-12-06 09:00:00
1   busy  2021-12-06 09:17:02
2   free  2021-12-06 09:18:32
3   busy  2021-12-06 09:32:45
4   busy  2021-12-06 09:41:07
5   busy  2021-12-06 10:08:01
6   free  2021-12-06 10:17:00
7    NaN  2021-12-06 10:18:01

資料集已按排序date_start，從最舊到最新。

我需要添加另一列，它會告訴我每一行“忙碌”期開始的日期時間 ( date_start_busy)。規則是：

如果狀態為“free”或“NaN”，date_start_busy則為“NaN”
如果狀態為“忙”且前一個狀態為“空閑”，則date_start_busy=date_start
如果狀態為“忙”且前一個狀態也是“忙”，那么date_start_busy應該是前一個date_start_busy

最終的資料框應如下所示：

>>> df
status           date_start      date_start_busy
0    NaN  2021-12-06 09:00:00                  NaN
1   busy  2021-12-06 09:17:02  2021-12-06 09:17:02
2   free  2021-12-06 09:18:32                  NaN
3   busy  2021-12-06 09:32:45  2021-12-06 09:32:45
4   busy  2021-12-06 09:41:07  2021-12-06 09:32:45
5   busy  2021-12-06 10:08:01  2021-12-06 09:32:45
6   free  2021-12-06 10:17:00                  NaN
7    NaN  2021-12-06 10:18:01                  NaN

我了解如何使用 for 回圈來完成此操作，但是我的資料庫非常大，我想以矢量化方式進行操作以實作更好的性能。

提前致謝！

uj5u.com熱心網友回復：

一種選擇是np.select：

cond1 = df.status.isna() | df.status.eq('free')
cond2 = df.status.shift().eq('free') & df.status.eq('busy')
cond3 = df.status.shift().eq('busy') & df.status.eq('busy')

# some extra steps to take care of the third condition
# which requires picking the very first value
temp1 = temp1 = df.status.ne('busy').cumsum()
temp2 = df.status.eq('busy')
temp3 = df.date_start.groupby([temp1, temp2], sort = False).transform('first')
temp3 = np.where(temp2, temp3, np.nan)
condlist = [cond1, cond2, cond3]
choicelist = [np.nan, df.date_start, temp3]
df.assign(date_start_busy = np.select(condlist, 
                                      choicelist, 
                                      default = df.date_start)
          )

  status           date_start      date_start_busy
0    NaN  2021-12-06 09:00:00                  NaN
1   busy  2021-12-06 09:17:02  2021-12-06 09:17:02
2   free  2021-12-06 09:18:32                  NaN
3   busy  2021-12-06 09:32:45  2021-12-06 09:32:45
4   busy  2021-12-06 09:41:07  2021-12-06 09:32:45
5   busy  2021-12-06 10:08:01  2021-12-06 09:32:45
6   free  2021-12-06 10:17:00                  NaN
7    NaN  2021-12-06 10:18:01                  NaN

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/376307.html

標籤：Python 熊猫数据框

上一篇：無法合并時添加基于另一個資料框的列

下一篇：根據其他列pandas中的值填寫列