所以我有一個包含大約 70,000 個資料點的資料集,我正在嘗試在示例資料集上測驗一些代碼,以確保它可以在大型資料集上運行。樣本資料集遵循以下格式:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'cond': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'B', 'B','B','B'],
'time': ['2009-07-09 15:00:00',
'2009-07-09 18:33:00',
'2009-07-09 20:55:00',
'2009-07-10 00:01:00',
'2009-07-10 09:00:00',
'2009-07-10 15:00:00',
'2009-07-10 18:00:00',
'2009-07-11 00:01:00',
'2009-07-12 03:10:00',
'2009-07-09 06:00:00',
'2009-07-10 15:00:00',
'2009-07-11 18:00:00',
'2009-07-11 21:00:00',
'2009-07-12 00:30:00',
'2009-07-12 12:05:00',
'2009-07-12 15:00:00',
'2009-07-13 21:00:00',
'2009-07-14 00:01:00'],
'Score': [0.0, 1.0, 0.0, 0.0, 0.0, -1.0, 0.0, 0.0, 1.0, 0.0, -1.0, 0.0, 1.0, 0.0, 0.0, -1.0, 0.0, 0.0],
})
print(df)
我本質上是在嘗試創建 2 個指標列。第一個指標列遵循以下規則:對于每個條件(A 和 B),一旦我的得分為 -1,我應該將該行指示為該條件的其余部分的“1”。第二個指標列應為每一行指明自上次得分 -1 以來是否已過去至少 24 小時。因此,最終結果應該類似于:
cond time Score Indicator 1 Indicator 2
0 A 2009-07-09 15:00:00 0.0 0 0
1 A 2009-07-09 18:33:00 1.0 0 0
2 A 2009-07-09 20:55:00 0.0 0 0
3 A 2009-07-10 00:01:00 0.0 0 0
4 A 2009-07-10 09:00:00 0.0 0 0
5 A 2009-07-10 15:00:00 -1.0 1 0
6 A 2009-07-10 18:00:00 0.0 1 0
7 A 2009-07-11 00:01:00 0.0 1 0
8 A 2009-07-12 03:10:00 1.0 1 1
9 B 2009-07-09 06:00:00 0.0 0 0
10 B 2009-07-10 15:00:00 -1.0 1 0
11 B 2009-07-11 18:00:00 0.0 1 1
12 B 2009-07-11 21:00:00 1.0 1 1
13 B 2009-07-12 00:30:00 0.0 1 1
14 B 2009-07-12 12:05:00 0.0 1 1
15 B 2009-07-12 15:00:00 -1.0 1 0
16 B 2009-07-13 21:00:00 0.0 1 1
17 B 2009-07-14 00:01:00 0.0 1 1
這與我昨天提出的關于指標 1 的問題類似,但我意識到,由于我的大型資料集有很多條件(700 ),我最終需要有關如何應用指標 1 解決方案的幫助,而不是單獨寫出所有 cond 值是可行的,對于指標 2,我正在使用滾動視窗函式,但我看到的滾動視窗示例的所有條件都是查看滾動總和或滾動平均值,這不是我的我試圖在這里計算,所以我不確定我想要的是否存在使用滾動視窗。
uj5u.com熱心網友回復:
嘗試:
#get the first time the score is -1 for each ID
first = df["cond"].map(df[df["Score"].eq(-1)].groupby("cond")["time"].min())
#get the most recent time that the score is -1
recent = df.loc[df["Score"].eq(-1), "time"].reindex(df.index, method="ffill")
#check that the time is greater than the first -1
df["Indicator 1"] = df["time"].ge(first).astype(int)
#check that at least 1 day has passed since the most recent -1
df["Indicator 2"] = df["time"].sub(recent).dt.days.ge(1).astype(int)
>>> df
cond time Score Indicator 1 Indicator 2
0 A 2009-07-09 15:00:00 0.0 0 0
1 A 2009-07-09 18:33:00 1.0 0 0
2 A 2009-07-09 20:55:00 0.0 0 0
3 A 2009-07-10 00:01:00 0.0 0 0
4 A 2009-07-10 09:00:00 0.0 0 0
5 A 2009-07-10 15:00:00 -1.0 1 0
6 A 2009-07-10 18:00:00 0.0 1 0
7 A 2009-07-11 00:01:00 0.0 1 0
8 A 2009-07-12 03:10:00 1.0 1 1
9 B 2009-07-09 06:00:00 0.0 0 0
10 B 2009-07-10 15:00:00 -1.0 1 0
11 B 2009-07-11 18:00:00 0.0 1 1
12 B 2009-07-11 21:00:00 1.0 1 1
13 B 2009-07-12 00:30:00 0.0 1 1
14 B 2009-07-12 12:05:00 0.0 1 1
15 B 2009-07-12 15:00:00 -1.0 1 0
16 B 2009-07-13 21:00:00 0.0 1 1
17 B 2009-07-14 00:01:00 0.0 1 1
uj5u.com熱心網友回復:
一種簡單的方法 IMO,cummax用于第一個指標,每個組的第一個值的差異與第二個的掩碼相結合:
# indicator 1
df['Indicator 1'] = df['Score'].eq(-1).astype(int).groupby(df['cond']).cummax()
# indicator 2
# convert to datetime
df['time'] = pd.to_datetime(df['time'])
# groups starting by -1
m1 = df['Score'].eq(-1).groupby(df['cond']).cumsum()
# is the time difference greater than 24h since the group start
m2 = df.groupby(['cond', m1])['time'].apply(lambda s: s.sub(s.iloc[0]).gt('24h'))
df['Indicator 2'] = (m1.eq(0) & m2).astype(int)
輸出:
cond time Score Indicator 1 Indicator 2
0 A 2009-07-09 15:00:00 0.0 0 0
1 A 2009-07-09 18:33:00 1.0 0 0
2 A 2009-07-09 20:55:00 0.0 0 0
3 A 2009-07-10 00:01:00 0.0 0 0
4 A 2009-07-10 09:00:00 0.0 0 0
5 A 2009-07-10 15:00:00 -1.0 1 0
6 A 2009-07-10 18:00:00 0.0 1 0
7 A 2009-07-11 00:01:00 0.0 1 0
8 A 2009-07-12 03:10:00 1.0 1 1
9 B 2009-07-09 06:00:00 0.0 0 0
10 B 2009-07-10 15:00:00 -1.0 1 0
11 B 2009-07-11 18:00:00 0.0 1 1
12 B 2009-07-11 21:00:00 1.0. 1 1
13 B 2009-07-12 00:30:00 0.0 1 1
14 B 2009-07-12 12:05:00 0.0 1 1
15 B 2009-07-12 15:00:00 -1.0 1 0
16 B 2009-07-13 21:00:00 0.0 1 0
17 B 2009-07-14 00:01:00 0.0 1 0
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/467213.html
