你好,我有一個 Df 看起來像這樣:
HostName Date
0 B 2021-01-01 12:42:00
1 B 2021-02-01 12:30:00
2 B 2021-02-01 12:40:00
3 B 2021-02-25 12:40:00
4 B 2021-03-01 12:41:00
5 B 2021-03-01 12:42:00
6 B 2021-03-02 12:43:00
7 B 2021-03-03 12:44:00
8 B 2021-04-04 12:44:00
9 B 2021-06-05 12:44:00
10 B 2021-08-06 12:44:00
11 B 2021-09-07 12:44:00
12 A 2021-03-12 12:45:00
13 A 2021-03-13 12:46:00
我對聚合做了什么,這是我解決問題的方法,但它根本沒有效率,如果有 100 萬行,則需要很長時間,是否有更好的方法在日期之間進行有效聚合?
最終結果:
HostName Date ds
0 B 2021-01-01 12:42:00 1
1 B 2021-02-01 12:30:00 2
2 B 2021-02-01 12:40:00 3
3 B 2021-02-25 12:40:00 3
4 B 2021-03-01 12:41:00 2
5 B 2021-03-01 12:42:00 3
6 B 2021-03-02 12:43:00 4
7 B 2021-03-03 12:44:00 5
8 B 2021-04-04 12:44:00 1
9 B 2021-06-05 12:44:00 1
10 B 2021-08-06 12:44:00 1
11 B 2021-09-07 12:44:00 1
12 A 2021-03-12 12:45:00 1
13 A 2021-03-13 12:46:00 2
TheList = []
for index, row in df.iterrows():
TheList.append((df[(df['Date'] > (df['Date'].iloc[index] - pd.DateOffset(months=1))) & (df['Date'] <= df['Date'].iloc[index])].groupby(['HostName']).size()[row[0]]))
df['ds'] = TheList
有沒有更好的方法可以做到但結果相同?
uj5u.com熱心網友回復:
這里使用了組之間的廣播,計數Trues 用于sum自定義函式中GroupBy.transform:
注意:性能還取決于組的長度,如果這里很少有非常大的組應該是記憶體問題。
df['Date'] = pd.to_datetime(df['Date'])
def f(x):
a = x.to_numpy()
b = x.sub(pd.DateOffset(months=1)).to_numpy()
return np.sum((a > b[:, None]) & (a <= a[:, None]), axis=1)
df['ds'] = df.groupby('HostName')['Date'].transform(f)
print (df)
HostName Date ds
0 B 2021-01-01 12:42:00 1
1 B 2021-02-01 12:30:00 2
2 B 2021-02-01 12:40:00 3
3 B 2021-02-25 12:40:00 3
4 B 2021-03-01 12:41:00 2
5 B 2021-03-01 12:42:00 3
6 B 2021-03-02 12:43:00 4
7 B 2021-03-03 12:44:00 5
8 B 2021-04-04 12:44:00 1
9 B 2021-06-05 12:44:00 1
10 B 2021-08-06 12:44:00 1
11 B 2021-09-07 12:44:00 1
12 A 2021-03-12 12:45:00 1
13 A 2021-03-13 12:46:00 2
不幸的是,如果記憶體問題需要回圈:
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = pd.to_datetime(df['Date']).sub(pd.DateOffset(months=1))
def f(x):
one = x['Date'].to_numpy()
both = x[['Date','Date1']].to_numpy()
x['ds'] = [np.sum((one > b) & (one <= a)) for a, b in both]
return x
df = df.groupby('HostName').apply(f)
print (df)
HostName Date Date1 ds
0 B 2021-01-01 12:42:00 2020-12-01 12:42:00 1
1 B 2021-02-01 12:30:00 2021-01-01 12:30:00 2
2 B 2021-02-01 12:40:00 2021-01-01 12:40:00 3
3 B 2021-02-25 12:40:00 2021-01-25 12:40:00 3
4 B 2021-03-01 12:41:00 2021-02-01 12:41:00 2
5 B 2021-03-01 12:42:00 2021-02-01 12:42:00 3
6 B 2021-03-02 12:43:00 2021-02-02 12:43:00 4
7 B 2021-03-03 12:44:00 2021-02-03 12:44:00 5
8 B 2021-04-04 12:44:00 2021-03-04 12:44:00 1
9 B 2021-06-05 12:44:00 2021-05-05 12:44:00 1
10 B 2021-08-06 12:44:00 2021-07-06 12:44:00 1
11 B 2021-09-07 12:44:00 2021-08-07 12:44:00 1
12 A 2021-03-12 12:45:00 2021-02-12 12:45:00 1
13 A 2021-03-13 12:46:00 2021-02-13 12:46:00 2
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/379350.html
上一篇:用C 初始化陣列:最快的方法?
