我需要評估制造執行系統的數百萬行性能日志記錄。我需要按日期、類別和名稱對資料進行分組,并找到大量并發運行事務的總“等待時間”。資料看起來與此資料框中的資料相似:
import pandas as pd
d = {'START_DATE': ['2021-08-07 19:11:40', '2021-08-07 19:11:40', '2021-08-07 19:11:40',
'2021-08-07 19:20:40', '2021-08-07 19:20:40', '2021-08-07 19:20:40',
'2021-08-07 19:21:40', '2021-08-07 19:21:40', '2021-08-07 19:21:40',
'2021-08-10 19:20:40', '2021-08-10 19:20:40', '2021-08-10 19:20:40',
'2021-08-10 19:21:40', '2021-08-10 19:21:40', '2021-08-10 19:21:40'
],
'ELAPSED_TIME': ['00:00:00.465', '00:00:01.000', '00:00:00.165',
'00:00:00.100', '00:00:00.200', '00:03:00.000',
'00:05:00.000', '00:00:00.200', '00:00:03.000',
'00:00:00.100', '00:00:00.200', '00:03:00.000',
'00:05:00.000', '00:00:00.200', '00:00:03.000'
],
'TRANSACTION': ['a', 'b', 'c',
'a', 'd', 'c',
'e', 'a', 'b',
'a', 'd', 'c',
'e', 'a', 'b'
],
'USER': ['Bob', 'Bob', 'Bob',
'Biff', 'Biff', 'Biff',
'Biff', 'Biff', 'Biff',
'Bob', 'Bob', 'Bob',
'Bob', 'Bob', 'Bob'
],
'CLASS': ['AA', 'AA', 'AA',
'BB', 'BB', 'BB',
'BB', 'BB', 'BB',
'AA', 'AA', 'AA',
'AA', 'AA', 'AA'
]}
df = pd.DataFrame(data=d)
查看事務時間如何同時開始并彼此并發運行,但將在不同時間“完成”。例如,Bob 的第一組事務(第 0-2 行)都需要不同的時間,但是當我按 DATE、CLASS 和 USER 分組時--我想顯示總等待時間為 1000 毫秒(基于第二行的等待時間)。
在 08/07/2021,Biff 有兩組事務在不同的時間開始,但它們仍然會重疊為一個等待時間--6000ms。
預期輸出將類似于:
DATE CLASS USER Wait
2021-08-07 AA Bob 1000
2021-08-07 BB Biff 360000
2021-08-10 AA Bob 360000
Like I mentioned the actual data has several millions lines of transactions--I am looking for help in finding something better (and hopefully faster than what I have/found):
def getSecs1(grp):
return pd.DatetimeIndex([]).union_many([ pd.date_range(
row.START_DATE, row.END_DATE, freq='25ms', closed='left')
for _, row in grp.iterrows() ]).size
I add an END_DATE column by adding the milliseconds to the START_DATE. I have to do it with chunks of 25ms otherwise it would take wwaayy too long to do.
Any help/advice would be greatly appreciated.
###Edit Change the overlap to minutes
uj5u.com熱心網友回復:
此解決方案使用一個名為的包staircase,該包基于 pandas 和 numpy 構建,用于處理(數學)步驟函式。您可以將間隔視為階躍函式,它在間隔開始時從值 0 變為 1,在間隔結束時從值 1 變為 0。
額外設定
轉換START_DATE并ELAPSED_TIME以適當的pandas時間物件
df["START_DATE"] = pd.to_datetime(df["START_DATE"])
df["ELAPSED_TIME"] = pd.to_timedelta(df["ELAPSED_TIME"])
定義每日垃圾箱
dates = pd.period_range("2021-08-07", "2021-08-10")
解決方案
定義一個函式,它接受一個資料幀,從開始時間和結束時間(計算為開始 持續時間)生成一個階躍函式,將非零值設定為 1,用 bin 對階躍函式進行切片,然后進行積分。
import staircase as sc
def calc_dates_for_user(df_):
return (
sc.Stairs( # creating step function
start=df_["START_DATE"],
end=df_["START_DATE"] df_["ELAPSED_TIME"],
)
.make_boolean() # where two intervals overlap the value of the step function will be 2. This sets all non-zero values to 1 (effectively creating a union of intervals).
.slice(dates) # analogous to groupby
.integral()/pd.Timedelta("1s") # for each slice integrate (which will equal the length of the interval) and divide by seconds
)
當我們 groupbyUSER并CLASS應用這個函式時,我們得到一個資料框,由這些變數索引,列索引對應于周期范圍內的間隔
USER CLASS [2021-08-07, 2021-08-08) [2021-08-08, 2021-08-09) [2021-08-09, 2021-08-10) [2021-08-10, 2021-08-11)
Biff BB 360000.0 0.0 0.0 0.0
Bob AA 1000.0 0.0 0.0 360000.0
我們會像這樣清理它
result = (
df.groupby(["USER", "CLASS"])
.apply(calc_dates_for_user)
.melt(ignore_index=False, var_name="DATE", value_name="WAIT") # melt column index into a single column of daily intervals
.query("WAIT != 0") # filter out days where no time recorded
.reset_index() # move USER and CLASS from index to columns
)
result 然后看起來像這樣
USER CLASS DATE WAIT
0 Biff BB [2021-08-07, 2021-08-08) 360000.0
1 Bob AA [2021-08-07, 2021-08-08) 1000.0
2 Bob AA [2021-08-10, 2021-08-11) 360000.0
要獲得預期結果,您可以使用與日期開始相關的時間戳替換 DATE 列
result["DATE"] = pd.IntervalIndex(result["DATE"]).left
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/351669.html
標籤:python pandas datetime pandas-groupby timedelta
上一篇:Python日期時間時區轉換
下一篇:自定義日期時間格式Python
