pythonpandas-處理嵌套分組的最佳方式-有解無憂

我目前正在嘗試使用 python 和 pandas 庫處理一些日志檔案。日志包含有關發送到服務器的請求的簡單資訊，我想從中提取有關會話的資訊。這里的會話定義為同一用戶在特定時間段內（例如30分鐘，從第一次請求到最后一次請求的時間，此時間段之后的請求應視為新會話的一部分） )

為此，目前我正在執行嵌套分組：首先我使用 groupby 來獲取每個用戶的請求，然后按 30 分鐘的間隔對每個用戶請求進行分組，最后迭代這些間隔并選擇那些實際包含資料的：

    # example log entry:
    # id,host,time,method,url,response,bytes
    # 303372,XXX.XXX.XXX.XXX,1995-07-11 12:17:09,GET,/htbin/wais.com?IMAX,200,6923

       by_host = logs.groupby('host', sort=False)
         for host, frame in by_host:
           by_frame = frame.groupby(pd.Grouper(key='time', freq='30min', origin='start'))
           for date, logs in by_frame:
             if not logs.empty and logs.shape[0] > 1:
                session_calculations()

這當然是相當低效的并且使得計算需要相當多的時間。有沒有辦法優化這個程序？我想不出任何成功的東西。

編輯：

                  host                time method                                           url  response  bytes
0          ***.novo.dk 1995-07-11 12:17:09    GET                                     /ksc.html       200   7067
1          ***.novo.dk 1995-07-11 12:17:48    GET               /shuttle/missions/missions.html       200   8678
2          ***.novo.dk 1995-07-11 12:23:10    GET     /shuttle/resources/orbiters/columbia.html       200   6922
3          ***.novo.dk 1995-08-09 12:48:48    GET  /shuttle/missions/sts-69/mission-sts-69.html       200  11264
4          ***.novo.dk 1995-08-09 12:49:48    GET               /shuttle/countdown/liftoff.html       200   4665

預期結果是從請求中提取的會話串列：

   host session_time
0  ***.novo.dk 00:06:01 
1  ***.novo.dk 00:01:00

請注意，這里的 session_time 是輸入的第一個和最后一個請求之間的時間差，將它們分組到 30 分鐘的時間視窗之后。

uj5u.com熱心網友回復：

要為每個用戶定義本地時間視窗，即考慮來源為每個用戶第一次請求的時間，您可以先按“主機”分組。然后將一個函式應用到每個用戶的 DataFrame，使用GroupBy.apply，它處理時間分組并計算用戶會話的持續時間。

def session_duration_by_host(by_host):
    time_grouper = pd.Grouper(key='time', freq='30min', origin='start')
    duration = lambda time: time.max() - time.min()
    return ( 
        by_host.groupby(time_grouper)
               .agg(session_time = ('time', duration))
    )

res = (
    logs.groupby("host")
        .apply(session_duration_by_host)
        .reset_index()
        .drop(columns="time")
)

uj5u.com熱心網友回復：

# You have to write idiomatic Pandas code, so rather then processing something -> saving into variable -> using that variable (only once) to something -> ....  you have to chain your process. Also pandas `apply` is much faster than normal `for` in most situations.

logs.groupby('host', sort=False).apply(
    lambda by_frame:by_frame.groupby(
        pd.Grouper(key='time', freq='30min', origin='start')
    ).apply(lambda logs: session_calculations() if (not logs.empty) and (logs.shape[0] > 1) else None)
)

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/336097.html

標籤：Python 熊猫数据框通过...分组

上一篇：嘗試將資料框過濾為具有特定值的行

下一篇：基于pandasDataFrameKeyColumn的動態檔案名