我有一個用戶登錄和注銷表。
該表看起來像這樣,但有幾十萬行:
data = [['aa', '2020-05-31 00:00:01', '2020-05-31 00:00:31'],
['bb','2020-05-31 00:01:01', '2020-05-31 00:02:01'],
['aa','2020-05-31 00:02:01', '2020-05-31 00:06:03'],
['cc','2020-05-31 00:03:01', '2020-05-31 00:04:01'],
['dd','2020-05-31 00:04:01', '2020-05-31 00:34:01'],
['aa', '2020-05-31 00:05:01', '2020-05-31 00:07:31'],
['bb','2020-05-31 00:05:01', '2020-05-31 00:06:01'],
['aa','2020-05-31 00:05:01', '2020-05-31 00:08:03'],
['cc','2020-05-31 00:10:01', '2020-05-31 00:40:01'],
['dd','2020-05-31 00:20:01', '2020-05-31 00:35:01']]
df_test = pd.DataFrame(data, columns=['user_id','login', 'logout'], dtype='datetime64[ns]')
我能夠使用 for 回圈以一種 hacky 的方式解決這個問題。它在較小的資料集上運行良好,但在 300k 行上需要數小時。
基本上,這段代碼計算每個會話有多少用戶同時登錄(會話是每一行)
這是我的解決方案。它給出了我需要的結果。我也可以通過使用 apply 撰寫 lambda 來做到這一點,但這需要更長的時間。
# create a new column for simultaneous
df_test['simultaneous'] = 0
start_time = time.time()
# loop through dataframe and check condition
for i in df_test.index:
login, logout = df_test.loc[i,'login'], df_test.loc[i,'logout']
this_index = df_test.index.isin([i])
df_test.loc[i, 'simultaneous'] = int(sum(
(df_test[~this_index]['login'] <= logout) & (df_test[~this_index]['logout'] >= login)
))
print("--- %s seconds ---" % (time.time() - start_time))
請您看一下,讓我知道是否有更好的方法來獲得相同的結果。也許我錯過了一些明顯的東西。
提前致謝!
uj5u.com熱心網友回復:
該演算法采用流式方法,基于此資料按登錄時間排序的事實。對于每個會話,它會記錄尚未超過注銷時間的所有會話的計數(只需將注銷時間存盤在串列中,并在每次檢查新會話時從該串列中洗掉陳舊條目)。我決定將 sess1.logout==sess2.login 算作同時發生,但如果您不同意,可以將其更改>=為。>
演算法在calculate函式中。
#!/usr/bin/python
import datetime
import random
import time
from statistics import mean, stdev
def calculate(data):
active_sessions = []
simultaneous_sessions = []
for user_id, login, logout in data:
active_sessions = [ts for ts in active_sessions if ts >= login]
simultaneous_sessions.append(len(active_sessions))
active_sessions.append(logout)
return simultaneous_sessions
def generate_data(numsessions):
start_time = datetime.datetime(2020, 5, 13, 0, 0, 1)
data = []
while len(data) < numsessions:
for cnt in range(random.choice([0, 0, 0, 1, 1, 2, 3])):
user_id = chr(ord("a") cnt) * 2
duration = random.choice([30, 30, 60, 90, 90, 900, 1800])
logout_time = start_time datetime.timedelta(seconds=duration)
data.append(
(
user_id,
start_time.strftime("%Y-%m-%d %H:%M:%S"),
logout_time.strftime("%Y-%m-%d %H:%M:%S"),
)
)
start_time = datetime.timedelta(minutes=1)
return data
start_time = time.time()
num_sessions = 3 * 1e5 # 300,000
print(f"generating data for {num_sessions:.0f} sessions")
data = generate_data(num_sessions)
print(f"sample data=[{data[0]}]")
print("--- %.2f seconds ---" % (time.time() - start_time))
start_time = time.time()
print("calculating simultaneous sessions")
simultaneous_sessions = calculate(data)
print(
"for {} sessions have max={} min={}, mean={:.2f} stdev={:.2f}".format(
len(simultaneous_sessions),
max(simultaneous_sessions),
min(simultaneous_sessions),
mean(simultaneous_sessions),
stdev(simultaneous_sessions),
)
)
print("--- %.2f seconds ---" % (time.time() - start_time))
從性能的角度來看,我遍歷串列一次,雖然我不斷地重新創建 active_sessions 串列,但只要 active_sessions 是一個小數字,這將很快。您可以通過更有效的 active_sessions 串列進行其他優化,但這應該比為每個會話搜索所有資料要快得多。即使資料沒有按登錄時間排序,我認為按登錄時間排序然后使用此演算法比掃描每個會話的所有會話更有效。
更新:我添加了一個合成資料生成器,它基于一些隨機變數創建了一堆會話。這表明該演算法對于 300k 行將花費不到一秒的時間。
對于 300k 會話,大約需要 0.4 秒:
generating data for 300000 sessions
sample data=[('aa', '2020-05-13 00:02:01', '2020-05-13 00:03:31')]
--- 1.99 seconds ---
calculating simultaneous sessions
for 300001 sessions have max=26 min=0, mean=7.42 stdev=2.76
--- 0.35 seconds ---
300 萬次會話大約需要 4 秒:
generating data for 3000000 sessions
sample data=[('aa', '2020-05-13 00:00:01', '2020-05-13 00:01:31')]
--- 19.35 seconds ---
calculating simultaneous sessions
for 3000001 sessions have max=26 min=0, mean=7.43 stdev=2.77
--- 3.93 seconds ---
上)
uj5u.com熱心網友回復:
試試這個解決方案,data * 30_000計算結果大約需要 1900 秒(AMD 3700X/Python 3.9.7) - 但我不確定它將如何處理真實資料:
mn = df_test["login"].min()
mx = df_test["logout"].max()
tmp = pd.Series(0, index=pd.date_range(mn, mx, freq="S"), dtype=object)
def fn1(x):
tmp[x["login"] : x["logout"]] = [
v | (1 << x.name) for v in tmp[x["login"] : x["logout"]]
]
def fn2(x):
out = 0
for v in tmp[x["login"] : x["logout"]]:
out |= v
# If you use Python 3.10 you can use this answer
# https://stackoverflow.com/a/64848298/10035985
# which should be ~6x faster instead of this:
return bin(out).count("1") - 1
df_test.apply(fn1, axis=1)
df_test["sim"] = df_test.apply(fn2, axis=1)
uj5u.com熱心網友回復:
如果你重組你的資料,你可以做一次通過它。這是一個很好的應用程式pandas.melt:
# use a session id, as opposed to a user id, as a single user can log in multiple times:
df_test['sid'] = df_test.user_id "-" df_test.index.astype(str)
#df_test
# user_id login logout sid
#0 aa 2020-05-31 00:00:01 2020-05-31 00:00:31 aa-0
#1 bb 2020-05-31 00:01:01 2020-05-31 00:02:01 bb-1
#2 aa 2020-05-31 00:02:01 2020-05-31 00:06:03 aa-2
#3 cc 2020-05-31 00:03:01 2020-05-31 00:04:01 cc-3
#4 dd 2020-05-31 00:04:01 2020-05-31 00:34:01 dd-4
#5 aa 2020-05-31 00:05:01 2020-05-31 00:07:31 aa-5
#6 bb 2020-05-31 00:05:01 2020-05-31 00:06:01 bb-6
#7 aa 2020-05-31 00:05:01 2020-05-31 00:08:03 aa-7
#8 cc 2020-05-31 00:10:01 2020-05-31 00:40:01 cc-8
#9 dd 2020-05-31 00:20:01 2020-05-31 00:35:01 dd-9
# restructure the data, and sort it
df_chrono = pd.melt(df_test.set_index('sid'), value_vars=['login', 'logout'], ignore_index=False)
df_chrono = df_chrono.sort_values(by='value').reset_index()
#df_chrono:
# sid variable value
#0 aa-0 login 2020-05-31 00:00:01
#1 aa-0 logout 2020-05-31 00:00:31
#2 bb-1 login 2020-05-31 00:01:01
#3 aa-2 login 2020-05-31 00:02:01
#4 bb-1 logout 2020-05-31 00:02:01
#5 cc-3 login 2020-05-31 00:03:01
#6 dd-4 login 2020-05-31 00:04:01
#7 cc-3 logout 2020-05-31 00:04:01
#8 aa-5 login 2020-05-31 00:05:01
#9 bb-6 login 2020-05-31 00:05:01
#10 aa-7 login 2020-05-31 00:05:01
#11 bb-6 logout 2020-05-31 00:06:01
#12 aa-2 logout 2020-05-31 00:06:03
#13 aa-5 logout 2020-05-31 00:07:31
#14 aa-7 logout 2020-05-31 00:08:03
#15 cc-8 login 2020-05-31 00:10:01
#16 dd-9 login 2020-05-31 00:20:01
#17 dd-4 logout 2020-05-31 00:34:01
#18 dd-9 logout 2020-05-31 00:35:01
#19 cc-8 logout 2020-05-31 00:40:01
使用按時間順序排列的資料,我們可以通過并輕松跟蹤每次迭代的登錄者(注意:請參閱下面的更新以獲取以下回圈的更優化版本)
# keep track of the current logins in simul_tracker, allowing for a single pass through the data
simul_track = {}
results = {"sid": [], "simul":[]}
for i,row in df_chrono.iterrows():
if row.variable=='login':
for sid in simul_track:
simul_track[sid] = 1
if row.sid not in simul_track:
simul_track[row.sid] = len(simul_track) # number of current logins
else:
results['simul'].append(simul_track.pop(row.sid))
results['sid'].append (row.sid)
#results
#{'sid': ['aa-0',
# 'bb-1',
# 'cc-3',
# 'bb-6',
# 'aa-2',
# 'aa-5',
# 'aa-7',
# 'dd-4',
# 'dd-9',
# 'cc-8'],
# 'simul': [0, 1, 2, 4, 6, 4, 4, 7, 2, 2]}
您可以使用結果字典更新原始資料框(注意結果鍵'sid'對于對齊至關重要)
pd.merge(df_test, pd.DataFrame(results), on='sid')
# user_id login logout sid simul
#0 aa 2020-05-31 00:00:01 2020-05-31 00:00:31 aa-0 0
#1 bb 2020-05-31 00:01:01 2020-05-31 00:02:01 bb-1 1
#2 aa 2020-05-31 00:02:01 2020-05-31 00:06:03 aa-2 6
#3 cc 2020-05-31 00:03:01 2020-05-31 00:04:01 cc-3 2
#4 dd 2020-05-31 00:04:01 2020-05-31 00:34:01 dd-4 7
#5 aa 2020-05-31 00:05:01 2020-05-31 00:07:31 aa-5 4
#6 bb 2020-05-31 00:05:01 2020-05-31 00:06:01 bb-6 4
#7 aa 2020-05-31 00:05:01 2020-05-31 00:08:03 aa-7 4
#8 cc 2020-05-31 00:10:01 2020-05-31 00:40:01 cc-8 2
#9 dd 2020-05-31 00:20:01 2020-05-31 00:35:01 dd-9 2
更新
如果有大量用戶同時登錄,上述字典更新(for sid in simul_track: simul_track[sid] = 1)可能會成為瓶頸。為了解決這個問題,可以使用以下方案:
import numpy as np
import time
t = time.time()
results = {"sid": [], "simul":[]}
n_records = len(df_chrono)
n_active = 0 # we will track the number of active logins here
# create an array for quick incremental updates
# Each session id gets a unique element in tracker
n_session = len(df_test)
tracker = np.zeros(n_session, dtype=np.uint)
# we create a 1-to-1 mapping from session id to the tracker array
idx_from_sid = {sid:i for i,sid in zip(df_test.index, df_test.sid)}
for i,row in df_chrono.iterrows():
idx = idx_from_sid[row.sid] # position in data array corresonding to this particular session id
# print progress
if i % 100==0:
perc_done = i / n_records * 100.
print("prog=%.2f%% (rt=%.3fsec)."% (perc_done, time.time()-t), flush=True, end='\r' )
if row.variable=='login':
# We track two quantities
# The first is how many additional users log in after current sid starts
tracker = 1 # never mind that we increment all values here; on the next line we override this particular sessions value
# the second is how many active users there are when this session id starts log in
tracker[idx] = n_active
n_active = 1
else:
n_active = n_active - 1
count = tracker[idx]
results['simul'].append(count)
results['sid'].append(row.sid)
print("")
與其他答案之一類似,我在 data*30000 上對此進行了計時,以模擬 300,000 行的縮放,并且能夠simultaneous在 ~110 秒內計算出這些值。
現在,根據我的回答,您可能仍然對您的原始解決方案感興趣,并且您也可以使用該解決方案進行一些優化。特別是df_test.loc[~this_index]: 每次迭代只需執行一次。此外,df.loc[this_index]是資料框中的單行,并且(df_test[this_index]['login'] <= logout) & (df_test[this_index]['logout'] >= login)始終為 True ,因此無需進行切片:
df_test.reset_index(drop=True) # just in case
for i, row in df_test.iterrows():
df_test.loc[i, 'simultaneous'] = int(np.sum(
(df_test.login <= row.logout) & (df_test.logout >= row.login)
)) -1 # note the subtraction by one saves us from having to do df.loc[~this_index]
# alternatively, you can try to use numexpr to speed up the element wise comparisons
#in_lt_out = pd.eval('df_test.login <= row.logout', engine='numexpr')
#out_gt_in = pd.eval('df_test.logout >= row.login', engine='numexpr')
#simul = np.sum(pd.eval('in_lt_out & out_gt_in', engine='numexpr'))
#df_test.loc[i, 'simultaneous'] = int(simul-1)
請注意,我很好奇您試圖對.isin呼叫做什么,這讓我認為您對同時的定義可能是針對唯一用戶的,但是,在您的解決方案中,情況并非如此。這可能是你想要更清楚的事情。我相信我發布的解決方案,如果您想同時反映唯一的登錄名,您可以簡單地將“sid”替換為“user_id”,但我還沒有測驗過。祝你好運,有趣的問題。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/510106.html
