我有一個包含發送和接收訊息的資料框。我想計算某人回復訊息所花費的時間。
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'sent':[78,18,94,55,68,57,78,8],
'received':[18,78,35,14,57,68,57,17],
'time':['2017-01-01T12','2017-01-01T13',
'2017-01-02T12','2017-02-01T13',
'2017-01-01T14','2017-01-01T15',
'2017-01-01T16','2017-01-01T17']})
df['time'] = pd.to_datetime(pd.Series(df['time']))
我想到的方法是識別對,所以如果發送=A,接收=B,那么應該有另一個條目,發送=B,接收=A。
df["pairs"] = df.apply(lambda x: not df[(df["sent"] == x["received"]) & (df["received"] == x["sent"]) & (df.index != x.name)].empty, axis=1)
然后,一旦我確定了這些對,我就可以計算出回應所需的時間
sent_time = datetime.strptime('2017-01-01 12:00:00', fmt)
recieved_time = datetime.strptime('2017-01-01 13:00:00', fmt)
if sent_time > recieved_time:
td = sent_time - recieved_time
else:
td = recieved_time - sent_time
time = int(round(td.total_seconds() / 60))
我覺得我可以單獨做這些,但我似乎無法將它們放在一起。
編輯
至于輸出,我想我需要一個單獨的資料框來列出發件人以及某人回復電子郵件所花費的時間。
所以用這個例子,
訊息是由 78 發送的,需要 60 分鐘才能回復。然后68發了一條訊息,花了60分鐘才回復
| 發件人 | time_to_respond |
|---|---|
| 78 | 60 |
| 68 | 60 |
uj5u.com熱心網友回復:
#Sort row values to create unique group
df[['s','t']] = np.sort(df[['sent','received']], axis=1)
#Subset duplicated groups
s = df[df.duplicated(subset=['s','t'], keep=False)]
#Compute time difference between duplicated groups, drop duplicated rows and unwanted columns
s=s.assign(time_to_respond=s.groupby(['s','t'])['time'].transform(lambda x:x.diff().bfill().dt.total_seconds()/60)).drop_duplicates(subset=['s','t'])[['sent','time_to_respond']]
sent time_to_respond
0 78 60.0
4 68 60.0
uj5u.com熱心網友回復:
一個命題pandas.merge:
df = (
df.merge(df, left_on='sent',right_on='received',how='left')
.assign(time_to_respond= lambda x: (x['time_y'] - x['time_x']).dt.total_seconds()/60)
)
out = (
df.loc[(df['time_to_respond'].gt(0)), ['sent_x', 'time_to_respond']]
.rename(columns={'sent_x': 'sender'})
.reset_index(drop=True)
)
# 輸出 :
print(out)
sender time_to_respond
0 78 60.0
1 68 60.0
2 57 60.0
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/512103.html
