我正在研究一個在線課程中分析學生點擊行為的專案,我們將點擊路徑視為順序資料,它看起來像:
user_id timestamp duration_sec Page
545301 8/25/2020 14:49 5 home
545301 8/25/2020 15:00 10 instructor
545301 9/2/2020 13:33 5 home
545301 9/8/2020 12:46 3 home
545301 9/9/2020 11:10 3 home
545301 9/9/2020 13:24 8 general
545301 9/9/2020 14:33 12 zoom
我想要做的是在子系列之間添加行作為分隔符,以指示學生在兩個行為系列之間休息。預期的資料應該是這樣的:
user_id timestamp duration_sec Page
545301 8/25/2020 14:49 5 home
545301 8/25/2020 15:00 10 instructor
545301 8/25/2020 15:10 99999 break
545301 9/2/2020 13:33 5 home
545301 9/2/2020 13:38 99999 break
545301 9/8/2020 12:46 3 home
545301 9/8/2020 12:49 99999 break
545301 9/9/2020 11:10 3 home
545301 9/9/2020 13:24 8 general
545301 9/9/2020 14:33 12 zoom
如果有人能給我一些提示,我將不勝感激。
uj5u.com熱心網友回復:
這是我的答案(對問題的第一個版本,因為某些資料確實發生了變化):
首先,我確實構建了類似于您的資料框的東西
import pandas as pd
from io import StringIO
import re
f= re.sub('\s ',',',re.sub('\n','..',"""user_id timestamp duration_min Page
545301 8/25/2020_14:49 8.600000 home
545301 8/25/2020_15:00 10.100000 instructor
545301 9/2/2020_13:33 49.700000 home
545301 9/8/2020_12:46 223.783333 home
545301 9/9/2020_11:10 7.633333 home
545301 9/9/2020_13:24 69.300000 general
545301 9/9/2020_14:33 2651.133333 zoom
""")).replace('..','\n')
f = StringIO(f)
df= pd.read_csv(f)
df['timestamp']=df['timestamp'].str.replace('_',' ',regex=False)
df.drop(['Unnamed: 4'],axis=1, inplace=True)
df.timestamp = pd.to_datetime(df.timestamp)
print(df)
然后我們需要找到我們必須插入行的索引:
List_of_home_indexes=[]
for i in range(len(df.index)):
if df.Page.iloc[i] =='home': List_of_home_indexes.append(i)
print(List_of_home_indexes)
然后我們在定義它之后插入該行,通過對我們找到的索引執行回圈:
from datetime import timedelta
for i in List_of_home_indexes:
line = pd.DataFrame({"user_id": 545301, "timestamp": df.timestamp.iloc[i] timedelta(seconds=1), 'duration_min':99999, 'Page':'break'}, index=[i 1])
df=pd.concat([df.iloc[:i 1], line, df.iloc[i 1:]]).reset_index(drop=True)
print(df)
然后你就會得到你想要的結果。
uj5u.com熱心網友回復:
對于這個解決方案,我假設 user_id 不是索引。如果是,只需在開始之前重置索引。
首先我們通過時間戳之間的差異來定義事件之間的“idle_time”,并考慮duration_sec(我們需要先將其從數字轉換為時間增量):
df['idle_time'] = df.timestamp.diff().shift(-1) - pd.to_timedelta(df.duration_sec, unit='s')
user_id timestamp duration_sec Page idle_time
0 545301 2020-08-25 14:49:00 5 home 0 days 00:10:55
1 545301 2020-08-25 15:00:00 10 instructor 7 days 22:32:50
2 545301 2020-09-02 13:33:00 5 home 5 days 23:12:55
3 545301 2020-09-08 12:46:00 3 home 0 days 22:23:57
4 545301 2020-09-09 11:10:00 3 home 0 days 02:13:57
5 545301 2020-09-09 13:24:00 8 general 0 days 01:08:52
6 545301 2020-09-09 14:33:00 12 zoom NaT
然后我們抓取學生休息之前的行,在這種情況下,我將其定義為超過 6 小時的 idle_time(但您可以將其更改為您想要的任何內容):
pre_breaks = df[df.idle_time > pd.to_timedelta(6, unit='h')]
user_id timestamp duration_sec Page idle_time
1 545301 2020-08-25 15:00:00 10 instructor 7 days 22:32:50
2 545301 2020-09-02 13:33:00 5 home 5 days 23:12:55
3 545301 2020-09-08 12:46:00 3 home 0 days 22:23:57
然后我們將這些行修改為中斷行,如下所示:
pre_breaks['timestamp'] = pre_breaks.timestamp
pd.to_timedelta(pre_breaks.duration_sec, 's')
pre_breaks['Page'] = 'break'
pre_breaks['duration_sec'] = pre_breaks.idle_time.apply(lambda x:x.seconds)
user_id timestamp duration_sec Page idle_time
1 545301 2020-08-25 15:00:10 81170 break NaN
2 545301 2020-09-02 13:33:05 83575 break NaN
3 545301 2020-09-08 12:46:03 80637 break NaN
然后我們將它們插入與學生休息的事件對應的索引中:
for i in pre_breaks.index:
df.loc[i 0.5] = pre_breaks.loc[i]
user_id timestamp duration_sec Page idle_time
0.0 545301 2020-08-25 14:49:00 5 home 0 days 00:10:55
1.0 545301 2020-08-25 15:00:00 10 instructor 7 days 22:32:50
2.0 545301 2020-09-02 13:33:00 5 home 5 days 23:12:55
3.0 545301 2020-09-08 12:46:00 3 home 0 days 22:23:57
4.0 545301 2020-09-09 11:10:00 3 home 0 days 02:13:57
5.0 545301 2020-09-09 13:24:00 8 general 0 days 01:08:52
6.0 545301 2020-09-09 14:33:00 12 zoom NaT
1.5 545301 2020-08-25 15:00:10 81170 break NaT
2.5 545301 2020-09-02 13:33:05 83575 break NaT
3.5 545301 2020-09-08 12:46:03 80637 break NaT
最后,我們對索引進行排序,并重置它。我們還洗掉了 idle_time 列(可選):
df = df.sort_index().reset_index(drop=True).drop(columns='idle_time')
最后結果:
user_id timestamp duration_sec Page
0 545301 2020-08-25 14:49:00 5 home
1 545301 2020-08-25 15:00:00 10 instructor
2 545301 2020-08-25 15:00:10 81170 break
3 545301 2020-09-02 13:33:00 5 home
4 545301 2020-09-02 13:33:05 83575 break
5 545301 2020-09-08 12:46:00 3 home
6 545301 2020-09-08 12:46:03 80637 break
7 545301 2020-09-09 11:10:00 3 home
8 545301 2020-09-09 13:24:00 8 general
9 545301 2020-09-09 14:33:00 12 zoom
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/340443.html
上一篇:將資料幀行拆分為多個小數值
