我有一個下面給出的資料框:
df1 = pd.DataFrame({"timestamp": [pd.Timestamp(2016, 7, 29), pd.Timestamp(2017, 8, 22), pd.Timestamp(2017, 10, 9), pd.Timestamp(2018, 1, 9), pd.Timestamp(2018, 3, 31), pd.Timestamp(2018, 7, 5),pd.Timestamp(2018, 8, 5), pd.Timestamp(2018, 9,5), pd.Timestamp(2018, 11, 6),pd.Timestamp(2018, 12, 6), pd.Timestamp(2018, 12, 8)], "userId": [1, 2, 2, 2, 2,2,3, 4, 4, 4,4 ], "movieId": [111065, 35455, 132531, 132531, 2863, 132531, 4493, 133813,8888, 133813,133813], "rating":[3,4,5,2,4,3, 2,2 ,3,1, 3]
})

我想首先按“userId”列分組,然后為每個組洗掉“movieId”連續重復的行。為了更好地說明,這就是最終 Dataframe 的樣子:(應該過濾掉紅色行)

我在 lambda 中嘗試了使用自定義函式的 groupby 和過濾技術,但是它并沒有保留所有列。請幫忙!
uj5u.com熱心網友回復:
試試下面的代碼:
import pandas as pd
df1 = pd.DataFrame({"timestamp": [pd.Timestamp(2016, 7, 29), pd.Timestamp(2017, 8, 22), pd.Timestamp(2017, 10, 9), pd.Timestamp(2018, 1, 9), pd.Timestamp(2018, 3, 31), pd.Timestamp(2018, 7, 5),pd.Timestamp(2018, 8, 5), pd.Timestamp(2018, 9,5), pd.Timestamp(2018, 11, 6),pd.Timestamp(2018, 12, 6), pd.Timestamp(2018, 12, 8)], "userId": [1, 2, 2, 2, 2,2,3, 4, 4, 4,4 ], "movieId": [111065, 35455, 132531, 132531, 2863, 132531, 4493, 133813,8888, 133813,133813], "rating":[3,4,5,2,4,3, 2,2 ,3,1, 3]
})
df1['match'] = df1.movieId.eq(df1.movieId.shift())
df1 = df1[df1['match']==False]
print(df1)
讓我知道這是否對您有幫助??
uj5u.com熱心網友回復:
嘗試.drop_duplicates:
df1 = df1.drop_duplicates(subset=["userId", "movieId"], keep="first")
print(df1)
印刷:
timestamp userId movieId rating
0 2016-07-29 1 111065 3
1 2017-08-22 2 35455 4
2 2017-10-09 2 132531 5
4 2018-03-31 3 4493 4
5 2018-07-05 4 133813 3
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/525790.html
