從我的資料集的這個示例中,我需要洗掉在除“parent_station”之外的所有列中具有相同值的重復行。將被洗掉的重復行必須是“parent_station”列中帶有 NaN 的行,并保持“parent_station”值不同于 NaN 的行。在本例中,需要洗掉的行是第 4 行,索引為 7789。我該怎么做?我還沒有弄清楚如何。
stop_id stop_name parent_station trip_id arrival_time departure_time stop_sequence route_id trip_headsign
7022 87413385 Gare de Yvetot StopArea:OCE87413385 OCESN003100F140147152 05:49:00 05:50:00 2.0 OCE1506035 3100.0
3518 87411017 Gare de Rouen-Rive-Droite StopArea:OCE87411017 OCESN003100F140147152 06:12:00 06:15:00 3.0 OCE1506035 3100.0
8040 87413013 Gare de Le Havre StopArea:OCE87413013 OCESN003100F140147152 05:20:00 05:20:00 0.0 OCE1506035 3100.0
7789 87413013 Gare de Le Havre NaN OCESN003100F140147152 05:20:00 05:20:00 0.0 OCE1506035 3100.0
7197 87413344 Gare de Bréauté-Beuzeville NaN OCESN003100F140147152 05:35:00 05:36:00 1.0 OCE1506035 3100.0
uj5u.com熱心網友回復:
您可以使用布爾掩碼:
out = df[~df.drop('parent_station', axis=1).duplicated(keep=False) | pd.notna(df['parent_station'])]
輸出:
stop_id stop_name parent_station \
index
7022 87413385 Gare de Yvetot StopArea:OCE87413385
3518 87411017 Gare de Rouen-Rive-Droite StopArea:OCE87411017
8040 87413013 Gare de Le Havre StopArea:OCE87413013
7197 87413344 Gare de Bréauté-Beuzeville NaN
trip_id arrival_time departure_time stop_sequence \
index
7022 OCESN003100F140147152 05:49:00 05:50:00 2.0
3518 OCESN003100F140147152 06:12:00 06:15:00 3.0
8040 OCESN003100F140147152 05:20:00 05:20:00 0.0
7197 OCESN003100F140147152 05:35:00 05:36:00 1.0
route_id trip_headsign
index
7022 OCE1506035 3100.0
3518 OCE1506035 3100.0
8040 OCE1506035 3100.0
7197 OCE1506035 3100.0
uj5u.com熱心網友回復:
使用drop_duplicates:
cols = df.columns[df.columns != 'parent_station']
out = df[~(df.duplicated(cols, keep=False) & df['parent_station'].isna())]
print(out)
# Output
stop_id stop_name parent_station trip_id arrival_time departure_time stop_sequence route_id trip_headsign
7022 87413385 Gare de Yvetot StopArea:OCE87413385 OCESN003100F140147152 05:49:00 05:50:00 2.0 OCE1506035 3100.0
3518 87411017 Gare de Rouen-Rive-Droite StopArea:OCE87411017 OCESN003100F140147152 06:12:00 06:15:00 3.0 OCE1506035 3100.0
8040 87413013 Gare de Le Havre StopArea:OCE87413013 OCESN003100F140147152 05:20:00 05:20:00 0.0 OCE1506035 3100.0
7197 87413344 Gare de Bréauté-Beuzeville NaN OCESN003100F140147152 05:35:00 05:36:00 1.0 OCE1506035 3100.0
uj5u.com熱心網友回復:
非常簡單的解決方案!我希望這是你要找的:
import pandas as pd
import numpy as np
df = pd.DataFrame({'stop_id': ['87413385', '87411017', '87413013', '87413013', '87413344'],
'stop_name': ['Gare de Yvetot', 'Gare de Rouen-Rive-Droite', 'Gare de Le Havre', 'Gare de Le Havre', 'Gare de Bréauté-Beuzeville'],
'parent_station': ['StopArea:OCE87413385', 'StopArea:OCE87411017', 'StopArea:OCE87413013', np.NaN, np.NaN]})
is_duplacted = df.duplicated(subset=['stop_id', 'stop_name'])
is_nan = df['parent_station'].isna()
print(df[~(is_duplacted & is_nan)])
# stop_id stop_name parent_station
# 0 87413385 Gare de Yvetot StopArea:OCE87413385
# 1 87411017 Gare de Rouen-Rive-Droite StopArea:OCE87411017
# 2 87413013 Gare de Le Havre StopArea:OCE87413013
# 4 87413344 Gare de Bréauté-Beuzeville NaN
uj5u.com熱心網友回復:
盡管它比上面的解釋稍長。
ss = df.columns.drop('parent_station')
keep_rows = df[(df.duplicated(subset=ss, keep=False)) & (~df.parent_station.isna())]
non_duplicates = df[~(df.duplicated(subset=ss, keep=False))]
df = pd.concat([non_duplicates, keep_rows])
在這里,我明確區分了根本沒有雙精度的行(non_duplicates)和要保留的重復行(keep_rows)。
uj5u.com熱心網友回復:
先sort_values然后drop_duplicates。在排序時,您可以選擇將 NaN 值保留在第一位或最后一位。默認為“最后”。keep我們在 drop_duplicates 中有類似的引數。
參考資料:https : //pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
output_df = df.sort_values(['parent_station'], na_position='last').drop_duplicates(['stop_id', 'stop_name'], keep='first')
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/399104.html
下一篇:回圈將大資料幀拆分為小資料幀
