我想計算一個缺失列 id 的串列。
Current_dataframe:
data = {'DIVISION': ['DENVER', 'JEWEL', 'JEWEL','DENVER', np.nan],
'STORE_ID': [np.nan,np.nan,'0001','0001',np.nan],
'FORECAST_DATE':['2021-08-15','2021-08-15','2021-08-15','2021-08-15','2021-08-15'],
'START_TIME':[np.nan,np.nan,9.0,np.nan,np.nan],
'FULFILLMENT_TYPE':['DUG','DELIVERY','DUG','DUG','DUG'],
'MULTIPLIER':[1.04,1.10,1.04,1.10,0.90],
'OVERWRITE':[np.nan,0,np.nan,1,2]
}
df = pd.DataFrame(data)
預期輸出:
data = {'DIVISION': ['DENVER', 'JEWEL', 'JEWEL','DENVER', np.nan],
'STORE_ID': [np.nan,np.nan,'0001','0001',np.nan],
'FORECAST_DATE':['2021-08-15','2021-08-15','2021-08-15','2021-08-15','2021-08-15'],
'START_TIME':[np.nan,np.nan,9.0,np.nan,np.nan],
'FULFILLMENT_TYPE':['DUG','DELIVERY','DUG','DUG','DUG'],
'MULTIPLIER':[1.04,1.10,1.04,1.10,0.90],
'OVERWRITE':[np.nan,0,np.nan,1,2],
'MISSING_ID':[[1,3],[1,3],np.nan,[3],[0,1,3]]
}
expected = pd.DataFrame(data)
其中“MISSING_ID”基于列 DIVISION、STORE_ID、FORECAST_DATE、START_TIME 和 FULFILMENT_TYPE 的缺失值
我可以通過撰寫多個 np.where 條件然后使用串列來實作這一點:
multiplier_df['DIV_MISSING'] = np.where(multiplier_df.DIVISION.isna(),1,0)
multiplier_df['STORE_MISSING'] = np.where(multiplier_df.STORE_ID.isna(),2,0)
multiplier_df['DATE_MISSING'] = np.where(multiplier_df.FORECAST_DATE.isna(),3,0)
multiplier_df['HOUR_MISSING'] = np.where(multiplier_df.START_TIME.isna(),4,0)
multiplier_df['FULFILLMENT_MISSING'] = np.where(multiplier_df.FULFILLMENT_TYPE.isna(),5,0)
multiplier_df['MISSING_ID'] = multiplier_df[['DIV_MISSING', 'STORE_MISSING','DATE_MISSING','HOUR_MISSING','FULFILLMENT_MISSING']].apply(list, axis=1)
然而,這會導致不需要的 0,因此正在尋找一種更簡單的方法來做到這一點。我在這里在 R 中看到了類似的解決方案:“為 R 中資料框的每一行回傳缺少 (NA) 資料的列名串列”但需要在 Pandas 中執行
uj5u.com熱心網友回復:
# 1) Take part of dataframe, only those columns in which we search for nans
df_part = df[['DIVISION', 'STORE_ID', 'FORECAST_DATE', 'START_TIME', 'FULFILLMENT_TYPE']]
# 2) use pd.isna(df).values to get np.array of True/False indicating where nans are
# then use list comprehension and np.where to determine for each row which columns have nans
df['MISSING_ID'] = [np.where(row)[0] for row in pd.isna(df_part).values]
print(df)
DIVISION STORE_ID FORECAST_DATE ... MULTIPLIER OVERWRITE MISSING_ID
0 DENVER NaN 2021-08-15 ... 1.04 NaN [1, 3]
1 JEWEL NaN 2021-08-15 ... 1.10 0.0 [1, 3]
2 JEWEL 0001 2021-08-15 ... 1.04 NaN []
3 DENVER 0001 2021-08-15 ... 1.10 1.0 [3]
4 NaN NaN 2021-08-15 ... 0.90 2.0 [0, 1, 3]
現在它提供 [] 而不是 np.nan,但您可以輕松解決此問題。
df['MISSING_ID'] = df['MISSING_ID'].apply(lambda x: x if len(x) else np.nan)
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/358442.html
