我有一個這樣的資料框 -請參考圖中所示的資料框
有四列('status','preferred_time','history','id'),需要檢查所有列是否都有一些值,在歷史列中,在某些情況下它是一個嵌套串列,所以需要特別檢查嵌套串列是否有所有強制鍵 'branch','rank','discharge_status','service_start',job_code','post_intention' 有值,如果在資料框中添加一個名為“輸出”的列所有列都有值,然后將其命名為“已完成”,如果任何列或歷史列中的空白或 NaN 或 [{}] 缺少任何鍵值對,則將其命名為“待定”。
從影像中,只有第一行應該處于完成狀態,其余的應該處于掛起狀態。
如果在這種情況下出現其他情況,請幫助我更好地構建。提前致謝。
上述df影像的字典 -
{'status': {0: 'No', 1: 'No', 2: nan, 3: 'No', 4: 'No'},
'preferred_time': {0: "['Morning', 'Midday', 'Afternoon']",
1: [],
2: "['Morning'] ",
3: nan,
4: "['Morning', 'Midday'] "},
'history': {0: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1999-02-13', 'job_code': '09', 'post_intention': ['No']}]",
1: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1999-02-13', 'job_code': '09', 'post_intention': ['No']}]",
2: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1995-02-13', 'job_code': '09', 'post_intention': ['No']},{'branch': 'A', 'rank': 'E6', 'discharge_status': 'Honorable Discharge', 'service_start': '2015-02-13', 'job_code': '09'}]",
3: nan,
4: '[{}]'},
'id': {0: 1, 1: 5, 2: 2, 3: 3, 4: 4}}
我嘗試了下面的代碼行 - 但我不知道如何在一個 if 陳述句中檢查所有四列 -
for i in df.index:
status = df['status'][i]
preferred_time = df['preferred_time'][i]
id = df['id'][i]
history = df['history'][i]
if status and preferred_time and id and status!='' and preferred_time!= '' and id!='':
enroll_status = "completed"
else:
enroll_status = "pending"
if history!= '' or str(history)!= '[{}]':
for item in history:
if 'branch' in item.keys() and'rank' in item.keys() and'discharge_status' in item.keys() and'service_start' in item.keys() and 'job_code' in item.keys() and 'post_intention' in item.keys():
enroll_status = "completed"
else:
enroll_status = "pending"
uj5u.com熱心網友回復:
考慮以下:
import numpy as np
import pandas as pd
from numpy import nan
def check_list(L):
if not isinstance(L,list):
return False
return all(k in d for k in keys_req for d in L)
labels = np.array(["pending","completed"])
keys_req = ['branch','rank','discharge_status','service_start','job_code','post_intention']
d = {'status': {0: 'No', 1: 'No', 2: nan, 3: 'No', 4: 'No'}, 'preferred_time': {0: "['Morning', 'Midday', 'Afternoon']", 1: nan, 2: "['Morning'] ", 3: nan, 4: "['Morning', 'Midday'] "}, 'history': {0: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1999-02-13', 'job_code': '09', 'post_intention': ['No']}]", 1: nan, 2: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1995-02-13', 'job_code': '09', 'post_intention': ['No']},{'branch': 'A', 'rank': 'E6', 'discharge_status': 'Honorable Discharge', 'service_start': '2015-02-13', 'job_code': '09'}]", 3: nan, 4: '[{}]'}, 'id': {0: 1, 1: 5, 2: 2, 3: 3, 4: 4}}
df = pd.DataFrame(d)
df['history_list'] = df['history'].apply(lambda x: eval(x) if isinstance(x,str) else x)
df['mandatory_keys'] = df['history_list'].apply(check_list)
df['no_nans'] = ~pd.isna(df).any(axis = 1)
df['output_tf'] = df['mandatory_keys'] & df['no_nans']
df['output'] = labels[df['output_tf'].to_numpy(dtype=int)]
請注意,我在復制的字典 d 版本中更正了您的資料框中的一些拼寫錯誤(例如,'rank:'E7'已替換為'rank':'E7')。添加的增量列(history_list、mandatory_keys、no_nans、output_tf)是為了更容易理解我在這里應用的程序;例如,如果您想使用盡可能少的空間,則實際上沒有必要將這些添加到資料框中。上面的腳本產生以下資料框df:
status preferred_time \
0 No ['Morning', 'Midday', 'Afternoon']
1 No NaN
2 NaN ['Morning']
3 No NaN
4 No ['Morning', 'Midday']
history id \
0 [{'branch': 'A', 'rank': 'E7', 'discharge_stat... 1
1 NaN 5
2 [{'branch': 'A', 'rank': 'E7', 'discharge_stat... 2
3 NaN 3
4 [{}] 4
history_list mandatory_keys no_nans \
0 [{'branch': 'A', 'rank': 'E7', 'discharge_stat... True True
1 NaN False False
2 [{'branch': 'A', 'rank': 'E7', 'discharge_stat... False False
3 NaN False False
4 [{}] False True
output_tf output
0 True completed
1 False pending
2 False pending
3 False pending
4 False pending
這是一個更簡潔的版本(它不添加不必要的列或存盤額外的“標簽”變數)。
import numpy as np
import pandas as pd
from numpy import nan
def check_list(L):
if not isinstance(L,list):
return False
return all(k in d for k in keys_req for d in L)
keys_req = ['branch','rank','discharge_status','service_start','job_code','post_intention']
d = {'status': {0: 'No', 1: 'No', 2: nan, 3: 'No', 4: 'No'}, 'preferred_time': {0: "['Morning', 'Midday', 'Afternoon']", 1: nan, 2: "['Morning'] ", 3: nan, 4: "['Morning', 'Midday'] "}, 'history': {0: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1999-02-13', 'job_code': '09', 'post_intention': ['No']}]", 1: nan, 2: "[{'branch': 'A', 'rank': 'E7', 'discharge_status': 'Honorable Discharge', 'service_start': '1995-02-13', 'job_code': '09', 'post_intention': ['No']},{'branch': 'A', 'rank': 'E6', 'discharge_status': 'Honorable Discharge', 'service_start': '2015-02-13', 'job_code': '09'}]", 3: nan, 4: '[{}]'}, 'id': {0: 1, 1: 5, 2: 2, 3: 3, 4: 4}}
df = pd.DataFrame(d)
df['output'] = np.array(["pending","completed"])[
(df['history'].apply(lambda x: eval(x) if isinstance(x,str) else x)
.apply(check_list)
& ~pd.isna(df).any(axis = 1)
).to_numpy(dtype=int)]
解決您最新評論的版本:
df = pd.DataFrame(d)
display(df)
df['output'] = np.array(["pending","completed"])[
(df['history'].apply(lambda x: eval(x) if isinstance(x,str) else x)
.apply(check_list)
& ~pd.isna(df).any(axis = 1)
& (df['preferred_time']!="[]")
).to_numpy(dtype=int)]
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/512691.html
