考慮以下資料集:
如何生成期望值,ExpectedGroup使得當 True 時存在相同的值,但當我們在case_id.
df = pd.DataFrame([
['A', 'P', 'O', 2, np.nan],
['A', 'O', 'O', 5, 1],
['A', 'O', 'O', 10, 1],
['A', 'O', 'P', 4, np.nan],
['A', 'P', 'P', 300, np.nan],
['A', 'P', 'O', 2, np.nan],
['A', 'O', 'O', 5, 2],
['A', 'O', 'O', 10, 2],
['A', 'O', 'P', 4, np.nan],
['A', 'P', 'P', 300, np.nan],
['B', 'P', 'O', 2, np.nan],
['B', 'O', 'O', 5, 3],
['B', 'O', 'O', 10, 3],
['B', 'O', 'P', 4, np.nan],
['B', 'P', 'P', 300, np.nan],
],
columns = ['ID', 'FromState', 'ToState', 'Hours', 'ExpectedGroup'])
# create boolean mask
df['case_id'] = ( (df.FromState == 'O') & (df.ToState == 'O') )
0 False
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 False
9 False
10 False
11 True
12 True
13 False
14 False
Name: case_id, dtype: boo
# but how to get incrementing groups?
np.where(df['case_id'] != False, df['case_id'].cumsum(), np.nan)
uj5u.com熱心網友回復:
您可以使用diff僅選擇每段的第一項True:
df['ExpectedGroup'] = (df['case_id'].diff()
&df['case_id']
).cumsum().where(df['case_id'])
如果您不想要中間列:
s = (df.FromState == 'O') & (df.ToState == 'O')
# or
# s = df[['FromState', 'ToState']].eq('O').all(axis=1)
df['ExpectedGroup'] = (s.diff()&s).cumsum().where(s)
# or
# df.loc[s, 'ExpectedGroup'] = (s.diff()&s).cumsum()
輸出:
ID FromState ToState Hours ExpectedGroup case_id
0 A P O 2 NaN False
1 A O O 5 1.0 True
2 A O O 10 1.0 True
3 A O P 4 NaN False
4 A P P 300 NaN False
5 B P O 2 NaN False
6 B O O 5 2.0 True
7 B O O 10 2.0 True
8 B O P 4 NaN False
9 B P P 300 NaN False
uj5u.com熱心網友回復:
讓我們使用cumsum創建計數器然后使用重新編碼計數器factorize
m = df['case_id']
df.loc[m, 'ExpectedGroup'] = (~m).cumsum()[m].factorize()[0] 1
ID FromState ToState Hours ExpectedGroup case_id
0 A P O 2 NaN False
1 A O O 5 1.0 True
2 A O O 10 1.0 True
3 A O P 4 NaN False
4 A P P 300 NaN False
5 A P O 2 NaN False
6 A O O 5 2.0 True
7 A O O 10 2.0 True
8 A O P 4 NaN False
9 A P P 300 NaN False
10 B P O 2 NaN False
11 B O O 5 3.0 True
12 B O O 10 3.0 True
13 B O P 4 NaN False
14 B P P 300 NaN False
uj5u.com熱心網友回復:
類似于mozway的絕妙方法:
df['ExpectedGroup'] = (df['case_id'].shift(-1) & df['case_id']).cumsum().mask(~s)
df
ID FromState ToState Hours ExpectedGroup case_id
0 A P O 2 NaN False
1 A O O 5 1.0 True
2 A O O 10 1.0 True
3 A O P 4 NaN False
4 A P P 300 NaN False
5 A P O 2 NaN False
6 A O O 5 2.0 True
7 A O O 10 2.0 True
8 A O P 4 NaN False
9 A P P 300 NaN False
10 B P O 2 NaN False
11 B O O 5 3.0 True
12 B O O 10 3.0 True
13 B O P 4 NaN False
14 B P P 300 NaN False
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/519471.html
標籤:Python熊猫数据框
