我有兩個資料框,如果存在,我需要根據第二個資料框的值更新第一個資料框。下面提供的示例故事是將 student_id 替換為 updatedId(如果存在于“old_id”列中)并將其替換為“new_id”。
import pandas as pd
import numpy as np
student = {
'Name': ['John', 'Jay', 'sachin', 'Geetha', 'Amutha', 'ganesh'],
'gender': ['male', 'male', 'male', 'female', 'female', 'male'],
'math score': [50, 100, 70, 80, 75, 40],
'student_Id': ['1234', '6788', 'xyz', 'abcd', 'ok83', '234v'],
}
updatedId = {
'old_id' : ['ok83', '234v'],
'new_id' : ['83ko', 'v432'],
}
df_student = pd.DataFrame(student)
df_updated_id = pd.DataFrame(updatedId)
print(df_student)
print(df_updated_id)
# Method with np.where
for index, row in df_updated_id.iterrows():
df_student['student_Id'] = np.where(df_student['student_Id'] == row['old_id'], row['new_id'], df_student['student_Id'])
# print(df_student)
# Method with dataframe.mask
for index, row in df_updated_id.iterrows():
df_student['student_Id'].mask(df_student['student_Id'] == row['old_id'], row['new_id'], inplace=True)
print(df_student)
上述兩種方法的結果都有效并產生了正確的結果
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 ok83
5 ganesh male 40 234v
old_id new_id
0 ok83 83ko
1 234v v432
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
盡管如此,學生的實際資料大約有 500,000 行,updated_id 有 6000 行。
因此,由于回圈非常慢,我遇到了性能問題:
放置一個簡單的計時器來觀察為 df_updated_id 處理的記錄數
100 行 - numpy 時間=3.9020769596099854;掩碼時間=3.9169061183929443
500 行 - numpy 時間=20.42293930053711;掩碼時間=19.768696784973145
1000 行 - numpy 時間=40.06309795379639;掩碼時間=37.26559829711914
我的問題是我是否可以使用合并(連接表)來優化它,或者放棄 iterrows?我嘗試了類似下面的方法,但未能使其正常作業。 根據另一個資料框中的匹配 id 替換資料框列值,以及如何在 Pandas 中迭代 DataFrame 中的行
請指教..
uj5u.com熱心網友回復:
我們只能replace
df_student.replace({'student_Id':df_updated_id.set_index('old_id')['new_id']},inplace=True)
df_student
Out[337]:
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
uj5u.com熱心網友回復:
您也可以嘗試map:
df_student['student_Id'] = (
df_student['student_Id'].map(df_updated_id.set_index('old_id')['new_id'])
.fillna(df_student['student_Id'])
)
print(df_student)
# Output
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
uj5u.com熱心網友回復:
另外,嘗試用字典理解替換:
df_student.replace({'student_Id':{o:n for o, n in zip(updatedId['old_id'],
updatedId['new_id'])}})
輸出:
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/451431.html
上一篇:ValueError:“c”引數必須是顏色、顏色序列或數字序列
下一篇:用索引值替換列的值
