我想在我的 k-fold 回圈中使用特征引擎中的 MeanEncoder 來編碼分類資料。似乎在轉換步驟之后,編碼器為我的資料集中的某些列引入了 NaN 值。代碼如下
from sklearn.model_selection import KFold
from sklearn import linear_model
kf = KFold(n_splits=2)
linear_reg = linear_model.LinearRegression()
kfold_rmse = []
X = housing.drop(columns=['Price'], axis=1)
y = housing['Price']
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
X_train.drop(columns=['BuildingArea','YearBuilt', 'Rooms'], axis=1, inplace=True)
X_test.drop(columns=['BuildingArea','YearBuilt', 'Rooms'], axis=1, inplace=True)
random_imputer = RandomSampleImputer(variables=['Car', 'CouncilArea'])
random_imputer.fit(X_train)
X_train = random_imputer.transform(X_train)
X_test = random_imputer.transform(X_test)
X_train[descrete_var] = X_train[descrete_var].astype('O')
X_test[descrete_var] = X_test[descrete_var].astype('O')
mean_encoder = MeanEncoder(variables=categorical_var descrete_var)
mean_encoder.fit(X_train,y_train)
print(X_test.isnull().mean()) # <--------- No NaN columns
X_train = mean_encoder.transform(X_train)
X_test = mean_encoder.transform(X_test)
print(X_test.isnull().mean()) # # <--------- NaN columns introduced
# Fit the model
# linear_reg_model = linear_reg.fit(X_train, y_train)
# y_pred_linear_reg = linear_reg_model.predict(X_test)
# # Calculate the RMSE for each fold and append it
# rmse = mean_squared_error(y_test, y_pred_linear_reg, squared=False)
# kfold_rmse.append(rmse)
對于進一步的背景關系,這是我得到的輸出:
...
Suburb 0.0
Type 0.0
Method 0.0
SellerG 0.0
Distance 0.0
Postcode 0.0
Bedroom2 0.0
Bathroom 0.0
Car 0.0
Landsize 0.0
CouncilArea 0.0
Regionname 0.0
Propertycount 0.0
Month_name 0.0
day 0.0
Year 0.0
dtype: float64
Suburb 0.000000
Type 0.000000
Method 0.000000
SellerG 0.014138
Distance 0.000000
Postcode 0.000000
Bedroom2 0.000000
Bathroom 0.000295
...
Month_name 0.000000
day 0.191605
Year 0.000000
這顯然會導致模型預測出現問題,因為 LinearRegression 不能接受 NaN 值。我認為這可能是我如何在 kfold 回圈中使用 MeanEncoder 的問題。關于 k-fold 程序或 MeanEncoder,我做錯了什么或不了解嗎?
uj5u.com熱心網友回復:
您的測驗折疊包含在訓練時看不到的類別,編碼器默認將這些類別編碼為NaN. 從檔案中:
錯誤:字串,默認='忽略'
指示在變換程序中遇到訓練集中不存在的類別時要做什么。如果'raise',那么稀有類別將引發錯誤。如果“忽略”,那么稀有類別將被設定為 NaN,并且會發出警告。
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/510351.html
