我正在使用資料框在 Python 中使用 Pandas df。我正在執行分類任務并且有兩個不平衡的類df['White']和df['Non-white']。出于這個原因,我構建了一個包含 SMOTE 和 RandomUnderSampling 的管道。
這是我的管道的樣子:
model = Pipeline([
('preprocessor', preprocessor),
('smote', over),
('random_under_sampler', under),
('classification', knn)
])
這些是確切的步驟:
Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('knnimputer', KNNImputer(),
['policePrecinct']),
('onehotencoder-1',
OneHotEncoder(), ['gender']),
('standardscaler',
StandardScaler(),
['long', 'lat']),
('onehotencoder-2',
OneHotEncoder(),
['neighborhood',
'problem'])])),
('smote', SMOTE()),
('random_under_sampler', RandomUnderSampler()),
('classification', KNeighborsClassifier())])
我想評估sampling_strategySMOTE 和 RandomUnderSampling 中的不同之處。調整引數時,我可以直接在 GridSearch 中執行此操作嗎?現在,我寫了以下內容for loop。此回圈不起作用 ( ValueError: too many values to unpack (expected 2))。
strategy_sm = [0.1, 0.3, 0.5]
strategy_un = [0.15, 0.30, 0.50]
best_strat = []
for k, n in strategy_sm, strategy_un:
over = SMOTE(sampling_strategy=k)
under = RandomUnderSampler(sampling_strategy=n)
model = Pipeline([
('preprocessor', preprocessor),
('smote', over),
('random_under_sampler', under),
('classification', knn)
])
mode.fit(X_train, y_train)
best_strat.append[(model.score(X_train, y_train))]
我對 Python 不是很精通,我懷疑有更好的方法來做到這一點。此外,我希望for loop(如果這確實是這樣做的方式),以可視化sampling_strategy. 有任何想法嗎?
uj5u.com熱心網友回復:
下面是一個示例,說明如何使用 5 折交叉驗證比較不同引陣列合的分類器準確度并將結果可視化。
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
# generate some data
X, y = make_classification(n_classes=2, weights=[0.1, 0.9], n_features=20, random_state=42)
# define the pipeline
estimator = Pipeline([
('smote', SMOTE()),
('random_under_sampler', RandomUnderSampler()),
('classification', KNeighborsClassifier())
])
# define the parameter grid
param_grid = {
'smote__sampling_strategy': [0.3, 0.4, 0.5],
'random_under_sampler__sampling_strategy': [0.5, 0.6, 0.7]
}
# run a grid search to calculate the cross-validation
# accuracy associated to each parameter combination
clf = GridSearchCV(
estimator=estimator,
param_grid=param_grid,
cv=StratifiedKFold(n_splits=3)
)
clf.fit(X, y)
# organize the grid search results in a data frame
res = pd.DataFrame(clf.cv_results_)
res = res.rename(columns={
'param_smote__sampling_strategy': 'smote_strategy',
'param_random_under_sampler__sampling_strategy': 'random_under_sampler_strategy',
'mean_test_score': 'accuracy'
})
res = res[['smote_strategy', 'random_under_sampler_strategy', 'accuracy']]
print(res)
# smote_strategy random_under_sampler_strategy accuracy
# 0 0.3 0.5 0.829471
# 1 0.4 0.5 0.869578
# 2 0.5 0.5 0.899881
# 3 0.3 0.6 0.809269
# 4 0.4 0.6 0.819370
# 5 0.5 0.6 0.778669
# 6 0.3 0.7 0.708259
# 7 0.4 0.7 0.778966
# 8 0.5 0.7 0.768568
# plot the grid search results
res_ = res.pivot(index='smote_strategy', columns='random_under_sampler_strategy', values='accuracy')
sns.heatmap(res_, annot=True, cbar_kws={'label': 'accuracy'})
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/385699.html
