是否可以為可選的sklearn管道步驟優化超引數？-有解無憂

我試圖構建一個包含一些可選步驟的管道。但是，我想為這些步驟優化超引數，因為我想在不使用它們和使用不同配置（在我的情況下為 SelectFromModel - sfm）之間獲得最佳選擇。

clf = RandomForestRegressor(random_state = 1)
stdscl = StandardScaler()
sfm = SelectFromModel(RandomForestRegressor(random_state=1))

p_grid_lr = {"clf__max_depth": [10, 50, 100, None],
             "clf__n_estimators": [10, 50, 100, 200, 500, 800],
             "clf__max_features":[0.1, 0.5, 1.0,'sqrt','log2'],
             "sfm": ['passthrough', sfm],
             "sfm__max_depth": [10, 50, 100, None],
             "sfm__n_estimators": [10, 50, 100, 200, 500, 800],
             "sfm__max_features":[0.1, 0.5, 1.0,'sqrt','log2'],
            }

pipeline=Pipeline([
                 ('scl',stdscl),
                 ('sfm',sfm),
                 ('clf',clf)
                  ])

gs_clf = GridSearchCV(estimator = pipeline, param_grid = p_grid_lr, cv =KFold(shuffle = True, n_splits = 5, random_state=1),scoring = 'r2', n_jobs =- 1)
gs_clf.fit(X_train, y_train)

clf = gs_clf.best_estimator_

我得到的錯誤是“字串”物件沒有屬性“set_params”，這是可以理解的。有沒有辦法指定應該一起嘗試哪些組合，在我的情況下，只有“直通”本身和具有不同超引數的 sfm？

謝謝！

uj5u.com熱心網友回復：

正如@Robin 所指定的，您可以定義p_grid_lr為字典串列。確實，以下是該提案中各州的檔案：GridSearchCV

param_grid：字典或字典串列

以引數名稱 (str) 作為鍵的字典和要嘗試作為值的引數設定串列，或此類字典的串列，在這種情況下，將探索串列中每個字典跨越的網格。這可以搜索任何引數設定序列。

p_grid_lr = [
    {
        "clf__max_depth": [10, 50, 100, None],
        "clf__n_estimators": [10, 50, 100, 200, 500, 800],
        "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
        "sfm__estimator__max_depth": [10, 50, 100, None],
        "sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],
        "sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
    },
    {
        "clf__max_depth": [10, 50, 100, None],
        "clf__n_estimators": [10, 50, 100, 200, 500, 800],
        "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
        "sfm": ['passthrough'],
    }
]

一個可擴展性較低的替代方案（對于您的情況）可能是以下

p_grid_lr_ = {
    "clf__max_depth": [10, 50, 100, None],
    "clf__n_estimators": [10, 50, 100, 200, 500, 800],
    "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
    "sfm": ['passthrough', 
            SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=10, max_features=0.1)),
            SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=50, max_features=0.1)),
            ...]
}

為您的引數指定所有可能的組合。

此外，請注意要訪問 parameters max_depth，n_estimators并max_features從您內部的RandomForestRegressor估算器中SelectFromModel鍵入引數為

"sfm__estimator__max_depth": [10, 50, 100, None],
"sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2']

而不是作為

"sfm__max_depth": [10, 50, 100, None],
"sfm__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__max_features": [0.1, 0.5, 1.0,'sqrt','log2']

因為這些引數來自估計器本身（max_features原則上也可能是來自的引數SelectFromModel，但在這種情況下，它可能只能獲得來自docs的整數值）。

pipeline.get_params().keys()通常，您可以通過（estimator.get_params().keys()通常）訪問所有可能優化的引數。

最后，這里是Pipelines 用戶指南的精彩讀物。

uj5u.com熱心網友回復：

參考這個例子，你可以制作一個字典串列。一個包含sfm及其相關引數，另一個不使用"passthrough".

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/422909.html

標籤：

上一篇：ValueError:y應該是一維陣列，得到一個形狀陣列

下一篇：NotFittedError：此DecisionTreeClassifier實體尚未安裝