用 XGBoost 在 Python 中進行特征重要性分析和特征選擇-有解無憂

使用諸如梯度增強之類的決策樹方法的集成的好處是，它們可以從訓練有素的預測模型中自動提供特征重要性的估計，

在本文中，您將發現如何使用Python中的XGBoost庫來估計特征對于預測性建模問題的重要性，閱讀這篇文章后，您將知道：

如何使用梯度提升演算法計算特征重要性，
如何繪制由XGBoost模型計算的Python中的特征重要性，
如何使用XGBoost計算的特征重要性來執行特征選擇，

梯度提升中的特征重要性

使用梯度增強的好處是，在構建增強后的樹之后，檢索每個屬性的重要性得分相對簡單，通常，重要性提供了一個分數，該分數指示每個特征在模型中構建增強決策樹時的有用性或價值，用于決策樹的關鍵決策使用的屬性越多，其相對重要性就越高，

此重要性是針對資料集中的每個屬性明確計算得出的，從而可以對屬性進行排名并進行相互比較，單個決策樹的重要性是通過每個屬性拆分點提高性能指標的數量來計算的，并由節點負責的觀察次數來加權，性能度量可以是用于選擇拆分點的純度（基尼系數），也可以是其他更特定的誤差函式，然后，將特征重要性在模型中所有決策樹之間平均，有關如何在增強型決策樹中計算特征重要性的更多技術資訊，請參見《統計學習的要素：資料挖掘，推理和預測》（第367頁）第10.13.1節“預測變數的相對重要性”，另外，請參見Matthew Drury對StackOverflow問題“ Boosting的相對變數重要性”的回答，在此他提供了非常詳細和實用的答案，

手動繪制特征重要性

訓練有素的XGBoost模型會自動計算出您的預測建模問題中的特征重要性，這些重要性分數可在訓練模型的feature_importances_成員變數中獲得，例如，可以按如下所示直接列印它們：

print(model.feature_importances_)

我們可以將這些得分直接繪制在條形圖上，以直觀表示資料集中每個特征的相對重要性，例如：

# plot
pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)
pyplot.show()

我們可以通過在皮馬印第安人發病的糖尿病資料集上訓練XGBoost模型并根據計算出的特征重要性創建條形圖來證明這一點，

下載資料集并將其放置在當前作業目錄中，

資料集檔案:

https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv

資料集詳細資訊:

https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names

# plot feature importance manually
from numpy import loadtxt
from xgboost import XGBClassifier
from matplotlib import pyplot
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
y = dataset[:,8]
# fit model no training data
model = XGBClassifier()
model.fit(X, y)
# feature importance
print(model.feature_importances_)
# plot
pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)
pyplot.show()

注意：由于演算法或評估程式的隨機性，或者數值精度的差異，您的結果可能會有所不同，考慮運行該示例幾次并比較平均結果，

首先運行此示例將輸出重要性分數，

[ 0.089701    0.17109634  0.08139535  0.04651163  0.10465116  0.2026578 0.1627907   0.14119601]

我們還獲得了相對重要性的條形圖，

該圖的缺點是要素按其輸入索引而不是其重要性排序，我們可以在繪制之前對特征進行排序，

值得慶幸的是，有一個內置的繪圖函式可以幫助我們，

使用內置XGBoost特征重要性圖XGBoost庫提供了一個內置函式，可以按重要性順序繪制要素，該函式稱為plot_importance（），可以按以下方式使用：

# plot feature importance
plot_importance(model)
pyplot.show()

例如，以下是完整的代碼清單，其中使用內置的plot_importance（）函式繪制了Pima Indians資料集的特征重要性，

# plot feature importance using built-in function
from numpy import loadtxt
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
y = dataset[:,8]
# fit model no training data
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model)
pyplot.show()

注意：由于演算法或評估程式的隨機性，或者數值精度的差異，您的結果可能會有所不同，考慮運行該示例幾次并比較平均結果，

運行該示例將為我們提供更有用的條形圖，

您可以看到，要素是根據它們在F0至F7的輸入陣列（X）中的索引自動命名的，手動將這些索引映射到問題描述中的名稱，可以看到該圖顯示F5（體重指數）具有最高的重要性，而F3（皮膚褶皺厚度）具有最低的重要性，

XGBoost特征重要性評分的特征選擇

特征重要性評分可用于scikit-learn中的特征選擇，這是通過使用SelectFromModel類完成的，該類采用一個模型，并且可以將資料集轉換為具有選定要素的子集，此類可以采用預訓練的模型，例如在整個訓練資料集上進行訓練的模型，然后，它可以使用閾值來確定要選擇的特征，當您在SelectFromModel實體上呼叫transform（）方法以一致地選擇訓練資料集和測驗資料集上的相同要素時，將使用此閾值，

在下面的示例中，我們首先訓練，然后分別在整個訓練資料集和測驗資料集上評估XGBoost模型，使用從訓練資料集計算出的特征重要性，然后將模型包裝在SelectFromModel實體中，我們使用它來選擇訓練資料集上的特征，從選定的特征子集中訓練模型，然后在測驗集上評估模型，并遵循相同的特征選擇方案，

例如：

# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)

出于興趣，我們可以測驗多個閾值，以根據特征重要性選擇特征，具體來說，每個輸入變數的特征重要性，從本質上講，使我們能夠按重要性測驗每個特征子集，從所有特征開始，到具有最重要特征的子集結束，

下面提供了完整的代碼清單：

# use feature importance for feature selection
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
 # select features using threshold
 selection = SelectFromModel(model, threshold=thresh, prefit=True)
 select_X_train = selection.transform(X_train)
 # train model
 selection_model = XGBClassifier()
 selection_model.fit(select_X_train, y_train)
 # eval model
 select_X_test = selection.transform(X_test)
 y_pred = selection_model.predict(select_X_test)
 predictions = [round(value) for value in y_pred]
 accuracy = accuracy_score(y_test, predictions)
 print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

請注意，如果您使用的是XGBoost 1.0.2（可能還有其他版本），則XGBClassifier類中存在一個錯誤，該錯誤會導致錯誤：

KeyError: 'weight'

這可以通過使用自定義XGBClassifier類來解決，該類為coef_屬性回傳None，下面列出了完整的示例，

# use feature importance for feature selection, with fix for xgboost 1.0.2
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
 
# define custom class to fix bug in xgboost 1.0.2
class MyXGBClassifier(XGBClassifier):
 @property
 def coef_(self):
  return None
 
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model on all training data
model = MyXGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
 # select features using threshold
 selection = SelectFromModel(model, threshold=thresh, prefit=True)
 select_X_train = selection.transform(X_train)
 # train model
 selection_model = XGBClassifier()
 selection_model.fit(select_X_train, y_train)
 # eval model
 select_X_test = selection.transform(X_test)
 predictions = selection_model.predict(select_X_test)
 accuracy = accuracy_score(y_test, predictions)
 print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

注意：由于演算法或評估程式的隨機性，或者數值精度的差異，您的結果可能會有所不同，考慮運行該示例幾次并比較平均結果，

運行此示例將列印以下輸出，

Accuracy: 77.95%
Thresh=0.071, n=8, Accuracy: 77.95%
Thresh=0.073, n=7, Accuracy: 76.38%
Thresh=0.084, n=6, Accuracy: 77.56%
Thresh=0.090, n=5, Accuracy: 76.38%
Thresh=0.128, n=4, Accuracy: 76.38%
Thresh=0.160, n=3, Accuracy: 74.80%
Thresh=0.186, n=2, Accuracy: 71.65%
Thresh=0.208, n=1, Accuracy: 63.78%

我們可以看到，模型的性能通常隨所選特征的數量而降低，

在此問題上，需要權衡測驗集精度的特征，我們可以決定采用較不復雜的模型（較少的屬性，例如n = 4），并接受估計精度的適度降低，從77.95％降至76.38％，

這可能是對這么小的資料集的洗禮，但是對于更大的資料集并使用交叉驗證作為模型評估方案可能是更有用的策略，

作者：沂水寒城，CSDN博客專家，個人研究方向：機器學習、深度學習、NLP、CV

Blog: http://yishuihancheng.blog.csdn.net

贊賞作者

更多閱讀

用 XGBoost 進行時間序列預測

5分鐘掌握 Python 隨機爬山演算法

5分鐘完全讀懂關聯規則挖掘演算法

特別推薦

點擊下方閱讀原文加入社區會員

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/278015.html

標籤：AI

上一篇：他15歲考進少年班，23歲成為阿里最年輕技術專家

下一篇：大廠面試干貨：面試官最喜歡pick什么樣的候選人