??本文介紹基于Python的隨機森林（Random Forest，RF）回歸代碼，以及模型超引數（包括決策樹個數與最大深度、最小分離樣本數、最小葉子節點樣本數、最大分離特征數等）自動優化的代碼，

??本文是在上一篇文章Python實作隨機森林RF并對比自變數的重要性的基礎上完成的，因此本次僅對隨機森林模型超引數自動擇優部分的代碼加以詳細解釋；而資料準備、模型建立、精度評定等其他部分的代碼詳細解釋，大家直接點擊上述文章Python實作隨機森林RF并對比自變數的重要性查看即可，

??其中，關于基于MATLAB實作同樣程序的代碼與實戰，大家可以點擊查看文章MATLAB實作隨機森林（RF）回歸與自變數影響程度分析，

??本文分為兩部分，第一部分為代碼的分段講解，第二部分為完整代碼，

1 代碼分段講解

1.1 資料與模型準備

??本部分是對隨機森林演算法的資料與模型準備，由于在之前的博客中已經詳細介紹過了，本文就不再贅述~大家直接查看文章Python實作隨機森林RF并對比自變數的重要性即可，

import pydot
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from pprint import pprint
from sklearn import metrics
from openpyxl import load_workbook
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Attention! Data Partition
# Attention! One-Hot Encoding

train_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Train.csv'
test_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Test.csv'
write_excel_path='G:/CropYield/03_DL/05_NewML/ParameterResult_ML.xlsx'
tree_graph_dot_path='G:/CropYield/03_DL/05_NewML/tree.dot'
tree_graph_png_path='G:/CropYield/03_DL/05_NewML/tree.png'

random_seed=44
random_forest_seed=np.random.randint(low=1,high=230)

# Data import

train_data=https://www.cnblogs.com/fkxxgis/p/pd.read_csv(train_data_path,header=0)
test_data=pd.read_csv(test_data_path,header=0)

# Separate independent and dependent variables

train_Y=np.array(train_data['Yield'])
train_X=train_data.drop(['ID','Yield'],axis=1)
train_X_column_name=list(train_X.columns)
train_X=np.array(train_X)

test_Y=np.array(test_data['Yield'])
test_X=test_data.drop(['ID','Yield'],axis=1)
test_X=np.array(test_X)

1.2 超引數范圍給定

??首先，我們需要對隨機森林模型超引數各自的范圍加以確定，之后我們將在這些范圍內確定各個超引數的最終最優取值，換句話說，我們現在先給每一個需要擇優的超引數劃定一個很大很大的范圍（例如對于“決策樹個數”這個超引數，我們可以將其范圍劃定在10到5000這樣一個很大的范圍），然后后期將用擇優演算法在每一個超引數的這個范圍內進行搜索，

??在此，我們先要確定對哪些超引數進行擇優，本文選擇在隨機森林演算法中比較重要的幾個超引數進行調優，分別是：決策樹個數n_estimators，決策樹最大深度max_depth，最小分離樣本數（即拆分決策樹節點所需的最小樣本數）min_samples_split，最小葉子節點樣本數（即一個葉節點所需包含的最小樣本數）min_samples_leaf，最大分離特征數（即尋找最佳節點分割時要考慮的特征變數數量）max_features，以及是否進行隨機抽樣bootstrap等六種，關于上述超引數如果大家不是太了解具體的含義，可以查看文章Python實作隨機森林RF并對比自變數的重要性的1.5部分，可能就會比較好理解了（不過其實不理解也不影響接下來的操作），

??這里提一句，其實隨機森林的超引數并不止上述這些，我這里也是結合資料情況與最終的精度需求，選擇了相對比較常用的幾個超引數；大家依據各自實際需要，選擇需要調整的超引數，并用同樣的代碼思路執行即可，

# Search optimal hyperparameter

n_estimators_range=[int(x) for x in np.linspace(start=50,stop=3000,num=60)]
max_features_range=['auto','sqrt']
max_depth_range=[int(x) for x in np.linspace(10,500,num=50)]
max_depth_range.append(None)
min_samples_split_range=[2,5,10]
min_samples_leaf_range=[1,2,4,8]
bootstrap_range=[True,False]

random_forest_hp_range={'n_estimators':n_estimators_range,
                        'max_features':max_features_range,
                        'max_depth':max_depth_range,
                        'min_samples_split':min_samples_split_range,
                        'min_samples_leaf':min_samples_leaf_range
                        # 'bootstrap':bootstrap_range
                        }
pprint(random_forest_hp_range)

??可以看到，上述代碼首先是對六種超引數劃定了一個范圍，然后將其分別存入了一個超引數范圍字典random_forest_hp_range，在這里大家可以看到，我在存入字典時，將bootstrap的范圍這一句注釋掉了，這是由于當時運行后我發現bootstrap還是處于True這個狀態比較好（也就是得到的結果精度比較高），因此就取消了這一超引數的擇優；大家依據個人資料與模型的實際情況來即可~

??我們可以看一下random_forest_hp_range變數的取值情況：

??沒錯，它是一個字典，鍵就是超引數的名稱，值就是超引數的范圍，因為我將bootstrap注釋掉了，因此這個字典里就沒有bootstrap這一項了~

1.3 超引數隨機匹配擇優

??上面我們確定了每一種超引數各自的范圍，那么接下來我們就將他們分別組合，對比每一個超引數取值組合所得到的模型結果，從而確定最優超引陣列合，

??其實大家會發現，我們上面劃定六種超引數（除去我后來洗掉的bootstrap的話是五種），如果按照排列組合來計算的話，會有很多很多種組合方式，如果要一一嘗試未免也太麻煩了，因此，我們用到RandomizedSearchCV這一功能——其將隨機匹配每一種超引陣列合，并輸出最優的組合，換句話說，我們用RandomizedSearchCV來進行隨機的排列，而不是對所有的超引數排列組合方法進行遍歷，這樣子確實可以節省很多時間，

random_forest_model_test_base=RandomForestRegressor()
random_forest_model_test_random=RandomizedSearchCV(estimator=random_forest_model_test_base,
                                                   param_distributions=random_forest_hp_range,
                                                   n_iter=200,
                                                   n_jobs=-1,
                                                   cv=3,
                                                   verbose=1,
                                                   random_state=random_forest_seed
                                                   )
random_forest_model_test_random.fit(train_X,train_Y)

best_hp_now=random_forest_model_test_random.best_params_
pprint(best_hp_now)

??由代碼可以看到，我們首先建立一個隨機森林模型random_forest_model_test_base，并將其帶入到RandomizedSearchCV中；其中，RandomizedSearchCV的引陣列合就是剛剛我們看的random_forest_hp_range，n_iter就是具體隨機搭配超引陣列合的次數（這個次數因此肯定是越大涵蓋的組合數越多，效果越好，但是也越費時間），cv是交叉驗證的折數（RandomizedSearchCV衡量每一種組合方式的效果就是用交叉驗證來進行的），n_jobs與verbose是關于模型執行緒、日志相關的資訊，大家不用太在意，random_state是隨機森林中隨機抽樣的亂數種子，

??之后，我們對random_forest_model_test_random加以訓練，并獲取其所得到的最優超引數匹配組合best_hp_now，在這里，模型的訓練次數就是n_iter與cv的乘積（因為交叉驗證有幾折，那么就需要運行幾次；而一共有n_iter個引數匹配組合，因此總次數就是二者相乘），例如，用上述代碼那么一共就需要運行600次，運行程序在程式中將自動顯示，如下圖，

??可以看到，一共有600次fit，我這里共花了11.7min完成，具體速度和電腦配置、自變數與因變數資料量大小，以及電腦此時記憶體等等都有關，

??運行完畢，我們來看看找到的最有超引陣列合best_hp_now，

??可以看到，經過200種組合匹配方式的計算，目前五種超引數最優的組合搭配方式已經得到了，其實每一次得到的超引數最優組合結果差距也是蠻大的——例如同一批資料，有的時候我得到的n_estimators最優值是如圖所示的100，有的時候也會是2350；所以大家依據實際情況來判斷即可~

??那么接下來，我們就繼續對這一best_hp_now所示的結果進行更進一步的擇優，

1.4 超引數遍歷匹配擇優

??剛剛我們基于RandomizedSearchCV，實作了200次的超引數隨機匹配與擇優；但是此時的結果是一個隨機不完全遍歷后所得的結果，因此其最優組合可能并不是全域最優的，而只是一個大概的最優范圍，因此接下來，我們需要依據上述所得到的隨機最優匹配結果，進行遍歷全部組合的匹配擇優，

??遍歷匹配即在隨機匹配最優結果的基礎上，在其臨近范圍內選取幾個數值，并通過GridSearchCV對每一種匹配都遍歷，從而選出比較好的超引數最終取值結果，

# Grid Search

random_forest_hp_range_2={'n_estimators':[60,100,200],
                          'max_features':[12,13],
                          'max_depth':[350,400,450],
                          'min_samples_split':[2,3] # Greater than 1
                          # 'min_samples_leaf':[1,2]
                          # 'bootstrap':bootstrap_range
                          }
random_forest_model_test_2_base=RandomForestRegressor()
random_forest_model_test_2_random=GridSearchCV(estimator=random_forest_model_test_2_base,
                                               param_grid=random_forest_hp_range_2,
                                               cv=3,
                                               verbose=1,
                                               n_jobs=-1)
random_forest_model_test_2_random.fit(train_X,train_Y)

best_hp_now_2=random_forest_model_test_2_random.best_params_
pprint(best_hp_now_2)

??大家可以看到，本部分代碼其實和1.3部分比較類似，我們著重講解random_forest_hp_range_2，其中，n_estimators設定為了[60,100,200]，這是由于我們剛剛得到的best_hp_now中n_estimators為100，那么我們就在100附近選取幾個值，作為新的n_estimators范圍；max_features也是一樣的，因為best_hp_now中max_features為'sqrt'，也就是輸入資料特征（自變數）的個數的平方根，而我這里自變數個數大概是150多個，因此其開平方之后就是12.24左右，那么就選擇其附近的兩個數（需要為整數），因此就選擇了[12,13]，其他的超引數取值也是類似的，這里我將'min_samples_leaf'也給注釋掉了是因為我跑了很多次發現，'min_samples_leaf'還是取1最好，那么就直接選擇為默認1（'min_samples_leaf'在不指定的情況下默認為1）即可，因為超引數范圍越小，程式跑的就越快，

??這里程式運行的次數就是每一種超引數取值個數的排列組合次數乘以交叉驗證的折數，也就是(2*3*2*3)*3=108次，我們來看看是不是108次：

??很明顯，沒有問題，就是108個fit，和前面的600次fit比起來，這樣就快很多了（這也是為什么我直接將'min_samples_leaf'與'bootstrap'注釋掉的原因；要是這兩個超引數也參與的話，那么假設他們兩個各有2個取值的話，總時間至少就要翻2*2=4倍），

??再來看看經過遍歷擇優后的最優超引數匹配取值best_hp_now_2，

??以上就是我們經過一次隨機擇優、一次遍歷擇優之后的超引數結果（不要忘記了'min_samples_leaf'與'bootstrap'還要分別取1和True，也就是默認值），如果大家感覺這個組合搭配還不是很好，那么可以繼續執行本文“1.4 超引數遍歷匹配擇優”部分1到2次，精度可能會有更進一步的提升，

1.5 模型運行與精度評定

??結束了上述超引數擇優程序，我們就可以進行模型運行、精度評定與結果輸出等操作，本部分內容除了第一句代碼（將最優超引陣列合分配給模型）之外，其余部分由于在之前的博客中已經詳細介紹過了，本文就不再贅述~大家直接查看文章Python實作隨機森林RF并對比自變數的重要性即可，

# Build RF regression model with optimal hyperparameters

random_forest_model_final=random_forest_model_test_2_random.best_estimator_

# Predict test set data

random_forest_predict=random_forest_model_test_2_random.predict(test_X)
random_forest_error=random_forest_predict-test_Y

# Draw test plot

plt.figure(1)
plt.clf()
ax=plt.axes(aspect='equal')
plt.scatter(test_Y,random_forest_predict)
plt.xlabel('True Values')
plt.ylabel('Predictions')
Lims=[0,10000]
plt.xlim(Lims)
plt.ylim(Lims)
plt.plot(Lims,Lims)
plt.grid(False)
    
plt.figure(2)
plt.clf()
plt.hist(random_forest_error,bins=30)
plt.xlabel('Prediction Error')
plt.ylabel('Count')
plt.grid(False)

# Verify the accuracy

random_forest_pearson_r=stats.pearsonr(test_Y,random_forest_predict)
random_forest_R2=metrics.r2_score(test_Y,random_forest_predict)
random_forest_RMSE=metrics.mean_squared_error(test_Y,random_forest_predict)**0.5
print('Pearson correlation coefficient is {0}, and RMSE is {1}.'.format(random_forest_pearson_r[0],
                                                                        random_forest_RMSE))

# Save key parameters

excel_file=load_workbook(write_excel_path)
excel_all_sheet=excel_file.sheetnames
excel_write_sheet=excel_file[excel_all_sheet[0]]
excel_write_sheet=excel_file.active
max_row=excel_write_sheet.max_row
excel_write_content=[random_forest_pearson_r[0],random_forest_R2,random_forest_RMSE,
                     random_seed,random_forest_seed]
for i in range(len(excel_write_content)):
        exec("excel_write_sheet.cell(max_row+1,i+1).value=https://www.cnblogs.com/fkxxgis/p/excel_write_content[i]")
excel_file.save(write_excel_path)

# Draw decision tree visualizing plot

random_forest_tree=random_forest_model_final.estimators_[5]
export_graphviz(random_forest_tree,out_file=tree_graph_dot_path,
                feature_names=train_X_column_name,rounded=True,precision=1)
(random_forest_graph,)=pydot.graph_from_dot_file(tree_graph_dot_path)
random_forest_graph.write_png(tree_graph_png_path)

# Calculate the importance of variables

random_forest_importance=list(random_forest_model_final.feature_importances_)
random_forest_feature_importance=[(feature,round(importance,8)) 
                                  for feature, importance in zip(train_X_column_name,
                                                                 random_forest_importance)]
random_forest_feature_importance=sorted(random_forest_feature_importance,key=lambda x:x[1],reverse=True)
plt.figure(3)
plt.clf()
importance_plot_x_values=list(range(len(random_forest_importance)))
plt.bar(importance_plot_x_values,random_forest_importance,orientation='vertical')
plt.xticks(importance_plot_x_values,train_X_column_name,rotation='vertical')
plt.xlabel('Variable')
plt.ylabel('Importance')
plt.title('Variable Importances')

2 完整代碼

??本文所用完整代碼如下，

# -*- coding: utf-8 -*-
"""
Created on Sun Mar 21 22:05:37 2021

@author: fkxxgis
"""

import pydot
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from pprint import pprint
from sklearn import metrics
from openpyxl import load_workbook
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Attention! Data Partition
# Attention! One-Hot Encoding

train_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Train.csv'
test_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Test.csv'
write_excel_path='G:/CropYield/03_DL/05_NewML/ParameterResult_ML.xlsx'
tree_graph_dot_path='G:/CropYield/03_DL/05_NewML/tree.dot'
tree_graph_png_path='G:/CropYield/03_DL/05_NewML/tree.png'

random_seed=44
random_forest_seed=np.random.randint(low=1,high=230)

# Data import

train_data=https://www.cnblogs.com/fkxxgis/p/pd.read_csv(train_data_path,header=0)
test_data=pd.read_csv(test_data_path,header=0)

# Separate independent and dependent variables

train_Y=np.array(train_data['Yield'])
train_X=train_data.drop(['ID','Yield'],axis=1)
train_X_column_name=list(train_X.columns)
train_X=np.array(train_X)

test_Y=np.array(test_data['Yield'])
test_X=test_data.drop(['ID','Yield'],axis=1)
test_X=np.array(test_X)

# Search optimal hyperparameter

n_estimators_range=[int(x) for x in np.linspace(start=50,stop=3000,num=60)]
max_features_range=['auto','sqrt']
max_depth_range=[int(x) for x in np.linspace(10,500,num=50)]
max_depth_range.append(None)
min_samples_split_range=[2,5,10]
min_samples_leaf_range=[1,2,4,8]
bootstrap_range=[True,False]

random_forest_hp_range={'n_estimators':n_estimators_range,
                        'max_features':max_features_range,
                        'max_depth':max_depth_range,
                        'min_samples_split':min_samples_split_range,
                        'min_samples_leaf':min_samples_leaf_range
                        # 'bootstrap':bootstrap_range
                        }
pprint(random_forest_hp_range)

random_forest_model_test_base=RandomForestRegressor()
random_forest_model_test_random=RandomizedSearchCV(estimator=random_forest_model_test_base,
                                                   param_distributions=random_forest_hp_range,
                                                   n_iter=200,
                                                   n_jobs=-1,
                                                   cv=3,
                                                   verbose=1,
                                                   random_state=random_forest_seed
                                                   )
random_forest_model_test_random.fit(train_X,train_Y)

best_hp_now=random_forest_model_test_random.best_params_
pprint(best_hp_now)

# Grid Search

random_forest_hp_range_2={'n_estimators':[60,100,200],
                          'max_features':[12,13],
                          'max_depth':[350,400,450],
                          'min_samples_split':[2,3] # Greater than 1
                          # 'min_samples_leaf':[1,2]
                          # 'bootstrap':bootstrap_range
                          }
random_forest_model_test_2_base=RandomForestRegressor()
random_forest_model_test_2_random=GridSearchCV(estimator=random_forest_model_test_2_base,
                                               param_grid=random_forest_hp_range_2,
                                               cv=3,
                                               verbose=1,
                                               n_jobs=-1)
random_forest_model_test_2_random.fit(train_X,train_Y)

best_hp_now_2=random_forest_model_test_2_random.best_params_
pprint(best_hp_now_2)

# Build RF regression model with optimal hyperparameters

random_forest_model_final=random_forest_model_test_2_random.best_estimator_

# Predict test set data

random_forest_predict=random_forest_model_test_2_random.predict(test_X)
random_forest_error=random_forest_predict-test_Y

# Draw test plot

plt.figure(1)
plt.clf()
ax=plt.axes(aspect='equal')
plt.scatter(test_Y,random_forest_predict)
plt.xlabel('True Values')
plt.ylabel('Predictions')
Lims=[0,10000]
plt.xlim(Lims)
plt.ylim(Lims)
plt.plot(Lims,Lims)
plt.grid(False)
    
plt.figure(2)
plt.clf()
plt.hist(random_forest_error,bins=30)
plt.xlabel('Prediction Error')
plt.ylabel('Count')
plt.grid(False)

# Verify the accuracy

random_forest_pearson_r=stats.pearsonr(test_Y,random_forest_predict)
random_forest_R2=metrics.r2_score(test_Y,random_forest_predict)
random_forest_RMSE=metrics.mean_squared_error(test_Y,random_forest_predict)**0.5
print('Pearson correlation coefficient is {0}, and RMSE is {1}.'.format(random_forest_pearson_r[0],
                                                                        random_forest_RMSE))

# Save key parameters

excel_file=load_workbook(write_excel_path)
excel_all_sheet=excel_file.sheetnames
excel_write_sheet=excel_file[excel_all_sheet[0]]
excel_write_sheet=excel_file.active
max_row=excel_write_sheet.max_row
excel_write_content=[random_forest_pearson_r[0],random_forest_R2,random_forest_RMSE,
                     random_seed,random_forest_seed]
for i in range(len(excel_write_content)):
        exec("excel_write_sheet.cell(max_row+1,i+1).value=https://www.cnblogs.com/fkxxgis/p/excel_write_content[i]")
excel_file.save(write_excel_path)

# Draw decision tree visualizing plot

random_forest_tree=random_forest_model_final.estimators_[5]
export_graphviz(random_forest_tree,out_file=tree_graph_dot_path,
                feature_names=train_X_column_name,rounded=True,precision=1)
(random_forest_graph,)=pydot.graph_from_dot_file(tree_graph_dot_path)
random_forest_graph.write_png(tree_graph_png_path)

# Calculate the importance of variables

random_forest_importance=list(random_forest_model_final.feature_importances_)
random_forest_feature_importance=[(feature,round(importance,8)) 
                                  for feature, importance in zip(train_X_column_name,
                                                                 random_forest_importance)]
random_forest_feature_importance=sorted(random_forest_feature_importance,key=lambda x:x[1],reverse=True)
plt.figure(3)
plt.clf()
importance_plot_x_values=list(range(len(random_forest_importance)))
plt.bar(importance_plot_x_values,random_forest_importance,orientation='vertical')
plt.xticks(importance_plot_x_values,train_X_column_name,rotation='vertical')
plt.xlabel('Variable')
plt.ylabel('Importance')
plt.title('Variable Importances')

??至此，大功告成，

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/544177.html

標籤：Python

上一篇：AcWing 787.歸并排序（Java）

下一篇：用Python做一個小說下載器，從獲取資料到撰寫GUI界面