??本文詳細介紹在Python中,實作隨機森林(Random Forest,RF)回歸與變數重要性分析、排序的代碼撰寫與分析程序,其中,關于基于MATLAB實作同樣程序的代碼與實戰,大家可以點擊查看這篇博客1,
??本文分為兩部分,第一部分為代碼的分段講解,第二部分為完整代碼,
1 代碼分段講解
1.1 模塊與資料準備
??首先,匯入所需要的模塊,在這里,需要pydot與graphviz這兩個相對不太常用的模塊,即使我用了Anaconda,也需要單獨下載、安裝,具體下載與安裝,如果同樣是在用Anaconda,大家就參考這篇博客即可,
import pydot
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn import metrics
from openpyxl import load_workbook
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestRegressor
??接下來,我們將代碼接下來需要用的主要變數加以定義,這一部分大家先不用過于在意,瀏覽一下繼續向下看即可;待到對應的變數需要運用時我們自然會理解其具體含義,
train_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Train.csv'
test_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Test.csv'
write_excel_path='G:/CropYield/03_DL/05_NewML/ParameterResult_ML.xlsx'
tree_graph_dot_path='G:/CropYield/03_DL/05_NewML/tree.dot'
tree_graph_png_path='G:/CropYield/03_DL/05_NewML/tree.png'
random_seed=44
random_forest_seed=np.random.randint(low=1,high=230)
??接下來,我們需要匯入輸入資料,
??在這里需要注意,本文對以下兩個資料處理的流程并沒有詳細涉及與講解(因為在寫本文時,我已經做過了同一批資料的深度學習回歸,本文就直接用了當時做深度學習時處理好的輸入資料,因此以下兩個資料處理的基本程序就沒有再涉及啦),大家直接查看下方所列出的其它幾篇博客即可,
-
初始資料劃分訓練集與測驗集
-
類別變數的獨熱編碼(One-hot Encoding)
??針對上述兩個資料處理程序,首先,資料訓練集與測驗集的劃分在機器學習、深度學習中是不可或缺的作用,這一部分大家可以查看這篇博客2的2.4部分,或這篇博客3的2.3部分;其次,關于類別變數的獨熱編碼,對于隨機森林等傳統機器學習方法而言可以說同樣是非常重要的,這一部分大家可以查看這篇博客4,
??在本文中,如前所述,我們直接將已經存在.csv中,已經劃分好訓練集與測驗集且已經對類別變數做好了獨熱編碼之后的資料加以匯入,在這里,我所匯入的資料第一行是表頭,即每一列的名稱,關于.csv資料匯入的代碼詳解,大家可以查看這篇博客5的資料匯入部分,
# Data import
'''
column_name=['EVI0610','EVI0626','EVI0712','EVI0728','EVI0813','EVI0829','EVI0914','EVI0930','EVI1016',
'Lrad06','Lrad07','Lrad08','Lrad09','Lrad10',
'Prec06','Prec07','Prec08','Prec09','Prec10',
'Pres06','Pres07','Pres08','Pres09','Pres10',
'SIF161','SIF177','SIF193','SIF209','SIF225','SIF241','SIF257','SIF273','SIF289',
'Shum06','Shum07','Shum08','Shum09','Shum10',
'Srad06','Srad07','Srad08','Srad09','Srad10',
'Temp06','Temp07','Temp08','Temp09','Temp10',
'Wind06','Wind07','Wind08','Wind09','Wind10',
'Yield']
'''
train_data=pd.read_csv(train_data_path,header=0)
test_data=pd.read_csv(test_data_path,header=0)
1.2 特征與標簽分離
??特征與標簽,換句話說其實就是自變數與因變數,我們要將訓練集與測驗集中對應的特征與標簽分別分離開來,
# Separate independent and dependent variables
train_Y=np.array(train_data['Yield'])
train_X=train_data.drop(['ID','Yield'],axis=1)
train_X_column_name=list(train_X.columns)
train_X=np.array(train_X)
test_Y=np.array(test_data['Yield'])
test_X=test_data.drop(['ID','Yield'],axis=1)
test_X=np.array(test_X)
??可以看到,直接借助drop就可以將標簽'Yield'從原始的資料中剔除(同時還剔除了一個'ID',這個是初始資料的樣本編號,后面就沒什么用了,因此隨著標簽一起剔除),同時在這里,還借助了train_X_column_name這一變數,將每一個特征值列所對應的標題(也就是特征的名稱)加以保存,供后續使用,
1.3 RF模型構建、訓練與預測
??接下來,我們就需要對隨機森林模型加以建立,并訓練模型,最后再利用測驗集加以預測,在這里需要注意,關于隨機森林的幾個重要超引數(例如下方的n_estimators)都是需要不斷嘗試找到最優的,關于這些超引數的尋優,在MATLAB中的實作方法大家可以查看這篇博客1的1.1部分;而在Python中的實作方法,后期我會再用一篇博客來介紹,
# Build RF regression model
random_forest_model=RandomForestRegressor(n_estimators=200,random_state=random_forest_seed)
random_forest_model.fit(train_X,train_Y)
# Predict test set data
random_forest_predict=random_forest_model.predict(test_X)
random_forest_error=random_forest_predict-test_Y
??其中,利用RandomForestRegressor進行模型的構建,n_estimators就是樹的個數,random_state是每一個樹利用Bagging策略中的Bootstrap進行抽樣(即有放回的袋外隨機抽樣)時,隨機選取樣本的亂數種子;fit進行模型的訓練,predict進行模型的預測,最后一句就是計算預測的誤差,
1.4 預測影像繪制、精度衡量指標計算與保存
??首先,進行預測影像繪制,其中包括預測結果的擬合圖與誤差分布直方圖,關于這一部分代碼的解時,大家可以查看這篇博客2的2.9部分,
# Draw test plot
plt.figure(1)
plt.clf()
ax=plt.axes(aspect='equal')
plt.scatter(test_Y,random_forest_predict)
plt.xlabel('True Values')
plt.ylabel('Predictions')
Lims=[0,10000]
plt.xlim(Lims)
plt.ylim(Lims)
plt.plot(Lims,Lims)
plt.grid(False)
plt.figure(2)
plt.clf()
plt.hist(random_forest_error,bins=30)
plt.xlabel('Prediction Error')
plt.ylabel('Count')
plt.grid(False)
??以上兩幅圖的繪圖結果如下所示,


??接下來,進行精度衡量指標的計算與保存,在這里,我們用皮爾遜相關系數、決定系數與RMSE作為精度的衡量指標,并將每一次模型運行的精度衡量指標結果保存在一個Excel檔案中,這一部分大家同樣查看這篇博客2的2.9部分即可,
# Verify the accuracy
random_forest_pearson_r=stats.pearsonr(test_Y,random_forest_predict)
random_forest_R2=metrics.r2_score(test_Y,random_forest_predict)
random_forest_RMSE=metrics.mean_squared_error(test_Y,random_forest_predict)**0.5
print('Pearson correlation coefficient is {0}, and RMSE is {1}.'.format(random_forest_pearson_r[0],
random_forest_RMSE))
# Save key parameters
excel_file=load_workbook(write_excel_path)
excel_all_sheet=excel_file.sheetnames
excel_write_sheet=excel_file[excel_all_sheet[0]]
excel_write_sheet=excel_file.active
max_row=excel_write_sheet.max_row
excel_write_content=[random_forest_pearson_r[0],random_forest_R2,random_forest_RMSE,random_seed,random_forest_seed]
for i in range(len(excel_write_content)):
exec("excel_write_sheet.cell(max_row+1,i+1).value=excel_write_content[i]")
excel_file.save(write_excel_path)
1.5 決策樹可視化
??這一部分我們借助DOT這一影像描述語言,進行隨機森林演算法中決策樹的繪制,
# Draw decision tree visualizing plot
random_forest_tree=random_forest_model.estimators_[5]
export_graphviz(random_forest_tree,out_file=tree_graph_dot_path,
feature_names=train_X_column_name,rounded=True,precision=1)
(random_forest_graph,)=pydot.graph_from_dot_file(tree_graph_dot_path)
random_forest_graph.write_png(tree_graph_png_path)
??其中,estimators_[5]是指整個隨機森林演算法中的第6棵樹(下標是從0開始的),換句話說我們就是從很多的樹(具體樹的個數就是前面提到的超引數n_estimators)中抽取了找一個來畫圖,做一個示范,如下圖所示,

??可以看到,單單是這一顆樹就已經非常非常龐大了,我們將上圖其中最頂端(也就是最上方的節點——根節點)部分放大,就可以看見每一個節點對應的資訊,如下圖

??在這里提一句,上圖根節點中有一個samples=151,但是我的樣本總數是315個,為什么這棵樹的樣本個數不是全部的樣本個數呢?
??其實這就是隨機森林的內涵所在:隨機森林的每一棵樹的輸入資料(也就是該棵樹的根節點中的資料),都是隨機選取的(也就是上面我們說的利用Bagging策略中的Bootstrap進行隨機抽樣),最后再將每一棵樹的結果聚合起來(聚合這個程序就是Aggregation,我們常說的Bagging其實就是Bootstrap與Aggregation的合稱),形成隨機森林演算法最終的結果,
1.6 變數重要性分析
??在這里,我們進行變數重要性的分析,并以圖的形式進行可視化,
# Calculate the importance of variables
random_forest_importance=list(random_forest_model.feature_importances_)
random_forest_feature_importance=[(feature,round(importance,8))
for feature, importance in zip(train_X_column_name,random_forest_importance)]
random_forest_feature_importance=sorted(random_forest_feature_importance,key=lambda x:x[1],reverse=True)
plt.figure(3)
plt.clf()
importance_plot_x_values=list(range(len(random_forest_importance)))
plt.bar(importance_plot_x_values,random_forest_importance,orientation='vertical')
plt.xticks(importance_plot_x_values,train_X_column_name,rotation='vertical')
plt.xlabel('Variable')
plt.ylabel('Importance')
plt.title('Variable Importances')
??得到影像如下所示,這里是由于我的特征數量(自變數數量)過多,大概有150多個,導致橫坐標的標簽(也就是自變數的名稱)都重疊了;大家一般的自變數個數都不會太多,就不會有問題~

??以上就是全部的代碼分段介紹~
2 完整代碼
# -*- coding: utf-8 -*-
"""
Created on Sun Mar 21 22:05:37 2021
@author: fkxxgis
"""
import pydot
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn import metrics
from openpyxl import load_workbook
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestRegressor
# Attention! Data Partition
# Attention! One-Hot Encoding
train_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Train.csv'
test_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Test.csv'
write_excel_path='G:/CropYield/03_DL/05_NewML/ParameterResult_ML.xlsx'
tree_graph_dot_path='G:/CropYield/03_DL/05_NewML/tree.dot'
tree_graph_png_path='G:/CropYield/03_DL/05_NewML/tree.png'
random_seed=44
random_forest_seed=np.random.randint(low=1,high=230)
# Data import
'''
column_name=['EVI0610','EVI0626','EVI0712','EVI0728','EVI0813','EVI0829','EVI0914','EVI0930','EVI1016',
'Lrad06','Lrad07','Lrad08','Lrad09','Lrad10',
'Prec06','Prec07','Prec08','Prec09','Prec10',
'Pres06','Pres07','Pres08','Pres09','Pres10',
'SIF161','SIF177','SIF193','SIF209','SIF225','SIF241','SIF257','SIF273','SIF289',
'Shum06','Shum07','Shum08','Shum09','Shum10',
'Srad06','Srad07','Srad08','Srad09','Srad10',
'Temp06','Temp07','Temp08','Temp09','Temp10',
'Wind06','Wind07','Wind08','Wind09','Wind10',
'Yield']
'''
train_data=pd.read_csv(train_data_path,header=0)
test_data=pd.read_csv(test_data_path,header=0)
# Separate independent and dependent variables
train_Y=np.array(train_data['Yield'])
train_X=train_data.drop(['ID','Yield'],axis=1)
train_X_column_name=list(train_X.columns)
train_X=np.array(train_X)
test_Y=np.array(test_data['Yield'])
test_X=test_data.drop(['ID','Yield'],axis=1)
test_X=np.array(test_X)
# Build RF regression model
random_forest_model=RandomForestRegressor(n_estimators=200,random_state=random_forest_seed)
random_forest_model.fit(train_X,train_Y)
# Predict test set data
random_forest_predict=random_forest_model.predict(test_X)
random_forest_error=random_forest_predict-test_Y
# Draw test plot
plt.figure(1)
plt.clf()
ax=plt.axes(aspect='equal')
plt.scatter(test_Y,random_forest_predict)
plt.xlabel('True Values')
plt.ylabel('Predictions')
Lims=[0,10000]
plt.xlim(Lims)
plt.ylim(Lims)
plt.plot(Lims,Lims)
plt.grid(False)
plt.figure(2)
plt.clf()
plt.hist(random_forest_error,bins=30)
plt.xlabel('Prediction Error')
plt.ylabel('Count')
plt.grid(False)
# Verify the accuracy
random_forest_pearson_r=stats.pearsonr(test_Y,random_forest_predict)
random_forest_R2=metrics.r2_score(test_Y,random_forest_predict)
random_forest_RMSE=metrics.mean_squared_error(test_Y,random_forest_predict)**0.5
print('Pearson correlation coefficient is {0}, and RMSE is {1}.'.format(random_forest_pearson_r[0],
random_forest_RMSE))
# Save key parameters
excel_file=load_workbook(write_excel_path)
excel_all_sheet=excel_file.sheetnames
excel_write_sheet=excel_file[excel_all_sheet[0]]
excel_write_sheet=excel_file.active
max_row=excel_write_sheet.max_row
excel_write_content=[random_forest_pearson_r[0],random_forest_R2,random_forest_RMSE,random_seed,random_forest_seed]
for i in range(len(excel_write_content)):
exec("excel_write_sheet.cell(max_row+1,i+1).value=excel_write_content[i]")
excel_file.save(write_excel_path)
# Draw decision tree visualizing plot
random_forest_tree=random_forest_model.estimators_[5]
export_graphviz(random_forest_tree,out_file=tree_graph_dot_path,
feature_names=train_X_column_name,rounded=True,precision=1)
(random_forest_graph,)=pydot.graph_from_dot_file(tree_graph_dot_path)
random_forest_graph.write_png(tree_graph_png_path)
# Calculate the importance of variables
random_forest_importance=list(random_forest_model.feature_importances_)
random_forest_feature_importance=[(feature,round(importance,8))
for feature, importance in zip(train_X_column_name,random_forest_importance)]
random_forest_feature_importance=sorted(random_forest_feature_importance,key=lambda x:x[1],reverse=True)
plt.figure(3)
plt.clf()
importance_plot_x_values=list(range(len(random_forest_importance)))
plt.bar(importance_plot_x_values,random_forest_importance,orientation='vertical')
plt.xticks(importance_plot_x_values,train_X_column_name,rotation='vertical')
plt.xlabel('Variable')
plt.ylabel('Importance')
plt.title('Variable Importances')
歡迎關注CSDN/公眾號/知乎:瘋狂學習GIS

轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/278867.html
標籤:python
