?? 作者：韓信子@ShowMeAI
?? 資料分析實戰系列：https://www.showmeai.tech/tutorials/40
?? 機器學習實戰系列：https://www.showmeai.tech/tutorials/41
?? 本文地址：https://www.showmeai.tech/article-detail/316
?? 宣告：著作權所有，轉載請聯系平臺與作者并注明出處
?? 收藏ShowMeAI查看更多精彩內容

大家出去旅游最關心的問題之一就是住宿，在國外以 Airbnb 為代表的民宿互聯網模式徹底改變了酒店業，很多游客更喜歡預訂 Airbnb 而不是酒店，而在國內的美團飛豬等平臺，也有大量的民宿入駐，

在現在這個資訊透明開放的互聯網時代，我們能否收集資料資訊，開發一個機器學習模型來預測房源價格，為自己的出行提供更智能化的資訊呢？肯定是可以的，下面ShowMeAI以Airbnb在大曼徹斯特地區的房源資料為例（截至 2022 年 3 月），來演示資料分析與挖掘建模的全程序，同樣的方法模式可以應用在大家熟悉的國內平臺上，

下面的專案業務和 ??Airbnb民宿資料 來源于 Inside Airbnb，包含有關 Airbnb 對住宅社區影響的資料和宣傳，資料源可以在上述鏈接中獲取，大家也可以訪問ShowMeAI的百度網盤地址，獲取我們為大家存盤好的專案資料，

?? 實戰資料集下載（百度網盤）：公眾號『ShowMeAI研究中心』回復『實戰』，或者點擊這里獲取本文 [22]基于Airbnb資料的民宿房價預測模型『Airbnb民宿資料』

? ShowMeAI官方GitHub：https://github.com/ShowMeAI-Hub

?? 業務問題

一般我們需要在開始挖掘和建模之前，深入了解我們的業務場景和資料情況，我們先總結了一些在這個業務場景下我們關心的一些業務問題，我們將通過資料分析挖掘來完成這些業務問題的理解，

哪些地區或城鎮的 Airbnb 房源最多？
最受歡迎的房型是什么？
大曼徹斯特地區的 Airbnb 房源價格特點是什么？
房源與房東的分布情況？
大曼徹斯特地區有哪些房型可供選擇？
機器學習模型預測該地區 Airbnb 房源價格的思路是什么樣的？
在預測大曼徹斯特地區 Airbnb 房源的價格時，哪些特征更重要？

?? 資料讀取與初探

我們先匯入本次需要使用到的分析挖掘與建模工具庫

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm, trange
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingRegressor
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.inspection import permutation_importance


pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

接下來我們讀取大曼徹斯特地區的房源資料

gm_listings = pd.read_csv('gm_listings-2.csv')
gm_calendar = pd.read_csv('calendar-2.csv')
gm_reviews = pd.read_csv('reviews-2.csv')

查看資料的基礎資訊如下

gm_listings.head()

gm_listings.shape
# (3584, 74)
gm_listings.columns

gm_calendar.head()

gm_reviews.head()

我們對資料的初覽可以看到，大曼徹斯特地區的房源資料集包含 3584 行和 78 列，包含有關房東、房源型別、區域和評級的資訊，

?? 資料清洗

資料清洗是機器學習建模應用的【特征工程】階段的核心步驟，它涉及的方法技能歡迎大家查閱ShowMeAI對應的教程文章，快學快用，

機器學習實戰 | 機器學習特征工程最全解讀

?? 欄位清洗

因為資料中的欄位眾多，有些欄位比較亂，我們需要做一些資料清洗的作業，資料包含一些帶有URL的列，對最后的預測作用不大，我們把它們清洗掉，

# 洗掉url欄位
def drop_function(df):
    df = df.drop(columns=['listing_url', 'description', 'host_thumbnail_url', 'host_picture_url', 'latitude', 'longitude', 'picture_url', 'host_url', 'host_location', 'neighbourhood', 'neighbourhood_cleansed', 'host_about', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'calendar_last_scraped'])
    
    return df

gm_df = drop_function(gm_listings)

洗掉過后的資料如下，干凈很多

?? 缺失值處理

資料中也包含了一些缺失值，我們對它們進行分析處理：

# 查看缺失值百分比
(gm_df.isnull().sum()/gm_df.shape[0])* 100

得到如下結果

id                                                0.000000
scrape_id                                         0.000000
last_scraped                                      0.000000
name                                              0.000000
neighborhood_overview                            41.266741
host_id                                           0.000000
host_name                                         0.000000
host_since                                        0.000000
host_response_time                               10.212054
host_response_rate                               10.212054
host_acceptance_rate                              5.636161
host_is_superhost                                 0.000000
host_neighbourhood                               91.657366
host_listings_count                               0.000000
host_total_listings_count                         0.000000
host_verifications                                0.000000
host_has_profile_pic                              0.000000
host_identity_verified                            0.000000
neighbourhood_group_cleansed                      0.000000
property_type                                     0.000000
room_type                                         0.000000
accommodates                                      0.000000
bathrooms                                       100.000000
bathrooms_text                                    0.306920
bedrooms                                          4.687500
beds                                              2.120536
amenities                                         0.000000
price                                             0.000000
minimum_nights                                    0.000000
maximum_nights                                    0.000000
minimum_minimum_nights                            0.000000
maximum_minimum_nights                            0.000000
minimum_maximum_nights                            0.000000
maximum_maximum_nights                            0.000000
minimum_nights_avg_ntm                            0.000000
maximum_nights_avg_ntm                            0.000000
calendar_updated                                100.000000
number_of_reviews                                 0.000000
number_of_reviews_ltm                             0.000000
number_of_reviews_l30d                            0.000000
first_review                                     19.810268
last_review                                      19.810268
review_scores_rating                             19.810268
review_scores_accuracy                           20.089286
review_scores_cleanliness                        20.089286
review_scores_checkin                            20.089286
review_scores_communication                      20.089286
review_scores_location                           20.089286
review_scores_value                              20.089286
license                                         100.000000
instant_bookable                                  0.000000
calculated_host_listings_count                    0.000000
calculated_host_listings_count_entire_homes       0.000000
calculated_host_listings_count_private_rooms      0.000000
calculated_host_listings_count_shared_rooms       0.000000
reviews_per_month                                19.810268
dtype: float64

我們分幾種不同的比例情況對缺失值進行處理：

高缺失比例的欄位，如license、calendar_updated、bathrooms、host_neighborhood等包含90%以上的NaN值，包括neighborhood overview是41%的NaN，并且包含文本資料，我們會直接剔除這些欄位，
數值型欄位，缺失不多的情況下，我們用欄位平均值進行填充，這保證了這些值的分布被保留下來，這些列包括bedrooms、beds、review_scores_rating、review_scores_accuracy和其他打分欄位，
類別型欄位，像bathrooms_text和host_response_time，我們用眾數進行填充，

# 剔除高缺失比例欄位
def drop_function_2(df):
    df = df.drop(columns=['license', 'calendar_updated', 'bathrooms', 'host_neighbourhood', 'neighborhood_overview'])
    
    return df

gm_df = drop_function_2(gm_df)

# 均值填充
def input_mean(df, column_list):
    for columns in column_list: 
        df[columns].fillna(value = https://www.cnblogs.com/showmeai/p/df[columns].mean(), inplace=True)
    
    return df

column_list = ['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
              'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
              'review_scores_value', 'reviews_per_month',
              'bedrooms', 'beds']
gm_df = input_mean(gm_df, column_list)

# 眾數填充
def input_mode(df, column_list):    
    for columns in column_list:        
        df[columns].fillna(value = https://www.cnblogs.com/showmeai/p/df[columns].mode()[0], inplace=True)
    
    return df

column_list = ['first_review', 'last_review', 'bathrooms_text', 'host_acceptance_rate', 
               'host_response_rate', 'host_response_time']

gm_df = input_mode(gm_df, column_list)

?? 欄位編碼

host_is_superhost 和 has_availability 等列對應的字串含義為 true 或 false，我們對其編碼替換為0或1，

gm_df = gm_df.replace({'host_is_superhost': 't', 'host_has_profile_pic': 't', 'host_identity_verified': 't', 'has_availability': 't', 'instant_bookable': 't'}, 1)

gm_df = gm_df.replace({'host_is_superhost': 'f', 'host_has_profile_pic': 'f', 'host_identity_verified': 'f', 'has_availability': 'f', 'instant_bookable': 'f'}, 0)

我們查看下替換后的資料分布

gm_df['host_is_superhost'].value_counts()

?? 欄位格式轉換

價格相關的欄位，目前還是字串型別，包含“$”等符號，我們對其處理并轉換為數值型，

def string_to_int(df, column):
    # 字串替換清理
    df[column] = df[column].str.replace("$", "")
    df[column] = df[column].str.replace(",", "")
    
    # 轉為數值型
    df[column] = pd.to_numeric(df[column]).astype(int)
    
    return df

gm_df = string_to_int(gm_df, 'price')

?? 串列型欄位編碼

像host_verifications和amenities這樣的欄位，取值為串列格式，我們對其進行編碼處理（用啞變數替換），

# 查看串列型取值欄位
gm_df_copy = gm_df.copy()
gm_df_copy['amenities'].head()

gm_df_copy['host_verifications'].head()

# 啞變數編碼
gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace('"', '')
gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace(']', "")
gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace('[', "")

df_amenities = gm_df_copy['amenities'].str.get_dummies(sep = ",")

gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace("'", "")
gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace(']', "")
gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace('[', "")

df_host_ver = gm_df_copy['host_verifications'].str.get_dummies(sep = ",")

編碼后的結果如下所示

df_amenities.head()
df_host_ver.head()

# 洗掉原始欄位
gm_df = gm_df.drop(['host_verifications', 'amenities'], axis=1)

?? 資料探索

下一步我們要進行更全面一些的探索性資料分析，

EDA資料分析部分涉及的工具庫，大家可以參考ShowMeAI制作的工具庫速查表和教程進行學習和快速使用，

資料科學工具庫速查表 | Pandas 速查表

圖解資料分析：從入門到精通系列教程

?? 哪些街區的房源最多？

gm_df['neighbourhood_group_cleansed'].value_counts()

bar_data = https://www.cnblogs.com/showmeai/p/gm_df['neighbourhood_group_cleansed'].value_counts().sort_values()

# 從bar_data構建新的dataframe
bar_data = https://www.cnblogs.com/showmeai/p/pd.DataFrame(bar_data).reset_index()
bar_data['size'] = bar_data['neighbourhood_group_cleansed']/gm_df['neighbourhood_group_cleansed'].count()

# 排序 
bar_data.sort_values(by='size', ascending=False)
bar_data = https://www.cnblogs.com/showmeai/p/bar_data.rename(columns={'index' : 'Towns', 'neighbourhood_group_cleansed' : 'number_of_listings',
                        'size':'fraction_of_total'})

#繪圖展示
#plt.figure(figsize=(10,10));
bar_data.plot(kind='barh', x ='Towns', y='fraction_of_total', figsize=(8,6))
plt.title('Towns with the Most listings');
plt.xlabel('Fraction of Total Listings');

曼徹斯特鎮擁有大曼徹斯特地區的大部分房源，占總房源的 53% (1849)，其次是索爾福德，占總房源的 17% ；特拉福德，占總房源的 9%，

?? 大曼徹斯特地區的 Airbnb 房源價格分布

gm_df['price'].mean(), gm_df['price'].min(), gm_df['price'].max(),gm_df['price'].median()
# (143.47600446428572, 8, 7372, 79.0)

Airbnb 房源的均價為 143 美元，中位價為 79 美元，資料集中觀察到的最高價格為 7372 美元，

# 劃分價格檔位區間
labels = ['$0 - $100', '$100 - $200', '$200 - $300', '$300 - $400', '$400 - $500', '$500 - $1000', '$1000 - $8000']
price_cuts = pd.cut(gm_df['price'], bins = [0, 100, 200, 300, 400, 500, 1000, 8000], right=True, labels= labels)

# 從價格檔構建dataframe
price_clusters = pd.DataFrame(price_cuts).rename(columns={'price': 'price_clusters'})

# 拼接原始dataframe
gm_df = pd.concat([gm_df, price_clusters], axis=1)

# 分布繪圖
def price_cluster_plot(df, column, title):    
    plt.figure(figsize=(8,6));
    yx = sb.histplot(data = https://www.cnblogs.com/showmeai/p/df[column]);
    

    total = float(df[column].count())
    for p in yx.patches:
        width = p.get_width()
        height = p.get_height()
        yx.text(p.get_x() + p.get_width()/2.,height+5,'{:1.1f}%'.format((height/total)*100), ha='center')
    yx.set_title(title);
    plt.xticks(rotation=90)
    
    return yx

price_cluster_plot(gm_df, column='price_clusters', 
                   title='Price distribution of Airbnb Listings in the Greater Manchester Area');

從上面的分析和可視化結果可以看出，65.4% 的總房源價格在 0-100 美元之間，而價格在 100-200 美元的房源占總房源的 23.4%，不過我們也觀察到資料分布有很明顯的長尾特性，也可以把特別高價的部分視作例外值，它們可能會對我們的分析有一些影響，

?? 最受歡迎的房型是什么

# 基于評論量統計排序
ax = gm_df.groupby('property_type').agg(
    median_rating=('review_scores_rating', 'median'),number_of_reviews=('number_of_reviews', 'max')).sort_values(
by='number_of_reviews', ascending=False).reset_index()

ax.head()

在評論最多的前 10 種房產型別中， Entire rental unit 評論數量最多，其次是Private room in rental unit，

# 可視化
bx = ax.loc[:10]
bx =sb.boxplot(data =https://www.cnblogs.com/showmeai/p/bx, x='median_rating', y='property_type')
bx.set_xlim(4.5, 5)
plt.title('Most Enjoyed Property types');
plt.xlabel('Median Rating');
plt.ylabel('Property Type')

?? 房東與房源分布

# 持有房源最多的房東
host_df = pd.DataFrame(gm_df['host_name'].value_counts()/gm_df['host_name'].count() *100).reset_index()
host_df = host_df.rename(columns={'index':'name', 'host_name':'perc_count'})
host_df.head(10)

host_df['perc_count'].loc[:10].sum()

從上述分析可以看出，房源最多的前 10 名房東占房源總數的 13.6%，

?? 大曼徹斯特地區提供的客房型別分布

gm_df['room_type'].value_counts()

# 分布繪圖
zx = sb.countplot(data=https://www.cnblogs.com/showmeai/p/gm_df, x='room_type')

total = float(gm_df['room_type'].count())
for p in zx.patches:
    width = p.get_width()
    height = p.get_height()
    zx.text(p.get_x() + p.get_width()/2.,height+5, '{:1.1f}%'.format((height/total)*100), ha='center')
    zx.set_title('Plot showing different type of rooms available');
    plt.xlabel('Room')

大部分客房是 整棟房屋/公寓 ，占房源總數的 60%，其次是私人客房，占房源總數的 39%，共享房間 和 酒店房間 分別占房源的 0.7% 和 0.5%，

?? 機器學習建模

下面我們使用回歸建模方法來對民宿房源價格進行預估，

?? 特征工程

關于特征工程，歡迎大家查閱ShowMeAI對應的教程文章，快學快用，

機器學習實戰 | 機器學習特征工程最全解讀

我們首先對原始資料進行特征工程，得到適合建模的資料特征，

# 查看此時的資料集
gm_df.head()

# 回歸資料集
gm_regression_df = gm_df.copy()

# 剔除無用欄位
gm_regression_df = gm_regression_df.drop(columns=['id', 'scrape_id', 'last_scraped', 'name', 'host_id', 'host_since', 'first_review', 'last_review', 'price_clusters', 'host_name'])

# 再次查看資料
gm_regression_df.head()

我們發現host_response_rate 和 host_acceptance_rate欄位帶有百分號，我們再做一點資料清洗，

# 去除百分號并轉換為數值型
gm_regression_df['host_response_rate'] =  gm_regression_df['host_response_rate'].str.replace("%", "")

gm_regression_df['host_acceptance_rate'] =  gm_regression_df['host_acceptance_rate'].str.replace("%", "")
   
# convert to int
gm_regression_df['host_response_rate'] = pd.to_numeric(gm_regression_df['host_response_rate']).astype(int)
gm_regression_df['host_acceptance_rate'] =  pd.to_numeric(gm_regression_df['host_acceptance_rate']).astype(int)

# 查看轉換后結果
gm_regression_df['host_response_rate'].head()

bathrooms_text 列包含數字和文本資料的組合，我們對其做一些處理

# 查看原始欄位
gm_regression_df['bathrooms_text'].value_counts()

# 切分與資料處理
def split_bathroom(df, column, text, new_column):
    df_2 = df[df[column].str.contains(text, case=False)]
    df.loc[df[column].str.contains(text, case=False), new_column] = df_2[column]
    return df

# 應用上述函式
gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='shared', new_column='shared_bath')
gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='private', new_column='private_bath')
# 查看shared_bath欄位
gm_regression_df['shared_bath'].value_counts()

# 查看private_bath欄位
gm_regression_df['private_bath'].value_counts()

gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private bath", "pb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private baths", "pbs", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared bath", "sb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared baths", "sb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared half-bath", "sb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private half-bath", "sb", case=False)

gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='bath', new_column='bathrooms_new')

gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].str.split(" ", expand=True)
gm_regression_df['private_bath'] = gm_regression_df['private_bath'].str.split(" ", expand=True)
gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].str.split(" ", expand=True)

# 填充缺失值為0 
gm_regression_df = gm_regression_df.fillna(0)

gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].replace(to_replace='Shared', value=https://www.cnblogs.com/showmeai/p/0.5)
gm_regression_df['private_bath'] = gm_regression_df['private_bath'].replace(to_replace='Private', value=https://www.cnblogs.com/showmeai/p/0.5)
gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].replace(to_replace='Half-bath', value=https://www.cnblogs.com/showmeai/p/0.5)

# 轉成數值型
gm_regression_df['shared_bath'] = pd.to_numeric(gm_regression_df['shared_bath']).astype(int)
gm_regression_df['private_bath'] = pd.to_numeric(gm_regression_df['private_bath']).astype(int)
gm_regression_df['bathrooms_new'] =  pd.to_numeric(gm_regression_df['bathrooms_new']).astype(int)

# 查看處理后的欄位
gm_regression_df[['shared_bath', 'private_bath', 'bathrooms_new']].head()

下面我們對類別型欄位進行編碼，根據欄位含義的不同，我們使用「序號編碼」和「獨熱向量編碼」等方法來完成，

# 序號編碼
def encoder(df):
    for column in df[['neighbourhood_group_cleansed', 'property_type']].columns:
        labels = df[column].astype('category').cat.categories.tolist()
        replace_map = {column : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}
        df.replace(replace_map, inplace=True)
        print(replace_map)
    
    return df 

gm_regression_df = encoder(gm_regression_df)

我們對于host_response_time和room_type欄位，使用獨熱向量編碼（啞變數變換）

host_dummy = pd.get_dummies(gm_regression_df['host_response_time'], prefix='host_response')
room_dummy = pd.get_dummies(gm_regression_df['room_type'], prefix='room_type')

# 拼接編碼后的欄位
gm_regression_df = pd.concat([gm_regression_df, host_dummy, room_dummy], axis=1)

# 剔除原始欄位
gm_regression_df = gm_regression_df.drop(columns=['host_response_time', 'room_type'], axis=1)

我們再把之前處理過的df_amenities做一點處理，再拼接到資料特征里

df_3 = pd.DataFrame(df_amenities.sum())
features = df_3['amenities'][:150].to_list()
amenities_updated = df_amenities.filter(items=(features))
gm_regression_df = pd.concat([gm_regression_df, amenities_updated], axis=1)

查看一下最終資料的維度

gm_regression_df.shape
# (3584, 198)

我們最后得到了198個欄位，為了避免特征之間的多重共線性，使用方差因子法（VIF）來選擇機器學習模型的特征， VIF 大于 10 的特征被洗掉，因為這些特征的方差可以由資料集中的其他特征表示和解釋，

# 計算VIF
vif_model = gm_regression_df.drop(['price'], axis=1)
vif_df = pd.DataFrame()
vif_df['feature'] = vif_model.columns
vif_df['VIF'] = [variance_inflation_factor(vif_model.values, i) for i in range(len(vif_model.columns))]

# 選出小于10的特征
vif_df_new = vif_df[vif_df['VIF']<=10]
feature_list =  vif_df_new['feature'].to_list()

# 選出這些特征對應的資料
model_df = gm_regression_df.filter(items=(feature_list))
model_df.head()

我們拼接上price目標標簽欄位，可以構建完整的資料集

price_col = gm_regression_df['price']
model_df = model_df.join(price_col)

?? 機器學習演算法

我們在這里使用幾個典型的回歸演算法，包括線性回歸、RandomForestRegression、Lasso Regression 和 GradientBoostingRegression，

關于機器學習演算法的應用方法，歡迎大家查閱ShowMeAI對應的教程與文章，快學快用，

機器學習實戰：手把手教你玩轉機器學習系列

機器學習實戰 | SKLearn入門與簡單應用案例

機器學習實戰 | SKLearn最全應用指南

線性回歸建模

def linear_reg(df, test_size=0.3, random_state=42):
    '''
    構建模型并回傳評估結果
    輸入: 資料dataframe 
    輸出: 特征重要度與評估準則（RMSE與R-squared）
    '''
    
    X = df.drop(columns=['price'])
    y = df[['price']]
    X_columns = X.columns
    
    # 切分訓練集與測驗集
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state=random_state)

    # 線性回歸分類器    
    clf = LinearRegression()
    
    # 候選引數串列      
    parameters = {
                  'n_jobs': [1, 2, 5, 10, 100],
                  'fit_intercept': [True, False]
                 
                  }
    
    # 網格搜索交叉驗證調參    
    cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=3, verbose=3)  
    cv.fit(X_train,y_train)
    
    # 測驗集預估
    pred = cv.predict(X_test)
    
    # 模型評估
    r2 = r2_score(y_test, pred)
    mse = mean_squared_error(y_test, pred)
    rmse = mse **.5
     
    # 最佳引數
    best_par = cv.best_params_
    coefficients = cv.best_estimator_.coef_
        
    #特征重要度
    importance = np.abs(coefficients)
    feature_importance = pd.DataFrame(importance, columns=X_columns).T
    #feature_importance = feature_importance.T
    feature_importance.columns = ['importance']
    feature_importance = feature_importance.sort_values('importance', ascending=False)
    
    print("The model performance for testing set")
    print("--------------------------------------")
    print('RMSE is {}'.format(rmse))
    print('R2 score is {}'.format(r2))
    print("\n")
    
    return feature_importance, rmse, r2
    
 linear_feat_importance, linear_rmse, linear_r2 = linear_reg(model_df)

隨機森林建模

# 隨機森林建模
def random_forest(df):
    '''
    構建模型并回傳評估結果
    輸入: 資料dataframe 
    輸出: 特征重要度與評估準則（RMSE與R-squared）
    '''
    
    X = df.drop(['price'], axis=1)
    X_columns = X.columns
    
    y = df['price']

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    # 隨機森林模型        
    clf = RandomForestRegressor()
    
    # 候選引數
    parameters = {

                'n_estimators': [50, 100, 200, 300, 400],
                'max_depth': [2, 3, 4, 5],
                 'max_depth': [80, 90, 100]
        
                     }

    # 網格搜索交叉驗證調參
    cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=5, verbose=3)
    model = cv
    model.fit(X_train, y_train)

    # 測驗集預估
    pred = model.predict(X_test)

    # 模型評估
    mse = mean_squared_error(y_test, pred)
    rmse = mse**.5
    r2 = r2_score(y_test, pred)
      
    # 最佳超引數
    best_par = model.best_params_
    

    # 特征重要度
    r = permutation_importance(model, X_test, y_test,
                           n_repeats=10,
                           random_state=0)
    perm = pd.DataFrame(columns=['AVG_Importance'], index=[i for i in X_train.columns])
    perm['AVG_Importance'] = r.importances_mean
    perm = perm.sort_values(by='AVG_Importance', ascending=False);
    
    return rmse, r2, best_par, perm

# 運行建模
r_forest_rmse, r_forest_r2, r_fores_best_params, r_forest_importance = random_forest(model_df)

運行結果如下

Fitting 5 folds for each of 15 candidates, totalling 75 fits
[CV 1/5] END ..................max_depth=80, n_estimators=50; total time=   2.4s
[CV 2/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 3/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 4/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 5/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 1/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 2/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 3/5] END .................max_depth=80, n_estimators=100; total time=   3.9s
[CV 4/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 5/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 1/5] END .................max_depth=80, n_estimators=200; total time=   7.5s
[CV 2/5] END .................max_depth=80, n_estimators=200; total time=   7.7s
[CV 3/5] END .................max_depth=80, n_estimators=200; total time=   7.7s
[CV 4/5] END .................max_depth=80, n_estimators=200; total time=   7.6s
[CV 5/5] END .................max_depth=80, n_estimators=200; total time=   7.6s
[CV 1/5] END .................max_depth=80, n_estimators=300; total time=  11.3s
[CV 2/5] END .................max_depth=80, n_estimators=300; total time=  11.4s
[CV 3/5] END .................max_depth=80, n_estimators=300; total time=  11.7s
[CV 4/5] END .................max_depth=80, n_estimators=300; total time=  11.4s
[CV 5/5] END .................max_depth=80, n_estimators=300; total time=  11.4s
[CV 1/5] END .................max_depth=80, n_estimators=400; total time=  15.1s
[CV 2/5] END .................max_depth=80, n_estimators=400; total time=  16.4s
[CV 3/5] END .................max_depth=80, n_estimators=400; total time=  15.6s
[CV 4/5] END .................max_depth=80, n_estimators=400; total time=  15.2s
[CV 5/5] END .................max_depth=80, n_estimators=400; total time=  15.6s
[CV 1/5] END ..................max_depth=90, n_estimators=50; total time=   1.9s
[CV 2/5] END ..................max_depth=90, n_estimators=50; total time=   1.9s
[CV 3/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s
[CV 4/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s
[CV 5/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s
[CV 1/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 2/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 3/5] END .................max_depth=90, n_estimators=100; total time=   4.0s
[CV 4/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 5/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 1/5] END .................max_depth=90, n_estimators=200; total time=   8.7s
[CV 2/5] END .................max_depth=90, n_estimators=200; total time=   8.1s
[CV 3/5] END .................max_depth=90, n_estimators=200; total time=   8.1s
[CV 4/5] END .................max_depth=90, n_estimators=200; total time=   7.7s
[CV 5/5] END .................max_depth=90, n_estimators=200; total time=   8.0s
[CV 1/5] END .................max_depth=90, n_estimators=300; total time=  11.6s
[CV 2/5] END .................max_depth=90, n_estimators=300; total time=  11.8s
[CV 3/5] END .................max_depth=90, n_estimators=300; total time=  12.2s
[CV 4/5] END .................max_depth=90, n_estimators=300; total time=  12.0s
[CV 5/5] END .................max_depth=90, n_estimators=300; total time=  13.2s
[CV 1/5] END .................max_depth=90, n_estimators=400; total time=  15.6s
[CV 2/5] END .................max_depth=90, n_estimators=400; total time=  15.9s
[CV 3/5] END .................max_depth=90, n_estimators=400; total time=  16.1s
[CV 4/5] END .................max_depth=90, n_estimators=400; total time=  15.7s
[CV 5/5] END .................max_depth=90, n_estimators=400; total time=  15.8s
[CV 1/5] END .................max_depth=100, n_estimators=50; total time=   1.9s
[CV 2/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 3/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 4/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 5/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 1/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 2/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 3/5] END ................max_depth=100, n_estimators=100; total time=   4.1s
[CV 4/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 5/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 1/5] END ................max_depth=100, n_estimators=200; total time=   7.8s
[CV 2/5] END ................max_depth=100, n_estimators=200; total time=   7.9s
[CV 3/5] END ................max_depth=100, n_estimators=200; total time=   8.1s
[CV 4/5] END ................max_depth=100, n_estimators=200; total time=   7.9s
[CV 5/5] END ................max_depth=100, n_estimators=200; total time=   7.8s
[CV 1/5] END ................max_depth=100, n_estimators=300; total time=  11.8s
[CV 2/5] END ................max_depth=100, n_estimators=300; total time=  12.0s
[CV 3/5] END ................max_depth=100, n_estimators=300; total time=  12.8s
[CV 4/5] END ................max_depth=100, n_estimators=300; total time=  11.4s
[CV 5/5] END ................max_depth=100, n_estimators=300; total time=  11.5s
[CV 1/5] END ................max_depth=100, n_estimators=400; total time=  15.1s
[CV 2/5] END ................max_depth=100, n_estimators=400; total time=  15.3s
[CV 3/5] END ................max_depth=100, n_estimators=400; total time=  15.6s
[CV 4/5] END ................max_depth=100, n_estimators=400; total time=  15.3s
[CV 5/5] END ................max_depth=100, n_estimators=400; total time=  15.3s

隨機森林最后的結果如下

r_forest_rmse, r_forest_r2
# (218.7941962807868, 0.4208644494689676)

GBDT建模

def GBDT_model(df):
    '''
    構建模型并回傳評估結果
    輸入: 資料dataframe 
    輸出: 特征重要度與評估準則（RMSE與R-squared）
    '''
    
    X = df.drop(['price'], axis=1)
    Y = df['price']
    X_columns = X.columns

    X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42)
    

        
    clf = GradientBoostingRegressor()
    
    
    parameters = {

                'learning_rate': [0.1, 0.5, 1],
                'min_samples_leaf': [10, 20, 40 , 60]
                
        
                     }
    cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=5, verbose=3)
    
    model = cv
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    
    r2 = r2_score(y_test, pred)
    mse = mean_squared_error(y_test, pred)
    rmse = mse**.5
        
    coefficients = model.best_estimator_.feature_importances_

    importance = np.abs(coefficients)
    feature_importance = pd.DataFrame(importance, index= X_columns,
                                      columns=['importance']).sort_values('importance', ascending=False)[:10]
    
    return r2, mse, rmse, feature_importance

GBDT_r2, GBDT_mse, GBDT_rmse, GBDT_feature_importance = GBDT_model(model_df)
GBDT_r2, GBDT_rmse
# (0.46352992147034244, 210.58063809645563)

?? 結果&分析

目前隨機森林的表現最穩定，而集成模型GradientBoostingRegression 的R2很高，RMSE 值也偏高，Boosting的模型受例外值影響很大，這可能是因為資料集中的例外值引起的，

下面我們來做一下優化，洗掉資料集中的例外值，看看是否可以提高模型性能，

?? 效果優化

例外值在早些時候就已經被識別出來了，我們基于統計的方法來對其進行處理，

# 基于統計方法計算價格邊界
q3, q1 = np.percentile(model_df['price'], [75, 25])
iqr = q3 - q1
q3 + (iqr*1.5)

# 得到結果245.0

我們把任何高于 245 美元的值都視為例外值并洗掉，

new_model_df = model_df[model_df['price']<245]

# 繪制此時的價格分布
sb.histplot(new_model_df['price'])
plt.title('New price distribution in the dataset')

重新運行這些演算法

linear_feat_importance, linear_rmse, linear_r2 = linear_reg(new_model_df)
r_forest_rmse, r_forest_r2, r_fores_best_params, r_forest_importance = random_forest(new_model_df)
GBDT_r2, GBDT_mse, GBDT_rmse, GBDT_feature_importance = GBDTboost(new_model_df)

得到的新結果如下

?? 歸因分析

那么，基于我們的模型來分析，在預測大曼徹斯特地區 Airbnb 房源的價格時，哪些因素更重要？

r_feature_importance = r_forest_importance.reset_index()
r_feature_importance = r_feature_importance.rename(columns={'index':'Feature'})
r_feature_importance[:15]

# 繪制最重要的15個因素
r_feature_importance[:15].sort_values(by='AVG_Importance').plot(kind='barh', x='Feature', y='AVG_Importance', figsize=(8,6));
plt.title('Top 15 Most Imporatant Features');

我們的模型給出的重要因素包括：

accommodates ：可以容納的最大人數，
bathrooms_new ：非共用或非私人浴室的數量，
minimum_nights ：房源可預定的最少晚數，
number_of_reviews ：總評論數，
Free street parking ：免費路邊停車位的存在是影響模型定價的最重要的便利設施，
Gym ：健身房設施，

?? 總結&展望

我們通過對Airbnb的資料進行深入挖掘分析和建模，完成對于民宿租賃場景下的AI理解與建模預估，我們后續還有一些可以做的事情，提升模型的表現，完成更精準地預估，比如：

更完善的特征工程，結合業務場景構建更有效的業務特征，
使用xgboost、lightgbm、catboost等模型，
使用貝葉斯調參等方法對超引數做更深入的調優，
深度學習與神經網路的方法引入，

參考資料

?? 資料科學工具庫速查表 | Pandas 速查表：https://www.showmeai.tech/article-detail/101
?? 圖解資料分析：從入門到精通系列教程：https://www.showmeai.tech/tutorials/33
?? 機器學習實戰：手把手教你玩轉機器學習系列：https://www.showmeai.tech/tutorials/41
?? 機器學習實戰 | SKLearn入門與簡單應用案例：https://www.showmeai.tech/article-detail/202
?? 機器學習實戰 | SKLearn最全應用指南：https://www.showmeai.tech/article-detail/203
?? 機器學習實戰 | 機器學習特征工程最全解讀：https://www.showmeai.tech/article-detail/208

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/521841.html

標籤：其他

上一篇：keepalived配置和使用

下一篇：Flink之狀態編程

AI帶你省錢旅游！精準預測民宿房源價格！