預測酒店預訂需求-有解無憂

作者|Dimas Adnan
編譯|VK
來源|Towards Data Science

在本文中，我想寫一篇關于如何使用Python和Jupyter Notebook構建預測模型的文章，我在這個實驗中使用的資料是來自Kaggle的酒店預訂需求資料集：https://www.kaggle.com/jessemostipak/hotel-booking-demand

在本文中，我將只向你展示建模階段，僅使用Logistic回歸模型，但是你可以訪問完整的檔案，包括在Github上進行的資料清理、預處理和探索性資料分析，

匯入庫

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings("ignore")

加載資料集

df = pd.read_csv('hotel_bookings.csv')
df = df.iloc[0:2999]
df.head()

下面是資料集的外觀，

它有32列，它的完整版本是：

['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date']

根據我在Notebook上運行的資訊，資料集中的NaN值可以在“country”、“agent”和“company”三列中找到

基于“lead_time”特征，我將“country”中的NaN值替換為PRT（葡萄牙），因為PRT是最常見的

我試圖根據lead_time, arrival_date_month, 和arrival_date_week_number替換“agent”特征上的NaN值，但大多數都是“240”作為最常見的代理，

在我閱讀了在互聯網上可以找到的資料集的描述和解釋后，作者將“agent”特征描述為“預訂的旅行社ID”，因此，那些在資料集中擁有“agent”的人是唯一通過旅行社訂購的人，而那些沒有“agent”或是Nan的人，是那些沒有通過旅行社訂購的人，因此，我認為最好是用0來填充NaN值，而不是用常見的代理來填充它們，這樣會使資料集與原始資料集有所不同，

最后但并非最不重要的是，我選擇放棄整個“company”特征，因為該特性中的NaN約占資料的96%，如果我決定修改資料，它可能會對資料產生巨大的影響，并可能會影響整個資料

拆分資料集

df_new = df.copy()[['required_car_parking_spaces','lead_time','booking_changes','adr','adults', 'is_canceled']]
df_new.head()

x = df_new.drop(['is_canceled'], axis=1)
y = df_new['is_canceled']

我試著根據與目標（is_Cancelled）最顯著相關的前5個特征對資料集進行拆分，它們是required_car_parking_spaces’, ’lead_time’, ’booking_changes’, ’adr’, ’adults,’ 和‘is_canceled.’

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, shuffle=False)

訓練和測驗分成80%和20%，

擬合模型

model_LogReg_Asli是在使用超引數調優之前使用Logistic回歸的原始模型，下面是模型預測，

模型性能

如上所述，Logistic回歸模型的準確率約為69.3%，

模型引數

Randomized Search CV的Logistic回歸分析

model_LR_RS是采用Logistic回歸和超引數調整（隨機）的模型，

如上圖所示，帶有Randomized Search CV的Logistic回歸模型的結果與沒有隨機搜索的結果完全相同，為69.3%，

基于網格搜索CV的Logistic回歸

model_LR2_GS是采用Logistic回歸和超引數調整（網格搜索）的模型，

上圖顯示，使用網格搜索CV的Logistic回歸模型具有相同的準確率，為69.3%，

模型評估

混淆矩陣

TN為真反例，FN為假反例，FP為假正例，TP為真正例，0不被取消，1被取消，下面是模型的分類報告，

在本文中，我再次使用Logistic回歸進行測驗，但是你可以使用其他型別的模型，如隨機森林、決策樹等，在我的Github上，我也嘗試過隨機森林分類器，但結果非常相似，

本文到此為止，謝謝你，祝你今天愉快，

原文鏈接：https://towardsdatascience.com/predicting-a-hotel-booking-demand-7608a7dbf5a4

歡迎關注磐創AI博客站：
http://panchuang.net/

sklearn機器學習中文官方檔案：
http://sklearn123.com/

歡迎關注磐創博客資源匯總站：
http://docs.panchuang.net/

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/60991.html

標籤：其他

上一篇：使用Python預測缺失值

下一篇：機器學習模型的度量選擇二