提示：這里可以添加系列文章的所有文章的目錄，目錄需要自己手動添加
例如：第一章 Python 機器學習入門之pandas的使用

提示：寫完文章后，目錄可以自動生成，如何生成可參考右邊的幫助檔案

前言

波士頓房價預測是一個經典的機器學習任務，類似于程式員世界的“Hello World”，利用機器學習方法完成波士頓房價的預測，理解機器學習解決簡單實際問題的基本步驟和方法，

一、實驗步驟及運行結果

1.資料分析

①.分析各個影響房價的特征資訊

import numpy as np
from sklearn.datasets import load_boston  # 匯入資料集
import matplotlib.pyplot as plt
from matplotlib.pyplot import MultipleLocator

boston = load_boston()
x = boston['data']  # 影響房價的特征資訊
y = boston['target']  # 房價
name = boston['feature_names']

for i in range(13):
    plt.figure(figsize=(10, 7))
    plt.grid()
    plt.scatter(x[:, i], y, s=5)  # 橫縱坐標和點的大小
    plt.title(name[i])
    print(name[i], np.corrcoef(x[:i]), y)
plt.show()

運行結果:
犯罪率:高房價的房屋大都集中在低犯罪率地區，
在這里插入圖片描述
住宅用地比例:與房價無明顯的線性關系，

城鎮中非商業用地的所占比例：與房價無明顯的線性關系，只能說在某一區間內房價呈現一定特征，
在這里插入圖片描述

是否處于查爾斯河邊(1表示在河邊，0表示不在河邊)：是否在查爾斯河邊影響房價也不明顯，
在這里插入圖片描述

一訊訓氮濃度: 一訊訓氮濃度與房價的關系呈現極其微弱的線性關系，一訊訓氮低于0.5的情況下，房價絕大部分高于15，
在這里插入圖片描述

每棟住宅的房間數：與房價之間具有較強的線性關系，
在這里插入圖片描述

1940年以前建成的業主自住單位的占比：對房價的影響較小，
在這里插入圖片描述

距離5個波士頓就業中心的平均距離：平均距離較小的情況下，房價對應也較低，
在這里插入圖片描述

距離高速公路的便利指數：房價高于30的房產，近乎都集中在距離高速公路的便利指數低的地區，
在這里插入圖片描述

每一萬美元的不動產稅率：與房價的線性相關度較小，
在這里插入圖片描述

城鎮中學生教師比例：對房價的影響較小，呈微弱的線性關系，
在這里插入圖片描述

黑人比例：黑人比例對波士頓房價的影響尤其是往后的影響越趨于更小，
在這里插入圖片描述

低收入階層占比：與房價具有較強的線性關系，是影響房價的重要因素，
在這里插入圖片描述

②.對房價的分析

import numpy as np
from sklearn.datasets import load_boston  # 匯入資料集
import matplotlib.pyplot as plt
from matplotlib.pyplot import MultipleLocator

boston = load_boston()
x = boston['data']  # 影響房價的特征資訊
y = boston['target']  # 房價

plt.figure(figsize=(20, 15))
y_major_locator = MultipleLocator(5)  # 把y軸的刻度間隔設定為10，并存在變數里
ax = plt.gca()  # ax為兩條坐標軸的實體
ax.yaxis.set_major_locator(y_major_locator)  # 把y軸的主刻度設定為5的倍數
plt.ylim(0, 51)
plt.grid()
for i in range(len(y)):
    plt.scatter(i, y[i], s=20)

plt.show()

運行結果：
在這里插入圖片描述
經分析，將房價大于等于46的資料視為例外資料，在劃分訓練集和測驗集之前先把這些資料從資料集中除去，

2.資料處理

經上述分析，去除房價中大于等于46的資料，對于房價的影響資訊，只保留NOX,RM,AGE,DIS,LSTAT, INDUS, PTRATIO幾個特征資訊，將剩下的特征資訊均除去，

3.建模測驗并運行

import numpy as np
import numpy as np
from skimage.metrics import mean_squared_error
from sklearn import linear_model
from sklearn.linear_model import LinearRegression  # 匯入線性模型
from sklearn.datasets import load_boston  # 匯入資料集
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split  # 匯入資料集劃分模塊
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt2

boston = load_boston()
x = boston['data']  # 影響房價的特征資訊資料
y = boston['target']  # 房價
name = boston['feature_names']

# 資料處理
unsF = []  # 次要特征下標
for i in range(len(name)):
    if name[i] == 'RM' or name[i] == 'PTRATIO' or name[i] == 'LSTAT' or name[i] == 'AGE' or name[i] == 'NOX' or name[i] == 'DIS' or name[i] == 'INDUS':
        continue
    unsF.append(i)
x = np.delete(x, unsF, axis=1)  # 洗掉次要特征


unsT = []  # 房價例外值下標
for i in range(len(y)):
    if y[i] > 46:
        unsT.append(i)
x = np.delete(x, unsT, axis=0)  # 洗掉樣本例外值資料
y = np.delete(y, unsT, axis=0)  # 洗掉例外房價

# 將資料進行拆分，一份用于訓練，一份用于測驗和驗證
# 測驗集大小為30%,防止過擬合
# 這里的random_state就是為了保證程式每次運行都分割一樣的訓練集和測驗集，
# 否則，同樣的演算法模型在不同的訓練集和測驗集上的效果不一樣，
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

# 線性回歸模型
lf = LinearRegression()
lf.fit(x_train, y_train)  # 訓練資料,學習模型引數
y_predict = lf.predict(x_test)  # 預測

# 嶺回歸模型
# rr = linear_model.Ridge()  # 模型嶺回歸
# rr.fit(x_train, y_train)  # 訓練模型
# y_predict = rr.predict(x_test)  # 預測


# lasso模型
# lassr = linear_model.Lasso(alpha=.0001)
# lassr.fit(x_train, y_train)
# y_predict = lassr.predict(x_test)

# 與驗證值作比較
error = mean_squared_error(y_test, y_predict).round(5)  # 平方差
score = r2_score(y_test, y_predict).round(5)  # 相關系數

# 繪制真實值和預測值的對比圖
fig = plt.figure(figsize=(13, 7))
plt.rcParams['font.family'] = "sans-serif"
plt.rcParams['font.sans-serif'] = "SimHei"
plt.rcParams['axes.unicode_minus'] = False  # 繪圖
plt.plot(range(y_test.shape[0]), y_test, color='red', linewidth=1, linestyle='-')
plt.plot(range(y_test.shape[0]), y_predict, color='blue', linewidth=1, linestyle='dashdot')
plt.legend(['真實值', '預測值'])
plt.title("190512213", fontsize=20)
error = "標準差d=" + str(error)+"\n"+"相關指數R^2="+str(score)
plt.xlabel(error, size=18, color="green")
plt.grid()
plt.show()

plt2.rcParams['font.family'] = "sans-serif"
plt2.rcParams['font.sans-serif'] = "SimHei"
plt2.title('190512213', fontsize=24)
xx = np.arange(0, 40)
yy = xx
plt2.xlabel('* truth *', fontsize=14)
plt2.ylabel('* predict *', fontsize=14)
plt2.plot(xx, yy)
plt2.scatter(y_test, y_predict, color='red')
plt2.grid()
plt2.show()

運行結果：
線性回歸：
在這里插入圖片描述

在這里插入圖片描述

嶺回歸：
在這里插入圖片描述

在這里插入圖片描述

Lasso模型：

在這里插入圖片描述

二、實驗結果分析

1.由本次實驗結果與真實值的對比圖可知，無論使用哪種模型預測，預測效果都不是很理想，主要原因仍是資料分析及處理程序中出現了問題，在預測時，應該對資料進行進一步的分析和處理，如對應區間內資料的變化，對極端資料的處理等等，
2.實驗采用了相關系數和平方差兩種手段去評判預測結果的好壞，相關系數越接近1說明選用的模型回歸的效果越好，預測的結果也就越優，在實際解決問題時，應該測驗多個模型選用最優的模型進行預測，
3.除了實驗中選擇的三種模型，還可以進一步利用支持向量機的核函式，SVR中的三種模型進行預測，支持向量機是目前最常用效果最好的分類器之一，但是其消耗的空間和時間代價太大，所以需要結合實際情況使用，

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/247660.html

標籤：python

上一篇：使用simpletransformers快速構建NLP比賽baseline

下一篇：Python函式

機器學習入門實戰——波士頓房價預測(簡單預測)

目錄

前言

一、實驗步驟及運行結果

1.資料分析

①.分析各個影響房價的特征資訊

②.對房價的分析

2.資料處理

3.建模測驗并運行

二、實驗結果分析