機器學習實戰:基于Scikit-Learn和TensorFlow的筆記
參考:作者的Jupyter Notebook
Chapter 2 – End-to-end Machine Learning project
-
下載資料
- 打開vscode,建立新的python檔案,輸入以下代碼,下載housing.tgz檔案,并將housing.csv解壓到這個目錄
import os import tarfile from six.moves import urllib download_root = "https://raw.githubusercontent.com/ageron/handson-ml/master/" HOUSING_PATH = "datasets/housing" HOUSING_URL = download_root + HOUSING_PATH + "/housing.tgz" def fetch_housing_data(housing_url=HOUSING_URL,housing_path=HOUSING_PATH): if not os.path.isdir(housing_path): os.makedirs(housing_path) tgz_path = os.path.join(housing_path, "housing.tgz") urllib.request.urlretrieve(housing_url, tgz_path) housing_tgz = tarfile.open(tgz_path) housing_tgz.extractall(path=housing_path) housing_tgz.close() fetch_housing_data()下載后可將函式注釋
-
快速查看資料結構
- 使用pandas加載資料
mport pandas as pd def load_housing_data(housing_path=HOUSING_PATH): csv_path = os.path.join(housing_path, "housing.csv") return pd.read_csv(csv_path)函式回傳一個包含所有資料的Pandas DataFrame物件
- 呼叫DataFrames的head()方法查看前5行資料(由于使用的是vscode所以會和書里有所不同),查看完可注釋
housing = load_housing_data() print(housing.head())總共有10個屬性
-
通過info()方法可以快速獲取資料集的簡單描述,特別是總行數、每個屬性的型別和非空值的數量
print(housing.info()) -
使用value_counts()方法查看有多少種分類存在,每種類別下分別有多少個區域
print(housing["ocean_proximity"].value_counts()) -
通過describe()方法可以顯示數值屬性的摘要
print(housing.describe()) -
在整個資料集上呼叫hist()方法,繪制每個屬性的直方圖
import matplotlib.pyplot as plt housing.hist(bins=50, figsize=(50,15)) plt.show() -
創建測驗集
- 理論上,創建測驗集非常簡單:只需要隨機選擇一些實體,通常是資料集的20%,然后將它們放在一邊:
import numpy as np def split_train_test(data, test_ratio): shuffled_indices = np.random.permutation(len(data)) test_set_size = int(len(data) * test_ratio) test_indices = shuffled_indices[:test_set_size] train_indices = shuffled_indices[test_set_size:] return data.iloc[train_indices], data.iloc[test_indices] train_set, test_set = split_train_test(housing, 0.2) print(len(train_set), "train +", len(test_set), "test")- 但這并不完美:如果你再運行一遍,它又會產生一個不同的資料集!這樣下去,你(或者是你的機器學習演算法)將會看到整個完整的資料集,而這正是創建測驗集時需要避免的,常見的解決辦法是每個實體都使用一個識別符號(identifier)來決定是否進入測驗集(假定每個實體都有一個唯一且不變的識別符號)
import hashlib def test_set_check(identifier,test_ratio, hash): return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5): ids = data[id_column] in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash)) return data.loc[~in_test_set], data.loc[in_test_set] #housing_with_id = housing.reset_index() #housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"] #train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id") from sklearn.model_selection import train_test_split train_set, test_set = train_test_split(housing, test_size=0.2, random=42)- 分層抽樣
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5) housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True) from sklearn.model_selection import StratifiedShuffleSplit split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) for train_index, test_index in split.split(housing, housing["income_cat"]): strat_train_set = housing.loc[train_index] strat_test_set = housing.loc[test_index] print(housing["income_cat"].value_counts() / len(housing)) for set in (strat_train_set, strat_test_set): set.drop(["income_cat"], axis=1, inplace=True) -
資料探索和可視化
- 創建一個副本
housing = strat_train_set.copy() - 將地理資料可視化
#housing.plot(kind="scatter", x="longitude", y="latitude") #housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1) housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, s=housing["population"] / 100, label="population", c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,) plt.legend() plt.show()- 尋找相關性
#corr_matrix = housing.corr() #print(corr_matrix["median_house_value"].sort_values(ascending=False)) from pandas.plotting import scatter_matrix #少了tools attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"] scatter_matrix(housing[attributes], figsize=(12, 8)) housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1) plt.show() - 創建一個副本
-
試驗不同屬性的組合
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"] housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"] housing["population_per_household"]=housing["population"]/housing["households"] corr_matrix = housing.corr() print(corr_matrix["median_house_value"].sort_values(ascending=False)) -
機器學習演算法的資料準備
housing = strat_train_set.drop("median_house_value", axis=1) housing_labels = strat_train_set["median_house_value"].copy() -
資料清理4選1
#housing.dropna(subset=["total_bedrooms"]) # option 1 #housing.drop("total_bedrooms", axis=1) # option 2 #median = housing["total_bedrooms"].median() #housing["total_bedrooms"].fillna(median) # option 3 #option4: Scikit-Learn提供的imputer, 指定你要用屬性的中位數值替換該屬性的缺失值 from sklearn.impute import SimpleImputer #與書中不同,進化了 imputer = SimpleImputer(strategy="median") #創建一個imputer實體 housing_num = housing.drop("ocean_proximity", axis=1) #創建一個沒有文本屬性的資料副本ocean_proximity imputer.fit(housing_num) #使用fit()方法將imputer實體適配到訓練集 #print(imputer.statistics_) #print(housing_num.median().values) X = imputer.transform(housing_num) #替換 housing_tr = pd.DataFrame(X, columns=housing_num.columns) #放回Pandas DataFrame -
處理文本和分類屬性
#先將這些文本標簽轉化為數字,Scikit-Learn為這類任務提供了一個轉換器LabelEncoder: from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() housing_cat = housing["ocean_proximity"] housing_cat_encoded = encoder.fit_transform(housing_cat) #print(housing_cat_encoded) #print(encoder.classes_) #Scikit-Learn提供了一個OneHotEncoder編碼器,可以將整數分類值轉換為獨熱向量 from sklearn.preprocessing import OneHotEncoder encoder = OneHotEncoder() housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1)) #print(housing_cat_1hot.toarray()) #使用LabelBinarizer類可以一次性完成兩個轉換 from sklearn.preprocessing import LabelBinarizer encoder = LabelBinarizer() housing_cat_1hot = encoder.fit_transform(housing_cat) print(housing_cat_1hot) -
自定義轉換器
from sklearn.base import BaseEstimator, TransformerMixin rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6 class CombinedAttributesAdder(BaseEstimator, TransformerMixin): def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs self.add_bedrooms_per_room = add_bedrooms_per_room def fit(self, X, y=None): return self #nothing else to do def transform(self, X, y=None): rooms_per_household = X[:, rooms_ix] / X[:, household_ix] population_per_household = X[:, population_ix] / X[:, household_ix] if self.add_bedrooms_per_room: bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix] return np.c_[X, rooms_per_household, population_per_household,bedrooms_per_room] else: return np.c_[X, rooms_per_household, population_per_household] attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False) housing_extra_attribs = attr_adder.transform(housing.values) -
轉換流水線
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler num_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy="median")), ('attribs_adder', CombinedAttributesAdder()), ('std_scaler', StandardScaler()), ]) housing_num_tr = num_pipeline.fit_transform(housing_num) #print(housing_num_tr) from sklearn.compose import ColumnTransformer num_attribs = list(housing_num) cat_attribs = ["ocean_proximity"] full_pipeline = ColumnTransformer([ ("num", num_pipeline, num_attribs), ("cat", OneHotEncoder(), cat_attribs), ]) housing_prepared = full_pipeline.fit_transform(housing) #print(housing_prepared) #print(housing_prepared.shape) -
選擇和訓練模型
- 訓練一個線性回歸模型:
from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(housing_prepared, housing_labels) #print(lin_reg) #實體試試 some_data = https://www.cnblogs.com/kissnow/p/housing.iloc[:5] some_labels = housing_labels.iloc[:5] some_data_prepared = full_pipeline.transform(some_data) #print("Predictions:", lin_reg.predict(some_data_prepared)) #print("Labels:", list(some_labels)) #print(some_data_prepared)- 使用Scikit-Learn的mean_squared_error函式來測量整個訓練集上回歸模型的RMSE:
from sklearn.metrics import mean_squared_error housing_predictions = lin_reg.predict(housing_prepared) lin_mse = mean_squared_error(housing_labels, housing_predictions) lin_rmse = np.sqrt(lin_mse) #print(lin_rmse) from sklearn.metrics import mean_absolute_error lin_mae = mean_absolute_error(housing_labels, housing_predictions) #print(lin_mae)- 我們來訓練一個(決策樹)DecisionTreeRegressor,
from sklearn.tree import DecisionTreeRegressor tree_reg = DecisionTreeRegressor(random_state=42) tree_reg.fit(housing_prepared, housing_labels) housing_predictions = tree_reg.predict(housing_prepared) tree_mse = mean_squared_error(housing_labels, housing_predictions) tree_rmse = np.sqrt(tree_mse) #print(tree_rmse) #可能對資料嚴重過度擬合- 使用交叉驗證來更好地進行評估
from sklearn.model_selection import cross_val_score scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10) tree_rmse_scores = np.sqrt(-scores) def display_scores(scores): print("Scores:", scores) print("Mean:", scores.mean()) print("Standard deviation:", scores.std()) #display_scores(tree_rmse_scores)- 計算一下線性回歸模型的評分
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10) lin_rmse_scores = np.sqrt(-lin_scores) #display_scores(lin_rmse_scores)- 隨機森林模型RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor forest_reg = RandomForestRegressor(n_estimators=10, random_state=42) forest_reg.fit(housing_prepared, housing_labels) housing_predictions = forest_reg.predict(housing_prepared) forest_mse = mean_squared_error(housing_labels, housing_predictions) forest_rmse = np.sqrt(forest_mse) #print(forest_rmse) from sklearn.model_selection import cross_val_score forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10) forest_rmse_scores = np.sqrt(-forest_scores) #display_scores(forest_rmse_scores) scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10) #print(pd.Series(np.sqrt(-scores)).describe()) -
微調模型
-
網格搜索
#你可以用Scikit-Learn的GridSearchCV來替你進行探索,你所要做的只是告訴它你要進行實驗的超引數是什么,以及需要嘗試的值,它將會使用交叉驗證來評估超引數值的所有可能的組合, #下面這段代碼搜索RandomForestRegressor的超引數值的最佳組合: #當你不知道超引數應該賦什么值時,一個簡單的方法是連續嘗試10的冪次方 from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import GridSearchCV param_grid = [ {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, # try 12 (3×4) combinations of hyperparameters {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}, # then try 6 (2×3) combinations with bootstrap set as False ] forest_reg = RandomForestRegressor() grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error') grid_search.fit(housing_prepared, housing_labels) #print(grid_search.best_params_) #print(grid_search.best_estimator_) cvres = grid_search.cv_results_ for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]): print(np.sqrt(-mean_score), params) print(pd.DataFrame(grid_search.cv_results_)) #隨機搜索 #集成方法 -
分析最佳模型及其錯誤
feature_importances = grid_search.best_estimator_.feature_importances_ #print(feature_importances) #將這些重要性分數顯示在對應的屬性名稱旁邊: extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"] #cat_encoder = cat_pipeline.named_steps["cat_encoder"] # old solution cat_encoder = full_pipeline.named_transformers_["cat"] cat_one_hot_attribs = list(cat_encoder.categories_[0]) attributes = num_attribs + extra_attribs + cat_one_hot_attribs sorted(zip(feature_importances, attributes), reverse=True) #print(sorted(zip(feature_importances, attributes), reverse=True)) #通過測驗集評估系統 from sklearn.metrics import mean_squared_error final_model = grid_search.best_estimator_ X_test = strat_test_set.drop("median_house_value", axis=1) y_test = strat_test_set["median_house_value"].copy() X_test_prepared = full_pipeline.transform(X_test) final_predictions = final_model.predict(X_test_prepared) final_mse = mean_squared_error(y_test, final_predictions) final_rmse = np.sqrt(final_mse) #print(final_rmse) -
啟動、監控和維護系統
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/164052.html
標籤:Python
上一篇:Python盜hao技術-代碼實作截屏鍵盤記錄遠程發送
下一篇:機器學習第3章分類
