《Python機器學習手冊——從資料預處理到深度學習》
這本書類似于工具書或者字典,對于python具體代碼的呼叫和使用場景寫的很清楚,感覺雖然是工具書,但是對照著做一遍應該可以對機器學習中python常用的這些庫有更深入的理解,在應用中也能更為熟練,
以下是根據書上的代碼進行實操,注釋基本寫明了每句代碼的作用(寫在本句代碼之前)和print的輸出結果(寫在print之后),不一定嚴格按照書上內容進行,根據代碼運行時具體情況稍作順序調整,也加入了一些自己的理解,
如果復制到自己的環境下跑一遍輸出,相信理解會更深刻更清楚,
博客中每個代碼塊代表一次完整的運行結果,可以直接以此為單位復制并運行,
04-處理數值型資料
包括:
- 特征縮放
- 歸一化觀察值
- 多項式特征和互動特征
- 自定義特征轉換
- 例外值
- 離散化與分組
- 缺失值處理
主要是sklearn模塊,對數值特征處理的一些應用,
04-1 特征縮放
包含歸一化、標準化、處理有離群值的資料三種情況,
from sklearn import preprocessing
import numpy as np
# 創建特征
feature = np.array([[-500.5], [-100.1], [0], [100.1], [900.9]])
print(feature)
# [[-500.5]
# [-100.1]
# [ 0. ]
# [ 100.1]
# [ 900.9]]
# --創建縮放器,歸一化,特征的最小值和最大值分別賦予0和1
minmax_scale = preprocessing.MinMaxScaler(feature_range = (0, 1))
# 縮放特征
scaled_feature = minmax_scale.fit_transform(feature)
print(scaled_feature)
# [[0. ]
# [0.28571429]
# [0.35714286]
# [0.42857143]
# [1. ]]
# 輸出平均值,標準差
print(scaled_feature.mean())
print(scaled_feature.std())
# 0.41428571428571426
# 0.32701494692170274
# --創建縮放器,標準化,平均值為0,標準差為1
scaler = preprocessing.StandardScaler()
# 標準化特征
scaled_feature = scaler.fit_transform(feature)
print(scaled_feature)
# [[-1.26687088]
# [-0.39316683]
# [-0.17474081]
# [ 0.0436852 ]
# [ 1.79109332]]
# 輸出平均值,標準差
print(scaled_feature.mean())
print(scaled_feature.std())
# 0.0
# 1.0
# --創建縮放器,縮放有離群值的資料
scaler = preprocessing.RobustScaler()
# 標準化特征
scaled_feature = scaler.fit_transform(feature)
print(scaled_feature)
# [[-2.5]
# [-0.5]
# [ 0. ]
# [ 0.5]
# [ 4.5]]
# 輸出平均值,標準差
print(scaled_feature.mean())
print(scaled_feature.std())
# 0.4
# 2.2891046284519194
04-2 歸一化觀察值
與特征縮放的區別在于:特征縮放以整體所有特征為單位進行計算,觀察值以樣本(行)為單位進行計算,
from sklearn.preprocessing import Normalizer
import numpy as np
# 創建特征矩陣
feature = np.array([[0.5, 0.5], [1.1, 3.4], [1.5, 20.2], [1.63, 34.4], [10.9, 3.3]])
print(feature)
# [[ 0.5 0.5 ]
# [ 1.1 3.4 ]
# [ 1.5 20.2 ]
# [ 1.63 34.4 ]
# [10.9 3.3 ]]
# 創建歸一化器,L2范數
normalizer = Normalizer(norm = 'l2')
# 轉換特征矩陣
print(normalizer.transform(feature))
# [[0.70710678 0.70710678]
# [0.30782029 0.95144452]
# [0.07405353 0.99725427]
# [0.04733062 0.99887928]
# [0.95709822 0.28976368]]
# 創建歸一化器,L1范數
normalizer = Normalizer(norm = 'l1')
# 轉換特征矩陣
print(normalizer.transform(feature))
# [[0.5 0.5 ]
# [0.24444444 0.75555556]
# [0.06912442 0.93087558]
# [0.04524008 0.95475992]
# [0.76760563 0.23239437]]
# 創建歸一化器,最大值歸一化
normalizer = Normalizer(norm = 'max')
# 轉換特征矩陣
print(normalizer.transform(feature))
# [[1. 1. ]
# [0.32352941 1. ]
# [0.07425743 1. ]
# [0.04738372 1. ]
# [1. 0.30275229]]
04-3 多項式特征和互動特征
- 創建多項式特征,解決特征與目標是非線性關系的問題
- 創建互動特征,解決目標由多個特征決定的問題
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
# 創建特征矩陣
features = np.array([[2, 3], [2, 3], [2, 3]])
print(features)
# [[2 3]
# [2 3]
# [2 3]]
# 創建PolynomialFeatures物件
polynomial_interaction = PolynomialFeatures(degree = 2, include_bias = False)
# --創建多項式特征,解決特征與目標是非線性關系的問題,degree是最高階數
# x1, x2, x1^2, x1*x2, x2^2
print(polynomial_interaction.fit_transform(features))
# [[2. 3. 4. 6. 9.]
# [2. 3. 4. 6. 9.]
# [2. 3. 4. 6. 9.]]
polynomial_interaction = PolynomialFeatures(degree = 3, include_bias = False)
# degree = 3,最大值為原特征最大值的三次方
print(polynomial_interaction.fit_transform(features))
# [[ 2. 3. 4. 6. 9. 8. 12. 18. 27.]
# [ 2. 3. 4. 6. 9. 8. 12. 18. 27.]
# [ 2. 3. 4. 6. 9. 8. 12. 18. 27.]]
interaction = PolynomialFeatures(degree = 2, interaction_only = True, include_bias = False)
# --創建互動特征,解決目標由多個特征決定的問題,degree是最高階數
# # x1, x2, x1*x2
print(interaction.fit_transform(features))
# [[2. 3. 6.]
# [2. 3. 6.]
# [2. 3. 6.]]
04-4 自定義特征轉換
有時需要按照自己的需求轉換特征,比如求特征的對數,可以通過函式轉換器FunctionTransformer()或者pandas中的apply()方法兩種方式達到自定義特征轉換的目的,
from sklearn.preprocessing import FunctionTransformer
import numpy as np
# 創建特征矩陣
features = np.array([[2, 3], [2, 3], [2, 3]])
print(features)
# [[2 3]
# [2 3]
# [2 3]]
# 自定義函式
def add_ten(x):
return x + 10
# 創建轉換器
ten_transformer = FunctionTransformer(add_ten)
print(ten_transformer.transform(features))
# [[12 13]
# [12 13]
# [12 13]]
# 同樣可以采用pandas來轉換
import pandas as pd
df = pd.DataFrame(features, columns = ['feature_1', 'feature_2'])
print(df.apply(add_ten))
# feature_1 feature_2
# 0 12 13
# 1 12 13
# 2 12 13
04-5 例外值
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs
import numpy as np
# 創建聚類的模擬資料集
features,_ = make_blobs(n_samples = 10, n_features = 2, centers = 1, random_state = 1)
print(features)
# [[-1.83198811 3.52863145]
# [-2.76017908 5.55121358]
# [-1.61734616 4.98930508]
# [-0.52579046 3.3065986 ]
# [ 0.08525186 3.64528297]
# [-0.79415228 2.10495117]
# [-1.34052081 4.15711949]
# [-1.98197711 4.02243551]
# [-2.18773166 3.33352125]
# [-0.19745197 2.34634916]]
# 替換極端值
features[0,1] = 10000
features[1,1] = 10000
print(features)
# [[-1.83198811e+00 1.00000000e+04]
# [-2.76017908e+00 1.00000000e+04]
# [-1.61734616e+00 4.98930508e+00]
# [-5.25790464e-01 3.30659860e+00]
# [ 8.52518583e-02 3.64528297e+00]
# [-7.94152277e-01 2.10495117e+00]
# [-1.34052081e+00 4.15711949e+00]
# [-1.98197711e+00 4.02243551e+00]
# [-2.18773166e+00 3.33352125e+00]
# [-1.97451969e-01 2.34634916e+00]]
# ----方法一:EllipticEnvelope()
# 創建例外值識別器,污染指數contamination是例外值的比例
outlier_detector = EllipticEnvelope(contamination = .1)
# 擬合識別器
outlier_detector.fit(features)
# 預測例外值
print(outlier_detector.predict(features))
# [-1 1 1 1 1 1 1 1 1 1]
# 修改污染指數
outlier_detector = EllipticEnvelope(contamination = .3)
# 擬合識別器
outlier_detector.fit(features)
# 預測例外值
print(outlier_detector.predict(features))
# [-1 -1 1 1 -1 1 1 1 1 1]
# ----方法二:四分位差IQR識別
# 也可以只查看某個特征的例外值,采用四分位差IQR識別
# IQR = 第一個四分位數和第三個四分位數的差值
# 例外值常常被定義為比第一個四分位數小1.5個IQR,或比第三個四分位數大1.5個IQR的值
feature = features[:,1]
print(feature)
# [1.00000000e+04 1.00000000e+04 4.98930508e+00 3.30659860e+00
# 3.64528297e+00 2.10495117e+00 4.15711949e+00 4.02243551e+00
# 3.33352125e+00 2.34634916e+00]
# 創建通過四分位差IQR識別法,回傳例外值下標的函式
def indicies_of_outliers(x):
q1, q3 = np.percentile(x, [25, 75])
iqr = q3 - q1
lower_bound = q1 - (iqr * 1.5)
upper_bound = q3 + (iqr * 1.5)
return np.where((x > upper_bound) | (x < lower_bound))
# 識別例外值下標
print(indicies_of_outliers(feature))
# (array([0, 1]),)
# ----處理例外值
# -----方法一:采用RobustScaler()縮放含有離群值的特征
from sklearn import preprocessing
scaler = preprocessing.RobustScaler()
scaled_feature = scaler.fit_transform(features)
print(scaled_feature)
# [[-2.61212566e-01 6.80970487e+03]
# [-9.47948061e-01 6.80970487e+03]
# [-1.02406616e-01 7.87126291e-01]
# [ 7.05196630e-01 -3.59186642e-01]
# [ 1.15728512e+00 -1.28464128e-01]
# [ 5.06645267e-01 -1.17778692e+00]
# [ 1.02406616e-01 2.20215119e-01]
# [-3.72184092e-01 1.28464128e-01]
# [-5.24414566e-01 -3.40846083e-01]
# [ 9.48122608e-01 -1.01333897e+00]]
# -----方法二:分析特征值的成因,針對性處理
import pandas as pd
# 創建資料幀
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116] # 臥室數量?
houses['Square_Feet'] = [1500, 2500, 1500, 48000]
print(houses)
# Price Bathrooms Square_Feet
# 0 534433 2.0 1500
# 1 392333 3.5 2500
# 2 293222 2.0 1500
# 3 4322032 116.0 48000
# 可以通過已知條件直接篩選的方式來篩選觀察值
print(houses[houses['Bathrooms'] < 20])
# Price Bathrooms Square_Feet
# 0 534433 2.0 1500
# 1 392333 3.5 2500
# 2 293222 2.0 1500
# 或者把他們標記為例外值,并作為資料集的一個特征
houses['Outlier'] = np.where(houses['Bathrooms'] < 20, 0, 1)
print(houses)
# Price Bathrooms Square_Feet Outlier
# 0 534433 2.0 1500 0
# 1 392333 3.5 2500 0
# 2 293222 2.0 1500 0
# 3 4322032 116.0 48000 1
# 對例外值進行轉換,降低例外值的影響
# 對特征取對數值
houses['log_of_square_feet'] = [np.log(x) for x in houses['Square_Feet']]
print(houses)
# Price Bathrooms Square_Feet Outlier log_of_square_feet
# 0 534433 2.0 1500 0 7.313220
# 1 392333 3.5 2500 0 7.824046
# 2 293222 2.0 1500 0 7.313220
# 3 4322032 116.0 48000 1 10.778956
04-6 離散化與分組
from sklearn.preprocessing import Binarizer
import numpy as np
age = np.array([[6], [12], [20], [36], [65]])
# -- 方法一:兩個區間,二值化
# 創建二值化器
binarizer = Binarizer(18)
# 二值化特征
print(binarizer.fit_transform(age))
# [[0]
# [0]
# [1]
# [1]
# [1]]
# -- 方法二:多個區間,離散化
# 將特征離散化,bins是區間串列,落在第i(0-n)個區間,回傳的值就是i
print(np.digitize(age, bins = [18]))
# [[0]
# [0]
# [1]
# [1]
# [1]]
print(np.digitize(age, bins = [20, 30, 64]))
# [[0]
# [0]
# [1]
# [2]
# [3]]
# -- 方法三:無顯式關系聯,聚類分組
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# 創建模擬的矩陣特征
features, _ = make_blobs(n_samples = 50, n_features = 2, centers = 3, random_state = 1)
print(features[:5])
# [[-9.87755355 -3.33614544]
# [-7.28721033 -8.35398617]
# [-6.94306091 -7.0237442 ]
# [-7.44016713 -8.79195851]
# [-6.64138783 -8.07588804]]
# 創建資料幀
dataframe = pd.DataFrame(features, columns = ['feature_1', 'feature_2'])
print(dataframe.head(5))
# feature_1 feature_2
# 0 -9.877554 -3.336145
# 1 -7.287210 -8.353986
# 2 -6.943061 -7.023744
# 3 -7.440167 -8.791959
# 4 -6.641388 -8.075888
# 創建K-Means聚類器
clusterer = KMeans(3, random_state = 0)
# 將聚類應用在特征上
clusterer.fit(features)
# 預測聚類的值
dataframe['group'] = clusterer.predict(features)
print(dataframe.head(5))
# feature_1 feature_2 group
# 0 -9.877554 -3.336145 0
# 1 -7.287210 -8.353986 2
# 2 -6.943061 -7.023744 2
# 3 -7.440167 -8.791959 2
# 4 -6.641388 -8.075888 2
04-7 缺失值處理
import numpy as np
# 創建特征矩陣
features = np.array([[1.1, 11.1], [2.2, 22.2], [3.3, 33.3], [4.4, 44.4], [np.nan, 55]])
print(features)
# [[ 1.1 11.1]
# [ 2.2 22.2]
# [ 3.3 33.3]
# [ 4.4 44.4]
# [ nan 55. ]]
# -- 方法一:只保留沒有(~表示取反補集)缺失值的觀察值
print(features[~np.isnan(features).any(axis = 1)])
# [[ 1.1 11.1]
# [ 2.2 22.2]
# [ 3.3 33.3]
# [ 4.4 44.4]]
# -- 方法二:pd.dropna()
import pandas as pd
dataframe = pd.DataFrame(features, columns = ['feature_1', 'feature_2'])
# 洗掉帶有缺失值的觀察值
print(dataframe.dropna())
# feature_1 feature_2
# 0 1.1 11.1
# 1 2.2 22.2
# 2 3.3 33.3
# 3 4.4 44.4
# -- 填充缺失值
# --- 方法一:fancyimpute模塊
from fancyimpute import KNN
# 填充演算法:最近鄰估算,使用兩行都具有觀測資料的特征的均方差來對樣本進行加權,然后用加權的結果進行特征值填充
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
# 創建模擬特征矩陣
features, _ = make_blobs(n_samples = 1000, n_features = 2, random_state = 1)
print(features[:5])
# [[-3.05837272 4.48825769]
# [-8.60973869 -3.72714879]
# [ 1.37129721 5.23107449]
# [-9.33917563 -2.9544469 ]
# [-8.63895561 -8.05263469]]
# 標準化特征
scaler = StandardScaler()
standardized_features = scaler.fit_transform(features)
print(standardized_features[:5])
# [[ 0.87301861 1.31426523]
# [-0.67073178 -0.22369263]
# [ 2.1048424 1.45332359]
# [-0.87357709 -0.07903966]
# [-0.67885655 -1.03344137]]
# 替換為缺失值
true_value = https://www.cnblogs.com/camilia/p/standardized_features[0,0]
standardized_features[0,0] = np.nan
print(standardized_features[:5])
# [[ nan 1.31426523]
# [-0.67073178 -0.22369263]
# [ 2.1048424 1.45332359]
# [-0.87357709 -0.07903966]
# [-0.67885655 -1.03344137]]
# 預測特征矩陣中的缺失值
features_knn_imputed = KNN(k = 5, verbose = 0).fit_transform(standardized_features)
# 對比真實值和填充值
print('True:', true_value)
print('Imputed:', features_knn_imputed[0,0])
# True: 0.8730186113995938
# Imputed: 1.0955332713113226
# --- 方法二:sklearn的Imputer模塊
# 用特征的平均數、中位數或眾數填充均值,效果一般比KNN的差
from sklearn.impute import SimpleImputer
# 創建填充器
mean_imputer = SimpleImputer(strategy = 'mean')
# 填充缺失值
features_mean_imputed = mean_imputer.fit_transform(standardized_features)
# 對比真實值和填充值
print('True:', true_value)
print('Imputed:', features_knn_imputed[0,0])
# True: 0.8730186113995938
# Imputed: 1.0955332713113226
# 如果采用填充策略,最好創建一個新的二元特征來表示該觀察值是否具有填充值,有時缺失值也是一個資訊
關于掉包,感謝一些前輩的踩坑經驗:
fancyimpute
Imputer
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/509180.html
標籤:其他
上一篇:Linux基礎命令2
下一篇:帶你體驗給黑白照片上色
