特征縮放
原因:
- 數量級的差異將導致量級較大的屬性占據主導地位
- 數量級的差異將導致迭代收斂速度減慢
- 依賴于樣本距離的演算法對于資料的數量級非常敏感
好處:
- 提升模型的精度:在機器學習演算法的目標函式中使用的許多元素(例如支持向量機的 RBF 內核或線性模型的 l1 和 l2 正則化),都是假設所有的特征都是零均值并且具有同一階級上的方差,如果某個特征的方差比其他特征大幾個數量級,那么它就會在學習演算法中占據主導位置,導致學習器并不能像我們期望的那樣,從其他特征中學習
- 提升收斂速度:對于線性模型來說,資料歸一化后,尋找最優解的程序明顯會變得平緩,更容易正確地收斂到最優解
Standardization (Z-score normalization)
通過減去均值然后除以標準差,將資料按比例縮放,使之落入一個小的特定區間,處理后的資料均值為0,標準差為1
x ′ = x ? m e a n ( x ) s t d ( x ) x^{\prime} = {{x - mean(x)} \over std(x)} x′=std(x)x?mean(x)?
適用范圍:
- 資料本身的分布就服從正太分布
- 最大值和最小值未知的情況,或有超出取值范圍的離群資料的情況
- 在分類、聚類演算法中需要使用距離來度量相似性、或者使用PCA(協方差分析)技術進行降維時,使用該方法表現更好
Rescaling (min-max normalization)
將原始資料線性變換到用戶指定的最大-最小值之間,處理后的資料會被壓縮到 [0,1] 區間上
x ′ = x ? m i n ( x ) m a x ( x ) ? m i n ( x ) x^{\prime} = {{x - min(x)} \over {max(x) - min(x)}} x′=max(x)?min(x)x?min(x)?
適用范圍:
- 對輸出范圍有要求
- 資料較為穩定,不存在極端的最大最小值
- 在不涉及距離度量、協方差計算、資料不符合正太分布的時候,可以使用該方法
Mean normalization
x ′ = x ? m e a n ( x ) m a x ( x ) ? m i n ( x ) x^{\prime} = {{x - mean(x)} \over {max(x) - min(x)}} x′=max(x)?min(x)x?mean(x)?
Scaling to unit length
將某一特征的模長轉化為1
x ′ = x ∣ ∣ x ∣ ∣ x^{\prime} = {x \over ||x||} x′=∣∣x∣∣x?
案例
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
%matplotlib inline
df = pd.read_csv(
'wine_data.csv', # 葡萄酒資料集
header=None, # 自定義列名
usecols=[0,1,2] # 回傳一個子集
)
df.columns = ['Class label', 'Alcohol', 'Malic acid'] # 類別標簽、酒精、蘋果酸
df.head()

如上表所示,酒精(百分含量/體積含量)和蘋果酸(克/升)的測量是在不同的尺度上進行的,因此在對這些資料進行任何比較或組合之前,必須先進行特征縮放
# Standardization
std_scale = preprocessing.StandardScaler().fit(df[['Alcohol', 'Malic acid']])
df_std = std_scale.transform(df[['Alcohol', 'Malic acid']])
# Min-Max scaling
minmax_scale = preprocessing.MinMaxScaler().fit(df[['Alcohol', 'Malic acid']])
df_minmax = minmax_scale.transform(df[['Alcohol', 'Malic acid']])
# 均值、標準差
print('Mean after standardization:\nAlcohol={:.2f}, Malic acid={:.2f}'
.format(df_std[:,0].mean(), df_std[:,1].mean()))
print('\nStandard deviation after standardization:\nAlcohol={:.2f}, Malic acid={:.2f}'
.format(df_std[:,0].std(), df_std[:,1].std()))
Mean after standardization:
Alcohol=-0.00, Malic acid=-0.00
Standard deviation after standardization:
Alcohol=1.00, Malic acid=1.00
# 最小值、最大值
print('Min-value after min-max scaling:\nAlcohol={:.2f}, Malic acid={:.2f}'
.format(df_minmax[:,0].min(), df_minmax[:,1].min()))
print('\nMax-value after min-max scaling:\nAlcohol={:.2f}, Malic acid={:.2f}'
.format(df_minmax[:,0].max(), df_minmax[:,1].max()))
Min-value after min-max scaling:
Alcohol=0.00, Malic acid=0.00
Max-value after min-max scaling:
Alcohol=1.00, Malic acid=1.00
def plot():
plt.figure(figsize=(8,6))
plt.scatter(df['Alcohol'], df['Malic acid'],
color='green', label='input scale', alpha=0.5)
plt.scatter(df_std[:,0], df_std[:,1], color='red',
label='Standardized [$N (\mu=0, \; \sigma=1)$]', alpha=0.3)
plt.scatter(df_minmax[:,0], df_minmax[:,1],
color='blue', label='min-max scaled [min=0, max=1]', alpha=0.3)
plt.title('Alcohol and Malic Acid content of the wine dataset')
plt.xlabel('Alcohol')
plt.ylabel('Malic Acid')
plt.legend(loc='upper left')
plt.grid()
plt.tight_layout()
plot()
plt.show()

上面的圖包括所有三個不同尺度上的葡萄酒資料點:非標準化資料(綠色),z-score標準化后的資料(紅色)和max-min標準化后的資料(藍色)
df['Class label'].value_counts().sort_index()
1 59
2 71
3 48
Name: Class label, dtype: int64
fig, ax = plt.subplots(3, figsize=(6,14))
for a,d,l in zip(range(len(ax)),
(df[['Alcohol', 'Malic acid']].values, df_std, df_minmax),
('Input scale',
'Standardized [$N (\mu=0, \; \sigma=1)$]',
'min-max scaled [min=0, max=1]')
):
for i,c in zip(range(1,4), ('red', 'blue', 'green')):
ax[a].scatter(d[df['Class label'].values == i, 0],
d[df['Class label'].values == i, 1],
alpha=0.5,
color=c,
label='Class %s' %i
)
ax[a].set_title(l)
ax[a].set_xlabel('Alcohol')
ax[a].set_ylabel('Malic Acid')
ax[a].legend(loc='upper left')
ax[a].grid()
plt.tight_layout()
plt.show()

在機器學習中,如果我們對訓練集做了上述處理,那么測驗集也必須要經過相同的處理
std_scale = preprocessing.StandardScaler().fit(X_train) X_train = std_scale.transform(X_train) X_test = std_scale.transform(X_test)
標準化處理對PCA的影響
以主成分分析(PCA)為例,標準化是至關重要的,因為它可以分析不同特征的差異
讀取資料集
df = pd.read_csv(
'wine_data.csv',
header=None,
)
df.head()

劃分訓練集和測驗集
# 將70%的樣本作為訓練集,30%作為測驗集
X_wine = df.values[:,1:]
y_wine = df.values[:,0]
X_train, X_test, y_train, y_test = train_test_split(X_wine, y_wine,
test_size=0.30, random_state=12345)
特征縮放
# Standardization
std_scale = preprocessing.StandardScaler().fit(X_train)
X_train_std = std_scale.transform(X_train)
X_test_std = std_scale.transform(X_test)
使用PCA進行降維
對標準化和非標準化資料集執行PCA,將資料集轉換為二維特征子空間
# on non-standardized data
pca = PCA(n_components=2).fit(X_train)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
# om standardized data
pca_std = PCA(n_components=2).fit(X_train_std)
X_train_std = pca_std.transform(X_train_std)
X_test_std = pca_std.transform(X_test_std)
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10,4))
for l,c,m in zip(range(1,4), ('blue', 'red', 'green'), ('^', 's', 'o')):
ax1.scatter(X_train[y_train==l, 0], X_train[y_train==l, 1],
color=c,
label='class %s' %l,
alpha=0.5,
marker=m
)
for l,c,m in zip(range(1,4), ('blue', 'red', 'green'), ('^', 's', 'o')):
ax2.scatter(X_train_std[y_train==l, 0], X_train_std[y_train==l, 1],
color=c,
label='class %s' %l,
alpha=0.5,
marker=m
)
ax1.set_title('Transformed NON-standardized training dataset after PCA')
ax2.set_title('Transformed standardized training dataset after PCA')
for ax in (ax1, ax2):
ax.set_xlabel('1st principal component')
ax.set_ylabel('2nd principal component')
ax.legend(loc='upper right')
ax.grid()
plt.tight_layout()
plt.show()

訓練樸素的貝葉斯分類器
貝葉斯公式:
p
(
w
j
∣
x
)
=
p
(
x
∣
w
j
)
?
p
(
w
j
)
p
(
x
)
p(w_j|x) = {{p(x|w_j) * p(w_j)} \over p(x)}
p(wj?∣x)=p(x)p(x∣wj?)?p(wj?)?
# on non-standardized data
gnb = GaussianNB()
fit = gnb.fit(X_train, y_train)
# on standardized data
gnb_std = GaussianNB()
fit_std = gnb_std.fit(X_train_std, y_train)
評估有無標準化的分類準確性
pred_train = gnb.predict(X_train)
print('\nPrediction accuracy for the training dataset')
print('{:.2%}'.format(metrics.accuracy_score(y_train, pred_train)))
pred_test = gnb.predict(X_test)
print('\nPrediction accuracy for the test dataset')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test)))
Prediction accuracy for the training dataset
81.45%
Prediction accuracy for the test dataset
64.81%
pred_train_std = gnb_std.predict(X_train_std)
print('\nPrediction accuracy for the training dataset')
print('{:.2%}'.format(metrics.accuracy_score(y_train, pred_train_std)))
pred_test_std = gnb_std.predict(X_test_std)
print('\nPrediction accuracy for the test dataset')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test_std)))
Prediction accuracy for the training dataset
96.77%
Prediction accuracy for the test dataset
98.15%
本文到此結束,后續將會不斷更新,如果發現上述有誤,請各位大佬及時指正!
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/279598.html
標籤:python
