資料預處理?
將資料預處理分解為 6 ??個基本步驟,從此處獲取本示例中使用的資料集
Step 1: 匯入庫?
In [ ]:import numpy as np import pandas as pd
Step 2: 倒入資料集?
In [ ]:dataset = pd.read_csv('./data/Data.csv') X = dataset.iloc[ : , :-1].values Y = dataset.iloc[ : , 3].values print(X) print(Y)
[['France' 44.0 72000.0] ['Spain' 27.0 48000.0] ['Germany' 30.0 54000.0] ['Spain' 38.0 61000.0] ['Germany' 40.0 nan] ['France' 35.0 58000.0] ['Spain' nan 52000.0] ['France' 48.0 79000.0] ['Germany' 50.0 83000.0] ['France' 37.0 67000.0]] ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
Step 3: 處理丟失資料?
In [ ]:from sklearn.impute import SimpleImputer imputer = SimpleImputer(missing_values = np.NAN, strategy = "mean") X[ : , 1:3] = imputer.fit_transform(X[ : , 1:3])
遺棄用法
In [ ]:import sklearn from sklearn.preprocessing import Imputer print(sklearn.__version__) import warnings warnings.filterwarnings("ignore") # Imputer 在 sklearn 0.20以上版本中被 impute.SimpleImputer 取代 imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0) imputer = imputer.fit(X[ : , 1:3]) X[ : , 1:3] = imputer.transform(X[ : , 1:3])
0.21.2
Step 4: 編碼分類資料?
In [ ]:from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder_X = LabelEncoder() X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0])
創建虛擬變數?
In [ ]:onehotencoder = OneHotEncoder(categorical_features = [0]) X = onehotencoder.fit_transform(X).toarray() labelencoder_Y = LabelEncoder() Y = labelencoder_Y.fit_transform(Y)
Step 5: 將資料集分為訓練集和測驗集?
In [ ]:from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)
Step 6: 特征縮放?
In [ ]:from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.fit_transform(X_test)
特征處理總結?
- 首先要明確有多少特征,哪些是連續的,哪些是類別的,
- 檢查有沒有缺失值,對確實的特征選擇恰當方式進行彌補,使資料完整,
- 對連續的數值型特征進行標準化,使得均值為0,方差為1,
- 對類別型的特征進行one-hot編碼,
- 將需要轉換成類別型資料的連續型資料進行二值化,
- 為防止過擬合或者其他原因,選擇是否要將資料進行正則化,
- 在對資料進行初探之后發現效果不佳,可以嘗試使用多項式方法,尋找非線性的關系,
- 根據實際問題分析是否需要對特征進行相應的函式轉換,
參考?
100-Days-Of-ML-Code
成本最低的事情是學習,性價比最高的事情也是學習!轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/468761.html
標籤:其他
