NumPy學習(三)
本次練習使用 鳶尾屬植物資料集.\iris.txt,在這個資料集中,包括了三類不同的鳶尾屬植物:Iris Setosa,Iris Versicolour,Iris Virginica,每類收集了50個樣本,因此這個資料集一共包含了150個樣本,
- sepallength:萼片長度
- sepalwidth:萼片寬度
- petallength:花瓣長度
- petalwidth:花瓣寬度
以上四個特征的單位都是厘米(cm),
所有操作均被封裝進irisData檔案中的函式,
首先呼叫此模塊,
>>> from irisData import *
>>>
檔案中呼叫了numpy庫,為了后續方便操作,將資料集的五列的標題與索引進行對應,
import numpy as np
# 全域變數,資料每列代表的屬性
sepallength = 0 # 萼片長度
sepalwidth = 1 # 萼片寬度
petallength = 2 # 花瓣長度
petalwideh = 3 # 花瓣寬度
species = 4 # 種類
1.匯入鳶尾屬植物資料集,保持文本不變,
【知識點:輸入和輸出】
- 如何匯入存在數字和文本的資料集?
# 讀取資料
# 引數為資料集路徑
def loadData(dataPath):
global irisData
irisData = np.loadtxt(dataPath, dtype=object, delimiter=',', skiprows=1)
return irisData
>>> irisData = loadData("iris.txt")
>>> print(irisData[0:10])
[['5.1' '3.5' '1.4' '0.2' 'Iris-setosa']
['4.9' '3.0' '1.4' '0.2' 'Iris-setosa']
['4.7' '3.2' '1.3' '0.2' 'Iris-setosa']
['4.6' '3.1' '1.5' '0.2' 'Iris-setosa']
['5.0' '3.6' '1.4' '0.2' 'Iris-setosa']
['5.4' '3.9' '1.7' '0.4' 'Iris-setosa']
['4.6' '3.4' '1.4' '0.3' 'Iris-setosa']
['5.0' '3.4' '1.5' '0.2' 'Iris-setosa']
['4.4' '2.9' '1.4' '0.2' 'Iris-setosa']
['4.9' '3.1' '1.5' '0.1' 'Iris-setosa']]
求出鳶尾屬植物萼片長度的平均值、中位數和標準差(第1列,sepallength)
【知識點:統計相關】
- 如何計算numpy陣列的均值,中位數,標準差?
# 計算平均值
# 引數為屬性代號,即0~3
def average(num):
global irisData
datas = irisData[:, num].astype(float)
result = np.mean(datas)
return result
# 計算標準差
def stddev(num):
global irisData
datas = irisData[:, num].astype(float)
result = np.std(datas)
return result
# 計算中位數
def median(num):
global irisData
datas = irisData[:, num].astype(float)
result = np.median(datas)
return result
>>> ave = average(sepallength)
>>> print(ave)
5.843333333333334
>>> std = stddev(sepallength)
>>> print(std)
0.8253012917851409
>>> med = median(sepallength)
>>> print(med)
5.8
>>>
3. 創建一種標準化形式的鳶尾屬植物萼片長度,其值正好介于0和1之間,這樣最小值為0,最大值為1(第1列,sepallength),
【知識點:統計相關】
- 如何標準化陣列?
# 資料標準化,此處為規范化方法,即結果落在[0, 1]上
def normalization(num):
global irisData
datas = irisData[:, num].astype(float)
aMax = np.amax(datas)
aMin = np.amin(datas)
result = (datas - aMin) / (aMax - aMin)
return result
>>> X = normalization(sepallength)
>>> print(X[0:10])
[0.22222222 0.16666667 0.11111111 0.08333333 0.19444444 0.30555556
0.08333333 0.19444444 0.02777778 0.16666667]
>>>
標準化方法參考三種常用資料標準化方法
4.把iris_data資料集中的20個隨機位置修改為np.nan值,
【知識點:隨機抽樣】
- 如何在陣列中的隨機位置修改值?
# 隨機替換資料中的n個值為np.nan
def swap(datas, n):
datas[np.random.choice(datas.shape[0], size=n), np.random.choice(datas.shape[1], size=n)]
return datas
>>> X = swap(irisData, 20)
>>> print(X[0:10])
[['5.1' '3.5' '1.4' '0.2' 'Iris-setosa']
['4.9' '3.0' '1.4' nan 'Iris-setosa']
['4.7' '3.2' '1.3' '0.2' 'Iris-setosa']
['4.6' '3.1' '1.5' '0.2' 'Iris-setosa']
['5.0' '3.6' nan '0.2' 'Iris-setosa']
['5.4' '3.9' '1.7' '0.4' 'Iris-setosa']
['4.6' nan '1.4' '0.3' 'Iris-setosa']
['5.0' '3.4' '1.5' '0.2' 'Iris-setosa']
['4.4' '2.9' '1.4' '0.2' 'Iris-setosa']
['4.9' nan '1.5' '0.1' 'Iris-setosa']]
>>>
5.計算 iris_data 中sepalLength(第1列)和petalLength(第3列)之間的相關系數,
【知識點:統計相關】
- 如何計算numpy陣列兩列之間的相關系數?
# 計算相關系數,此處為皮爾遜系數
# 引數為某兩列的屬性代號
def pearson(x, y):
global irisData
X = irisData[:, x].astype(float)
Y = irisData[:, y].astype(float)
xMean = np.mean(X)
yMean = np.mean(Y)
xStd = np.sqrt(np.dot(X-xMean, X-xMean))
yStd = np.sqrt(np.dot(Y-yMean, Y-yMean))
result = np.dot(X-xMean, Y-yMean) / (xStd * yStd)
return result
>>> pear = pearson(sepallength, petallength)
>>> print(pear)
0.8717541573048712
關于相關系數可參考三大統計相關系數
6.將 iris_data 的花瓣長度(第3列)以形成分類變數的形式顯示,
【知識點:統計相關】
- 如何將數字轉換為分類(文本)陣列?
# 將某一列以分類變數的形式顯示,區間端點為三等分點
def clfied(num):
global irisData
datas = irisData[:, num].astype(float)
aMax = np.amax(datas)
aMin = np.amin(datas)
div1 = (aMax + aMin) / 3
div2 = div1 * 2
binData = np.digitize(datas, [aMin, div1, div2, aMax])
label_map = {1: 'small', 2: 'medium', 3: 'large', 4: np.nan}
result = [label_map[x] for x in binData]
return result
>>> petal_length_cat = clfied(petallength)
>>> print(petal_length_cat[0:10])
['small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small']
7.在 iris_data 中創建一個新列,其中 volume 是 (pi x petallength x sepallength ^ 2)/ 3,
【知識點:陣列操作】
- 如何從numpy陣列的現有列創建新列?
# 在irisData中新建一列,其值為 (pi * petallength * sepallength ^ 2)/ 3
def newCul():
global irisData
splLength = iris_data[:, 0].astype(float)
ptlLength = iris_data[:, 2].astype(float)
volume = (np.pi * petalLength * sepalLength ** 2) / 3
volume = volume[:, np.newaxis]
irisData = np.concatenate([iris_data, volume], axis=1)
return
>>> Z = newCul()
>>> print(Z[0:10])
[['5.1' '3.5' '1.4' '0.2' 'Iris-setosa' 38.13265162927291]
['4.9' '3.0' '1.4' '0.2' 'Iris-setosa' 35.200498485922445]
['4.7' '3.2' '1.3' '0.2' 'Iris-setosa' 30.0723720777127]
['4.6' '3.1' '1.5' '0.2' 'Iris-setosa' 33.238050274980004]
['5.0' '3.6' '1.4' '0.2' 'Iris-setosa' 36.65191429188092]
['5.4' '3.9' '1.7' '0.4' 'Iris-setosa' 51.911677007917746]
['4.6' '3.4' '1.4' '0.3' 'Iris-setosa' 31.022180256648003]
['5.0' '3.4' '1.5' '0.2' 'Iris-setosa' 39.269908169872416]
['4.4' '2.9' '1.4' '0.2' 'Iris-setosa' 28.38324242763259]
['4.9' '3.1' '1.5' '0.1' 'Iris-setosa' 37.714819806345474]]
>>>
8.隨機抽鳶尾屬植物的種類,使得Iris-setosa的數量是Iris-versicolor和Iris-virginica數量的兩倍,
【知識點:隨機抽樣】
- 如何在numpy中進行概率抽樣?
# 隨機抽鳶尾屬植物的種類,使得Iris-setosa的數量是Iris-versicolor和Iris-virginica數量的兩倍
def pickSpecies():
global irisData
species = np.array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
speciesOut = np.random.choice(species, 20, p=[0.5, 0.25, 0.25])
return speciesOut
>>> out = pickSpecies(20)
>>> print(out)
['Iris-setosa' 'Iris-virginica' 'Iris-setosa' 'Iris-versicolor'
'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-setosa' 'Iris-virginica'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-virginica' 'Iris-setosa'
'Iris-versicolor' 'Iris-setosa' 'Iris-setosa']
>>>
9.根據 sepallength 列對資料集進行排序,
【知識點:排序】
- 如何按列對2D陣列進行排序?
# 根據某一項對資料集進行排序
def sort(num):
global irisData
datas = irisData[:, num]
index = np.argsort(datas)
result = irisData[index]
return result
>>> result = sort(sepallength)
>>> print(result[0:10])
[['4.3' '3.0' '1.1' '0.1' 'Iris-setosa']
['4.4' '3.2' '1.3' '0.2' 'Iris-setosa']
['4.4' '3.0' '1.3' '0.2' 'Iris-setosa']
['4.4' '2.9' '1.4' '0.2' 'Iris-setosa']
['4.5' '2.3' '1.3' '0.3' 'Iris-setosa']
['4.6' '3.6' '1.0' '0.2' 'Iris-setosa']
['4.6' '3.1' '1.5' '0.2' 'Iris-setosa']
['4.6' '3.4' '1.4' '0.3' 'Iris-setosa']
['4.6' '3.2' '1.4' '0.2' 'Iris-setosa']
['4.7' '3.2' '1.3' '0.2' 'Iris-setosa']]
>>>
END
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/216463.html
標籤:其他
