我有一個自定義資料集,我想使用 kmeans 進行磁區。這是我的 MCVE:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
samples = np.array([[3.2736001e 03, 1.7453293e 00], [3.7256001e 03, 5.2359879e-02], [3.2960000e 03, 1.7366025e 00], [3.7112000e 03, 4.3633230e-02],
[3.7136001e 03, 4.3633230e-02], [6.8240002e 02, 1.4137167e 00], [6.9279999e 02, 1.4049901e 00], [3.2944001e 03, 1.7366025e 00], [3.7480000e 03, 6.1086524e-02],
[7.4640002e 02, 1.2217305e-01], [3.2712000e 03, 1.7453293e 00], [7.8320001e 02, 1.3962634e-01], [8.0240002e 02, 1.4835298e-01], [3.7176001e 03, 5.2359879e-02],
[7.1520001e 02, 1.3875368e 00], [7.0079999e 02, 1.3962634e 00], [7.6640002e 02, 1.3962634e-01], [7.3440002e 02, 1.1344640e-01], [3.3272000e 03, 1.7278759e 00],
[7.3840002e 02, 1.3788100e 00], [7.6079999e 02, 1.3089970e-01], [7.6240002e 02, 1.3089970e-01], [7.8160004e 02, 1.3613569e 00], [7.6079999e 02, 1.3700835e 00],
[6.0800000e 02, 1.4224433e 00], [6.3600000e 02, 1.4137167e 00], [6.7040002e 02, 1.4224433e 00], [7.8879999e 02, 1.6580628e-01], [7.3840002e 02, 1.2217305e-01],
[8.1920001e 02, 1.5707964e-01], [7.5760004e 02, 1.3089970e-01], [6.6240002e 02, 9.5993109e-02], [7.7520001e 02, 1.4835298e-01], [6.9040002e 02, 1.3962634e 00],
[3.7544001e 03, 6.1086524e-02], [7.8240002e 02, 1.5707964e-01], [7.1520001e 02, 1.1344640e-01], [7.9840002e 02, 1.3526301e 00], [7.4079999e 02, 1.3788100e 00],
[3.7200000e 03, 5.2359879e-02], [3.7168000e 03, 4.3633230e-02], [7.1760004e 02, 1.3875368e 00], [6.6479999e 02, 1.4049901e 00]])
numCluster = 4
kmeans = KMeans(n_clusters=numCluster).fit(samples)
for i in range(0,numCluster):
res = samples[kmeans.labels_ == i]
plt.scatter(res[:,0], res[:,1])
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], color='black')
輸出

對我來說,左側兩個集群的輸出沒有任何意義。根據我對 kmeans 的理解,在下一次迭代中,中心將在集群內移動。但即使有給定的中心,所分配的標簽也沒有意義。
到目前為止我檢查過的內容:
- 使用“kmeans ”的輸出看起來相同
- 調整“tol”
- 測驗了kmeans的opencv實作
有人可以解釋這個輸出,以及如何解決它嗎?
uj5u.com熱心網友回復:
資料未正確縮放。x 維度在 500 到 3500 之間縮放,而 y 維度在 0 到 1.75 之間縮放。
該圖隱藏了這一點,但左側的兩個“斑點”在適當縮放時看起來更像平行線。
如果我們關注x維度小于1000的部分資料,這個問題就更加明顯了。k-means 收斂的中心現在看起來很合理:
left_side = samples[samples[:, 0] < 1000]
plt.scatter(left_side[:, 0], left_side[:, 1])
kmeans = KMeans(n_clusters=2).fit(left_side)

將此與我們首先通過移除均值和方差 ( ) 來轉換資料的情況進行對比StandardScaler:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
samples = np.array([[3.2736001e 03, 1.7453293e 00], [3.7256001e 03, 5.2359879e-02], [3.2960000e 03, 1.7366025e 00], [3.7112000e 03, 4.3633230e-02],
[3.7136001e 03, 4.3633230e-02], [6.8240002e 02, 1.4137167e 00], [6.9279999e 02, 1.4049901e 00], [3.2944001e 03, 1.7366025e 00], [3.7480000e 03, 6.1086524e-02],
[7.4640002e 02, 1.2217305e-01], [3.2712000e 03, 1.7453293e 00], [7.8320001e 02, 1.3962634e-01], [8.0240002e 02, 1.4835298e-01], [3.7176001e 03, 5.2359879e-02],
[7.1520001e 02, 1.3875368e 00], [7.0079999e 02, 1.3962634e 00], [7.6640002e 02, 1.3962634e-01], [7.3440002e 02, 1.1344640e-01], [3.3272000e 03, 1.7278759e 00],
[7.3840002e 02, 1.3788100e 00], [7.6079999e 02, 1.3089970e-01], [7.6240002e 02, 1.3089970e-01], [7.8160004e 02, 1.3613569e 00], [7.6079999e 02, 1.3700835e 00],
[6.0800000e 02, 1.4224433e 00], [6.3600000e 02, 1.4137167e 00], [6.7040002e 02, 1.4224433e 00], [7.8879999e 02, 1.6580628e-01], [7.3840002e 02, 1.2217305e-01],
[8.1920001e 02, 1.5707964e-01], [7.5760004e 02, 1.3089970e-01], [6.6240002e 02, 9.5993109e-02], [7.7520001e 02, 1.4835298e-01], [6.9040002e 02, 1.3962634e 00],
[3.7544001e 03, 6.1086524e-02], [7.8240002e 02, 1.5707964e-01], [7.1520001e 02, 1.1344640e-01], [7.9840002e 02, 1.3526301e 00], [7.4079999e 02, 1.3788100e 00],
[3.7200000e 03, 5.2359879e-02], [3.7168000e 03, 4.3633230e-02], [7.1760004e 02, 1.3875368e 00], [6.6479999e 02, 1.4049901e 00]])
scaler = StandardScaler()
samples = scaler.fit_transform(samples)
numCluster = 4
kmeans = KMeans(n_clusters=numCluster).fit(samples)
for i in range(0, numCluster):
res = samples[kmeans.labels_ == i]
plt.scatter(res[:, 0], res[:, 1])
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], color="black")

轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/520807.html
下一篇:誤差平方和的數學概念的梯度下降
