使用K-Means對超市客戶分組
主要步驟流程:
- 1. 匯入包
- 2. 匯入資料集
- 3. 使用肘部法則選擇最優的K值
- 4. 使用K=5做聚類
- 5. 可視化聚類效果
- 6. 采取措施
- 7. 瑞士卷生產及其聚類
1. 匯入包
In [1]:# 匯入包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
2. 匯入資料集
In [2]:# 匯入資料集
dataset = pd.read_csv('Mall_Customers.csv')
dataset
Out[2]:
| CustomerID | Genre | Age | Annual Income (k$) | Spending Score (1-100) | |
|---|---|---|---|---|---|
| 0 | 1 | Male | 19 | 15 | 39 |
| 1 | 2 | Male | 21 | 15 | 81 |
| 2 | 3 | Female | 20 | 16 | 6 |
| 3 | 4 | Female | 23 | 16 | 77 |
| 4 | 5 | Female | 31 | 17 | 40 |
| ... | ... | ... | ... | ... | ... |
| 195 | 196 | Female | 35 | 120 | 79 |
| 196 | 197 | Female | 45 | 126 | 28 |
| 197 | 198 | Male | 32 | 126 | 74 |
| 198 | 199 | Male | 32 | 137 | 18 |
| 199 | 200 | Male | 30 | 137 | 83 |
200 rows × 5 columns
為了可視化聚類效果,僅選取Annual Income (k$)和Spending Score (1-100)這2個欄位
In [3]:X = dataset.iloc[:, [3, 4]].values
X[:3, :]
Out[3]:
array([[15, 39],
[15, 81],
[16, 6]], dtype=int64)
3. 使用肘部法則選擇最優的K值
In [4]:# 使用肘部法則選擇最優的K值
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', n_init=10, max_iter=300, random_state = 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
In [5]:
# 畫出 聚類個數 vs WCSS 圖
plt.figure()
plt.plot(range(1, 11), wcss, 'ro-')
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

從K=5開始,WCSS下降的不再明顯,說明K=5是最優選擇
4. 使用K=5做聚類
In [6]:# 使用選擇出的K,使用K-Means做聚類
kmeans = KMeans(n_clusters = 5, init = 'k-means++', n_init=10, max_iter=300, random_state = 0)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
In [7]:
y_kmeans
Out[7]:
array([3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1,
3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 0,
3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 4, 2, 0, 2, 4, 2, 4, 2,
0, 2, 4, 2, 4, 2, 4, 2, 4, 2, 0, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2])
5. 可視化聚類效果
In [8]:# 可視化聚類效果
plt.figure()
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()
6. 采取措施
- Cluster 1 工資收入中等,消費中等;
- Cluster 2 工資收入低,消費高,查看這個分組主要購買哪些商品;
- Cluster 3 工資收入高,消費高;
- Cluster 4 工資收入低,消費低;
- Cluster 5 工資收入高,消費低,給這個分組的客戶辦理優惠券或打折購物卡,吸引他們消費;
7. 瑞士卷生產及其聚類
In [10]:from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn import manifold, datasets
import matplotlib.pyplot as plt
?
#生成帶噪聲的瑞士卷資料集
X,color = datasets.make_swiss_roll(n_samples=3000)
?
#使用100個K-means簇對資料進行近似
clusters_swiss_roll = KMeans(n_clusters=3,random_state=1).fit_predict(X)
?
fig2 = plt.figure(figsize=(10,10))
ax = fig2.add_subplot(111,projection='3d')
ax.scatter(X[:,0],X[:,1],X[:,2],c = clusters_swiss_roll,cmap = 'Spectral')
plt.show()

轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/444385.html
標籤:其他
上一篇:【使用分享】Hive磁區表那些事
