我正在查看食譜 (180K) * 成分 (~8000) 的高維資料集。我將值設為二進制,具體取決于配方中是否包含某種成分。顯然,在使用 Kmodes 時,如果我將 NaN 替換為 0s '''data = data.replace(np.nan, 0)'''。我最終得到一個密集類別(從零開始)和每個其他集群中的一個值(相似性基于 1 和 0)。所以問題是我怎樣才能使這些 NaN 使 Kmodes 不考慮它們?
from kmodes.kmodes import KModes
km_cao = KModes(n_clusters=20, init = "Cao", n_init = 1, verbose=1)
fitClusters_cao = km_cao.fit_predict(data)
fitClusters_cao

例子:
import pandas as pd
import numpy as np
{'recipe_id': {0: 424415, 1: 424415, 2: 424415, 3: 424415, 4: 424415, 5: 146223, 6: 146223, 7: 146223, 8: 146223, 9: 146223, 10: 146223, 11: 146223, 12: 146223, 13: 146223, 14: 146223, 15: 146223, 16: 146223, 17: 312329, 18: 312329, 19: 312329}, 'ingredient_ids': {0: 389, 1: 7655, 2: 6270, 3: 1527, 4: 3406, 5: 2683, 6: 4969, 7: 800, 8: 5298, 9: 840, 10: 2499, 11: 6632, 12: 7022, 13: 1511, 14: 3248, 15: 4964, 16: 6270, 17: 1257, 18: 7655, 19: 6270}}
df = pandas.DataFrame.from_dict(data_as_dict)
df[['counts']] = df\
.groupby(by = ['ingredient_ids'], as_index = False)['ingredient_ids'].count()
df[['counts']] = df\
.groupby(by = ['ingredient_ids'], as_index = False)['ingredient_ids'].count()
data_exploded = df[['recipe_id', 'ingredient_ids', 'counts']]
data_exploded['count'] = 1
data_exploded = data_exploded.drop('counts', axis = 1)
data_exploded = data_exploded.pivot_table(values = 'count', index = 'recipe_id', columns='ingredient_ids')
data_exploded = data_exploded.replace(np.nan, 0)
from kmodes.kmodes import KModes
km_cao = KModes(n_clusters=20, init = "Cao", n_init = 1, verbose=1)
fitClusters_cao = km_cao.fit_predict(data_exploded)
fitClusters_cao
uj5u.com熱心網友回復:
解決這個問題的方法是將所有內容轉換為字串值(kmodes 顯然可以處理)。因此,從 pivot.table() 中,使 fill_value = '',如果使用二進制資料,還將 1(和 0)轉換為字串值。
''' data_exploded[['count']] = '1'
data_exploded = data_exploded.drop('counts', axis = 1)
#data_exploded[['count']] = data_exploded[['count']].astype(int) data_exploded = data_exploded.pivot_table(index = 'recipe_id', columns='ingredient_ids', values = 'count', fill_value = ' ', aggfunc='sum') data_exploded #data_exploded = data_exploded.replace(0, Na)
#data_exploded = data_exploded.replace(np.nan, 0)
從 kmodes.kmodes 匯入 KModes km_cao = KModes(n_clusters=25, init = "Cao", n_init = 1, verbose=1)
fitClusters_cao = km_cao.fit_predict(data_exploded) fitClusters_cao '''
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/479922.html
下一篇:獲取集群數量(3D)
