在kmodes中處理Na(NaNs)-有解無憂

我正在查看食譜 (180K) * 成分 (~8000) 的高維資料集。我將值設為二進制，具體取決于配方中是否包含某種成分。顯然，在使用 Kmodes 時，如果我將 NaN 替換為 0s '''data = data.replace(np.nan, 0)'''。我最終得到一個密集類別（從零開始）和每個其他集群中的一個值（相似性基于 1 和 0）。所以問題是我怎樣才能使這些 NaN 使 Kmodes 不考慮它們？

from kmodes.kmodes import KModes
km_cao = KModes(n_clusters=20, init = "Cao", n_init = 1, verbose=1)
fitClusters_cao = km_cao.fit_predict(data)
fitClusters_cao

在 kmodes 中處理 Na (NaNs)

例子：

import pandas as pd
import numpy as np

{'recipe_id': {0: 424415, 1: 424415, 2: 424415, 3: 424415, 4: 424415, 5: 146223, 6: 146223, 7: 146223, 8: 146223, 9: 146223, 10: 146223, 11: 146223, 12: 146223, 13: 146223, 14: 146223, 15: 146223, 16: 146223, 17: 312329, 18: 312329, 19: 312329}, 'ingredient_ids': {0: 389, 1: 7655, 2: 6270, 3: 1527, 4: 3406, 5: 2683, 6: 4969, 7: 800, 8: 5298, 9: 840, 10: 2499, 11: 6632, 12: 7022, 13: 1511, 14: 3248, 15: 4964, 16: 6270, 17: 1257, 18: 7655, 19: 6270}}

df = pandas.DataFrame.from_dict(data_as_dict)


df[['counts']] = df\
.groupby(by = ['ingredient_ids'], as_index = False)['ingredient_ids'].count()

df[['counts']] = df\
.groupby(by = ['ingredient_ids'], as_index = False)['ingredient_ids'].count()

data_exploded = df[['recipe_id', 'ingredient_ids', 'counts']]

data_exploded['count'] = 1

data_exploded = data_exploded.drop('counts', axis = 1)

data_exploded = data_exploded.pivot_table(values = 'count', index = 'recipe_id', columns='ingredient_ids')

data_exploded = data_exploded.replace(np.nan, 0)

from kmodes.kmodes import KModes
km_cao = KModes(n_clusters=20, init = "Cao", n_init = 1, verbose=1)

fitClusters_cao = km_cao.fit_predict(data_exploded)
fitClusters_cao

uj5u.com熱心網友回復：

解決這個問題的方法是將所有內容轉換為字串值（kmodes 顯然可以處理）。因此，從 pivot.table() 中，使 fill_value = ''，如果使用二進制資料，還將 1（和 0）轉換為字串值。

''' data_exploded[['count']] = '1'

data_exploded = data_exploded.drop('counts', axis = 1)

#data_exploded[['count']] = data_exploded[['count']].astype(int) data_exploded = data_exploded.pivot_table(index = 'recipe_id', columns='ingredient_ids', values = 'count', fill_value = ' ', aggfunc='sum') data_exploded #data_exploded = data_exploded.replace(0, Na)

#data_exploded = data_exploded.replace(np.nan, 0)

從 kmodes.kmodes 匯入 KModes km_cao = KModes(n_clusters=25, init = "Cao", n_init = 1, verbose=1)

fitClusters_cao = km_cao.fit_predict(data_exploded) fitClusters_cao '''

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/479922.html

標籤：Python 熊猫机器学习

上一篇：獲取GridSearchCV中每個CV的所有預測值

下一篇：獲取集群數量（3D）