重新定義分類變數的類別，忽略大小寫-有解無憂

我有一個帶有分類變數的資料集，該變數沒有很好地編碼。同一類別有時以大寫字母出現，有時以小寫字母出現（以及它的幾種變體）。由于我有一個大資料集，我想利用分類 dtype 來協調類別 - 因此排除任何replace解決方案。我找到的唯一解決方案是this和this，但我覺得它們隱含地使用了替換。

我在下面報告一個玩具示例和我嘗試過的解決方案

from pandas import Series

# Create dataset
df = Series(["male", "female","Male", "FEMALE", "MALE", "MAle"], dtype="category", name = "NEW_TEST")

# Define the old, the "new" and the desired categories
original_categories = list(df.cat.categories)
standardised_categories = list(map(lambda x: x.lower(), df.cat.categories)) 
desired_new_cat = list(set(standardised_categories))

# Failed attempt to change categories   
df.cat.categories = standardised_categories
df = df.cat.rename_categories(standardised_categories)
# Error message: Categorical categories must be unique

uj5u.com熱心網友回復：

你不應該在轉換到類別后嘗試協調。這使得類別的使用毫無意義，因為每個確切的字串都會創建一個類別。

您可以將大小寫與協調str.capitalize，然后轉換為分類：

s = (pd.Series(["male", "female","Male", "FEMALE", "MALE", "MAle"],
               name = "NEW_TEST")
       .str.capitalize().astype('category')
     )

如果您已經有一個類別，請轉換回字串并重新開始：

s = s.astype(str).str.capitalize().astype('category')

輸出：

0      Male
1    Female
2      Male
3    Female
4      Male
5      Male
Name: NEW_TEST, dtype: category
Categories (2, object): ['Female', 'Male']

uj5u.com熱心網友回復：

鑒于dfOP 在問題中共享的代碼示例中創建的系列，一種方法是按如下pandas.Series.str.lower方式使用.astype("category")

df = df.str.lower().astype("category")

[Out]:

0      male
1    female
2      male
3    female
4      male
5      male

如果一個人列印dtype, 一個人得到

CategoricalDtype(categories=['female', 'male'], ordered=False)

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/533403.html

標籤：Python熊猫分类数据

上一篇：有什么方法可以將兩個不同的csv檔案與python中的相似列匹配嗎？

下一篇：如何在矢量化資料幀列上應用onehot編碼器？[復制]