將資料框字串列轉換為多列并根據標簽重新排列每一列-有解無憂

我想將具有多個標簽的字串列轉換為每個標簽的單獨列，并重新排列相同標簽位于同一列中的資料框。例如：

ID	標簽
0	蘋果，湯姆，汽車
1	蘋果，汽車
2	湯姆，蘋果

到

ID	標簽	0	1	2
0	蘋果，湯姆，汽車	蘋果	車	湯姆
1	蘋果，汽車	蘋果	車	沒有任何
2	湯姆，蘋果	蘋果	沒有任何	湯姆

df["Label"].str.split(',',3, expand=True)

0	1	2
蘋果	湯姆	車
蘋果	車	沒有任何
湯姆	蘋果	沒有任何

我知道如何拆分字串列，但我無法真正弄清楚如何對標簽列進行排序，特別是因為每個樣本的標簽數量不同。

uj5u.com熱心網友回復：

這是一種方法。

首先呼叫 df['Label'].apply() 將 csv 字串替換為串列，并將 Python dict 映射標簽填充到新的列索引值。

然后創建第二個資料框 df2 來填充問題中指定的新標簽列。

最后，水平連接兩個 DataFrame 并洗掉“Label”列。

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'ID' : [0,1,2],
    'Label' : ['apple, tom, car', 'apple, car', 'tom, apple']
})

labelInfo = [labels := {}, curLabelIdx := 0]
def foo(x, labelInfo):
    theseLabels = [s.strip() for s in x.split(',')]
    labels, curLabelIdx = labelInfo
    for label in theseLabels:
        if label not in labels:
            labels[label] = curLabelIdx
            curLabelIdx  = 1
    labelInfo[1] = curLabelIdx
    return theseLabels
df['Label'] = df['Label'].apply(foo, labelInfo=labelInfo)
df2 = pd.DataFrame(np.array(df['Label'].apply(lambda x: [s if s in x else 'None' for s in labels]).to_list()), 
    columns = list(labels.values()))
df = pd.concat([df, df2], axis=1).drop(columns=['Label'])

print(df)

輸出：

   ID      0     1     2
0   0  apple   tom   car
1   1  apple  None   car
2   2  apple   tom  None

如果您希望使用它們包含的標簽來命名新列，您可以將df2賦值行替換為：

df2 = pd.DataFrame(np.array(df['Label'].apply(lambda x: [s if s in x else 'None' for s in labels]).to_list()), 
    columns = list(labels))

現在輸出是：

   ID  apple   tom   car
0   0  apple   tom   car
1   1  apple  None   car
2   2  apple   tom  None

uj5u.com熱心網友回復：

嘗試：

df = df.assign(xxx=df.Label.str.split(r"\s*,\s*")).explode("xxx")
df["Col"] = df.groupby("xxx").ngroup()
df = (
    df.set_index(["ID", "Label", "Col"])
    .unstack(2)
    .droplevel(0, axis=1)
    .reset_index()
)
df.columns.name = None
print(df)

印刷：

   ID            Label      0    1    2
0   0  apple, tom, car  apple  car  tom
1   1       apple, car  apple  car  NaN
2   2       tom, apple  apple  NaN  tom

uj5u.com熱心網友回復：

我相信你想要的是這樣的：

import pandas as pd

data = {'Label': ['apple, tom, car', 'apple, car', 'tom, apple']}
df = pd.DataFrame(data)
print(f"df: \n{df}")

def norm_sort(series):
    mask = []
    for line in series:
        mask.extend([l.strip() for l in line.split(',')])
    mask = sorted(list(set(mask)))
    labels = []
    for line in series:
        labels.append(', '.join([m if m in line else 'None' for m in mask]))
    return labels

df.Label = norm_sort(df.loc[:, 'Label'])
df = df.Label.str.split(', ', expand=True)
print(f"df: \n{df}")

uj5u.com熱心網友回復：

您的程式的目標不明確。如果您對不同行中存在哪些元素感到好奇，那么我們可以將它們全部獲取并像這樣堆疊資料框：

df = pd.DataFrame({'label': ['apple, banana, grape', 'apple, banana', 'banana, grape']})
final_df = df['label'].str.split(', ', expand=True).stack()
final_df.reset_index(drop=True, inplace=True)

>>> final_df
0     apple
1    banana
2     grape
3     apple
4    banana
5    banana
6     grape

此時，我們可以洗掉重復項或計算每個重復項的出現次數，具體取決于您的用例。

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/462150.html

標籤：Python 熊猫数据框排序

上一篇：這是否可以根據排序另一個陣列獲得的排序順序對一個陣列進行排序？

下一篇：根據值過濾兩個陣列以洗掉重復項