如何將一個資料幀分成幾個資料幀-有解無憂

如何劃分具有多個分類列的大型資料框，其中包含多個標簽或類。

例如，我有100 萬行、100 列和 50 列，其中包含帶有不同標簽的分類資料。

現在如何將資料框分成 2 或 3 個部分（或子集），其中分類列中的所有標簽都應出現在2 或 3 個子集中。對于大型資料集是否可以這樣做？

def rec():
    print('#rec Started')
    shuf_data = df.sample(frac=1)
    ran_data = np.random.rand(len(shuf_data)) < 0.5
    p_d = shuf_data[ran_data]
    d = shuf_data[~ran_data]

    def rrec(p_d,d):
        print('#rrec Started')
        for col in df_cat_cols:
            p_dcol = p_d[col].unique()
            dcol = d[col].unique()
            outcome = all(elem in p_dcol for elem in dcol)
            if outcome:
                print("Yes, list1 contains all elements in list2")
            else:
                print("No, list1 does not contains all elements in list2")
                return rec()
        return p_d,d

    return rrec(p_d,d)

由于資料集非常大（100 萬條記錄），上述代碼終止了行程。請提出一個更好更有效的答案。謝謝你。

下面是一個例子：

Eg:
    Fruits  Color   Price
0   Banana  Yellow  60
1   Grape   Black   100
2   Apple   Red     200
3   Papaya  Yellow  50
4   Dragon  Pink    150
5   Mango   Yellow  400
6   Banana  Yellow  75
7   Grape   Black   106
8   Apple   Red     190
9   Papaya  Yellow  60
10  Dragon  Pink    120
11  Mango   Yellow  390

Expected 50:50 split:

df1:

3   Papaya  Yellow  50
4   Dragon  Pink    150
5   Mango   Yellow  400
6   Banana  Yellow  75
7   Grape   Black   106
8   Apple   Red     190

df2:
0   Banana  Yellow  60
1   Grape   Black   100
2   Apple   Red     200
9   Papaya  Yellow  60
10  Dragon  Pink    120
11  Mango   Yellow  390

uj5u.com熱心網友回復：

為什么不嘗試使用 sk-learn 和 OneHotEncoder() 中的 train_test_split() 方法來分解分類列。這更像是一種機器學習方法，我以前用它來打破一百萬行的資料集，所以它應該可以作業

uj5u.com熱心網友回復：

是的，一種方法是列舉具有相同類別的所有行：

cat_cols = ['cat_col1', 'cat_col2']

groups = df.groupby(cat_cols).cumcount() // 3

sub_df = {g: d for g,d in df.groupby(groups)}

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/347950.html

標籤：Python 熊猫数据框

上一篇：在資料框中跨行查找t置信區間

下一篇：如何根據順序合并熊貓中的兩個資料幀