如何根據列名將資料框拆分為多個資料框？-有解無憂

我有一個包含如下列的資料框：

['id', 't_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0', 
 't_energy1', 't_energy2']

我有一個代碼，它回傳同名的三列的平均值：

# Takes in a dataframe with three columns and returns a dataframe with one column of their means as integers
def average_column(dataframe):

    dataframe = dataframe.copy()                  # To avoid SettingWithCopyWarning

    # Create new column name without integers
    temp = dataframe.columns.tolist()[0]
    col_name = temp.rstrip(temp[2:-1])

    dataframe[col_name] = dataframe.mean(axis=1)  # Add column to the dataframe (axis=1 means the mean() is applied row-wise)
    mean_df = dataframe.iloc[: , -1:]             # Isolated column of the mean by selecting all rows (:) for the last column (-1:)

    print("Original:\n{}\nAverage columns:\n{}".format(dataframe, mean_df))

    return mean_df.astype(float)

這個函式給了我這個輸出：

Original:
      t_dance0  t_dance1  t_dance2     dance
0        0.549     0.623    0.5190  0.563667
1        0.871     0.702    0.4160  0.663000
2        0.289     0.328    0.2340  0.283667
3        0.886     0.947    0.8260  0.886333
4        0.724     0.791    0.7840  0.766333
...        ...       ...       ...       ...

Average columns:
         dance
0     0.563667
1     0.663000
2     0.283667
3     0.886333
4     0.766333
...        ...

我問了這個關于如何將其拆分為唯一列和重復列的問題。這導致我看到這段代碼：

# Function that splits dataframe into two separate dataframes, one with all unique 
columns and one with all duplicates
def sub_dataframes(dataframe):

    # Extract common prefix -> remove trailing digits
    cols = dataframe.columns.str.replace(r'\d*$', '', regex=True).to_series().value_counts()

    # Split columns
    unq_cols = cols[cols == 1].index
    dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)]  # All columns from dataframe that is not in unq_cols

    return dataframe[unq_cols], dataframe[dup_cols]

unq_df = sub_dataframes(df)[0]
dup_df = sub_dataframes(df)[1]
print("Unique columns:\n\n{}\n\nDuplicate columns:\n\n{}".format(unq_df, dup_df))

這給了我這個輸出：

Unique columns:

          id
0      22352
1     106534
2      23608
3       8655
4      49670
...      ...

Duplicate columns:

      t_dur0    t_dur1    t_dur2  t_dance0  t_dance1  t_dance2
0     292720  293760.0  292733.0     0.549     0.623    0.5190
1     213760  181000.0  245973.0     0.871     0.702    0.4160
2     157124  130446.0  152450.0     0.289     0.328    0.2340 
3     127896  176351.0  166968.0     0.886     0.947    0.8260
4     210320  226253.0  211880.0     0.724     0.791    0.7840
...      ...       ...       ...       ...       ...       ...
2828   70740  262400.0  220680.0     0.224     0.609    0.7110 
2829  252226  222400.0  214973.0     0.526     0.623    0.4820
2830  269146  251560.0  172760.0     0.551     0.756    0.7820
2831  344764  425613.0  249652.0     0.473     0.572    0.8230
2832  210955  339869.0  304124.0     0.112     0.523    0.0679

我試圖將這些函陣列合在另一個函式中，該函式接收一個資料幀并回傳資料幀，其中所有重復的列都被它們的平均值替換，但是我無法將它們拆分dups_df成更小的資料幀。有沒有更簡單的方法可以做到這一點？

所需輸出的示例：

Original: 
      total_tracks  t_dur0    t_dur1    t_dur2  t_dance0  t_dance1  t_dance2  \
0                4  292720  293760.0  292733.0     0.549     0.623    0.5190   
1               12  213760  181000.0  245973.0     0.871     0.702    0.4160   
2               59  157124  130446.0  152450.0     0.289     0.328    0.2340   
3                8  127896  176351.0  166968.0     0.886     0.947    0.8260   
4               17  210320  226253.0  211880.0     0.724     0.791    0.7840   
...            ...     ...       ...       ...       ...       ...       ...   

After function: 
    total_tracks popularity    duration     dance   
0        4          21      293071.000000  0.563667
1       12          14      213577.666667  0.663000
2       59          41      146673.333333  0.283667
3        8           1      157071.666667  0.886333
4       17          47      216151.000000  0.766333
        ...         ...         ...          ...

uj5u.com熱心網友回復：

用于wide_to_long先重塑原始 DataFrame，然后再聚合mean：

cols = ['total_tracks']
df1 = (pd.wide_to_long(df, 
                     stubnames=['t_dur','t_dance'], 
                     i=cols,
                     j='tmp')
        .reset_index()
        .drop('tmp', 1)
        .groupby(cols, as_index=False)
        .mean())

print (df1)
   total_tracks          t_dur   t_dance
0             4  293071.000000  0.563667
1             8  157071.666667  0.886333
2            12  213577.666667  0.663000
3            17  216151.000000  0.766333
4            59  146673.333333  0.283667

詳情：

cols = ['total_tracks']
print(pd.wide_to_long(df, 
                     stubnames=['t_dur','t_dance'], 
                     i=cols,
                     j='tmp'))

                     t_dur  t_dance
total_tracks tmp                   
4            0    292720.0    0.549
12           0    213760.0    0.871
59           0    157124.0    0.289
8            0    127896.0    0.886
17           0    210320.0    0.724
4            1    293760.0    0.623
12           1    181000.0    0.702
59           1    130446.0    0.328
8            1    176351.0    0.947
17           1    226253.0    0.791
4            2    292733.0    0.519
12           2    245973.0    0.416
59           2    152450.0    0.234
8            2    166968.0    0.826
17           2    211880.0    0.784

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/455785.html

標籤：Python 熊猫数据框

上一篇：在R中構建JSON資料

下一篇：移動資料框并填充NaN