我有一個包含如下列的資料框:
['id', 't_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0',
't_energy1', 't_energy2']
我有一個代碼,它回傳同名的三列的平均值:
# Takes in a dataframe with three columns and returns a dataframe with one column of their means as integers
def average_column(dataframe):
dataframe = dataframe.copy() # To avoid SettingWithCopyWarning
# Create new column name without integers
temp = dataframe.columns.tolist()[0]
col_name = temp.rstrip(temp[2:-1])
dataframe[col_name] = dataframe.mean(axis=1) # Add column to the dataframe (axis=1 means the mean() is applied row-wise)
mean_df = dataframe.iloc[: , -1:] # Isolated column of the mean by selecting all rows (:) for the last column (-1:)
print("Original:\n{}\nAverage columns:\n{}".format(dataframe, mean_df))
return mean_df.astype(float)
這個函式給了我這個輸出:
Original:
t_dance0 t_dance1 t_dance2 dance
0 0.549 0.623 0.5190 0.563667
1 0.871 0.702 0.4160 0.663000
2 0.289 0.328 0.2340 0.283667
3 0.886 0.947 0.8260 0.886333
4 0.724 0.791 0.7840 0.766333
... ... ... ... ...
Average columns:
dance
0 0.563667
1 0.663000
2 0.283667
3 0.886333
4 0.766333
... ...
我問了這個關于如何將其拆分為唯一列和重復列的問題。這導致我看到這段代碼:
# Function that splits dataframe into two separate dataframes, one with all unique
columns and one with all duplicates
def sub_dataframes(dataframe):
# Extract common prefix -> remove trailing digits
cols = dataframe.columns.str.replace(r'\d*$', '', regex=True).to_series().value_counts()
# Split columns
unq_cols = cols[cols == 1].index
dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)] # All columns from dataframe that is not in unq_cols
return dataframe[unq_cols], dataframe[dup_cols]
unq_df = sub_dataframes(df)[0]
dup_df = sub_dataframes(df)[1]
print("Unique columns:\n\n{}\n\nDuplicate columns:\n\n{}".format(unq_df, dup_df))
這給了我這個輸出:
Unique columns:
id
0 22352
1 106534
2 23608
3 8655
4 49670
... ...
Duplicate columns:
t_dur0 t_dur1 t_dur2 t_dance0 t_dance1 t_dance2
0 292720 293760.0 292733.0 0.549 0.623 0.5190
1 213760 181000.0 245973.0 0.871 0.702 0.4160
2 157124 130446.0 152450.0 0.289 0.328 0.2340
3 127896 176351.0 166968.0 0.886 0.947 0.8260
4 210320 226253.0 211880.0 0.724 0.791 0.7840
... ... ... ... ... ... ...
2828 70740 262400.0 220680.0 0.224 0.609 0.7110
2829 252226 222400.0 214973.0 0.526 0.623 0.4820
2830 269146 251560.0 172760.0 0.551 0.756 0.7820
2831 344764 425613.0 249652.0 0.473 0.572 0.8230
2832 210955 339869.0 304124.0 0.112 0.523 0.0679
我試圖將這些函陣列合在另一個函式中,該函式接收一個資料幀并回傳資料幀,其中所有重復的列都被它們的平均值替換,但是我無法將它們拆分dups_df成更小的資料幀。有沒有更簡單的方法可以做到這一點?
所需輸出的示例:
Original:
total_tracks t_dur0 t_dur1 t_dur2 t_dance0 t_dance1 t_dance2 \
0 4 292720 293760.0 292733.0 0.549 0.623 0.5190
1 12 213760 181000.0 245973.0 0.871 0.702 0.4160
2 59 157124 130446.0 152450.0 0.289 0.328 0.2340
3 8 127896 176351.0 166968.0 0.886 0.947 0.8260
4 17 210320 226253.0 211880.0 0.724 0.791 0.7840
... ... ... ... ... ... ... ...
After function:
total_tracks popularity duration dance
0 4 21 293071.000000 0.563667
1 12 14 213577.666667 0.663000
2 59 41 146673.333333 0.283667
3 8 1 157071.666667 0.886333
4 17 47 216151.000000 0.766333
... ... ... ...
uj5u.com熱心網友回復:
用于wide_to_long先重塑原始 DataFrame,然后再聚合mean:
cols = ['total_tracks']
df1 = (pd.wide_to_long(df,
stubnames=['t_dur','t_dance'],
i=cols,
j='tmp')
.reset_index()
.drop('tmp', 1)
.groupby(cols, as_index=False)
.mean())
print (df1)
total_tracks t_dur t_dance
0 4 293071.000000 0.563667
1 8 157071.666667 0.886333
2 12 213577.666667 0.663000
3 17 216151.000000 0.766333
4 59 146673.333333 0.283667
詳情:
cols = ['total_tracks']
print(pd.wide_to_long(df,
stubnames=['t_dur','t_dance'],
i=cols,
j='tmp'))
t_dur t_dance
total_tracks tmp
4 0 292720.0 0.549
12 0 213760.0 0.871
59 0 157124.0 0.289
8 0 127896.0 0.886
17 0 210320.0 0.724
4 1 293760.0 0.623
12 1 181000.0 0.702
59 1 130446.0 0.328
8 1 176351.0 0.947
17 1 226253.0 0.791
4 2 292733.0 0.519
12 2 245973.0 0.416
59 2 152450.0 0.234
8 2 166968.0 0.826
17 2 211880.0 0.784
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/455785.html
上一篇:在R中構建JSON資料
下一篇:移動資料框并填充NaN
