如何通過分隔符為相應的名稱和值列拆分資料-有解無憂

我正在嘗試使用一個 excel 檔案來做一些以相當煩人的格式放在一起的事情（我沒有創建它；這是我正在使用的現有資源）。感興趣的值位于稱為（類似）的列中，All_Values由句點分隔，而與這些值對應的度量在單獨的列中指定All_Measures，也由句點分隔，每行不同。例如，使用玩具資料集：

Object        All_Measures  All_Values     (additional columns that are not like this)
     1       Height.Weight      20.50      ...
     2       Weight.Height      65.30      ...
     3  Height.Width.Depth   22.30.10      ...

我想要做的是像這樣重新格式化資料，用 0 填充缺失值（列的最終順序并不重要）：

Object  Height  Weight  Width  Depth  (additional columns)
     1      20      50      0      0  ...
     2      30      65      0      0  ...
     3      22       0     30     10  ...

我可以這樣做的一種方法是（非常緩慢，因為它是一個大資料集）創建一個新的空白資料幀，然后迭代現有資料幀中的每一行，創建一個新的資料幀行，其中的列由 splitting All_Measuresby指定.，然后值指定由分裂All_Values的.。然后，我從行中洗掉All_Measures和All_Values并將新資料幀附加到它的末尾，并將其附加到空白資料幀。但這非常笨拙，如果有一種更快、更優雅的方式來做到這一點就好了。

由于這里沒有錯誤，我沒有 MWE，但是這里有一些代碼可以復制以創建像上面這樣的玩具資料集，以防它派上用場。

df = pd.DataFrame(
    columns = ['Object','All_Measures','All_Values','Object_Name']
    [[1,'Height.Weight','20.50','First'], 
     [2,'Weight.Height','65.30','Second'], 
     [3,'Height.Width.Depth','22.30.10','Third']]
)

uj5u.com熱心網友回復：

使用str.split，explode以及pivot_table：

# split the "All" columns into lists
df['All_Measures'] = df['All_Measures'].str.split('.')
df['All_Values'] = df['All_Values'].str.split('.')

# explode the lists into rows
df = df.explode(['All_Measures', 'All_Values'])

# pivot the measures into columns
df.pivot_table(
    index=['Object', 'Object_Name'],
    columns='All_Measures',
    values='All_Values',
    fill_value=0)

輸出：

All_Measures       Depth Height Weight Width
Object Object_Name                          
1      First           0     20     50     0
2      Second          0     30     65     0
3      Third          10     22      0    30

詳細分類

str.split 將“所有”列轉換為串列：

df['All_Measures'] = df['All_Measures'].str.split('.')
df['All_Values'] = df['All_Values'].str.split('.')

#    Object            All_Measures    All_Values Object_Name
# 0       1        [Height, Weight]      [20, 50]       First
# 1       2        [Weight, Height]      [65, 30]      Second
# 2       3  [Height, Width, Depth]  [22, 30, 10]       Third

explode 串列成行：

df = df.explode(['All_Measures', 'All_Values'])

#    Object All_Measures All_Values Object_Name
# 0       1       Height         20       First
# 0       1       Weight         50       First
# 1       2       Weight         65      Second
# 1       2       Height         30      Second
# 2       3       Height         22       Third
# 2       3        Width         30       Third
# 2       3        Depth         10       Third

pivot_table 分列的措施：

df.pivot_table(
    index=['Object', 'Object_Name'],
    columns='All_Measures',
    values='All_Values',
    fill_value=0)

# All_Measures       Depth Height Weight Width
# Object Object_Name                          
# 1      First           0     20     50     0
# 2      Second          0     30     65     0
# 3      Third          10     22      0    30

uj5u.com熱心網友回復：

可能有一些方法可以在不使用回圈或 apply() 的情況下完成此操作，但我想不出。以下是我想到的：

import pandas as pd
df = pd.DataFrame(
    [[1,'Height.Weight','20.50','First'], 
     [2,'Weight.Height','65.30','Second'], 
     [3,'Height.Width.Depth','22.30.10','Third']],
    columns = ['Object','All_Measures','All_Values','Object_Name'],
)

def parse_combined_measure(row):
    keys = row["All_Measures"].split(".")
    values = row["All_Values"].split(".")
    return row.append(pd.Series(dict(zip(keys, values))))

df2 = df.apply(parse_combined_measure, axis=1)
df2 = df2.fillna(0)

uj5u.com熱心網友回復：

# Create a new DataFrame with just the values extracted from the All_Values column
In [24]: new_df = df['All_Values'].str.split('.').apply(pd.Series)
Out[24]:
    0   1    2
0  20  50  NaN
1  65  30  NaN
2  22  30   10

# Figure out the names those columns should have
In [37]: df.loc[df['All_Measures'].str.count('\.').idxmax(), 'All_Measures']
Out[37]: 'Height.Width.Depth'

In [38]: new_df.columns = df.loc[df['All_Measures'].str.count('\.').idxmax(), 'All_Measures'].split('.')
Out[39]:
  Height Width Depth
0     20    50   NaN
1     65    30   NaN
2     22    30    10

# Join the new DF with the original, except the columns we've expanded.
In [41]: df[['Object', 'Object_Name']].join(new_df)
Out[41]:
   Object Object_Name Height Width Depth
0       1       First     20    50   NaN
1       2      Second     65    30   NaN
2       3       Third     22    30    10

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/371336.html

標籤：Python 熊猫数据框

上一篇：從專用用戶訪問S3存盤桶（策略失敗？）

下一篇：Python：對權重進行歸一化的約束，使得沒有權重大于1/sqrt(n)