如何使用一列或另一列對PandasDataFrame進行分組-有解無憂

親愛的 Pandas DataFrame 專家：

我一直在使用大熊貓DataFrames與重新撰寫代碼的圖表中的一個開源專案（幫助https://openrem.org/，https://bitbucket.org/openrem/openrem）。

我一直在對諸如 study_name 和 x_ray_system_name 等欄位的資料進行分組和聚合。

示例資料框可能包含以下資料：

study_name   request_name   total_dlp   x_ray_system_name
      head           head        50.0         All systems
      head           head       100.0         All systems
      head            NaN       200.0         All systems
     blank            NaN        75.0         All systems
     blank            NaN       125.0         All systems
     blank           head       400.0         All systems

以下行計算按 x_ray_system_name 和 study_name 分組的 total_dlp 資料的計數和平均值：

df.groupby(["x_ray_system_name", "study_name"]).agg({"total_dlp": ["count", "mean"]})

結果如下：

                                 total_dlp
                                     count         mean
x_ray_system_name   study_name   
All systems         blank                3   200.000000
                    head                 3   116.666667

我現在需要能夠計算在 study_name或request_name 中的條目上分組的 total_dlp 資料的平均值。所以在上面的例子中，我希望“head”意味著包括三個 study_name“head”條目，以及單個 request_name“head”條目。

我希望結果看起來像這樣：

                                 total_dlp
                                     count         mean
x_ray_system_name   name   
All systems         blank                3   200.000000
                    head                 4   187.500000

有誰知道我如何根據一個領域或另一個領域的類別進行分組？

您可以提供的任何幫助將不勝感激。

親切的問候，

大衛

uj5u.com熱心網友回復：

您（groupby）資料本質上是以下各項的并集：

提取那些 study_name == request_name
復制那些與study_name != request_name，一為study_name，一為request_name

我們可以復制資料 melt

(pd.concat([df.query('study_name==request_name')    # equal part
              .drop('request_name', axis=1),        # remove so `melt` doesn't duplicate this data
            df.query('study_name!=request_name')])  # not equal part
   .melt(['x_ray_system_name','total_dlp'])         # melt to duplicate
   .groupby(['x_ray_system_name','value'])
   ['total_dlp'].mean()
)

更新：編輯上面的代碼幫助我意識到我們可以簡化：

# mask `request_name` with `NaN` where they equal `study_name`
# so they are ignored when duplicate/mean
(df.assign(request_name=df.request_name.mask(df.study_name==df.request_name))
   .melt(['x_ray_system_name','total_dlp']) 
   .groupby(['x_ray_system_name','value'])
   ['total_dlp'].mean()
)

輸出：

x_ray_system_name  value
All systems        blank    200.0
                   head     187.5
Name: total_dlp, dtype: float64

uj5u.com熱心網友回復：

我有與@QuangHoang 類似的方法，但操作順序不同。

我在這里使用原始（范圍）索引來選擇如何洗掉重復資料。

你可以melt,drop_duplicates和dropna和groupby：

(df.reset_index()
   .melt(id_vars=['index', 'total_dlp', 'x_ray_system_name'])
   .drop_duplicates(['index', 'value'])
   .dropna(subset=['value'])
   .groupby(["x_ray_system_name", 'value'])
   .agg({"total_dlp": ["count", "mean"]})
)

輸出：

                        total_dlp       
                            count   mean
x_ray_system_name value                 
All systems       blank         3  200.0
                  head          4  187.5

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/347400.html

標籤：Python 熊猫数据框 pandas-groupby 总计的

上一篇：根據重復的列值提取資料幀行并將它們存盤在新的資料幀中

下一篇：獲取張量tensorflow中的整數值