我兩者都用dplyr,data.table所以如果有使用任何一個包的解決方案,我很感興趣。
我通常希望通過一些分組變數和整個資料集(大均值)來使用匯總統計資料來匯總資料——我們以平均值為例。然后我通常將它們組合成一個輸出資料框進行顯示,在分組變數列中將總平均行指定為“總計”或“總體”。
這是我通常這樣做的方式,同時使用dplyr和data.table:
dplyr
library(dplyr)
d <- tibble(grp = rep(letters[1:3], 10), v = 1:30)
group_means <- d %>%
group_by(grp) %>%
summarize(v_mean = mean(v))
grand_means <- d %>%
summarize(v_mean = mean(v)) %>%
mutate(grp = 'overall')
bind_rows(group_means, grand_means)
資料表
library(data.table)
d <- data.table(grp = rep(letters[1:3], 10), v = 1:30)
group_means <- d[, .(v_mean = mean(v)), by = .(grp)]
grand_means <- d[, .(v_mean = mean(v))]
grand_means[, grp := 'overall']
rbindlist(list(group_means, grand_means), use.names = TRUE)
My issue is that this isn't very concise. It's not that bad in this example, but if I have to calculate a large number of summary statistics, I have to repeat the same code twice. My question is, is there an idiomatic and concise way to get both grouped and overall summary statistics in either dplyr or data.table?
uj5u.com熱心網友回復:
這是dplyr的簡潔方法:
d %>%
add_row(grp = 'overall', v = mean(.$v)) %>%
group_by(grp) %>%
summarise(mean_v = mean(v))
另一種選擇,以避免重復匯總統計計算兩次:
d %>%
bind_rows(mutate(., grp = 'overall')) %>%
group_by(grp) %>%
summarise(mean_v = mean(v))
uj5u.com熱心網友回復:
當我發現這個整潔的 data.table 函式時,我洗掉了我之前的答案
data.table::cube(d, mean(v), by = c("grp"))
這為您提供了您的組的(子)總數
grp V1
1: a 14.5
2: b 15.5
3: c 16.5
4: <NA> 15.5
包括替換NA和正確的列名
data.table::cube(d, .(v_mean = mean(v)), by = c("grp"))[is.na(grp), grp := "overall"][]
grp v_mean
1: a 14.5
2: b 15.5
3: c 16.5
4: overall 15.5
更多資訊可以在這里找到: https ://www.rdocumentation.org/packages/data.table/versions/1.14.2/topics/groupingsets
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/454421.html
標籤:r dplyr data.table
上一篇:我可以使用什么功能來完成和填充缺失的時間序列觀察,避免在序列開始日期之前完成?
下一篇:從沒有擴展名的字串中提取檔案名
