大家好!我很少提出新問題,因為在這個論壇上已經說了這么多,但是這次我找不到足夠的材料來解決我的性能問題。
我基本上有調查資料,我想從中計算品牌層面的各種指標。訣竅是我需要為回圈的每個元素創建向量的變體,方法是排除與測驗的第 i 個元素相關的所有元素。目前我還沒有找到一種方法來矢量化我的代碼。因此,我的lapply回圈非常緩慢(到目前為止,這是較大腳本中最慢的部分)。
我的資料集有 800 萬行長,我回圈了 70 多個品牌,因此此時性能開始變得重要。請參閱您自己的測驗的更短的可重現示例:
(編輯:添加到腳本中的注釋以便更好地理解。)
# Small sample size to experiment
N <- 200L
# Table with survey data :
# - each observation is the answer of a person about a brand
# - each observation is associated to a weight, used to compute statistics (frequencies, means...)
# - each person is the described by few socio-demographic variables (country, gender, age)
# - brands are given a grade ('score' variable), ranging from 0 to 10
repex_DT <- data.table (
country = factor(sample(c("COUNTRY 1", "COUNTRY 2", "COUNTRY 3", "COUNTRY 4"), size = N, replace=TRUE)),
gender = factor(sample(c("Male", "Female"), size = N, replace=TRUE)),
age_group = factor(sample(c("Less than 35", "35 and more"), size = N, replace=TRUE)),
brand = factor(sample(c(toupper(letters[1:26])), size = N, replace=TRUE)),
score = sample(x = c(0:10), size = N, replace=TRUE),
weight = sample(x = c(2/(1:10)), size = N, replace=TRUE)
)
# The loop computes for each "country x gender x age_group x brand" combination :
# - the "country x gender x age_group" socio-demographic group size (cases_total, i.e. all brands included)
# - the "country x gender x age_group" group size, excluding the people assessing the current 'brand_' argument
# - Same logic for mean and standard deviation indicators
current_loop <- data.table::rbindlist(l=lapply(unique(repex_DT$brand), function(brand_){
# Calculations done for each 'brand_' of each "country x gender x age_group" combination
out <- repex_DT[ , .(
cases_total = sum(x=weight, na.rm=TRUE),
cases_others = sum(x=weight[brand != brand_], na.rm=TRUE),
mean_others = expss::w_mean(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE),
sd_others = expss::w_sd(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE)
), by = .(country, gender, age_group)]
out[, brand := brand_]
data.table::setcolorder(x=out, neworder="brand")
return(data.table::copy(x=out))})) %>%
# Sorting at the end for better readability
.[order(., country, gender, age_group, brand)]
So far I have read plenty of other SO questions like this one, this other one and others on the same topic, so I am well aware that loops extending a data.table is both memory and time consuming. Yet I haven't found an other way to get me where I want. Hope you R experts can :-)
And by the way, I use expss to compute weighted means and standard deviations because I also use the package to compute tables here and there, but but I certainly could use other packages if that could help performance-wise.
uj5u.com熱心網友回復:
這是一個更快的方法
- 獲取總計的 n,mean,sd
cases_total = repex_DT[, .(cases_total = sum(weight, na.rm=T),
mean_total = expss::w_mean(score, weight, na.rm=T),
sd_total = expss::w_sd(score, weight, na.rm=T)), .(country, gender, age_group)]
- 獲取每個品牌的 n、mean、sd
cases_brand = repex_DT[, .(cases_brand = sum(weight, na.rm=T),
mean_brand = expss::w_mean(score, weight, na.rm=T),
sd_brand = expss::w_sd(score, weight, na.rm=T)), .(brand,country, gender, age_group)]
- 將這些合并在一起
result = cases_brand[cases_total, on=.(country, gender, age_group)][, cases_others:=cases_total-cases_brand]
- 很容易對“其他人”(即非品牌)變得刻薄
result[, mean_others:= (cases_total*mean_total - cases_brand*mean_brand)/cases_others]
- 現在,一個小函式來獲取“其他人”的 sd
sd_other <- function(n1,n2,total_sd,sub_sd,total_mean, sub_mean ,other_mean) {
sqrt(
(total_sd^2*(n1 n2-1) - (n1-1)*sub_sd^2 - n1*(sub_mean-total_mean)^2 - n2*(other_mean-total_mean)^2)/(n2-1)
)
}
- 應用該功能來獲取其他人的 sd
result[, sd_others:= sd_other(cases_brand, cases_others,sd_total,sd_brand,mean_total,mean_brand, mean_others)]
- 洗掉不必要的列并設定順序
result[, `:=`(cases_brand=NULL, mean_brand=NULL, sd_brand=NULL, mean_total=NULL, sd_total=NULL)]
setorder(result, country, gender, age_group, brand)
比較:
> microbenchmark::microbenchmark(list=cmds, times=10)
Unit: milliseconds
expr min lq mean median uq max neval
current_loop 3684.4233 3700.1134 3775.4322 3729.8387 3855.5353 3938.4605 10
new_approach 155.9486 158.2265 164.1699 165.9736 167.5279 174.0746 10
uj5u.com熱心網友回復:
雖然沒有解決矢量化問題,但使用該collapse包可以更有效并帶來很好的加速(YMMV,取決于可用內核的數量):
invisible(suppressPackageStartupMessages(
lapply(c("magrittr", "expss", "data.table", "collapse"), require, character.only=TRUE)))
options(datatable.optimize = 3L)
N <- 1E7
repex_DT <- data.table (
country = factor(sample(c("COUNTRY 1", "COUNTRY 2", "COUNTRY 3", "COUNTRY 4"), size = N, replace=TRUE)),
gender = factor(sample(c("Male", "Female"), size = N, replace=TRUE)),
age_group = factor(sample(c("Less than 35", "35 and more"), size = N, replace=TRUE)),
brand = factor(sample(LETTERS, size = N, replace=TRUE)),
score = sample(x = c(0:10), size = N, replace=TRUE),
weight = sample(x = c(2/(1:10)), size = N, replace=TRUE)
)
# your version
system.time({
current_loop <- data.table::rbindlist(l=lapply(unique(repex_DT$brand), function(brand_){
# Calculations done for each 'brand_' of each "country x gender x age_group" combination
out <- repex_DT[ , .(
cases_total = sum(x=weight, na.rm=TRUE),
cases_others = sum(x=weight[brand != brand_], na.rm=TRUE),
mean_others = expss::w_mean(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE),
sd_others = expss::w_sd(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE)
), by = .(country, gender, age_group)]
out[, brand := brand_]
data.table::setcolorder(x=out, neworder="brand")
return(data.table::copy(x=out))})) %>%
# Sorting at the end for better readability
.[order(., country, gender, age_group, brand)]
})
#> user system elapsed
#> 95.612 1.557 49.309
# version using collapse
system.time({
cols <- c("country", "gender", "age_group")
ot <- repex_DT[ , .(cases_total = sum(weight, na.rm=TRUE)), by = cols]
ot2 <- data.table::setorder(rbindlist(lapply(setNames(levels(repex_DT$brand), levels(repex_DT$brand)),
\(i) repex_DT[brand != i][, .(cases_others = fsum(x=weight, na.rm=TRUE),
mean_others = fmean(score, w=weight, na.rm=TRUE),
sd_others = fsd(score, w=weight, na.rm=TRUE)), by = cols]),
idcol="brand"), country, gender, age_group, brand)
out <- data.table::setcolorder(ot[ot2, on=cols], "brand")
out[, brand := factor(brand)]
})
#> user system elapsed
#> 49.836 3.478 11.543
all.equal(current_loop, out)
#> [1] TRUE
由reprex 包于 2022-03-07 創建(v2.0.1)
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/439918.html
標籤:r performance loops data.table
上一篇:我正在嘗試制作一個迷你游戲。運動在python中的X和Y坐標范圍內,范圍在-5和5之間。我如何把它放在一個回圈中
