向量子集的Rlapply回圈性能-有解無憂

大家好！我很少提出新問題，因為在這個論壇上已經說了這么多，但是這次我找不到足夠的材料來解決我的性能問題。

我基本上有調查資料，我想從中計算品牌層面的各種指標。訣竅是我需要為回圈的每個元素創建向量的變體，方法是排除與測驗的第 i 個元素相關的所有元素。目前我還沒有找到一種方法來矢量化我的代碼。因此，我的lapply回圈非常緩慢（到目前為止，這是較大腳本中最慢的部分）。

我的資料集有 800 萬行長，我回圈了 70 多個品牌，因此此時性能開始變得重要。請參閱您自己的測驗的更短的可重現示例：

（編輯：添加到腳本中的注釋以便更好地理解。）

# Small sample size to experiment
N <- 200L 

# Table with survey data :
# - each observation is the answer of a person about a brand
# - each observation is associated to a weight, used to compute statistics (frequencies, means...)
# - each person is the described by few socio-demographic variables (country, gender, age)
# - brands are given a grade ('score' variable), ranging from 0 to 10
repex_DT <- data.table (
  country = factor(sample(c("COUNTRY 1", "COUNTRY 2", "COUNTRY 3", "COUNTRY 4"), size = N, replace=TRUE)),
  gender = factor(sample(c("Male", "Female"), size = N, replace=TRUE)),
  age_group = factor(sample(c("Less than 35", "35 and more"), size = N, replace=TRUE)),
  brand = factor(sample(c(toupper(letters[1:26])), size = N, replace=TRUE)),
  score = sample(x = c(0:10), size = N, replace=TRUE),
  weight = sample(x = c(2/(1:10)), size = N, replace=TRUE)
)

# The loop computes for each "country x gender x age_group x brand" combination :
# - the "country x gender x age_group" socio-demographic group size (cases_total, i.e. all brands included)
# - the "country x gender x age_group" group size, excluding the people assessing the current 'brand_' argument
# - Same logic for mean and standard deviation indicators

current_loop <- data.table::rbindlist(l=lapply(unique(repex_DT$brand), function(brand_){
  
  # Calculations done for each 'brand_' of each "country x gender x age_group" combination
  out <- repex_DT[ , .(
    cases_total = sum(x=weight, na.rm=TRUE),
    cases_others = sum(x=weight[brand != brand_], na.rm=TRUE),
    mean_others = expss::w_mean(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE),
    sd_others  = expss::w_sd(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE)
  ), by = .(country, gender, age_group)]
  
  out[, brand := brand_]
  data.table::setcolorder(x=out, neworder="brand")
  return(data.table::copy(x=out))})) %>% 
  
  # Sorting at the end for better readability
  .[order(., country, gender, age_group, brand)]

So far I have read plenty of other SO questions like this one, this other one and others on the same topic, so I am well aware that loops extending a data.table is both memory and time consuming. Yet I haven't found an other way to get me where I want. Hope you R experts can :-) And by the way, I use expss to compute weighted means and standard deviations because I also use the package to compute tables here and there, but but I certainly could use other packages if that could help performance-wise.

uj5u.com熱心網友回復：

這是一個更快的方法

獲取總計的 n,mean,sd

cases_total = repex_DT[, .(cases_total = sum(weight, na.rm=T),
                           mean_total = expss::w_mean(score, weight, na.rm=T),
                           sd_total = expss::w_sd(score, weight, na.rm=T)), .(country, gender, age_group)]

獲取每個品牌的 n、mean、sd

cases_brand = repex_DT[, .(cases_brand = sum(weight, na.rm=T),
                           mean_brand = expss::w_mean(score, weight, na.rm=T),
                           sd_brand = expss::w_sd(score, weight, na.rm=T)), .(brand,country, gender, age_group)]

將這些合并在一起

result = cases_brand[cases_total, on=.(country, gender, age_group)][, cases_others:=cases_total-cases_brand]

很容易對“其他人”（即非品牌）變得刻薄

result[, mean_others:= (cases_total*mean_total - cases_brand*mean_brand)/cases_others]

現在，一個小函式來獲取“其他人”的 sd

sd_other <- function(n1,n2,total_sd,sub_sd,total_mean, sub_mean ,other_mean) {
  sqrt(
    (total_sd^2*(n1 n2-1) - (n1-1)*sub_sd^2 - n1*(sub_mean-total_mean)^2 - n2*(other_mean-total_mean)^2)/(n2-1)
  )
}

應用該功能來獲取其他人的 sd

result[, sd_others:= sd_other(cases_brand, cases_others,sd_total,sd_brand,mean_total,mean_brand, mean_others)]

洗掉不必要的列并設定順序

result[, `:=`(cases_brand=NULL, mean_brand=NULL, sd_brand=NULL, mean_total=NULL, sd_total=NULL)]
setorder(result, country, gender, age_group, brand)

比較：

> microbenchmark::microbenchmark(list=cmds, times=10)
Unit: milliseconds
         expr       min        lq      mean    median        uq       max neval
 current_loop 3684.4233 3700.1134 3775.4322 3729.8387 3855.5353 3938.4605    10
 new_approach  155.9486  158.2265  164.1699  165.9736  167.5279  174.0746    10

uj5u.com熱心網友回復：

雖然沒有解決矢量化問題，但使用該collapse包可以更有效并帶來很好的加速（YMMV，取決于可用內核的數量）：

invisible(suppressPackageStartupMessages(
    lapply(c("magrittr", "expss", "data.table", "collapse"), require, character.only=TRUE)))
options(datatable.optimize = 3L)
N <- 1E7
repex_DT <- data.table (
  country = factor(sample(c("COUNTRY 1", "COUNTRY 2", "COUNTRY 3", "COUNTRY 4"), size = N, replace=TRUE)),
  gender = factor(sample(c("Male", "Female"), size = N, replace=TRUE)),
  age_group = factor(sample(c("Less than 35", "35 and more"), size = N, replace=TRUE)),
  brand = factor(sample(LETTERS, size = N, replace=TRUE)),
  score = sample(x = c(0:10), size = N, replace=TRUE),
  weight = sample(x = c(2/(1:10)), size = N, replace=TRUE)
)

# your version
system.time({
    current_loop <- data.table::rbindlist(l=lapply(unique(repex_DT$brand), function(brand_){
  # Calculations done for each 'brand_' of each "country x gender x age_group" combination
  out <- repex_DT[ , .(
    cases_total = sum(x=weight, na.rm=TRUE),
    cases_others = sum(x=weight[brand != brand_], na.rm=TRUE),
    mean_others = expss::w_mean(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE),
    sd_others  = expss::w_sd(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE)
  ), by = .(country, gender, age_group)]
  
  out[, brand := brand_]
  data.table::setcolorder(x=out, neworder="brand")
  return(data.table::copy(x=out))})) %>% 
  
  # Sorting at the end for better readability
  .[order(., country, gender, age_group, brand)]
})
#>    user  system elapsed 
#>  95.612   1.557  49.309

# version using collapse
system.time({
cols <- c("country", "gender", "age_group")
ot <- repex_DT[ , .(cases_total = sum(weight, na.rm=TRUE)), by = cols]
ot2 <- data.table::setorder(rbindlist(lapply(setNames(levels(repex_DT$brand), levels(repex_DT$brand)), 
    \(i) repex_DT[brand != i][, .(cases_others = fsum(x=weight, na.rm=TRUE),
    mean_others = fmean(score, w=weight, na.rm=TRUE),
    sd_others = fsd(score, w=weight, na.rm=TRUE)), by = cols]), 
    idcol="brand"), country, gender, age_group, brand)
out <- data.table::setcolorder(ot[ot2, on=cols], "brand")
out[, brand := factor(brand)]
})
#>    user  system elapsed 
#>  49.836   3.478  11.543

all.equal(current_loop, out)
#> [1] TRUE

^{由reprex 包于 2022-03-07 創建(v2.0.1)}

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/439918.html

標籤：r performance loops data.table

上一篇：我正在嘗試制作一個迷你游戲。運動在python中的X和Y坐標范圍內，范圍在-5和5之間。我如何把它放在一個回圈中

下一篇：使用不同的引數postgressql多次執行相同的查詢