我想創建一個使用列變數名和資料變數名的函式。
這個功能是我想要的,它可以作業:
n <- 1e7
d <- data.table(x = 1:n, grp = sample(1:1e5, n, replace = T))
dataName = "d"
colName = "x"
# Objective :
FOO <- function(dataName = "d",
colName = "x"){
get(dataName)[, mean(get(colName)), by = grp]
}
問題是get()對每個組的評估非常耗時。在真實資料示例中,它比靜態名稱等效項長 14 倍。我想達到與列名是靜態的相同的執行時間。
我試過的:
(cl <- substitute(mean(eval(parse(text = colName))), list(colName = as.name(colName))))
microbenchmark::microbenchmark(
# 1) works and quick but does not use variable names of columns (654ms)
(t1 <- d[, mean(x), by = grp]),
# 2) works but slow (1006ms)
(t2 <- get(dataName)[, mean(get(colName)), by = grp]), # works but slow
# 3) works but slow (4075ms)
(t3 <- eval(parse(text = dataName))[, mean(eval(parse(text = colName))), by = grp]),
# 4) works but very slow (37202ms)
(t4 <- get(dataName)[, eval(cl), by = grp]),
# 5) double dot syntax doesn't work cause I don't master it
# (t5 <- get(dataName)[, mean(..colName), by = grp]),
times = 10)
雙點語法在這里合適嗎?為什么 4) 這么慢?我從這篇文章中獲取了它,它是最好的選擇。我改編了這篇文章中的雙點語法。
非常感謝你的幫助 !
uj5u.com熱心網友回復:
最好將資料集名稱傳遞d給FOO函式而不是傳遞字串"d"。此外,您可以lapply結合使用with.SD以便您可以從內部優化中受益,而不是使用mean(get(colName)).
FOO2 = function(dataName=d, colName = "x") { # d instead of "d" passed to the first argument!
dataName[, lapply(.SD, mean), by=grp, .SDcols=colName]
}
基準:FOOvsFOO2
set.seed(147852)
n <- 1e7
d <- data.table(x = 1:n, grp = sample(1:1e5, n, replace = T))
microbenchmark::microbenchmark(
FOO(),
FOO2(),
times=5L
)
Unit: milliseconds
expr min lq mean median uq max neval
FOO() 4632.4014 4672.7781 4787.4958 4707.9023 4846.7081 5077.6893 5
FOO2() 255.0828 267.1322 297.0389 275.4467 281.9873 405.5456 5
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/336873.html
上一篇:資料框上的年度移動視窗
