.SDindata.tablejoin參考i中的任意列串列-有解無憂

問題：使用基于連接鍵的另一個表中的權重計算一個表的列的加權平均值。

以下是 reprex 的步驟：

library(data.table)
#DT1 table of values - here just 2 columns, but may be an arbitrary number
DT1 <- data.table(k1 = c('A1','A2','A3'), 
                  k2 = c('X','X','Y'), 
                  v1 = c(10,11,12), 
                  v2 = c(.5, .6, 1.7))
#DT2 table of weights - columns correspond to value columns in table 1
DT2 <- data.table(k2 = c('X','Y'), 
                  w1 = c(5,2), 
                  w2 = c(1,7))
#Vectors of corresponding column names (could be any number of columns)
vals <- c('v1','v2')
weights <- c('w1','w2')
i.weights <- paste0('i.', weights)

#1. This returns all columns
DT1[DT2, on=.(k2)]
#>    k1 k2 v1  v2 w1 w2
#> 1: A1  X 10 0.5  5  1
#> 2: A2  X 11 0.6  5  1
#> 3: A3  Y 12 1.7  2  7
#2. This use of SD is standard
DT1[DT2, on=.(k2), .SD, .SDcols = vals, by=.(k1)]
#>    k1 v1  v2
#> 1: A1 10 0.5
#> 2: A2 11 0.6
#> 3: A3 12 1.7
#3. But refer to the columns of i (DT2) and it fails, both without and with the i. prefix
DT1[DT2, on=.(k2), .SD, .SDcols = weights, by=.(k1)]
#> Error in `[.data.table`(DT1, DT2, on = .(k2), .SD, .SDcols = weights, : Some items of .SDcols are not column names: [w1, w2]
DT1[DT2, on=.(k2), .SD, .SDcols = i.weights, by=.(k1)]
#> Error in `[.data.table`(DT1, DT2, on = .(k2), .SD, .SDcols = i.weights, : Some items of .SDcols are not column names: [i.w1, i.w2]
#4. So following suggestion in https://stackoverflow.com/questions/43257664/sd-and-sdcols-for-the-i-expression-in-data-table-join
# turn to mget() - in one command it fails
DT1[DT2, on=.(k2), c(mget(vals), mget(weights)), by=.(k1,k2)]
#> Error: value for 'w1' not found
#5. But by exploiting 1. above and splitting into chained queries we get success!
DT1[DT2, on=.(k2),][, c(mget(vals), mget(weights)), by=.(k1,k2)]
#>    k1 k2 v1  v2 w1 w2
#> 1: A1  X 10 0.5  5  1
#> 2: A2  X 11 0.6  5  1
#> 3: A3  Y 12 1.7  2  7
#6. Now we can turn to the original intention, but no luck
DT1[DT2, on=.(k2)][, .(wmean = weighted.mean(mget(vals), mget(weights))), by=.(k1,k2)]
#> Error in x * w: non-numeric argument to binary operator
#7. One more step - turn the lists returned by mget to data.tables - hurrahh!
DT1[DT2, on=.(k2)][, .(wmean = weighted.mean(setDT(mget(vals)), setDT(mget(weights)))), by=.(k1,k2)]
#>    k1 k2    wmean
#> 1: A1  X 8.416667
#> 2: A2  X 9.266667
#> 3: A3  Y 3.988889

^{由reprex 包( v2.0.0 )于 2021 年 11 月 26 日創建}

真的應該這么難做嗎？有沒有更直接（最好是更高效）的方法來做到這一點？

推論 - 我實際上想用這個計算在 DT1 中創建一個新列，但由于這最終有兩個鏈式查詢，我無法在此命令中進行分配。我必須創建一個新表并將其連接回原始表以添加列。是否有解決上述問題的方法可以避免這個額外的步驟？

uj5u.com熱心網友回復：

另一種方法是將資料從寬到長融合，然后相互連接。

molten_dt1 = melt(DT1, measure.vars = vals)[, variable := as.integer(substring(variable, 2))]
molten_dt2 = melt(DT2, measure.vars = weights)[, variable := as.integer(substring(variable, 2))]

molten_dt1[molten_dt2, 
           on = .(k2, variable)
           ][,
             weighted.mean(value, i.value),
             by = .(k1, k2)]

之所以不直接，是因為無論何時我們需要進行并行列查找（即v1 * w1和v2 * w2），復雜性總是會增加，因為我們需要考慮列之間的關系。融合資料使我們能夠簡化我們的方法，因為資料結構允許我們加入，而且我們在weighted.meandata.frames中使用向量。

另一個注意事項是，如果您weighted.mean()為串列創建一種允許我們跳過setDT要求的新方法，您可能能夠簡化原始方法。

## slight changes made to stats:::weighted.mean.default
weighted.mean.list = function (x, w, ..., na.rm = FALSE) 
{
  x = unlist(x)
  if (missing(w)) {
    if (na.rm) 
      x <- x[!is.na(x)]
    return(sum(x)/length(x))
  }
  w = unlist(w)
  if (length(w) != length(x)) 
    stop("'x' and 'w' must have the same length")
  if (na.rm) {
    i <- !is.na(x)
    w <- w[i]
    x <- x[i]
  }
  sum((x * w)[w != 0])/sum(w)
}

DT1[DT2, on=.(k2)][, .(wmean = weighted.mean(mget(vals), mget(weights))), by=.(k1,k2)]

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/368672.html

標籤：加入数据表

上一篇：如何連接兩個表，其中一個有另一個的兩個樣本（我們希望在最終表中看到它們）

下一篇：sql在列上加入多個條件