加快R中向量的setdiff()、intersect()、union()操作-有解無憂

我有一個資料表dt1：

id1	id2	V1	V2
1	一個	c(1, 2, 3, 4)	c(1, 3, 6)
2	b	c(2, 6, 9, 8)	c(8, 5)

我想添加新列，這些列是setdiff(),intersect()和union()對V1和V2變數的操作的結果。

預期輸出：

id1	id2	V1	V2	diff_V1_V2	相交_V1_V2	union_V1_V2
1	一個	c(1, 2, 3, 4)	c(1, 3, 6)	c(2, 4)	c(1, 3)	c(1, 2, 3, 4, 6)
2	b	c(2, 6, 9, 8)	c(8, 5)	c(2, 6, 9)	c(8)	c(2, 5, 6, 8, 9)

我試過了：

dt_new <- dt1[, c("diff_V1_V2", "intersect_V1_V2", "union_V1_V2") := list(
              Map(setdiff, V1, V2),
              Map(intersect, V1, V2),
              Map(union, V1, V2))]

但是我的真實向量很長，所以這些操作需要很長時間。

那么，如何加快這些操作或如何使用其他功能/方法獲得類似的結果？我正在尋找最有效的方法。
或者是否可以并行計算？

uj5u.com熱心網友回復：

天真的方法：

Naive <- function (a, b) {
  list(intersect = intersect(a, b),
       union = union(a, b),
       adiffb = setdiff(a, b))
}

您可以利用基礎數學在一次掃描向量a和中完成所有三個操作b，而不是通過三個函式呼叫進行三次掃描。此外，unique如果您確定任一向量中沒有重復值，則可以跳過昂貴的。

SetOp <- function (a, b, no.dup.guaranteed = FALSE) {
  if (no.dup.guaranteed) {
    au <- a
    bu <- b
  } else {
    au <- unique(a)
    bu <- unique(b)
  }
  ind <- match(bu, au, nomatch = 0)
  INTERSECT <- au[ind]
  DIFF <- au[-c(ind, length(au)   1)]  ## https://stackoverflow.com/a/52772380
  UNION <- c(bu, DIFF)
  list(intersect = INTERSECT, union = UNION, adiffb = DIFF)
}

SetOp(a = c(1, 2, 3, 4), b = c(1, 3, 6))
SetOp(a = c(2, 6, 9, 8), b = c(8, 5))

一些基準：

## no duplicated values in either a or b; can set no.dup.guaranteed = TRUE
a <- sample.int(11000, size = 10000, replace = FALSE)
b <- sample.int(11000, size = 10000, replace = FALSE)

microbenchmark::microbenchmark(naive = Naive(a, b),
                               better = SetOp(a, b),
                               fly = SetOp(a, b, no.dup.guaranteed = TRUE))
#Unit: milliseconds
#   expr      min       lq     mean   median       uq      max neval
#  naive 6.457302 6.489710 6.751996 6.511399 6.567941 8.571623   100
# better 3.251701 3.268873 3.377910 3.277086 3.306723 3.880755   100
#    fly 1.734898 1.749300 1.805163 1.755927 1.767114 3.326179   100

## lots of duplicated values in both a and b; must have no.dup.guaranteed = FALSE
a <- sample.int(100, size = 10000, replace = TRUE)
b <- sample.int(100, size = 10000, replace = TRUE)

microbenchmark::microbenchmark(naive = Naive(a, b),
                               better = SetOp(a, b))
#Unit: microseconds
#   expr      min       lq      mean   median       uq      max neval
#  naive 1421.702 1431.023 1653.1339 1443.147 1483.255 3809.031   100
# better  396.193  398.695  446.7062  400.293  412.046 1995.294   100

如果你想進一步加速，你需要考慮如何加速unique()。這并不容易，因為您可能無法擊敗R使用的內部演算法。

我看到了用 R fast 替換 unique() 的額外速度改進Rfast::sort_unique()。

謝謝你，@ M.Viking。很高興看到你的回答。我沒有時間在我的作業系統上安裝 GNU Scientific Library (GSL)，所以我自己無法安裝和嘗試Rfast。

以下是對您的基準測驗結果的一些評論。

“better2”比“fly”更快的原因是因為match在排序向量上更快。所以是的，即使 and 中沒有重復值，a應用b仍然是一個好主意sort_unique。
您可能還想嘗試RfastMatch中的功能。我在包的檔案中發現了這個函式，但不知道它與R的基本版本相比有多快。此外，檔案沒有明確說明如何處理不匹配。相比之下，基本版本有一個有用的引數，我將其設定為 0 以避免 NA 索引。Matchmatchnomatch

好主意。不幸Rfast::Match()的是不是一個替代品base::match()。然而，幸運的是，fastmatch::fmatch()它是一個快速下降的替代品match()。

我們在這里進行了一次非常鼓舞人心的迭代！很高興知道！擁有如此多有用工具的令人驚嘆的R社區！

myV1和V2變數不包含重復項，所以unique如果我理解正確，就不需要使用該函式？

直覺上，是的，因為我們不想做額外的作業。但有趣的是，M.Viking 答案中的基準測驗結果表明，值得對向量進行排序以加速match。所以你實際上可以接受SetOp2()M.Viking 給出的。

我不認為在您的應用程式SetOp3()中base::match替換為fastmatch::fmatch值得。根據其檔案，fmatch比match僅在我們執行重復匹配時更快，例如match(a, key)，match(b, key)等，其中key被重用。M.Viking 的基準有利于這一點，因為microbenchmark()重復SetOp3(a, b)了 100 次。在第一次運行中，與;fmatch一樣快那么它比接下來的 99 次運行match要快得多。match但是，在您的應用程式中，每個向量僅使用一次。由于不存在重用，我們很樂意繼續使用match.

那么，我如何將您的解決方案應用于我的每一行資料（V1并且V2那里有串列）？

我們必須使用回圈或類似回圈的函式，就像Map您在問題中使用的那樣。唯一的問題是我們需要一些后處理來提取結果。見下文。

V1 <- list(a1 = c(1, 2, 3, 4), a2 = c(2, 6, 9, 8))
V2 <- list(b1 = c(1, 3, 6), b2 = c(8, 5))

## or: ans <- Map(SetOp2, V1, V2)
ans <- Map(SetOp, V1, V2, no.dup.guaranteed = TRUE)
## post-processing
INTERSECT <- lapply(ans, "[[", 1)
UNION <- lapply(ans, "[[", 2)
SETDIFF <- lapply(ans, "[[", 3)

關于并行處理的其他想法

jblood94 的答案通過并行計算更進一步。好作業！這是一個很好的練習parallel。然而原則上，我們不想并行這個任務，因為它受記憶體限制而不是 CPU 限制。我們只是從記憶體中掃描資料而沒有進行復雜的 CPU 運算。眾所周知，并行處理對于這類作業沒有希望，我預計不會有很大的加速。似乎jblood94能夠獲得 82.89 / 34.21 = 2.42 對串行處理的加速。但是，他/她沒有提及使用了多少 CPU 內核。例如，如果使用 8 個內核，那么 2.42 的加速非常差。

uj5u.com熱心網友回復：

建立在@Zheyuan Li 的回答基礎上的一種可能的并行化方法：

library(data.table)
library(Rfast)
library(parallel)

SetOp <- function(V1, V2) {
  n <- length(V1)
  DIFF <- vector("list", n)
  INTERSECT <- vector("list", n)
  UNION <- vector("list", n)
  for (i in 1:n) {
    u1 <- sort_unique(V1[[i]])
    u2 <- sort_unique(V2[[i]])
    ind <- match(u2, u1, nomatch = 0)
    DIFF[[i]] <- u1[-c(ind, length(u1)   1)]
    INTERSECT[[i]] <- u1[ind]
    UNION[[i]] <- c(u2, DIFF[[i]])
  }
  
  list(DIFF, INTERSECT, UNION)
}

SetOpParallel <- function(V1, V2, ncl = detectCores() - 1L) {
  ncl <- min(length(V1), ncl)
  cl <- makeCluster(ncl)
  on.exit(stopCluster(cl))
  node <- rep(c(1:ncl, ncl:1), ceiling(length(V1)/ncl/2))[1:length(V1)][rank(-as.numeric(lengths(V1))*as.numeric(lengths(V2)), ties.method = "first")]
  clusterEvalQ(cl, {library(data.table); library(Rfast)})
  rowidx <- integer(length(V1))
  lastidx <- 0L
  for (i in 1:ncl) {
    # pass only the needed data to each node
    nodeidx <- which(node == i)
    rowidx[nodeidx] <- (lastidx   1L):(lastidx   length(nodeidx))
    lastidx <- lastidx   length(nodeidx)
    v1 <- V1[nodeidx]
    v2 <- V2[nodeidx]
    clusterExport(cl[i], c("v1", "v2"), environment())
  }
  rm("v1", "v2")
  clusterExport(cl, "SetOp")
  rbindlist(clusterEvalQ(cl, SetOp(v1, v2)))[rowidx]
}

一個較小的資料集來檢查兩個函式的等效性：

n <- 1e2L
dt <- data.table(id1 = 1:n,
                 id2 = n:1,
                 V1 = replicate(n, sample.int(1e5, sample(5e4:6e4)), FALSE),
                 V2 = replicate(n, sample.int(1e5, sample(3e4:4e4)), FALSE))
identical(copy(dt)[, c("diff_V1_V2", "intersect_V1_V2", "union_V1_V2") := SetOp(V1, V2)],
          copy(dt)[, c("diff_V1_V2", "intersect_V1_V2", "union_V1_V2") := SetOpParallel(V1, V2)])
#> [1] TRUE

用于速度比較的更大資料集：

n <- 1e4L
dt <- data.table(id1 = 1:n,
                 id2 = n:1,
                 V1 = replicate(n, sample.int(1e5, sample(5e4:6e4)), FALSE),
                 V2 = replicate(n, sample.int(1e5, sample(3e4:4e4)), FALSE))
dt2 <- copy(dt)
system.time(dt[, c("diff_V1_V2", "intersect_V1_V2", "union_V1_V2") := SetOpParallel(V1, V2)])
#>    user  system elapsed 
#>    8.30   11.22   34.21
rm(dt)
invisible(gc())
system.time(dt2[, c("diff_V1_V2", "intersect_V1_V2", "union_V1_V2") := SetOp(V1, V2)])
#>    user  system elapsed 
#>   75.11    5.76   82.89

uj5u.com熱心網友回復：

@Zheyuan Li 答案的一個版本，使用Rfast::sort_unique()和fastmatch::fmatch()

#install.packages("Rfast")  ## Takes time to compile all the Rfast functions https://github.com/RfastOfficial/Rfast
library(Rfast)
#install.packages("fastmatch") ## https://github.com/s-u/fastmatch
library(fastmatch)

SetOp2 <- function (a, b, no.dup.guaranteed = FALSE) {
  if (no.dup.guaranteed) {
    au <- a
    bu <- b
  } else {
    au <- sort_unique(a) #Only difference from @Zheyuan Li version
    bu <- sort_unique(b) #""
  }
  ind <- match(bu, au, nomatch = 0)
  INTERSECT <- au[ind]
  DIFF <- au[-c(ind, length(au)   1)]
  UNION <- c(bu, DIFF)
  list(intersect = INTERSECT, union = UNION, adiffb = DIFF)
}

SetOp3 <- function (a, b, no.dup.guaranteed = FALSE) {
  if (no.dup.guaranteed) {
    au <- a
    bu <- b
  } else {
    au <- sort_unique(a)  ## Updated
    bu <- sort_unique(b)  ## Updated
  }
  ind <- fastmatch::fmatch(bu, au, nomatch = 0)   ## Updated
  INTERSECT <- au[ind]
  DIFF <- au[-c(ind, length(au)   1)]
  UNION <- c(bu, DIFF)
  list(intersect = INTERSECT, union = UNION, adiffb = DIFF)
}

a <- sample.int(11000, size = 10000, replace = FALSE)
b <- sample.int(11000, size = 10000, replace = FALSE)

microbenchmark::microbenchmark(naive = Naive(a, b),
                               better = SetOp(a, b),
                               better2 = SetOp2(a, b),
                               better3 = SetOp3(a, b),
                               fly = SetOp(a, b, no.dup.guaranteed = TRUE),
                               times=10000)

Unit: microseconds
    expr      min        lq      mean    median        uq       max neval
   naive 2970.053 3110.7045 3409.1796 3215.7650 3352.5110 396391.39 10000
  better 1749.386 1819.7195 1987.1681 1877.7340 1984.3705  39385.95 10000
 better2  840.537  873.3440  977.6893  902.5495  969.4585  35723.44 10000
 better3  421.832  460.3975  530.7420  480.0845  519.7660  38398.51 10000
     fly  935.782  972.0125 1074.5115  993.6900 1062.4290  39954.88 10000

x <- sample.int(100, size = 10000, replace = TRUE)
y <- sample.int(100, size = 10000, replace = TRUE)

microbenchmark::microbenchmark(naive = Naive(x, y),
                               better = SetOp(x, y),
                               better2 = SetOp2(x, y),
                               better3 = SetOp3(x, y),
                               times=10000)

Unit: microseconds
    expr     min      lq      mean   median       uq       max neval
   naive 595.526 623.245 771.52724 654.2590 718.4740 38635.977 10000
  better 188.642 196.369 255.91779 206.0725 227.1735 41866.460 10000
 better2  54.856  59.675  71.53553  63.6070  74.8945   549.182 10000
 better3  63.609  70.503  94.74702  80.8270 100.8745 25584.469 10000

Nota Bene： R 版本 4.0.4 用-mtune=native -march=native -O3 -fopenmp -fpic和 openblas BLAS/LAPACK 編譯。

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/491640.html

標籤：r 表现数据表

上一篇：C與Perl的速度和記憶體管理

下一篇：Swift讀寫圖片占用大量記憶體