為什么tibble在逐行比較中的執行速度比data.frame慢-有解無憂

我正在將舊代碼庫轉換為 tidyverse，我注意到特定步驟的性能下降；因為我現在使用readr( read_delim) 來讀取我的資料，所以我最終得到了一個tibble而不是先前的基數 R data.frame( read.delim) — 這很好。

無論如何，當使用tibble逐行比較時，與常規data.frame.

這是我的代碼：

library(tidyverse)

# Data
df <- tribble(
  ~x_pos, ~y_pos,
  0.0,  5.0,
  NA,   NA,
  0.1,  0.9,
  1.1,  1.5,
  1.7,  2.0,
  3.2,  1.0,
  4.0,  1.5,
  4.1,  5.0,
)

# Defining Regions of interest
roi_set_top <- list(
  roi_list = list(
    roi1 = list(
      hit_name = "left",
      x1 = 1.0,
      y1 = 1.0,
      x2 = 2.0,
      y2 = 2.0
    ),
    roi2 = list(
      hit_name = "right",
      x1 = 3.0,
      y1 = 1.0,
      x2 = 4.0,
      y2 = 2.0
    )
  )
)

# ?? UNCOMMENT THIS LINE this line to convert the `tibble` to a `data.frame` and source the file again
# df <- as.data.frame(df)

start.time <- Sys.time()

for (bench in 1:1000) {
  roi_vector <- rep("NO EVAL", times = nrow(df))
  
  # loop over rows
  for (i in 1:nrow(df)) {
    
    # loop over the aoilist
    for (roi in roi_set_top$roi_list) {
      
      # check if either x or y is NA (or both) if so return NA
      if (is.na(df[i, "x_pos"]) || is.na(df[i, "y_pos"])) {
        roi_vector[i] <- "No X/Y"
        break
      }
      
      # check the hit area
      if (df[i, "x_pos"] >= roi$x1 && df[i, "y_pos"] >= roi$y1 &&
          df[i, "x_pos"] <= roi$x2 && df[i, "y_pos"] <= roi$y2) {
        roi_vector[i] <- roi$hit_name
        break
      }
      
      # Finally, if current row’s x and y is neither NA nor in hit range assign Outside ROI
      roi_vector[i] <- "Outside ROI"
    }
  }
}

end.time <- Sys.time()
time.taken <- end.time - start.time
print(time.taken)

比較

當你作為是源代碼，大約需要相比，當你取消注釋與??線，從將其轉換成10倍的時間tibble來data.frame。

如果我愿意提取這樣的向量，我可以恢復我的表現data.farme：x_pos <- df$x_pos; y_pos <- df$x_pos并在回圈中使用向量而不是 df。但是，我得到了一個基本問題

問題

tibble與基數 R 相比，為什么在逐行比較中的執行速度較慢data.frame？

作為最佳實踐風格的后續行動；當一個人只需要使用向量時，使用 df 似乎是一種不好的做法。因此，應該不斷迭代向量而不是 df 中的列？

uj5u.com熱心網友回復：

主要原因是 tibbles 在子集化時回傳 tibbles，而資料幀有時回傳向量。在您的示例中，這顯示在評估中df[i, "x_pos"]，如果df是小標題，則為小標題，但如果df是資料框，則為數字標量。這使得計算is.na(df[i, "x_pos"])速度慢得多。

drop = TRUE每次您確實想要一個向量或標量時，您都會通過添加來獲得更快的速度（我看到花費的時間減少了 25%），但更好的主意是在回圈外轉換為向量以避免所有tibble 中的那些個人訪問。例如這段代碼：

start.time <- Sys.time()

for (bench in 1:1000) {
  roi_vector <- rep("NO EVAL", times = nrow(df))
  # loop over rows
  x_pos <- df$x_pos
  y_pos <- df$y_pos
  for (i in 1:nrow(df)) {
    # loop over the aoilist
    for (roi in roi_set_top$roi_list) {
      # check if either x or y is NA (or both) if so return NA
      if (is.na(x_pos[i]) || is.na(y_pos[i])) {
        roi_vector[i] <- "No X/Y"
        break
      }
      # check the hit area
      if (x_pos[i] >= roi$x1 && y_pos[i] >= roi$y1 &&
          x_pos[i] <= roi$x2 && y_pos[i] <= roi$y2) {
        roi_vector[i] <- roi$hit_name
        break
      }
      # Finally, if current row’s x and y is neither NA nor in hit range assign Outside ROI
      roi_vector[i] <- "Outside ROI"
    }
  }
}
end.time <- Sys.time()
time.taken <- end.time - start.time
print(time.taken)

比我系統上的原始代碼快 60 倍。

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/334665.html

標籤：r 数据框 dplyr 整理宇宙小题大做

上一篇：系結正則運算式組的可選性而不復制強制性模式部分

下一篇：將行轉移到r中的列