我有tibble超過 200 萬行。其中一列size是一個值,用M代表百萬,k代表千;它也有一些<NA>價值。列型別為character,如下所示:
size
1.3M
5k
302
8.6M
<NA>
4.4k
21
...等等。
我嘗試了以下代碼:
for (i in 1:length(example$size)) {
if (!is.na(example$size[i])) {
if (str_sub(example$size[i],-1,-1) == "M") {
example$size[i] = as.numeric(str_sub(example$size[i], 1,-2)) * 1000000
} else if (str_sub(example$size[i],-1,-1) == "k") {
example$size[i] = as.numeric(str_sub(example$size[i], 1,-2)) * 1000
}
}
}
但它花了半個多小時仍在運行,所以我打斷了它,因為我不確定我的代碼是否錯誤并且它處于無限回圈中。有什么錯誤或任何編碼方式來提高效率嗎?
uj5u.com熱心網友回復:
str_replace_all您可以使用and完成所有操作as.numeric:
as.numeric(stringr::str_replace_all(size, c(M = "e6", k = "e3")))
[1] 1300000 5000 NA 21 4400
編輯:
更快的方法是使用 baseRsub函式兩次:
as.numeric(sub("k", "e3", sub("M", "e6", bigsize,fixed = TRUE), fixed = TRUE))
快速的微基準檢查表明這種方法是最快的:
microbenchmark::microbenchmark(
a = as.numeric(sub("k", "e3", sub("M", "e6", bigsize,fixed = TRUE), fixed = TRUE)),
b = as.numeric(str_replace_all(bigsize, c(M = "e6", k = "e3"))),
rep1 = rep1(bigsize),
rep2 = rep2(bigsize),
rep3 = rep3(bigsize),
rep4 = rep4(bigsize),
rep5 = rep5(bigsize_df), times=3)
Unit: milliseconds
expr min lq mean median uq max neval
a 621.1582 638.9055 664.4689 656.6529 686.1242 715.5955 3
b 1102.8758 1108.1215 1118.1558 1113.3673 1125.7958 1138.2244 3
rep1 1450.3998 1478.7379 1547.1752 1507.0761 1595.5629 1684.0497 3
rep2 6144.4160 6419.0407 8411.8940 6693.6654 9545.6329 12397.6005 3
rep3 19224.9825 19225.2984 19427.0457 19225.6143 19528.0773 19830.5402 3
rep4 1188.0552 1310.4584 1368.6480 1432.8616 1458.9444 1485.0273 3
rep5 3056.1525 3177.7098 3672.9781 3299.2671 3981.3909 4663.5148 3
uj5u.com熱心網友回復:
對于 2M 行,我得到大約 3 秒的時間,這聽起來像是提高了約 600 倍。
# example data
size <- c("1.3M","5k",NA,21,"4.4k")
bigsize <- c(replicate(4e5, size)) # big(ish) example for benchmarking
bigsize_df <- data.frame(bigsize) # 2,000,000 rows
# split out k/M
library(dplyr)
rep5 <- function(df) {
df %>%
mutate(num = readr::parse_number(bigsize),
suffix = stringr::str_match_all(bigsize, "k|M"),
num2 = num * case_when(suffix == "M" ~ 1E6,
suffix == "k" ~ 1E3,
TRUE ~ 1))
}
#3.003 sec
tictoc::tic()
rep5(bigsize_df)
tictoc::toc()
結果:
bigsize num suffix num2
1 1.3M 1.3 M 1300000
2 5k 5.0 k 5000
3 <NA> NA NA NA
4 21 21.0 21
5 4.4k 4.4 k 4400
6 1.3M 1.3 M 1300000
etc.
uj5u.com熱心網友回復:
tl;dr矢量化將速度提高了 5 倍,試圖巧妙地避免重復處理獲得 30 倍的速度增益。長度為 50,000 的向量仍然需要大約 1.5 秒(因此預計 200 萬個條目大約需要 1 分鐘......)
- 原始方法和@KacZdr 的建議都生成字符向量,因為用數值替換字符向量中的值會強制它們回傳字符(你總是可以
as.numeric()在最后使用);@KacZdr 的解決方案會發出警告。
size <- c("1.3M","5k",NA,21,"4.4k")
bigsize <- c(replicate(1e4, size)) # big(ish) example for benchmarking
## process outside of function to avoid repetition
prefixes <- c("M"=1e6, "k"=1e3)
re <- sprintf("[%s]", paste(names(prefixes), collapse =""))
rep1 <- function(size) {
rx <- regexpr(re, size) ## find matches
w <- which(!is.na(rx) & rx > 0) ## indices for replacement
sw <- size[w]
vals <- prefixes[substr(sw, rx[w], rx[w])] ## find letter values
result <- numeric(length(size)) ## allocate result vector
result[-w] <- as.numeric(size[-w]) ## assign non-suffixed values
result[w] <- as.numeric(sub(re, "", sw))*vals ## assign suffixed values
result
}
將其他兩種方法包裝在函式中以進行基準測驗:
rep2 <- function(size) {
size <- ifelse(!is.na(size) & grepl("M",size),as.numeric(sub("M.*", "", size))*1000000,size)
size <- ifelse(!is.na(size) & grepl("k",size),as.numeric(sub("k.*", "", size))*1000,size)
return(size)
}
原來的:
library(stringr)
rep3 <- function(size) {
for (i in 1:length(size)) {
if (!is.na(size[i])) {
if (str_sub(size[i],-1,-1) == "M") {
size[i] = as.numeric(str_sub(size[i], 1,-2)) * 1000000
} else if (str_sub(size[i],-1,-1) == "k") {
size[i] = as.numeric(str_sub(size[i], 1,-2)) * 1000
}
}
}
size
}
library(rbenchmark)
benchmark(rep1(bigsize), rep2(bigsize), rep3(bigsize))[,1:5]
test replications elapsed relative user.self
1 rep1(bigsize) 100 1.451 1.000 1.452
2 rep2(bigsize) 100 7.812 5.384 7.807
3 rep3(bigsize) 100 41.489 28.593 41.485
這是另一個想法,我認為它會比rep1()但實際上更快:
rep4 <- function(size) {
lastchar <- stringr::str_sub(size, -1, -1)
w <- grep(re, lastchar)
sw <- size[w]
vals <- prefixes[lastchar[w]] ## find letter values
result <- numeric(length(size)) ## allocate result vector
result[-w] <- as.numeric(size[-w]) ## assign non-suffixed values
result[w] <- as.numeric(sub(re, "", sw))*vals ## assign suffixed values
result
}
uj5u.com熱心網友回復:
試試這個:
size <- c("1.3M","5k",NA,21,"4.4k")
size <- ifelse(!is.na(size) & grepl("M",size),as.numeric(sub("M.*", "", size))*1000000,size)
size <- ifelse(!is.na(size) & grepl("k",size),as.numeric(sub("k.*", "", size))*1000,size)
編輯(以避免錯誤):
size <- ifelse(!is.na(size) & grepl("M",size),suppressWarnings(as.numeric(sub("M.*", "", size)))*1000000,size)
size <- ifelse(!is.na(size) & grepl("k",size),suppressWarnings(as.numeric(sub("k.*", "", size)))*1000,size)
輸出:
> size
[1] "1300000" "5000" NA "21" "4400"
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/519861.html
