我有一個像下面這樣的資料集(實際資料集有 5M 行,沒有間隙),我試圖過濾掉行本身及其前一行和下一行的所有數字列的總和等于零的行。
NB
Time是dttm實際資料中的一列。- 連續零的數量可以超過 3 行,在這種情況下,多行將被過濾掉。
# A tibble: 13 x 4
group Time Val1 Val2
<chr> <int> <dbl> <dbl>
1 A 1 0 0
2 B 1 0.1 0
3 A 3 0 0
4 B 3 0 0
5 A 2 0 0
6 B 2 0.2 0.2
7 B 4 0 0
8 A 4 0 0.1
9 A 5 0 0
10 A 6 0 0
11 B 6 0.1 0.5
12 B 5 0.1 0.2
13 A 7 0 0
請參閱下面的示例了解所需內容:
# A tibble: 13 x 8
group Time Val1 Val2 rowsum leadsum lagsum sum
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 0 0 0 0 NA NA
2 A 2 0 0 0 0 0 0 This will get filtered out!
3 A 3 0 0 0 0.1 0 0.1
4 A 4 0 0.1 0.1 0 0 0.1
5 A 5 0 0 0 0 0.1 0.1
6 A 6 0 0 0 0 0 0 This will get filtered out!
7 A 7 0 0 0 NA 0 NA
8 B 1 0.1 0 0.1 0.4 NA NA
9 B 2 0.2 0.2 0.4 0 0.1 0.5
10 B 3 0 0 0 0 0.4 0.4
11 B 4 0 0 0 0.3 0 0.3
12 B 5 0.1 0.2 0.3 0.6 0 0.9
13 B 6 0.1 0.5 0.6 NA 0.3 NA
到目前為止,我已經嘗試通過使用dplyr::lag()and來做到這一點dplyr::lead();但這非常低效,并且會為實際資料集引發記憶體分配錯誤:
> Error in Sys.getenv("TESTTHAT") : > could not allocate memory (0 Mb) in C function 'R_AllocStringBuffer'
這是我到目前為止所擁有的;我可以先拿到的總和Val1和Val2,然后執行lead和lag,但不會解決問題。
df0 %>%
##arrange by group is not necessary since we're grouping by that var
arrange(group, Time) %>%
group_by(group) %>%
mutate(sum = Val1 Val2 lag(Val1) lag(Val2) lead(Val1) lead(Val2)) # %>%
# filter(is.na(sum) | sum != 0)
## commenting out filter to show the full results
# > # A tibble: 13 x 5
# > # Groups: group [2]
# > group Time Val1 Val2 sum
# > <chr> <int> <dbl> <dbl> <dbl>
# > 1 A 1 0 0 NA
# ! - A 2 0 0 0
# > 2 A 3 0 0 0.1
# > 3 A 4 0 0.1 0.1
# > 4 A 5 0 0 0.1
# ! - A 6 0 0 0
# > 5 A 7 0 0 NA
# > 6 B 1 0.1 0 NA
# > 7 B 2 0.2 0.2 0.5
# > 8 B 3 0 0 0.4
# > 9 B 4 0 0 0.3
# > 10 B 5 0.1 0.2 0.9
# > 11 B 6 0.1 0.5 NA
玩具資料集:
df0 <- structure(list(group = c("A", "B", "A", "B", "A", "B",
"B", "A", "A", "A", "B", "B", "A"),
Time = c(1L, 1L, 3L, 3L, 2L, 2L, 4L, 4L, 5L, 6L, 6L, 5L, 7L),
Val1 = c(0, 0.1, 0, 0, 0, 0.2, 0, 0, 0, 0, 0.1, 0.1, 0),
Val2 = c(0, 0, 0, 0, 0, 0.2, 0, 0.1, 0, 0, 0.5, 0.2, 0)),
row.names = c(NA, -13L),
class = c("tbl_df", "tbl", "data.frame"))
uj5u.com熱心網友回復:
我們可以使用在包中rle實作的base或其更快的實作。rlencpurler
library(tidyverse)
library(purler)
subsetter <- function(df){
df %>%
select(where(is.double)) %>%
rowSums() %>%
purler::rlenc() %>%
filter(lengths >= 3L & values == 0L) %>%
transmute(ids = map2(start, start lengths, ~ (.x 1) : (.y - 2))) %>%
unlist(use.names = F)
}
# to get data as shown in example
df0 <- df0 %>%
mutate(Time = as.character(Time)) %>%
arrange(group, Time)
edge_cases <- tribble(
~group, ~Time, ~Val1, ~Val2,
"C", "1", 0, 0,
"C", "2", 0, 0,
"C", "3", 0, 0,
"C", "4", 0, 0,
)
df1 <- rbind(df0, edge_cases)
df1 %>%
`[`(-subsetter(.),)
# A tibble: 13 x 4
group Time Val1 Val2
<chr> <chr> <dbl> <dbl>
1 A 1 0 0
2 A 3 0 0
3 A 4 0 0.1
4 A 5 0 0
5 A 7 0 0
6 B 1 0.1 0
7 B 2 0.2 0.2
8 B 3 0 0
9 B 4 0 0
10 B 5 0.1 0.2
11 B 6 0.1 0.5
12 C 1 0 0
13 C 4 0 0
bench::mark(df1 %>% `[`(-subsetter(.),))[,c(3,5,7)]
# A tibble: 1 x 3
median mem_alloc n_itr
<bch:tm> <bch:byt> <int>
1 3.91ms 9.38KB 93
uj5u.com熱心網友回復:
自從你標記 資料表,這是一個data.table-native 解決方案:
library(data.table)
dt0 <- as.data.table(df0)
setorder(dt0, Time) # add 'group' if you want
isnum <- names(which(sapply(dt0, function(z) is.numeric(z) & !is.integer(z))))
isnum
# [1] "Val1" "Val2"
dt0[, sum0 := abs(rowSums(.SD)) < 1e-9, .SDcols = isnum
][, .SD[(c(0,sum0[-.N]) sum0 c(sum0[-1],0)) < 3,], by = .(group)
][, sum0 := NULL ][]
# group Time Val1 Val2
# <char> <int> <num> <num>
# 1: A 1 0.0 0.0
# 2: A 3 0.0 0.0
# 3: A 4 0.0 0.1
# 4: A 5 0.0 0.0
# 5: A 7 0.0 0.0
# 6: B 1 0.1 0.0
# 7: B 2 0.2 0.2
# 8: B 3 0.0 0.0
# 9: B 4 0.0 0.0
# 10: B 5 0.1 0.2
# 11: B 6 0.1 0.5
根據您的評論,A-2 和 A-6 均已洗掉。
效率:
rowSums快速高效;- 我們使用默認的直接索引進行轉換
0;在data.table,這是非常有效的處理,并且不會招致(誠然小)的開銷lead/lag/shift呼叫; - 在對一行求和后,我們只對這個值進行行移位,而不是每行四行移位。
編輯,稍微提高性能(15-20%):
dt0[
dt0[, sum0 := abs(rowSums(.SD)) < 1e-9, .SDcols = isnum
][, .I[(c(0,sum0[-.N]) sum0 c(sum0[-1],0)) < 3], by=group ]$V1
][, sum0 := NULL][]
誠然,這可能有點難以理解,但它在大約 82% 的時間內(使用此資料集)產生相同的結果。感謝@Henrik 幫助我更多地了解.I它及其好處。
uj5u.com熱心網友回復:
您可以嘗試以下data.table選項
setorder(setDT(df0), group, Time)[
,
rs := rowSums(Filter(is.double, .SD))
][, .SD[!(rs == 0 & .N > 2 & (!rowid(rs) %in% c(1, .N)))], rleid(rs)][
,
rleid := NULL
][]
這使
group Time Val1 Val2
1: A 1 0.0 0.0
2: A 3 0.0 0.0
3: A 4 0.0 0.1
4: A 5 0.0 0.0
5: A 7 0.0 0.0
6: B 1 0.1 0.0
7: B 2 0.2 0.2
8: B 3 0.0 0.0
9: B 4 0.0 0.0
10: B 5 0.1 0.2
11: B 6 0.1 0.5
uj5u.com熱心網友回復:
這個解決方案主要受到@r2evans 的啟發。它使用Reduce、 和shift,而不是基于rowSums和c函式的@r2evans 解決方案。我認為這個解決方案的改進來自使用Reduce( , .SD)而不是rowSums(.SD)data.frame/data.table (以及避免[, .SD[...], ...]使用 data.table synthax);它更快(至少在我的 PC 上)和更高的記憶體效率(不轉換為矩陣)。警告: 沒有直接的等價物rowSums(.SD, na.rm=TRUE)。
n = 1e7
dt0 = setDT(df0[sample(nrow(df0), n, replace=TRUE), ])
setorder(dt0, group, Time)
isnum = sapply(dt0, function(x) is.numeric(x) && !is.integer(x))
eps = sqrt(.Machine$double.eps)
# New solution
f1 = function() {
ans = dt0[, is0 := {sum0 = abs(Reduce(` `, .SD)) < eps; Reduce(` `, shift(sum0, -1:1, fill=0)) < 3},
by=group, .SDcols=isnum][(is0), !"is0"]
dt0[, is0 := NULL] # remove is0 from the initial dataset
ans
}
# similar to f1: easily adaptable to rowSums(.SD, na.rm=TRUE).
f2 = function() {
# here I replace Reduce(` `, .SD) with rowSums(.SD) just in case its na.rm argument is needed.
ans = dt0[, is0 := {sum0 = abs(rowSums(.SD)) < eps; Reduce(` `, shift(sum0, -1:1, fill=0)) < 3},
by=group, .SDcols=isnum][(is0), !"is0"]
dt0[, is0:=NULL] # remove is0 from the initial dataset
ans
}
# r2evans first solution
f3 = function() {
ans = dt0[
dt0[, sum0 := abs(rowSums(.SD)) < eps, .SDcols = isnum
][, .I[(c(0,sum0[-.N]) sum0 c(sum0[-1],0)) < 3], by=group ]$V1
][, sum0 := NULL][]
dt0[, sum0 := NULL] # remove sum0 from the initial dataset
ans
}
# r2evans second solution
f4 = function() {
ans = dt0[, sum0 := abs(rowSums(.SD)) < eps, .SDcols = isnum
][, .SD[(c(0,sum0[-.N]) sum0 c(sum0[-1],0)) < 3,], by = .(group)
][, sum0 := NULL ][]
dt0[, sum0:=NULL] # remove sum0 from the initial dataset
ans
}
# modified version of r2evans second solution: similar to f4 but avoid [, .SD[...], by=group]
f5 = function() {
ans = dt0[, sum0 := abs(rowSums(.SD)) < eps, .SDcols = isnum
][, sum0 := (c(0,sum0[-.N]) sum0 c(sum0[-1],0)) < 3, by = .(group)
][(sum0), !"sum0"][]
dt0[, sum0:=NULL] # remove sum0 from the initial dataset
ans
}
基準
bench::mark(
f1(),
f2(),
f3(),
f4(),
f5(),
iterations=5L, check=FALSE
)
# A tibble: 5 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 f1() 347ms 406ms 2.49 698.47MB 5.48 5 11 2.01s
2 f2() 529ms 578ms 1.69 851.02MB 4.06 5 12 2.96s
3 f3() 717ms 821ms 1.22 1.25GB 3.40 5 14 4.12s
4 f4() 889ms 956ms 1.04 1.57GB 5.01 5 24 4.79s
5 f5() 642ms 677ms 1.40 1.07GB 3.37 5 12 3.56s
基于此結果,第一個解決方案比 f3 和 f4 快 2 ,而且記憶體效率更高。
我正在使用 data.table (data.table 1.14.3)的開發版本
uj5u.com熱心網友回復:
library(tidyverse)
df0 %>%
arrange(group, Time) %>% # EDIT to arrange by time (and group for clarity)
rowwise() %>%
mutate(sum = sum(c_across(Val1:Val2))) %>%
group_by(group) %>%
filter( !(sum == 0 & lag(sum, default = 1) == 0 & lead(sum, default = 1) == 0)) %>%
ungroup()
# A tibble: 11 x 5
group Time Val1 Val2 sum
<chr> <int> <dbl> <dbl> <dbl>
1 A 1 0 0 0
2 A 3 0 0 0
3 A 4 0 0.1 0.1
4 A 5 0 0 0
5 A 7 0 0 0
6 B 1 0.1 0 0.1
7 B 2 0.2 0.2 0.4
8 B 3 0 0 0
9 B 4 0 0 0
10 B 5 0.1 0.2 0.3
11 B 6 0.1 0.5 0.6
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/344541.html
