我想在資料框中保留不同的行,使用演算法選擇每組的最后一個值(dplyr::distinct()默認情況下),但前提是它不是NA. 我在依賴于 的 SO 上看到了這個很好的答案data.table,但我無法將其擴展到具有多個分組變數的資料。
為了演示這個問題,我從有效的最小示例開始,然后將其放大。首先,請考慮以下資料:
library(tibble)
df_id_and_type <-
tibble::tribble(
~id, ~type,
1, "A",
1, NA,
2, "B",
3, "A",
3, NA,
3, "D",
3, NA,
4, NA,
4, "C",
5, "A",
6, NA,
6, "B",
6, NA
)
我想通過選擇最后一個值來獲得typeper的不同值id,除非它是NA. 如果最后一個是 NA然后上升直到有非NA. 所以這個答案向我們展示了如何做到這一點data.table:
library(data.table)
dt_id_and_type <- as.data.table(df_id_and_type)
dt_id_and_type$typena <- is.na(dt_id_and_type$type)
setorderv(dt_id_and_type, c("typena","id"), order = c(-1, 1))
dt_id_and_type[!duplicated(id, fromLast = TRUE), c("id", "type"), with = FALSE]
#> id type
#> 1: 1 A
#> 2: 2 B
#> 3: 3 D
#> 4: 4 C
#> 5: 5 A
#> 6: 6 B
但是,如果我們有多個分組變數(即,不僅是id)怎么辦?在以下示例中,我添加了一個year變數:
df_id_year_and_type <-
df_id_and_type %>%
add_column(year = c(2002, 2002, 2008, 2010, 2010, 2010, 2013, 2020, 2020, 2009, 2010, 2010, 2012),
.before = "type")
df_id_year_and_type
#> # A tibble: 13 x 3
#> id year type
#> <dbl> <dbl> <chr>
#> 1 1 2002 A
#> 2 1 2002 <NA>
#> 3 2 2008 B
#> 4 3 2010 A
#> 5 3 2010 <NA>
#> 6 3 2010 D
#> 7 3 2013 <NA>
#> 8 4 2020 <NA>
#> 9 4 2020 C
#> 10 5 2009 A
#> 11 6 2010 <NA>
#> 12 6 2010 B
#> 13 6 2012 <NA>
我的預期輸出是:
## # A tibble: 8 x 3
## id year type
## <dbl> <dbl> <chr>
## 1 1 2002 A
## 2 2 2008 B
## 3 3 2010 D
## 4 3 2013 NA # for id 3 in year 2013 there was only `NA`, so that's what we get
## 5 4 2020 C
## 6 5 2009 A
## 7 6 2010 B
## 8 6 2012 NA # same as comment above
知道如何將在 1-grouping-var 情況下作業的解決方案擴展到當前資料嗎?前兩行代碼很簡單:
dt_id_year_and_type <- as.data.table(df_id_year_and_type)
dt_id_year_and_type$typena <- is.na(dt_id_year_and_type$type)
setorderv(dt_id_year_and_type, c("typena","id"), order = c(-1, 1)) # <--- how to account for `year`?
dt_id_year_and_type[!duplicated(id, fromLast = TRUE), c("id", "type"), with = FALSE] # <--- here too...
uj5u.com熱心網友回復:
我會提出這個解決方案,您可以在其中排除unique. 如果為一組的所有觀測NA,sum(is.na(x)) / .N等于1,我們從這里出發
library(tibble)
library(data.table)
df_id_and_type <-
tibble::tribble(
~id, ~type,
1, "A",
1, NA,
2, "B",
3, "A",
3, NA,
3, "D",
3, NA,
4, NA,
4, "C",
5, "A",
6, NA,
6, "B",
6, NA
)
df_id_year_and_type <-
df_id_and_type %>%
add_column(year = c(2002, 2002, 2008, 2010, 2010, 2010, 2013, 2020, 2020, 2009, 2010, 2010, 2012),
.before = "type")
# convert to data.table
dt_id_year_and_type <- as.data.table(df_id_year_and_type)
# define grouping vars
grouping_vars <- c("id", "year")
# are all types na for a group?
dt_id_year_and_type[, na_ratio := sum(is.na(type)) / .N,
by = c(grouping_vars)]
# remove all lines that are NA, except they are from a group in which all
# observations are NA
dt_id_year_and_type <- dt_id_year_and_type[na_ratio == 1 | !is.na(type)]
# sort correctly
setorderv(dt_id_year_and_type, grouping_vars)
dt_id_year_and_type
#> id year type na_ratio
#> 1: 1 2002 A 0.5000000
#> 2: 2 2008 B 0.0000000
#> 3: 3 2010 A 0.3333333
#> 4: 3 2010 D 0.3333333
#> 5: 3 2013 <NA> 1.0000000
#> 6: 4 2020 C 0.5000000
#> 7: 5 2009 A 0.0000000
#> 8: 6 2010 B 0.5000000
#> 9: 6 2012 <NA> 1.0000000
# keep only the last observation of each group
dt_unique <- unique(dt_id_year_and_type, by = grouping_vars, fromLast = TRUE)
remove no longer needed helper column
dt_unique[, na_ratio := NULL]
dt_unique
#> id year type
#> 1: 1 2002 A
#> 2: 2 2008 B
#> 3: 3 2010 D
#> 4: 3 2013 <NA>
#> 5: 4 2020 C
#> 6: 5 2009 A
#> 7: 6 2010 B
#> 8: 6 2012 <NA>
uj5u.com熱心網友回復:
另一種可能的解決方案:
library(tidyverse)
df_id_year_and_type %>%
group_by(id, year) %>%
fill(type, .direction = "downup") %>%
summarise(type = last(type), .groups = "drop")
#> # A tibble: 8 × 3
#> id year type
#> <dbl> <dbl> <chr>
#> 1 1 2002 A
#> 2 2 2008 B
#> 3 3 2010 D
#> 4 3 2013 <NA>
#> 5 4 2020 C
#> 6 5 2009 A
#> 7 6 2010 B
#> 8 6 2012 <NA>
uj5u.com熱心網友回復:
這里有一些基于 data.table 的解決方案。
setDT(df_id_year_and_type)
方法一
na.omit(df_id_year_and_type, cols="type")NA根據 column洗掉行type。
unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE)查找所有組。通過加入它們(使用最后一個匹配:)mult="last",我們獲得了所需的輸出。
na.omit(df_id_year_and_type, cols="type"
)[unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE),
on=c('id', 'year'),
mult="last"]
# id year type
# <num> <num> <char>
# 1: 1 2002 A
# 2: 2 2008 B
# 3: 3 2010 D
# 4: 3 2013 <NA>
# 5: 4 2020 C
# 6: 5 2009 A
# 7: 6 2010 B
# 8: 6 2012 <NA>
方法二
df_id_year_and_type[df_id_year_and_type[, .I[which.max(cumsum(!is.na(type)))], .(id, year)]$V1,]
方法三
(由于[開銷可能較慢)
df_id_year_and_type[, .SD[which.max(cumsum(!is.na(type)))], .(id, year)]
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/368512.html
