使用“data.table”從重復行中選擇非“NA”值——當有多個分組變數時-有解無憂

我想在資料框中保留不同的行，使用演算法選擇每組的最后一個值（dplyr::distinct()默認情況下），但前提是它不是NA. 我在依賴于的 SO 上看到了這個很好的答案data.table，但我無法將其擴展到具有多個分組變數的資料。

為了演示這個問題，我從有效的最小示例開始，然后將其放大。首先，請考慮以下資料：

library(tibble)

df_id_and_type <-
  tibble::tribble(
        ~id, ~type,
          1,   "A",
          1,    NA,
          2,   "B",
          3,   "A",
          3,    NA,
          3,   "D",
          3,    NA,
          4,    NA,
          4,   "C",
          5,   "A",
          6,    NA,
          6,   "B",
          6,    NA
        )

我想通過選擇最后一個值來獲得typeper的不同值id，除非它是NA. 如果最后一個是 NA然后上升直到有非NA. 所以這個答案向我們展示了如何做到這一點data.table：

library(data.table)

dt_id_and_type        <- as.data.table(df_id_and_type)
dt_id_and_type$typena <- is.na(dt_id_and_type$type)
setorderv(dt_id_and_type, c("typena","id"), order = c(-1, 1))
dt_id_and_type[!duplicated(id, fromLast = TRUE), c("id", "type"), with = FALSE]
#>    id type
#> 1:  1    A
#> 2:  2    B
#> 3:  3    D
#> 4:  4    C
#> 5:  5    A
#> 6:  6    B

但是，如果我們有多個分組變數（即，不僅是id）怎么辦？在以下示例中，我添加了一個year變數：

df_id_year_and_type <-
  df_id_and_type %>%
  add_column(year = c(2002, 2002, 2008, 2010, 2010, 2010, 2013, 2020, 2020, 2009, 2010, 2010, 2012), 
             .before = "type")

df_id_year_and_type
#> # A tibble: 13 x 3
#>       id  year type 
#>    <dbl> <dbl> <chr>
#>  1     1  2002 A    
#>  2     1  2002 <NA> 
#>  3     2  2008 B    
#>  4     3  2010 A    
#>  5     3  2010 <NA> 
#>  6     3  2010 D    
#>  7     3  2013 <NA> 
#>  8     4  2020 <NA> 
#>  9     4  2020 C    
#> 10     5  2009 A    
#> 11     6  2010 <NA> 
#> 12     6  2010 B    
#> 13     6  2012 <NA>

我的預期輸出是：

## # A tibble: 8 x 3
##      id  year type 
##   <dbl> <dbl> <chr>
## 1     1  2002 A    
## 2     2  2008 B    
## 3     3  2010 D    
## 4     3  2013 NA   # for id 3 in year 2013 there was only `NA`, so that's what we get
## 5     4  2020 C    
## 6     5  2009 A    
## 7     6  2010 B    
## 8     6  2012 NA   # same as comment above

知道如何將在 1-grouping-var 情況下作業的解決方案擴展到當前資料嗎？前兩行代碼很簡單：

dt_id_year_and_type        <- as.data.table(df_id_year_and_type)
dt_id_year_and_type$typena <- is.na(dt_id_year_and_type$type)
setorderv(dt_id_year_and_type, c("typena","id"), order = c(-1, 1)) # <--- how to account for `year`?
dt_id_year_and_type[!duplicated(id, fromLast = TRUE), c("id", "type"), with = FALSE] # <--- here too...

uj5u.com熱心網友回復：

我會提出這個解決方案，您可以在其中排除unique. 如果為一組的所有觀測NA，sum(is.na(x)) / .N等于1，我們從這里出發

library(tibble)
library(data.table)

df_id_and_type <-
  tibble::tribble(
    ~id, ~type,
    1,   "A",
    1,    NA,
    2,   "B",
    3,   "A",
    3,    NA,
    3,   "D",
    3,    NA,
    4,    NA,
    4,   "C",
    5,   "A",
    6,    NA,
    6,   "B",
    6,    NA
  )


df_id_year_and_type <-
  df_id_and_type %>%
  add_column(year = c(2002, 2002, 2008, 2010, 2010, 2010, 2013, 2020, 2020, 2009, 2010, 2010, 2012), 
             .before = "type")

# convert to data.table
dt_id_year_and_type <- as.data.table(df_id_year_and_type)

# define grouping vars
grouping_vars <- c("id", "year")

# are all types na for a group?
dt_id_year_and_type[, na_ratio := sum(is.na(type)) / .N, 
                    by = c(grouping_vars)]

# remove all lines that are NA, except they are from a group in which all 
# observations are NA
dt_id_year_and_type <- dt_id_year_and_type[na_ratio == 1 | !is.na(type)]

# sort correctly
setorderv(dt_id_year_and_type, grouping_vars) 
dt_id_year_and_type
#>    id year type  na_ratio
#> 1:  1 2002    A 0.5000000
#> 2:  2 2008    B 0.0000000
#> 3:  3 2010    A 0.3333333
#> 4:  3 2010    D 0.3333333
#> 5:  3 2013 <NA> 1.0000000
#> 6:  4 2020    C 0.5000000
#> 7:  5 2009    A 0.0000000
#> 8:  6 2010    B 0.5000000
#> 9:  6 2012 <NA> 1.0000000

# keep only the last observation of each group
dt_unique <- unique(dt_id_year_and_type, by = grouping_vars, fromLast = TRUE)

remove no longer needed helper column
dt_unique[, na_ratio := NULL]
dt_unique
#>    id year type
#> 1:  1 2002    A
#> 2:  2 2008    B
#> 3:  3 2010    D
#> 4:  3 2013 <NA>
#> 5:  4 2020    C
#> 6:  5 2009    A
#> 7:  6 2010    B
#> 8:  6 2012 <NA>

uj5u.com熱心網友回復：

另一種可能的解決方案：

library(tidyverse) 

df_id_year_and_type %>% 
  group_by(id, year) %>% 
  fill(type, .direction = "downup") %>% 
  summarise(type = last(type), .groups = "drop")

#> # A tibble: 8 × 3
#>      id  year type 
#>   <dbl> <dbl> <chr>
#> 1     1  2002 A    
#> 2     2  2008 B    
#> 3     3  2010 D    
#> 4     3  2013 <NA> 
#> 5     4  2020 C    
#> 6     5  2009 A    
#> 7     6  2010 B    
#> 8     6  2012 <NA>

uj5u.com熱心網友回復：

這里有一些基于 data.table 的解決方案。

setDT(df_id_year_and_type)

方法一

na.omit(df_id_year_and_type, cols="type")NA根據 column洗掉行type。 unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE)查找所有組。通過加入它們（使用最后一個匹配：）mult="last"，我們獲得了所需的輸出。

na.omit(df_id_year_and_type, cols="type"
        )[unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE), 
          on=c('id', 'year'), 
          mult="last"]

#       id  year   type
#    <num> <num> <char>
# 1:     1  2002      A
# 2:     2  2008      B
# 3:     3  2010      D
# 4:     3  2013   <NA>
# 5:     4  2020      C
# 6:     5  2009      A
# 7:     6  2010      B
# 8:     6  2012   <NA>

方法二

df_id_year_and_type[df_id_year_and_type[, .I[which.max(cumsum(!is.na(type)))], .(id, year)]$V1,]

方法三

（由于[開銷可能較慢）

df_id_year_and_type[, .SD[which.max(cumsum(!is.na(type)))], .(id, year)]

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/368512.html

標籤：r 数据表不同值

上一篇：如何將變數傳遞給已在R中的引數中實作非標準評估的函式？

下一篇：R通過字符變數的值對data.table進行子集化