根據R中的多個條件對第一行重復資料進行子集化-有解無憂

我有重復的隨機人的種族資料行，日期是分配種族的日期。我想根據這些條件為每個人分配一個種族類別（白人、亞洲人、黑人、其他、混合）：（1）如果這個人有多個種族，那么將分配最常見的一個（最高種族_n）。(2) 如果患者具有相同數量的多個種族（例如，1 個混合種族、1 個亞洲人、1 個其他種族），那么將分配最近的一個。我以某種方式組織資料，以便計算每個患者的每個種族，并撰寫代碼以降序排列種族日期。但是，當我運行代碼以獲取組織表的第一行時，我最終會為每個人分配隨機種族。

人	種族	種族_n	種族日期
1	白色的	4	2019 年 4 月 9 日
1	白色的	4	2018 年 4 月 9 日
1	白色的	4	2017 年 4 月 9 日
1	白色的	4	2016 年 4 月 9 日
1	其他	1	2015 年 4 月 9 日
2	亞洲人	1	2019 年 4 月 9 日
2	其他	1	2018 年 4 月 9 日
2	混合	1	2017 年 4 月 9 日
3	黑色的	2	2016 年 4 月 9 日
3	黑色的	2	2015 年 4 月 9 日

我用這段代碼制作了上面的表格

df %>%
  group_by(person,ethnicity_n,ethnicity_date) %>%
  arrange(person,ethnicity_n,desc(ethnicity_date))

我希望決賽桌看起來像這樣

人	種族	種族_n	種族日期
1	白色的	4	2019 年 4 月 9 日
2	亞洲人	1	2019 年 4 月 9 日
3	黑色的	2	2016 年 4 月 9 日

我嘗試了所有這些代碼來獲得第二張桌子，但每次種族不符合我想要的條件

df %>%
  group_by(person,ethnicity_n,ethnicity_date) %>%
  arrange(person,ethnicity_n,desc(ethnicity_date)) %>% 
  slice(1L)

df %>%
  group_by(person) %>%
  arrange(person,ethnicity_n,desc(ethnicity_date)) %>% 
  slice(1L)

df %>%
  group_by(person,ethnicity_n,ethnicity_date) %>%
  arrange(person,ethnicity_n,desc(ethnicity_date)) %>% 
  filter(row_number()==1)

df %>%
  group_by(person) %>%
  arrange(person,ethnicity_n,desc(ethnicity_date)) %>% 
  filter(row_number()==1)

資料：

df <- structure(list(person = c(
    1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L,
    3L
), ethnicity = c(
    "white", "white", "white", "white", "other",
    "asian", "other", "mixed", "black", "black"
), ethnicity_n = c(
    4L,
    4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L
), ethnicity_date = c(
    "04/09/2019",
    "04/09/2018", "04/09/2017", "04/09/2016", "04/09/2015", "04/09/2019",
    "04/09/2018", "04/09/2017", "04/09/2016", "04/09/2015"
)), class = "data.frame", row.names = c(
    NA,
    -10L
))

uj5u.com熱心網友回復：

你的主要問題是你ethnicity_date是一個字符向量，而不是一個Date.

我假設它是月-日-年格式，但如果它是日-月-年格式，您可以更改format = "%m/%d/%Y"為format = "%d/%m/%Y".

請注意，由于我們使用max()函式ethnicity_date和ethnicity_n，因此不必先按arrange()順序排列資料。如果任一列中有任何值，您將需要提供na.rm=TRUE給該函式。max()NA

df |>
    mutate(
        ethnicity_date = as.Date(
            ethnicity_date,
            format = "%m/%d/%Y"
        )
    ) |>
    group_by(person) |>
    filter(
        ethnicity_n == max(ethnicity_n)
    ) |>
    filter(
        ethnicity_date == max(ethnicity_date)
    ) |>
    slice(1L) # in case there are still ties

# # A tibble: 3 x 4
# # Groups:   person [3]
#   person ethnicity ethnicity_n ethnicity_date
#    <int> <chr>           <int> <date>
# 1      1 white               4 2019-04-09
# 2      2 asian               1 2019-04-09
# 3      3 black               2 2016-04-09

我將您保留slice(1L)在最后，以防有人有多行具有相同的ethnicity_nand ethnicity_date，但是如果您想在這種情況下保留兩行，則可以將其洗掉。

uj5u.com熱心網友回復：

編輯：感謝您添加示例資料集，這更容易。

我想slice_max()這就是你要找的，見下文。最好先正確格式化日期

library(dplyr)
df %>%
  mutate(ethnicity_date = as.Date(ethnicity_date, format = "%d/%m/%Y")) %>% 
  group_by(person) %>% 
  slice_max(ethnicity_n) %>% 
  slice_max(ethnicity_date) %>% 
  ungroup()

# A tibble: 3 × 4
# person ethnicity ethnicity_n ethnicity_date
# <int> <chr>           <int> <date>        
# 1      1 white               4 2019-09-04    
# 2      2 asian               1 2019-09-04    
# 3      3 black               2 2016-09-04

轉載請註明出處，本文鏈接：https://www.uj5u.com/qukuanlian/513102.html

標籤：r日期dplyr

上一篇：有人可以解釋一下這個Dax運算式嗎？

下一篇：分析兩個日期到文本之間的日期差異：“之前：x天，x小時，x分鐘”