我在 R 中有一個超過 800k obs 的資料集。它看起來像這樣:
id <- c("58497484", "58544005", "74766653", "74764718", "62824455", "58497484", "58497484", "74766653", "58544005")
key <- c("5718368_09/06/1981_3_2014", "2077485_02/06/1977_8_2014", "2091585_23/10/1982_1_2014", "2077388_30/01/2000_11_2017", "2082225_02/07/1998_10_2017",
"2077450_04/05/2001_1_2016", "2077477_03/03/1978_8_2017",
"2077388_30/01/2020_11_2019", "5718368_08/06/1982_3_2012")
out <- c("2.1 - Reason 1", "1.2 - Reason 2", "1.2 - Reason 2", "1.2 - Reason 2", "1.2 - Reason 2", "1.3 - Reason 3" ,
"1.2 - Reason 2", "3.6 - Reason 8", "3.2 - Reason 10")
flag1 <- c("1", "0", "1", "1", "1", "1", "0", "1", "0")
flag2 <- c("0", "0", "0", "1", "0", "1", "0", "1", "0")
data <- as.data.frame(cbind(id, key, out, flag1, flag2))
我的一些身份證號碼重復了。我需要做的是將這些 ID 分組,以便我可以看到所有分組的 obs。我使用此代碼來執行此操作:
data <- data %>%
arrange(id) %>%
mutate(id = ifelse(duplicated(id), "", id))
當我這樣做時,我的 df 看起來像這樣:
id key out flag1 flag2
58497484 5718368_09/06/1981_3_2014 2.1 - Reason 1 1 0
2077450_04/05/2001_1_2016 1.3 - Reason 3 1 1
2077477_03/03/1978_8_2017 1.2 - Reason 2 0 0
58544005 2077485_02/06/1977_8_2014 1.2 - Reason 2 0 0
5718368_08/06/1982_3_2012 3.2 - Reason 10 0 0
62824455 2082225_02/07/1998_10_2017 1.2 - Reason 2 1 0
74764718 2077388_30/01/2000_11_2017 1.2 - Reason 2 1 1
74766653 2091585_23/10/1982_1_2014 1.2 - Reason 2 1 0
2077388_30/01/2020_11_2019 3.6 - Reason 8 1 1
我需要做的是使用“out”、“flag1”和“flag2”變數作為條件洗掉集群。它是這樣的:如果我的分組 id 中的任何行以“out”2 或 3 開頭,并且“flag1”和“flag2”都等于零 (0)。我想要的輸出如下:
id key out flag1 flag2
58497484 5718368_09/06/1981_3_2014 2.1 - Reason 1 1 0
2077450_04/05/2001_1_2016 1.3 - Reason 3 1 1
2077477_03/03/1978_8_2017 1.2 - Reason 2 0 0
62824455 2082225_02/07/1998_10_2017 1.2 - Reason 2 1 0
74764718 2077388_30/01/2000_11_2017 1.2 - Reason 2 1 1
74766653 2091585_23/10/1982_1_2014 1.2 - Reason 2 1 0
2077388_30/01/2020_11_2019 3.6 - Reason 8 1 1
請注意,輸出中不存在“58544005”ID 號,因為它顯示標志為零和“out”原因 3.2。提前致謝。
uj5u.com熱心網友回復:
library(tidyverse)
data %>%
arrange(id)%>%
type.convert(as.is = TRUE)%>%
group_by(id) %>%
filter(!(if_all(flag1:flag2, ~sum(.x)==0)&
any(substr(out, 1, 1) %in% 2:3))) %>%
ungroup() %>%
mutate(id = replace(id, duplicated(id), ''))
# A tibble: 7 x 5
id key out flag1 flag2
<chr> <chr> <chr> <int> <int>
1 "58497484" 5718368_09/06/1981_3_2014 2.1 - Reason 1 1 0
2 "" 2077450_04/05/2001_1_2016 1.3 - Reason 3 1 1
3 "" 2077477_03/03/1978_8_2017 1.2 - Reason 2 0 0
4 "62824455" 2082225_02/07/1998_10_2017 1.2 - Reason 2 1 0
5 "74764718" 2077388_30/01/2000_11_2017 1.2 - Reason 2 1 1
6 "74766653" 2091585_23/10/1982_1_2014 1.2 - Reason 2 1 0
7 "" 2077388_30/01/2020_11_2019 3.6 - Reason 8 1 1
uj5u.com熱心網友回復:
根據您在評論中的澄清,您似乎想洗掉 id 的所有記錄,其中該id 的任何記錄滿足三個條件。這是我將使用的策略:
對于每一行,檢查是否滿足洗掉條件。
對于每個 id,檢查是否有任何記錄被設定為洗掉。如果是,請洗掉該 ID。
library(tidyverse)
data %>%
# Step 1: Create an indicator for deletion for each record
mutate(
delete_ind =
str_sub(out, 1, 1) %in% c("2", "3") &
flag1 == 0 &
flag2 == 0
) %>%
# Step 2: Filter out all id's that satisfied the condition at least once
group_by(id) %>%
filter(sum(delete_ind) == 0) %>%
ungroup()
# A tibble: 7 x 6
id key out flag1 flag2 delete_ind
<chr> <chr> <chr> <chr> <chr> <lgl>
1 58497484 5718368_09/06/1981_3_2014 2.1 - Reason 1 1 0 FALSE
2 74766653 2091585_23/10/1982_1_2014 1.2 - Reason 2 1 0 FALSE
3 74764718 2077388_30/01/2000_11_2017 1.2 - Reason 2 1 1 FALSE
4 62824455 2082225_02/07/1998_10_2017 1.2 - Reason 2 1 0 FALSE
5 58497484 2077450_04/05/2001_1_2016 1.3 - Reason 3 1 1 FALSE
6 58497484 2077477_03/03/1978_8_2017 1.2 - Reason 2 0 0 FALSE
7 74766653 2077388_30/01/2020_11_2019 3.6 - Reason 8 1 1 FALSE
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/384789.html
上一篇:限制惰性列中的專案
