在給定的資料集中,case_control表示一行是 acase還是control,id是一個唯一的識別符號,case但它可以重復control并group表示集群。我需要在每個案例中為每個案例選擇一個控制元件,group但是如果之前為案例選擇了一個控制元件,則無法根據id變數為下一個案例選擇它。如果沒有可用的控制元件,則必須放棄該案例。
我如何才能在具有約 1000 萬行(具有 200 萬個案例和 800 萬個控制元件)的非常大的資料集中快速作業?
資料集看起來像這樣(https://docs.google.com/spreadsheets/d/1MpjKv9Fm_Hagb11h_dqtDX4hV7G7sZrt/edit#gid=1801722229)
group case_control id
cluster_1 case 11
cluster_1 control 21
cluster_1 control 22
cluster_1 control 23
cluster_2 case 12
cluster_2 control 21
cluster_2 control 22
cluster_2 control 24
cluster_3 case 13
cluster_3 control 21
cluster_3 control 22
cluster_3 control 25
預期輸出必須如下所示
group case_control id
cluster_1 case 11
cluster_1 control 21
cluster_2 case 12
cluster_2 control 22
cluster_3 case 13
cluster_3 control 25
uj5u.com熱心網友回復:
這是一個 data.table 方法。
代碼可以縮短(很多),但我選擇將每個步驟分開(并注釋),這樣您就可以看到采取了哪些操作并可以檢查中間結果。
library(data.table)
#initialise vector for used ids
id.used <- as.numeric()
#split by group and loop
L <- lapply(split(DT, by = "group"), function(x) {
#select first row
caserow <- x[1,]
#select second to last row
controlrow <- x[2:nrow(x), ]
#match against id's already in use
controlrow.new <- controlrow[!id %in% id.used, ]
#sample random row from id's not already used
controlrow.sample <- controlrow.new[controlrow.new[, .I[sample(.N, 1)], ]]
#fill id.used (be carefull with the use of <<- !! google why..)
id.used <<- c(id.used, controlrow.sample$id)
#rowbind the sampled row to the caserow
return(rbind(caserow, controlrow.sample))
})
# rowbind the list back together and cast to wide
dcast(rbindlist(L), group ~ case_control, value.var = "id")
# group case control
# 1: cluster_1 11 21
# 2: cluster_2 12 24
# 3: cluster_3 13 25
使用的樣本資料
DT <- fread("group case_control id
cluster_1 case 11
cluster_1 control 21
cluster_1 control 22
cluster_1 control 23
cluster_2 case 12
cluster_2 control 21
cluster_2 control 22
cluster_2 control 24
cluster_3 case 13
cluster_3 control 21
cluster_3 control 22
cluster_3 control 25")
uj5u.com熱心網友回復:
基礎 R:
Reduce(\(x,y)rbind(x, y[which(!y$id %in% x$id)[1:2], ]), split(df[-(3:4),], ~group))
group case_control id
1 cluster_1 case 11
2 cluster_1 control 21
5 cluster_2 case 12
7 cluster_2 control 22
9 cluster_3 case 13
12 cluster_3 control 25
請注意,我們只需要每個集群的第一個 case 和第一個非重復控制元件,因此按 1:2 切片
整理宇宙:
df %>%
slice(-(3:4))%>%
group_split(group) %>%
reduce(~rbind(.x, slice(anti_join(.y, .x, by = c("case_control", "id")), 1:2)))
# A tibble: 6 x 3
group case_control id
<chr> <chr> <int>
1 cluster_1 case 11
2 cluster_1 control 21
3 cluster_2 case 12
4 cluster_2 control 22
5 cluster_3 case 13
6 cluster_3 control 25
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/377748.html
上一篇:R中的PCA與prcomp
