我有一張這樣的桌子:
test <- data.frame(chr=c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"), start=c(1,1,1,2,2,10), end=c(5,5,5,7,7,20), gene=c("g1", "g1", "g1", "g2", "g2", "g3"), chrQ=c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"), startq=c(1,1,1,2,3, 10), endq=c(5,5,6,7,7, 20), geneq=c("g1q", "g2q", "g3q", "g4q", "g5q", "g6q"))
> test
chr start end gene chrQ startq endq geneq
1 chr1 1 5 g1 chr1 1 5 g1q
2 chr1 1 5 g1 chr1 1 5 g2q
3 chr1 1 5 g1 chr1 1 6 g3q
4 chr2 2 7 g2 chr2 2 7 g4q
5 chr2 2 7 g2 chr2 3 7 g5q
6 chr2 10 20 g3 chr2 10 20 g6q
我想根據列基因洗掉重復的行。并折疊此示例中命名的列的值:chrQ、startq、endq、geneq
我想將該表轉換為此
chr start end gene matched matched_total
1 chr1 1 5 g1 chr1 1 5 g1q; g1 chr1 1 5 g2q; chr1 1 6 g3q 3
2 chr2 2 7 g2 chr2 2 7 g4q; chr2 3 7 g5q 2
3 chr2 10 20 g3 chr2 10 20 g6q 1
我想添加一個名為match的列,其中包含單行中提到的列,由;分隔。或任何其他字符以及匹配列中的重復行數。
我知道我可以消除這樣的重復列
test %>% distinct(gene, .keep_all = TRUE)
我可以用這樣的東西添加計數:
test_s <- test %>% group_by(gene) %>% summarize(Total=n())
使用包dplyr,但我不知道如何折疊其他列。你能告訴我我怎么能做到這一點嗎?
uj5u.com熱心網友回復:
您不能使用distinct,因為那樣您將丟失創建列所需的資料matched。改為summary折疊屬于一個基因的所有行的資料:
library(tidyverse)
test <- data.frame(
chr = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
start = c(1, 1, 1, 2, 2, 10),
end = c(5, 5, 5, 7, 7, 20),
gene = c("g1", "g1", "g1", "g2", "g2", "g3"),
chrQ = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
startq = c(1, 1, 1, 2, 3, 10),
endq = c(5, 5, 6, 7, 7, 20),
geneq = c("g1q", "g2q", "g3q", "g4q", "g5q", "g6q")
)
test %>%
group_by(chr, start, end, gene) %>%
unite("matched", chrQ, startq, endq, geneq, sep = " ") %>%
summarise(
matched = matched %>% paste0(collapse = "; "),
matched_total = n()
)
#> `summarise()` has grouped output by 'chr', 'start', 'end'. You can override
#> using the `.groups` argument.
#> # A tibble: 3 × 6
#> # Groups: chr, start, end [3]
#> chr start end gene matched matched_total
#> <chr> <dbl> <dbl> <chr> <chr> <int>
#> 1 chr1 1 5 g1 chr1 1 5 g1q; chr1 1 5 g2q; chr1 1 6 g3q 3
#> 2 chr2 2 7 g2 chr2 2 7 g4q; chr2 3 7 g5q 2
#> 3 chr2 10 20 g3 chr2 10 20 g6q 1
由reprex 包創建于 2022-04-01 (v2.0.0 )
uj5u.com熱心網友回復:
另一種可能的解決方案:
library(tidyverse)
test %>%
mutate(across(where(is.numeric), as.character)) %>%
rowwise %>%
mutate(matched = str_c(c_across(chrQ:geneq), collapse = " ")) %>%
group_by(gene) %>%
summarise(matched = str_c(matched, collapse = "; "), matched_total = n())
#> # A tibble: 3 × 3
#> gene matched matched_total
#> <chr> <chr> <int>
#> 1 g1 chr1 1 5 g1q; chr1 1 5 g2q; chr1 1 6 g3q 3
#> 2 g2 chr2 2 7 g4q; chr2 3 7 g5q 2
#> 3 g3 chr2 10 20 g6q 1
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/454428.html
上一篇:不帶分隔符的拆分字串
下一篇:如何在R中組合兩個函式圖?
