我使用單序列讀取分類,并希望根據分類質量進行過濾。但是,需要更改輸出格式才能執行此操作。對于每次讀取,我都有一個如下所示的分類統計資訊(分數),它代表 ["taxonomy":"kmers assigned to that taxonomy" "taxonomy":"kmers assignment to that taxonomy" 等],并且每個分類可以出現多個次。
classification_stats<-c("3:1 7:4 0:34 3:7 0:27",
"0:110 561:19 0:37",
"0:3 562:5 0:7 543:55 0:47")
read_ID<-c("read1", "read2", "read3")
df<-data.frame(read_ID, classification_stats)
> df
read_ID classification_stats
1 read1 3:1 7:4 0:34 3:7 0:27
2 read2 0:110 561:19 0:37
3 read3 0:3 562:5 0:7 543:55 0:47
對于每次讀取(每一行),我想計算分配給分類的 kmer 總數(在分類統計中),但由于每個分類不連續多次出現,這變得更加困難。這意味著例如 read1 分類 3 有 1 7 kmers,分類 7 有 4 kmers,分類 0 有 34 27 kmers。
我想要的輸出看起來像這樣,最好進行排序,以便 tax1 是 kmers 數量最多的分類法。
read_ID classification_stats tax1 kmer1 tax2 kmer2 tax3 kmer3
read1 3:1 7:4 0:34 3:7 0:27 0 61 3 8 7 4
read2 0:110 561:19 0:37 0 147 561 19 NA NA
read3 0:3 562:5 0:7 543:55 0:47 0 57 543 55 562 5
R 或 bash 解決方案都很有趣。
uj5u.com熱心網友回復:
這是 R 中的另一種解決方案。請注意,我已經進行了一些更改read 3以測驗我的代碼(543:55到543:80)
輸入
df
read_ID classification_stats
1 read1 3:1 7:4 0:34 3:7 0:27
2 read2 0:110 561:19 0:37
3 read3 0:3 562:5 0:7 543:80 0:47
代碼和輸出
- 將
classification_stats列分成單獨的行,并以空格作為分隔符。 - 將生成的列分成兩列,用冒號“:”分隔。
- 然后創建一
unique_tax列,其中包含tax每個read_ID. 這將用作列名的一部分(告訴我們應該生成pivot_wider多少對kmer和)。tax group_by(read_ID, tax)讓一切都在那個層面上作業summarise匯總相同和kmer下所有值的列read_IDtax- 把資料先排列
read_ID然后 bykmer,這樣我們就可以row_number()用來生成正確的索引對了 - 這個版本的代碼一個
pivot_wider就夠了 left_join合并classification_stats列- 將列重新排序到您想要的位置
library(tidyverse)
left_join(
df %>%
separate_rows(!read_ID, sep = " ") %>%
separate(classification_stats, into = c("tax", "kmer")) %>%
group_by(read_ID, tax) %>%
summarize(kmer = sum(as.numeric(kmer))) %>%
arrange(read_ID, desc(kmer)) %>%
mutate(unique_tax = row_number()) %>%
pivot_wider(everything(), names_from = "unique_tax", values_from = c(tax, kmer)),
df,
by = "read_ID"
) %>%
select(read_ID, classification_stats, tax_1, kmer_1, tax_2, kmer_2, tax_3, kmer_3)
# A tibble: 3 × 8
# Groups: read_ID [3]
read_ID classification_stats tax_1 kmer_1 tax_2 kmer_2 tax_3 kmer_3
<chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 read1 3:1 7:4 0:34 3:7 0:27 0 61 3 8 7 4
2 read2 0:110 561:19 0:37 0 147 561 19 NA NA
3 read3 0:3 562:5 0:7 543:80 … 543 80 0 57 562 5
uj5u.com熱心網友回復:
使用 data.table 的解決方案
library(data.table)
setDT(df) # make the tibble a data.table
dt <- df[, .(tax_kmer = unlist(str_split(classification_stats, " "))), by = read_ID]
dt[, c("tax", "kmer") := tstrsplit(tax_kmer, ":")]
dt <- dt[, .(kmer = sum(as.numeric(kmer))), by = .(read_ID, tax)]
dt[, order := frankv(kmer, ties.method = "first", order = c(-1L)), by = read_ID]
dt <- dcast(dt, read_ID ~ order, value.var = c("kmer", "tax"), sep = "")
setcolorder(dt, c("read_ID", paste0(c("tax", "kmer"), rep(1:((ncol(dt)-1)/2), each = 2))))
結果
dt
# read_ID tax1 kmer1 tax2 kmer2 tax3 kmer3
# 1: read1 0 61 3 8 7 4
# 2: read2 0 147 561 19 <NA> NA
# 3: read3 0 57 543 55 562 5
資料
classification_stats <- c(
"3:1 7:4 0:34 3:7 0:27",
"0:110 561:19 0:37",
"0:3 562:5 0:7 543:55 0:47"
)
read_ID <- c("read1", "read2", "read3")
df <- tibble(read_ID, classification_stats)
分解代碼
dt <- df[, .(tax_kmer = unlist(str_split(classification_stats, " "))), by = read_ID]
在這里,我們為您的輸出創建一個新表,并確保每一行都包含您的每個 ID 組的“tax:kmer”對。
dt[, c("tax", "kmer") := tstrsplit(tax_kmer, ":")]
在這里,我們將您的 tax_kmer 對分成兩個不同的列,以保存稅值和 kmer 值。
dt <- dt[, .(kmer = sum(as.numeric(kmer))), by = .(read_ID, tax)]
匯總每個稅號組合的 kmer 值。
dt[, order := frankv(kmer, ties.method = "first", order = c(-1L)), by = read_ID]
確定定義什么稅變成 tax1、tax2 等的順序的最安全方法是使用我存盤在“訂單”列中的 kmer 總和(-1 排名從高到低)的排名號
dt <- dcast(dt, read_ID ~ order, value.var = c("kmer", "tax"), sep = "")
dcast 是“pivot_wider”的 data.table 變體,注意我們使用 order 進行 dcast 并使用來自 kmer 和 tax 的值。
setcolorder(dt, c("read_ID", paste0(c("tax", "kmer"), rep(1:((ncol(dt)-1)/2), each = 2))))
由于我們的 dcast 程式使用 id 對您的表列進行排序,然后是所有 tax 列,然后是所有 kmer 列,我們根據您的喜好動態設定順序。我們查看列的總數(我們事先不知道我們找到了多少稅?)并為 ID 減去一個,根據定義,我們將其除以 2,因為它是交替的稅/kmer 對。這將創建這個向量[1] "read_ID" "tax1" "kmer1" "tax2" "kmer2" "tax3" "kmer3",這正是我們要設定的列的順序。
uj5u.com熱心網友回復:
library(tidyverse)
classification_stats <- c(
"3:1 7:4 0:34 3:7 0:27",
"0:110 561:19 0:37",
"0:3 562:5 0:7 543:55 0:47"
)
read_ID <- c("read1", "read2", "read3")
df <- tibble(read_ID, classification_stats)
df %>%
separate_rows(classification_stats, sep = " ") %>%
separate(classification_stats, into = c("tax", "kmer")) %>%
type_convert() %>%
arrange(-kmer) %>%
nest(tax) %>%
mutate(id = row_number()) %>%
unnest(data) %>%
pivot_wider(names_from = id, values_from = c(kmer, tax))
#> Warning: All elements of `...` must be named.
#> Did you want `data = tax`?
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> read_ID = col_character(),
#> tax = col_double(),
#> kmer = col_double()
#> )
#> # A tibble: 3 × 27
#> read_ID kmer_1 kmer_2 kmer_3 kmer_4 kmer_5 kmer_6 kmer_7 kmer_8 kmer_9 kmer_10
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 read2 110 NA NA 37 NA NA 19 NA NA NA
#> 2 read3 NA 55 47 NA NA NA NA NA 7 5
#> 3 read1 NA NA NA NA 34 27 NA 7 NA NA
#> # … with 16 more variables: kmer_11 <dbl>, kmer_12 <dbl>, kmer_13 <dbl>,
#> # tax_1 <dbl>, tax_2 <dbl>, tax_3 <dbl>, tax_4 <dbl>, tax_5 <dbl>,
#> # tax_6 <dbl>, tax_7 <dbl>, tax_8 <dbl>, tax_9 <dbl>, tax_10 <dbl>,
#> # tax_11 <dbl>, tax_12 <dbl>, tax_13 <dbl>
由reprex 包于 2022-03-10 創建 (v2.0.0 )
這里計算了每次讀取時出現 kmer 或分類單元的頻率:
df %>%
separate_rows(classification_stats, sep = " ") %>%
separate(classification_stats, into = c("tax", "kmer")) %>%
type_convert() %>%
arrange(-kmer) %>%
nest(-tax) %>%
mutate(id = row_number()) %>%
unnest(data) %>%
pivot_wider(names_from = id, values_from = c(kmer, tax), values_fn = length)
#> Warning: All elements of `...` must be named.
#> Did you want `data = -tax`?
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> read_ID = col_character(),
#> tax = col_double(),
#> kmer = col_double()
#> )
#> # A tibble: 3 × 13
#> read_ID kmer_1 kmer_2 kmer_3 kmer_4 kmer_5 kmer_6 tax_1 tax_2 tax_3 tax_4
#> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 read2 2 NA 1 NA NA NA 2 NA 1 NA
#> 2 read3 3 1 NA NA 1 NA 3 1 NA NA
#> 3 read1 2 NA NA 2 NA 1 2 NA NA 2
#> # … with 2 more variables: tax_5 <int>, tax_6 <int>
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/441516.html
