拆分函式不回傳任何具有大型資料集的觀察結果-有解無憂

我有一個這樣的資料框：

seqnames       pos     strand    nucleotide     count
    id1         12                   A            13
    id1         13                   C            25
    id2         24                   G            10
    id2         25                   T            25
    id2         26                   A            10
    id3         10                   C            5

但它總共有超過 100,000 行，seqnames有 3138 個級別。我想根據 seqnames 將它拆分為資料框串列，所以我使用了 split 函式：

data_list <- split(data,data$seqnames)

但它只回傳如下內容：

List of 3138
 $ id1:'data.frame':    0 obs. of  6 variables:
  ..$ seqnames  : Factor w/ 3138 levels "id1","id2",..: 
  ..$ pos       : int(0) 
  ..$ strand    : Factor w/ 3 levels " ","-","*": 
  ..$ nucleotide: Factor w/ 8 levels "A","C","G","T",..: 
  ..$ count     : int(0) 
  ..$ sample_id : chr(0) 
 $ id2:'data.frame':    0 obs. of  6 variables:
  ..$ seqnames  : Factor w/ 3138 levels "id1","id2",..: 
  ..$ pos       : int(0) 
  ..$ strand    : Factor w/ 3 levels " ","-","*": 
  ..$ nucleotide: Factor w/ 8 levels "A","C","G","T",..: 
  ..$ count     : int(0) 
  ..$ sample_id : chr(0)

我不知道為什么會這樣，因為我已經在一個包含所有數字的組成資料幀上使用了它（當然，沒有這個行那么多）并且它可以作業。我怎么解決這個問題？

uj5u.com熱心網友回復：

只是有許多未使用的級別，因為列 'seqnames' 是factor. 使用split，有一個選項drop（drop = TRUE- 默認情況下是FALSE）來洗掉這些串列元素。否則，它們將回傳data.frame0 行。如果我們希望將這些元素替換為NULL，則找到那些行數 ( nrow) 為 0 的元素并將其分配給NULL

data_list <- split(data,data$seqnames)
> str(data_list)
List of 5
 $ id1:'data.frame':    2 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 1 1
  ..$ pos       : int [1:2] 12 13
  ..$ strand    : chr [1:2] " " " "
  ..$ nucleotide: chr [1:2] "A" "C"
  ..$ count     : int [1:2] 13 25
 $ id2:'data.frame':    3 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 2 2 2
  ..$ pos       : int [1:3] 24 25 26
  ..$ strand    : chr [1:3] " " " " " "
  ..$ nucleotide: chr [1:3] "G" "T" "A"
  ..$ count     : int [1:3] 10 25 10
 $ id3:'data.frame':    1 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 3
  ..$ pos       : int 10
  ..$ strand    : chr " "
  ..$ nucleotide: chr "C"
  ..$ count     : int 5
 $ id4:'data.frame':    0 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 
  ..$ pos       : int(0) 
  ..$ strand    : chr(0) 
  ..$ nucleotide: chr(0) 
  ..$ count     : int(0) 
 $ id5:'data.frame':    0 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 
  ..$ pos       : int(0) 
  ..$ strand    : chr(0) 
  ..$ nucleotide: chr(0) 
  ..$ count     : int(0)

做任務NULL

data_list[sapply(data_list, nrow) == 0] <- list(NULL)

-再檢查一遍

> str(data_list)
List of 5
 $ id1:'data.frame':    2 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 1 1
  ..$ pos       : int [1:2] 12 13
  ..$ strand    : chr [1:2] " " " "
  ..$ nucleotide: chr [1:2] "A" "C"
  ..$ count     : int [1:2] 13 25
 $ id2:'data.frame':    3 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 2 2 2
  ..$ pos       : int [1:3] 24 25 26
  ..$ strand    : chr [1:3] " " " " " "
  ..$ nucleotide: chr [1:3] "G" "T" "A"
  ..$ count     : int [1:3] 10 25 10
 $ id3:'data.frame':    1 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 3
  ..$ pos       : int 10
  ..$ strand    : chr " "
  ..$ nucleotide: chr "C"
  ..$ count     : int 5
 $ id4: NULL
 $ id5: NULL

資料

data <- structure(list(seqnames = structure(c(1L, 1L, 2L, 2L, 2L, 
3L), .Label = c("id1", 
"id2", "id3", "id4", "id5"), class = "factor"), pos = c(12L, 
13L, 24L, 25L, 26L, 10L), strand = c(" ", " ", " ", " ", " ", 
" "), nucleotide = c("A", "C", "G", "T", "A", "C"), count = c(13L, 
25L, 10L, 25L, 10L, 5L)), row.names = c(NA, -6L), class = "data.frame")

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/442225.html

標籤：r 列表数据框分裂大数据

上一篇：如何根據R中的值合并資料框列

下一篇：如何根據ID值創建具有另一列值的列？