使用R從字串中提取字符和數字-有解無憂

這是我的資料框的一部分。

> df
    Group Direction cytoband  q value residual q value      wide peak boundaries
V29   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622
V30   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553
V31   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602
V32   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503
V33   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554

我想提取“寬峰邊界”列中“chr”之后的字符或數字。我嘗試了下面的代碼，但第二行獲得了 NA 值。

library(tidyr)
df <- extract(df, 'wide peak boundaries', into = c('chr', 'start', 'end'), 
              '(\\d ) :(\\d ) -(\\d )', remove = F, convert = T)
df
    Group Direction cytoband  q value residual q value      wide peak boundaries chr     start       end
V29   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622  11 130906630 135086622
V30   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553  NA        NA        NA
V31   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602  10  87745632  87859602
V32   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503  22  33050952  34766503
V33   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554  11   3230287   3799554

資料

structure(list(Group = c("All", "All", "All", "All", "All"), 
    Direction = c("DEL", "DEL", "DEL", "DEL", "DEL"), cytoband = c("11q25", 
    "Xp22.11", "10q23.31", "22q12.3", "11p15.4"), `q value` = c("7.78E-43", 
    "3.01E-38", "3.61E-31", "4.03E-25", "6.59E-25"), `residual q value` = c("2.22E-39", 
    "1.91E-35", "3.61E-31", "3.96E-25", "6.59E-25"), `wide peak boundaries` = c("chr11:130906630-135086622", 
    "chrX:23277186-26139553", "chr10:87745632-87859602", "chr22:33050952-34766503", 
    "chr11:3230287-3799554"), chr = c(11L, NA, 10L, 22L, 11L), 
    start = c(130906630L, NA, 87745632L, 33050952L, 3230287L), 
    end = c(135086622L, NA, 87859602L, 34766503L, 3799554L)), class = "data.frame", row.names = c("V29", 
"V30", "V31", "V32", "V33"))

uj5u.com熱心網友回復：

想法是按:和拆分-，但對于chr列，您不提取“chr”字串。所以你可以使用：

（根據@Chris Ruehlemann 的評論更新）

df %>%
  extract("wide peak boundaries",
          into = c("chr", "start", "end"),
          regex = "((?<=chr).*):(.*)-(.*)",
          remove = FALSE)

這使：

    Group Direction cytoband  q value residual q value      wide peak boundaries chr     start       end
V29   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622  11 130906630 135086622
V30   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553   X  23277186  26139553
V31   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602  10  87745632  87859602
V32   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503  22  33050952  34766503
V33   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554  11   3230287   3799554

uj5u.com熱心網友回復：

library(data.table)
setDT(mydata)[, c("chr", "start", "end") := tstrsplit(`wide peak boundaries`, "[:-]", perl = TRUE)]

   Group Direction cytoband  q value residual q value      wide peak boundaries   chr     start       end
1:   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622 chr11 130906630 135086622
2:   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553  chrX  23277186  26139553
3:   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602 chr10  87745632  87859602
4:   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503 chr22  33050952  34766503
5:   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554 chr11   3230287   3799554

uj5u.com熱心網友回復：

您只需要\\d在第一個捕獲組中更改為\\w（\\d僅匹配數字，而\\w匹配字母字符和數字以及下劃線）：

編輯： (?<=chr)是正向后視，它確保\\w僅在出現字串后才開始匹配chr：

df %>% 
  extract(col = 'wide peak boundaries', 
          into = c('chr', 'start', 'end'),
          regex = '((?<=chr)\\w ):(\\d )-(\\d )', 
          remove = FALSE, convert = TRUE)
    Group Direction cytoband  q value residual q value      wide peak boundaries chr     start       end
V29   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622  11 130906630 135086622
V30   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553   X  23277186  26139553
V31   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602  10  87745632  87859602
V32   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503  22  33050952  34766503
V33   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554  11   3230287   3799554

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/354156.html

標籤：r

上一篇：如何通過R中的for回圈獲取和處理列中的資訊？

下一篇：y軸的自由刻度不適用于facet_nested(ggh4x)