這是我的資料框的一部分。
> df
Group Direction cytoband q value residual q value wide peak boundaries
V29 All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622
V30 All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553
V31 All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602
V32 All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503
V33 All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554
我想提取“寬峰邊界”列中“chr”之后的字符或數字。我嘗試了下面的代碼,但第二行獲得了 NA 值。
library(tidyr)
df <- extract(df, 'wide peak boundaries', into = c('chr', 'start', 'end'),
'(\\d ) :(\\d ) -(\\d )', remove = F, convert = T)
df
Group Direction cytoband q value residual q value wide peak boundaries chr start end
V29 All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622 11 130906630 135086622
V30 All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553 NA NA NA
V31 All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602 10 87745632 87859602
V32 All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503 22 33050952 34766503
V33 All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554 11 3230287 3799554
資料
structure(list(Group = c("All", "All", "All", "All", "All"),
Direction = c("DEL", "DEL", "DEL", "DEL", "DEL"), cytoband = c("11q25",
"Xp22.11", "10q23.31", "22q12.3", "11p15.4"), `q value` = c("7.78E-43",
"3.01E-38", "3.61E-31", "4.03E-25", "6.59E-25"), `residual q value` = c("2.22E-39",
"1.91E-35", "3.61E-31", "3.96E-25", "6.59E-25"), `wide peak boundaries` = c("chr11:130906630-135086622",
"chrX:23277186-26139553", "chr10:87745632-87859602", "chr22:33050952-34766503",
"chr11:3230287-3799554"), chr = c(11L, NA, 10L, 22L, 11L),
start = c(130906630L, NA, 87745632L, 33050952L, 3230287L),
end = c(135086622L, NA, 87859602L, 34766503L, 3799554L)), class = "data.frame", row.names = c("V29",
"V30", "V31", "V32", "V33"))
uj5u.com熱心網友回復:
想法是按:和拆分-,但對于chr列,您不提取“chr”字串。所以你可以使用:
(根據@Chris Ruehlemann 的評論更新)
df %>%
extract("wide peak boundaries",
into = c("chr", "start", "end"),
regex = "((?<=chr).*):(.*)-(.*)",
remove = FALSE)
這使:
Group Direction cytoband q value residual q value wide peak boundaries chr start end
V29 All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622 11 130906630 135086622
V30 All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553 X 23277186 26139553
V31 All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602 10 87745632 87859602
V32 All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503 22 33050952 34766503
V33 All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554 11 3230287 3799554
uj5u.com熱心網友回復:
library(data.table)
setDT(mydata)[, c("chr", "start", "end") := tstrsplit(`wide peak boundaries`, "[:-]", perl = TRUE)]
Group Direction cytoband q value residual q value wide peak boundaries chr start end
1: All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622 chr11 130906630 135086622
2: All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553 chrX 23277186 26139553
3: All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602 chr10 87745632 87859602
4: All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503 chr22 33050952 34766503
5: All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554 chr11 3230287 3799554
uj5u.com熱心網友回復:
您只需要\\d在第一個捕獲組中更改為\\w(\\d僅匹配數字,而\\w匹配字母字符和數字以及下劃線):
編輯:
(?<=chr)是正向后視,它確保\\w僅在出現字串后才開始匹配chr:
df %>%
extract(col = 'wide peak boundaries',
into = c('chr', 'start', 'end'),
regex = '((?<=chr)\\w ):(\\d )-(\\d )',
remove = FALSE, convert = TRUE)
Group Direction cytoband q value residual q value wide peak boundaries chr start end
V29 All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622 11 130906630 135086622
V30 All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553 X 23277186 26139553
V31 All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602 10 87745632 87859602
V32 All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503 22 33050952 34766503
V33 All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554 11 3230287 3799554
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/354156.html
標籤:r
