我有一個檔案,例如:
>Nscaffold_033778.1_22 [24674 - 24880] some information
ACCATTAAGAGAGAAAAGAGAGGAGAGAGAGAGAGGAGAGAGAGAGAGAGGagAGGAAGA
AGGAGAGAGA
>NC_0337652.1_23 [26291 - 26443] some other informations boring
MMDOODODODODNJBCIOICICVOCICVCPCCM
>contig_033652.1_24 [25507 - 26529] species with informations
AJGSIVPDYVPDYVDPYVDPYDVDYVPVYVIYVDPIDVYPDIVYDPIYVDPIDVYPDIVP
PUDVPIYDVPDIVPDIDVPDVPDIVDPVDIVPDIVPDIVDPIDVIDVPDDIVPDDVPDVD
DDGGDDGDDIDIDDFDUDUDTTUDDUCDUDCDCC
我只想以串列格式提取以下資訊:
串列:
[[1]]
[1] "Nscaffold_033778.1_22" "24674" "24880"
[[2]]
[1] "NC_0337652.1_23" "26291" "26443"
[[3]]
[1] "contig_033652.1_24" "25507" "26529"
有人有想法嗎??
- 串列的第一個元素是
">"符號后面的部分, - 該串列的第二個元素是
first number內[] - 串列中的第三個要素是
second number內[]
uj5u.com熱心網友回復:
我們可以使用readLines,grep行讀取檔案,提取相關資訊并拆分
strsplit(sub("^>(\\S )\\s \\[(\\d )\\D (\\d )\\].*", "\\1,\\2,\\3",
grep(">", lines, value = TRUE)), ",")
-輸出
[[1]]
[1] "Nscaffold_033778.1_22" "24674" "24880"
[[2]]
[1] "NC_0337652.1_23" "26291" "26443"
[[3]]
[1] "contig_033652.1_24" "25507" "26529"
資料
lines <- readLines('file.txt')
uj5u.com熱心網友回復:
如果vec保存檔案內容,
vec <- readLines(...)
然后
strcapture("^>(.*) *\\[(\\d )\\D*(\\d ).*",
vec[grepl("^>", vec)],
list(x="",y="",z=""))
# x y z
# 1 Nscaffold_033778.1_22 24674 24880
# 2 NC_0337652.1_23 26291 26443
# 3 contig_033652.1_24 25507 26529
我承認這不是嚴格要求的格式。我提供它作為替代,因為它可以隨時訪問所有內容。此外,如果您打算對列y和進行整數化z,那么您可以通過將第三個引數 ( proto=)替換為 來實作內置的list(x="", y=1L, z=1L),如
str(
strcapture("^>(.*) *\\[(\\d )\\D*(\\d ).*",
vec[grepl("^>", vec)],
list(x="",y="",z=""))
)
# 'data.frame': 3 obs. of 3 variables:
# $ x: chr "Nscaffold_033778.1_22 " "NC_0337652.1_23 " "contig_033652.1_24 "
# $ y: chr "24674" "26291" "25507"
# $ z: chr "24880" "26443" "26529"
str(
strcapture("^>(.*) *\\[(\\d )\\D*(\\d ).*",
vec[grepl("^>", vec)],
list(x="",y=1L,z=1L))
)
# 'data.frame': 3 obs. of 3 variables:
# $ x: chr "Nscaffold_033778.1_22 " "NC_0337652.1_23 " "contig_033652.1_24 "
# $ y: int 24674 26291 25507
# $ z: int 24880 26443 26529
uj5u.com熱心網友回復:
你可以使用 {unglue} :
data <- c(">Nscaffold_033778.1_22 [24674 - 24880] some information",
">NC_0337652.1_23 [26291 - 26443] some other information",
">contig_033652.1_24 [25507 - 26529] species with informations")
unglue::unglue_data(data, ">{id} [{n1} - {n2}]{}")
#> id n1 n2
#> 1 Nscaffold_033778.1_22 24674 24880
#> 2 NC_0337652.1_23 26291 26443
#> 3 contig_033652.1_24 25507 26529
由reprex 包(v2.0.1)于 2021 年 12 月 17 日創建
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/385030.html
