我有一個非常混亂的檔案,有多個分隔符,并且行之間的欄位順序/數量不同。
V1 到 V5 列很好,但我想從 V9 中提取來自“Variant_seq”、“Reference_seq”的資訊和來自“Dbxref”的 rsxxxx 編號。
另一個復雜之處是“Variant_seq”和“Reference_seq”欄位可以是單個字符(“A”、“T”、“C”或“G”),也可以是多個逗號分隔的字符(例如“TTTT,TTC,GGGGGC”)。這些欄位可以位于 V9 的末尾或中間的任何位置。
V1 V2 V3 V4 V5 V9
9 dbSNP SNV 10007 10007 ID=1;Variant_seq=C;Dbxref=dbSNP_154:rs1449034754;evidence_values=Frequency,TOPMed;Reference_seq=T
9 dbSNP SNV 10009 10009 ID=2;Variant_seq=C,G;Dbxref=dbSNP_154:rs1587255763;evidence_values=Frequency;Reference_seq=A
9 dbSNP SNV 14824990 14824990 ID=30545117;Reference_seq=C;clinical_significance=benign;Variant_seq=GGGC,CCCCG;ancestral_allele=C;Dbxref=dbSNP_154:rs140144319;evidence_values=Frequency,1000Genomes,ESP,Phenotype_or_Disease,ExAC,TOPMed,gnomAD;global_minor_allele_frequency=1|0.004193|211
我最初想到一個帶有多個分隔符的 awk -F '{print }' 但很快意識到這不是一個可行的解決方案,因為欄位在行之間不一致。dplyr::separate 在這里也沒有真正適應。
我試圖將每個單個欄位分別提取到一個新列中,但該命令不處理欄位位于行尾的情況:
gsub("Reference_seq[=]([^.] )[;].*", "\\1", df$V9)
我找不到解決方案來僅 grep 捕獲組 1 中的欄位,如果沒有“;”則停止 下列的。謝謝你的幫助。
uj5u.com熱心網友回復:
由于這有點復雜的提取,我發現使用 stringr 包的str_extract()功能在這里更容易使用。
在這種情況下,我使用 3 行單獨的行來提取感興趣的文本。并使用 (?<=) 向后看運算子以避免前導文本。
df<- read.table(header=TRUE, text="V1 V2 V3 V4 V5 V9
9 dbSNP SNV 10007 10007 ID=1;Variant_seq=C;Dbxref=dbSNP_154:rs1449034754;evidence_values=Frequency,TOPMed;Reference_seq=T
9 dbSNP SNV 10009 10009 ID=2;Variant_seq=C,G;Dbxref=dbSNP_154:rs1587255763;evidence_values=Frequency;Reference_seq=A
9 dbSNP SNV 14824990 14824990 ID=30545117;Reference_seq=C;clinical_significance=benign;Variant_seq=GGGC,CCCCG;ancestral_allele=C;Dbxref=dbSNP_154:rs140144319;evidence_values=Frequency,1000Genomes,ESP,Phenotype_or_Disease,ExAC,TOPMed,gnomAD;global_minor_allele_frequency=1|0.004193|211"
)
library(stringr)
str_extract(df[,"V9"], "(?<=Variant_seq=). ?;")
str_extract(df[,"V9"], "(?<=Reference_seq=). ?;")
str_extract(df[,"V9"], "(?<=Dbxref=). ?;")
data.frame(Variant_seq, Reference_seq, Db_ref)
# Variant_seq Reference_seq Db_ref
# 1 C; <NA> dbSNP_154:rs1449034754;
# 2 C,G; <NA> dbSNP_154:rs1587255763;
# 3 GGGC,CCCCG; C; dbSNP_154:rs140144319;
這個最終的資料框現在可以cbind恢復到您的原始資料
uj5u.com熱心網友回復:
你可以做
stringr::str_match(df$V9, "Reference_seq=([^;] );")[, 2L]
stringr::str_match(df$V9, "Variant_seq=([^;] );")[, 2L]
stringr::str_match(df$V9, "Dbxref=([^;] );")[, 2L]
輸出
> stringr::str_match(df$V9, "Reference_seq=([^;] );")[, 2L]
[1] NA NA "C"
> stringr::str_match(df$V9, "Variant_seq=([^;] );")[, 2L]
[1] "C" "C,G" "GGGC,CCCCG"
> stringr::str_match(df$V9, "Dbxref=([^;] );")[, 2L]
[1] "dbSNP_154:rs1449034754" "dbSNP_154:rs1587255763" "dbSNP_154:rs140144319"
uj5u.com熱心網友回復:
甲基礎R使用溶液strsplit在琴弦上具有成對gsub的子串
cbind(df1[,-6], sapply(c("Variant","Reference_seq","Dbxref"), function(str)
sapply(strsplit(df1[,"V9"],";"), function(x)
gsub(".*=|dbSNP_.*:","",x[grep(str,x)]))))
V1 V2 V3 V4 V5 Variant Reference_seq Dbxref
1 9 dbSNP SNV 10007 10007 C T rs1449034754
2 9 dbSNP SNV 10009 10009 C,G A rs1587255763
3 9 dbSNP SNV 14824990 14824990 GGGC,CCCCG C rs140144319
資料
df1 <- structure(list(V1 = c(9L, 9L, 9L), V2 = c("dbSNP", "dbSNP", "dbSNP"
), V3 = c("SNV", "SNV", "SNV"), V4 = c(10007L, 10009L, 14824990L
), V5 = c(10007L, 10009L, 14824990L), V9 = c("ID=1;Variant_seq=C;Dbxref=dbSNP_154:rs1449034754;evidence_values=Frequency,TOPMed;Reference_seq=T",
"ID=2;Variant_seq=C,G;Dbxref=dbSNP_154:rs1587255763;evidence_values=Frequency;Reference_seq=A",
"ID=30545117;Reference_seq=C;clinical_significance=benign;Variant_seq=GGGC,CCCCG;ancestral_allele=C;Dbxref=dbSNP_154:rs140144319;evidence_values=Frequency,1000Genomes,ESP,Phenotype_or_Disease,ExAC,TOPMed,gnomAD;global_minor_allele_frequency=1|0.004193|211"
), new = list(c("Variant_seq=C", "Dbxref=rs1449034754", "Reference_seq=T"
), c("Variant_seq=C,G", "Dbxref=rs1587255763", "Reference_seq=A"
), c("Reference_seq=C", "Variant_seq=GGGC,CCCCG", "Dbxref=rs140144319"
))), row.names = c(NA, -3L), class = "data.frame")
uj5u.com熱心網友回復:
這會將 V9 的所有子欄位提取到單獨的列中,而無需使用正則運算式或包。它使用 paste 和 chartr 將 V9 轉換為 dcf 格式,然后使用 read.dcf 將其讀入。最后,我們將創建的列附加到 DF。
m <- DF$V9 |>
paste(collapse = "\n\n") |>
chartr(old = "=;", new = ":\n") |>
textConnection() |>
read.dcf()
DF2 <- cbind(DF, m)
> str(DF2)
'data.frame': 3 obs. of 14 variables:
$ V1 : int 9 9 9
$ V2 : chr "dbSNP" "dbSNP" "dbSNP"
$ V3 : chr "SNV" "SNV" "SNV"
$ V4 : int 10007 10009 14824990
$ V5 : int 10007 10009 14824990
$ V9 : chr "ID=1;Variant_seq=C;Dbxref=dbSNP_154:rs1449034754;evidence_values=Frequency,TOPMed;Reference_seq=T" "ID=2;Variant_seq=C,G;Dbxref=dbSNP_154:rs1587255763;evidence_values=Frequency;Reference_seq=A" "ID=30545117;Reference_seq=C;clinical_significance=benign;Variant_seq=GGGC,CCCCG;ancestral_allele=C;Dbxref=dbSNP"| __truncated__
$ ID : chr "1" "2" "30545117"
$ Variant_seq : chr "C" "C,G" "GGGC,CCCCG"
$ Dbxref : chr "dbSNP_154:rs1449034754" "dbSNP_154:rs1587255763" "dbSNP_154:rs140144319"
$ evidence_values : chr "Frequency,TOPMed" "Frequency" "Frequency,1000Genomes,ESP,Phenotype_or_Disease,ExAC,TOPMed,gnomAD"
$ Reference_seq : chr "T" "A" "C"
$ clinical_significance : chr NA NA "benign"
$ ancestral_allele : chr NA NA "C"
$ global_minor_allele_frequency: chr NA NA "1|0.004193|211"
或者這樣寫:
cbind(
DF,
read.dcf(textConnection(chartr("=;", ":\n", paste(DF$V9, collapse = "\n\n"))))
)
筆記
可重現形式的輸入 DF。
DF <- structure(list(V1 = c(9L, 9L, 9L), V2 = c("dbSNP", "dbSNP", "dbSNP"
), V3 = c("SNV", "SNV", "SNV"), V4 = c(10007L, 10009L, 14824990L
), V5 = c(10007L, 10009L, 14824990L), V9 = c("ID=1;Variant_seq=C;Dbxref=dbSNP_154:rs1449034754;evidence_values=Frequency,TOPMed;Reference_seq=T",
"ID=2;Variant_seq=C,G;Dbxref=dbSNP_154:rs1587255763;evidence_values=Frequency;Reference_seq=A",
"ID=30545117;Reference_seq=C;clinical_significance=benign;Variant_seq=GGGC,CCCCG;ancestral_allele=C;Dbxref=dbSNP_154:rs140144319;evidence_values=Frequency,1000Genomes,ESP,Phenotype_or_Disease,ExAC,TOPMed,gnomAD;global_minor_allele_frequency=1|0.004193|211"
)), class = "data.frame", row.names = c(NA, -3L))
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/404945.html
標籤:
上一篇:R中的熱圖無法正確顯示
