我想從 SDF 檔案中提取資料。
我想將> <Name>和> <SCORE.INTER>值保存在 .tsv 檔案中。有什么方法可以快速解決,例如通過 awk?提前致謝。
SDF 檔案由數千個 Block 組成。檔案的一個塊如下所示:
ZINC000169748276
38 39 0 0 0 0 0 0 0 0999 V2000
11.2318 3.6419 22.3134 C 0 0 0 0 0 0
12.5621 3.7685 22.2617 C 0 0 0 0 0 0
13.0725 5.1806 22.3121 C 0 0 0 0 0 0
10.8850 6.0303 22.4462 C 0 0 0 0 0 0
13.4310 2.6268 22.1614 C 0 0 0 0 0 0
12.9848 1.3691 22.0592 C 0 0 0 0 0 0
8.2548 4.7608 21.1375 C 0 0 0 0 0 0
7.1479 3.7322 21.1132 C 0 0 0 0 0 0
7.7728 2.5366 21.8185 C 0 0 0 0 0 0
8.9539 4.4605 22.4534 C 0 0 0 0 0 0
13.8873 0.1824 21.9500 C 0 0 0 0 0 0
8.5117 1.6060 20.8656 C 0 0 0 0 0 0
12.2544 6.2009 22.3970 N 0 0 0 0 0 0
10.3635 4.7178 22.4055 N 0 0 0 0 0 0
14.4254 5.4429 22.2718 N 0 0 0 0 0 0
13.7646 -0.5167 20.6443 N 0 3 0 0 0 0
6.5529 -4.6019 19.9460 O 0 5 0 0 0 0
8.2203 -4.0310 21.8048 O 0 5 0 0 0 0
6.8149 1.6459 17.3793 O 0 5 0 0 0 0
5.4231 -2.1179 18.5726 O 0 5 0 0 0 0
10.1403 7.0090 22.5243 O 0 0 0 0 0 0
5.7155 -3.6365 22.1679 O 0 0 0 0 0 0
5.6431 1.8811 19.7228 O 0 0 0 0 0 0
5.0295 -0.6218 20.7059 O 0 0 0 0 0 0
8.7342 3.0736 22.7475 O 0 0 0 0 0 0
6.0324 4.2091 21.8626 O 0 0 0 0 0 0
8.1857 1.9631 19.5323 O 0 0 0 0 0 0
7.0232 -2.2197 20.5667 O 0 0 0 0 0 0
7.0081 -0.1966 19.1450 O 0 0 0 0 0 0
6.8632 -3.7464 21.1697 P 0 0 0 0 0 0
6.7991 1.4009 18.8725 P 0 0 0 0 0 0
5.9605 -1.3044 19.7288 P 0 0 0 0 0 0
15.0444 4.6730 22.2089 H 0 0 0 0 0 0
14.7148 6.3890 22.3078 H 0 0 0 0 0 0
14.3405 -1.3642 20.6292 H 0 0 0 0 0 0
14.0706 0.0896 19.8769 H 0 0 0 0 0 0
12.7928 -0.7891 20.4667 H 0 0 0 0 0 0
5.3352 3.5319 21.8055 H 0 0 0 0 0 0
1 2 2 0 0 0
1 14 1 0 0 0
2 3 1 0 0 0
2 5 1 0 0 0
3 13 2 0 0 0
3 15 1 0 0 0
4 13 1 0 0 0
4 14 1 0 0 0
4 21 2 0 0 0
5 6 2 0 0 0
6 11 1 0 0 0
7 8 1 0 0 0
7 10 1 0 0 0
8 9 1 0 0 0
8 26 1 0 0 0
9 12 1 0 0 0
9 25 1 0 0 0
10 14 1 0 0 0
10 25 1 0 0 0
11 16 1 0 0 0
12 27 1 0 0 0
17 30 1 0 0 0
18 30 1 0 0 0
19 31 1 0 0 0
20 32 1 0 0 0
22 30 2 0 0 0
23 31 2 0 0 0
24 32 2 0 0 0
27 31 1 0 0 0
28 30 1 0 0 0
28 32 1 0 0 0
29 31 1 0 0 0
29 32 1 0 0 0
15 33 1 0 0 0
15 34 1 0 0 0
16 35 1 0 0 0
16 36 1 0 0 0
16 37 1 0 0 0
26 38 1 0 0 0
M END
> <CHROM.1>
2.74804207,-114.83879868,178.63419806,-11.86097681,-104.18799792,-175.61867989
-82.60305529,-167.43897154,58.52671946,-50.63759561,-111.24083331,101.74294800
8.69431853,1.29062552,20.98254072,-0.89039136,0.27787279,-3.08051579
> <Name>
ZINC000169748276
> <RI>
1.76083e 07
> <Rbt.Executable>
rbdock/0.1.0
> <Rbt.Library>
librxdock.so/0.1.0
> <SCORE>
-41.7582
> <SCORE.INTER>
-41.8551
> <SCORE.INTER.CONST>
1
> <SCORE.INTER.POLAR>
-4.96496
> <SCORE.INTER.REPUL>
0
> <SCORE.INTER.ROT>
10
> <SCORE.INTER.VDW>
-40.3742
> <SCORE.INTER.norm>
-1.30797
> <SCORE.INTRA>
0.0969082
> <SCORE.INTRA.DIHEDRAL>
-5.79141
> <SCORE.INTRA.DIHEDRAL.0>
19.5819
> <SCORE.INTRA.POLAR>
0
> <SCORE.INTRA.POLAR.0>
0
> <SCORE.INTRA.REPUL>
0
> <SCORE.INTRA.REPUL.0>
0
> <SCORE.INTRA.VDW>
2.99261
> <SCORE.INTRA.VDW.0>
-5.2787
> <SCORE.INTRA.norm>
0.00302838
> <SCORE.RESTR>
0
> <SCORE.RESTR.CAVITY>
0
> <SCORE.RESTR.norm>
0
> <SCORE.SYSTEM>
0
> <SCORE.SYSTEM.CONST>
0
> <SCORE.SYSTEM.DIHEDRAL>
0
> <SCORE.SYSTEM.norm>
0
> <SCORE.heavy>
32
> <SCORE.norm>
-1.30494
$$$$
.tsv 檔案應如下所示:
ZINC000169748276 -41.8551
ZINC000079214514 -41.7892
ZINC000195993528 -40.9293
uj5u.com熱心網友回復:
使用任何 awk:
$ awk -v OFS='\t' '
/^>/ { tag=$2; next }
NF { f[tag]=$1 }
$0 == "$$$$" { print f["<Name>"], f["<SCORE.INTER>"] }
' file
ZINC000169748276 -41.8551
以上假設包含$$$$用于分隔輸入記錄的行。
請注意,通過首先創建一個陣列(f[]上面)將標簽/名稱映射到它們的值的這種方法,您可以按您喜歡的任何順序列印您喜歡的任何值,將整個東西轉換為 CSV,通過它們的值與其他值進行比較名稱等。例如,您可以撰寫如下內容來分析資料區域和輸出報告等:
awk -v OFS='\t' '
/^>/ { tag=$2; next }
NF { f[tag]=$1 }
$0 == "$$$$" {
if ( (f["<SCORE.INTRA.POLAR>"] >= f["<SCORE.INTRA.REPUL>"]) &&
(f["<SCORE.RESTR.CAVITY>"] == 27) ) {
print f["<Name>"]
for ( tag in f ) {
if ( tag ~ /SCORE/ ) {
print f[tag]
}
}
}
}
' file
如果您曾經考慮使用getlinethen 請參閱http://awk.freeshell.org/AllAboutGetline了解為什么它通常是錯誤的方法。
uj5u.com熱心網友回復:
為什么awk?
Prompt> grep -A 1 -i "<NAME>" test.txt | tail -n 1
ZINC000169748276
Prompt> grep -A 1 -i "<SCORE.INTER>" test.txt | tail -n 1
-41.8551
如您所見,grep要容易得多。
-A 1意思是“也取下 1 行”。
經過一番討論,這是最終的解決方案:
grep -A 1 -i "<SCORE.INTER>" test.sdf | grep -v '^>' | grep -v '^--' >> results
uj5u.com熱心網友回復:
我想將
> <NAME>和> <SCORE.INTER>值保存在 .tsv 檔案中。有什么方法可以快速解決,例如通過 awk?
你的檔案> <Name>沒有> <NAME>(如果你以區分大小寫的方式匹配,重要的區別)。我會AWK按照以下方式使用 GNU 來完成這個任務(這個假設> <Name>通常是在之前> <SCORE.INTER>并且每個> <SCORE.INTER>都有對應的> <Name>)讓file.txt內容成為
ZINC000169748276
38 39 0 0 0 0 0 0 0 0999 V2000
11.2318 3.6419 22.3134 C 0 0 0 0 0 0
12.5621 3.7685 22.2617 C 0 0 0 0 0 0
13.0725 5.1806 22.3121 C 0 0 0 0 0 0
10.8850 6.0303 22.4462 C 0 0 0 0 0 0
13.4310 2.6268 22.1614 C 0 0 0 0 0 0
12.9848 1.3691 22.0592 C 0 0 0 0 0 0
8.2548 4.7608 21.1375 C 0 0 0 0 0 0
7.1479 3.7322 21.1132 C 0 0 0 0 0 0
7.7728 2.5366 21.8185 C 0 0 0 0 0 0
8.9539 4.4605 22.4534 C 0 0 0 0 0 0
13.8873 0.1824 21.9500 C 0 0 0 0 0 0
8.5117 1.6060 20.8656 C 0 0 0 0 0 0
12.2544 6.2009 22.3970 N 0 0 0 0 0 0
10.3635 4.7178 22.4055 N 0 0 0 0 0 0
14.4254 5.4429 22.2718 N 0 0 0 0 0 0
13.7646 -0.5167 20.6443 N 0 3 0 0 0 0
6.5529 -4.6019 19.9460 O 0 5 0 0 0 0
8.2203 -4.0310 21.8048 O 0 5 0 0 0 0
6.8149 1.6459 17.3793 O 0 5 0 0 0 0
5.4231 -2.1179 18.5726 O 0 5 0 0 0 0
10.1403 7.0090 22.5243 O 0 0 0 0 0 0
5.7155 -3.6365 22.1679 O 0 0 0 0 0 0
5.6431 1.8811 19.7228 O 0 0 0 0 0 0
5.0295 -0.6218 20.7059 O 0 0 0 0 0 0
8.7342 3.0736 22.7475 O 0 0 0 0 0 0
6.0324 4.2091 21.8626 O 0 0 0 0 0 0
8.1857 1.9631 19.5323 O 0 0 0 0 0 0
7.0232 -2.2197 20.5667 O 0 0 0 0 0 0
7.0081 -0.1966 19.1450 O 0 0 0 0 0 0
6.8632 -3.7464 21.1697 P 0 0 0 0 0 0
6.7991 1.4009 18.8725 P 0 0 0 0 0 0
5.9605 -1.3044 19.7288 P 0 0 0 0 0 0
15.0444 4.6730 22.2089 H 0 0 0 0 0 0
14.7148 6.3890 22.3078 H 0 0 0 0 0 0
14.3405 -1.3642 20.6292 H 0 0 0 0 0 0
14.0706 0.0896 19.8769 H 0 0 0 0 0 0
12.7928 -0.7891 20.4667 H 0 0 0 0 0 0
5.3352 3.5319 21.8055 H 0 0 0 0 0 0
1 2 2 0 0 0
1 14 1 0 0 0
2 3 1 0 0 0
2 5 1 0 0 0
3 13 2 0 0 0
3 15 1 0 0 0
4 13 1 0 0 0
4 14 1 0 0 0
4 21 2 0 0 0
5 6 2 0 0 0
6 11 1 0 0 0
7 8 1 0 0 0
7 10 1 0 0 0
8 9 1 0 0 0
8 26 1 0 0 0
9 12 1 0 0 0
9 25 1 0 0 0
10 14 1 0 0 0
10 25 1 0 0 0
11 16 1 0 0 0
12 27 1 0 0 0
17 30 1 0 0 0
18 30 1 0 0 0
19 31 1 0 0 0
20 32 1 0 0 0
22 30 2 0 0 0
23 31 2 0 0 0
24 32 2 0 0 0
27 31 1 0 0 0
28 30 1 0 0 0
28 32 1 0 0 0
29 31 1 0 0 0
29 32 1 0 0 0
15 33 1 0 0 0
15 34 1 0 0 0
16 35 1 0 0 0
16 36 1 0 0 0
16 37 1 0 0 0
26 38 1 0 0 0
M END
> <CHROM.1>
2.74804207,-114.83879868,178.63419806,-11.86097681,-104.18799792,-175.61867989
-82.60305529,-167.43897154,58.52671946,-50.63759561,-111.24083331,101.74294800
8.69431853,1.29062552,20.98254072,-0.89039136,0.27787279,-3.08051579
> <Name>
ZINC000169748276
> <RI>
1.76083e 07
> <Rbt.Executable>
rbdock/0.1.0
> <Rbt.Library>
librxdock.so/0.1.0
> <SCORE>
-41.7582
> <SCORE.INTER>
-41.8551
> <SCORE.INTER.CONST>
1
> <SCORE.INTER.POLAR>
-4.96496
> <SCORE.INTER.REPUL>
0
> <SCORE.INTER.ROT>
10
> <SCORE.INTER.VDW>
-40.3742
> <SCORE.INTER.norm>
-1.30797
> <SCORE.INTRA>
0.0969082
> <SCORE.INTRA.DIHEDRAL>
-5.79141
> <SCORE.INTRA.DIHEDRAL.0>
19.5819
> <SCORE.INTRA.POLAR>
0
> <SCORE.INTRA.POLAR.0>
0
> <SCORE.INTRA.REPUL>
0
> <SCORE.INTRA.REPUL.0>
0
> <SCORE.INTRA.VDW>
2.99261
> <SCORE.INTRA.VDW.0>
-5.2787
> <SCORE.INTRA.norm>
0.00302838
> <SCORE.RESTR>
0
> <SCORE.RESTR.CAVITY>
0
> <SCORE.RESTR.norm>
0
> <SCORE.SYSTEM>
0
> <SCORE.SYSTEM.CONST>
0
> <SCORE.SYSTEM.DIHEDRAL>
0
> <SCORE.SYSTEM.norm>
0
> <SCORE.heavy>
32
> <SCORE.norm>
-1.30494
$$$$
然后
awk '/^> <Name>/{getline;printf "%s\t",$0}/^> <SCORE\.INTER>/{getline;print $0}' file.txt
輸出
ZINC000169748276 -41.8551
解釋:getline導致 GNUAWK加載下一行,因此$0成為當前行之后的行內容。當遇到> <Name>行首 (^?? ) 時,加載下一行并列印它,然后按 TAB 表示以加載下一行開頭的> <SCORE.INTER>行并列印它。請注意,.需要轉義,因為它具有特殊含義。
(在 gawk 4.2.1 中測驗)
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/473089.html
