這是對下面概述的這個問題的跟進。我有以下三個字串(忽略以>開頭的行)
>chain A
---------MGPRLSVWLLLLPAALLLHEEHSRAAA--KGGCAGSGC-GKCDCHGVKGQKGERGLPGLQGVIGFPGMQGPEGPQGPPGQKGDTGEPGLPGTKGTRGPPGASGYPGNPGLPGIPGQDGPPGPPGIPGCNGTKGERGPLGPPGLPGFAGNPGPPGLPGMKGDPGEILGHVPGMLLKGERGFPGIPGTPGPPGLPGLQGPVGPPGFTGPPGPPGPPGPPGEKGQMGLSFQGPKGDKGDQGVSGPPGVPGQA-------QVQEKG
>chain B
---------MGPRLSVWLLLLPAALLLHEEHSRAAA--KGGCAGSGC-GKCDCHGVKGQKGERGLPGLQGVIGFPGMQGPEGPQGPPGQKGDTGEPGLPGTKGTRGPPGASGYPGNPGLPGIPGQDGPPGPPGIPGCNGTKGERGPLGPPGLPGFAGNPGPPGLPGMKGDPGEILGHVPGMLLKGERGFPGIPGTPGPPGLPGLQGPVGPPGFTGPPGPPGPPGPPGEKGQMGLSFQGPKGDKGDQGVSGPPGVPGQA-------QVQEKG
>chain C
MGRDQRAVAGPALRRWLLLGTVTVGFLAQSVLAGVKKFDVPCGGRDCSGGCQCYPEKGGRGQPGPVGPQGYNGPPGLQGFPGLQGRKGDKGERGAPGVTGPKGDVGARGVSGFPGADGIPGHPGQGGPRGRPGYDGCNGTQGDSGPQGPPGSEGFTGPPGPQGPKGQKGEP-YALPKEERDRYRGEPGEPGLVGFQGPPGRPGHVGQMGPVGAPGRPGPPGPPGPKGQQGNRGLGFYGVKGEKGDVGQPGPNGIPSDTLHPIIAPTGVTFH
我想找出滿足以下關系的三個鏈中所有R和D/E的字符位置
Ri (chain A) - Di 2 (chain B)
Ri (chain B) - Di 2 (chain C)
Ri (chain C) - Di 5 (chain A)
解釋:遍歷鏈 A 中的每個第 i 個 R 并檢查鏈 B 的 i 2 位置是否包含 D 或 E。如果是,則輸出每個這樣的 R 和 D/E 對的字符位置。對鏈 B C 和鏈 C A 執行相同操作。
Catch:在決定關系時,它應該計算破折號。但是在列印位置時,它應該忽略破折號。
使用原始問題中發布的腳本,我得到以下輸出
B-C 187 R E
輸出應該是什么
B-C 175-188 R E
我修改了原始問題中發布的代碼以包含更正
awk '
{ chain_id[ c]=$2 # save chain id, eg, "A", "B", "C"
getline # read next line from input file
chains[c]=$0 # save associated chain
}
END { i_char="R" # character to search for in 1st chain
for (i=1;i<=c;i ) { # loop through list of chains
j= (i==c ? 1 : i 1) # determine index of 2nd chain
offset= (i==c ? 5 : 2) # 2 for A-B, B-C; 5 for C-A
chain_i=chains[i] # copy chains as we are going to cut them up as we process them
chain_j=chains[j]
chain_pair= chain_id[i] "-" chain_id[j] # build output label, eg, "A-B"
pos=0 # reset position
while (length(chain_i)>0) {
n=index(chain_i,i_char) # look for "K"
if (n==0) break # if not found we are done with this chain pair so break out of loop else ...
pos=pos n # update our position in the chain and ...pos is the field position
j_char=substr(chain_j,n offset,1) # find character from 2nd chain at location n 2
if (j_char ~ /D|E/) {
corr_i=substr(chain_i,1,n)
corr=gsub (/-/,"",corr_i) # if 2nd chain character is one of "D" or "E" then ..
corr_pos=pos-corr
print chain_pair,corr_pos,i_char,j_char # print our finding
}
chain_i=substr(chain_i,n 1) # strip off 1st n characters
chain_j=substr(chain_j,n 1)
}
}
}
' file
但這無濟于事,并且輸出不正確。
B-C 187 R E
uj5u.com熱心網友回復:
添加一些邏輯來保持破折號的計數:
awk '
{ chain_id[ c]=$2; getline; chains[c]=$0 }
END { i_char="R"
for (i=1;i<=c;i ) {
j= (i==c ? 1 : i 1)
offset= (i==c ? 5 : 2)
chain_i=chains[i]
chain_j=chains[j]
chain_pair= chain_id[i] "-" chain_id[j]
pos=dash_cnt_i=dash_cnt_j=0
while (length(chain_i)>0) {
n=index(chain_i,i_char)
if (n==0) break
pos=pos n
head_i = substr(chain_i,1,n) # copy everything up to matching character
head_j = substr(chain_j,1,n) # copy everything up to matching character
dash_cnt_i = gsub(/-/,"",head_i) # add count of dashes in head_i; gsub() returns number of substitutions which in this case is also the number of dashes in head_i
dash_cnt_j = gsub(/-/,"",head_j) # add count of dashes in head_j
j_char=substr(chain_j,n offset,1)
if (j_char ~ /E|D/)
print chain_pair,(pos-dash_cnt_i) "-" (pos offset-dash_cnt_j) ,i_char,j_char
chain_i=substr(chain_i,n 1)
chain_j=substr(chain_j,n 1)
}
}
}
' file.txt
這會產生:
A-B 355-357 R E
A-B 390-392 R E
A-B 597-599 R D
A-B 781-783 R E
A-B 917-919 R D
A-B 968-970 R D
A-B 1063-1065 R E
A-B 1516-1518 R D
A-B 1638-1640 R E
B-C 175-188 R E # OP's expected result
B-C 346-364 R D
B-C 355-373 R E
B-C 396-414 R D
B-C 500-519 R D
B-C 585-602 R D
B-C 917-963 R E
B-C 1063-1108 R E
B-C 1173-1218 R D
B-C 1516-1562 R D
C-A 334-321 R E
C-A 400-389 R E
C-A 471-459 R E
C-A 740-706 R D
C-A 1228-1190 R E
C-A 1589-1552 R E
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/515842.html
標籤:细绳重击awk序列
