有很多帖子與此類似。解決這個問題的幾個小時我很絕望,因為它看起來應該很簡單。
我有一個看起來像這樣的檔案:
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like
tig00000005 23339 23974 XP_012152584.1 NW_003797083.1 LOC100878991 PREDICTED: LOW QUALITY PROTEIN
tig00000005 24600 25138 XP_012143166.1 NW_003797196.1 LOC100881279 PREDICTED: ankyrin-2 isoform X2
tig00000005 2685 4511 XP_012144644.1 NW_003797249.1 LOC105662970 PREDICTED: fibrinogen alpha chain-like isoform X2
tig00000005 28923 29432 XP_012148395.1 NW_003797444.1 LOC100881617 PREDICTED: eukaryotic translation initiation factor 4 gamma 3-like isoform X12
tig00000005 32415 34324 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like
還有一個看起來像這樣的第二個檔案:
tig00000005 maker gene 15310 16162 . . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 maker gene 16764 17237 . . ID=snap_masked-tig00000005-processed-gene-0.3;Name=snap_masked-tig00000005-processed-gene-0.3
tig00000005 maker gene 23339 23974 . . ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 maker gene 24600 25138 . - . ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10
tig00000005 maker gene 25472 26900 . . ID=snap_masked-tig00000005-processed-gene-0.5;Name=snap_masked-tig00000005-processed-gene-0.5
我想將第一個檔案中的 1、2 和 3 列與第二個檔案中的 1、4 和 5 相匹配,如果它們匹配,則將第二個檔案的資料附加到第一個檔案中,如下所示:
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like tig00000005 maker gene 15310 16162 . . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
一些不起作用的示例代碼:
awk 'OFS="\t"; FS="\t"; NR==FNR{a[$1,$2,$3]=$0; next} (($1,$4,$5) in a){print $0,a[$0]}' file 1 file 2
awk 'OFS="\t"; FS="\t"; NR==FNR{a[$1,$2,$3]=($1,$4,$5)} {print $0,a[$0]}' file 1 file 2
首先輸出檔案 1 中的每一行,然后是(未附加)檔案 2,第二個代碼拋出與 = 函式相關的錯誤。我已經嘗試了我能想象到的任何排列。感謝您提供任何幫助
uj5u.com熱心網友回復:
對 OP 的第一個awk腳本進行了一些小改動:
# old:
awk 'OFS="\t"; FS="\t"; NR==FNR{a[$1,$2,$3]=$0; next} (($1,$4,$5) in a){print $0,a[$0]}' file1 file2
# new - add BEGIN block, modify print statement:
awk 'BEGIN {FS=OFS="\t"} NR==FNR{a[$1,$2,$3]=$0; next} (($1,$4,$5) in a){print a[$1,$4,$5],$0}' file1 file2
修改后的awk腳本生成:
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like tig00000005 maker gene 15310 16162 . . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 23339 23974 XP_012152584.1 NW_003797083.1 LOC100878991 PREDICTED: LOW QUALITY PROTEIN tig00000005 maker gene 23339 23974 . . ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 24600 25138 XP_012143166.1 NW_003797196.1 LOC100881279 PREDICTED: ankyrin-2 isoform X2 tig00000005 maker gene 24600 25138 . - . ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10
uj5u.com熱心網友回復:
像這樣?
awk 'NR==FNR{a[$1" "$2" "$3]=$0; next}; {if($1" "$4" "$5 in a){print a[$1" "$4" "$5],$0}}' file1 file2
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like tig00000005 maker gene 15310 16162 . . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 23339 23974 XP_012152584.1 NW_003797083.1 LOC100878991 PREDICTED: LOW QUALITY PROTEIN tig00000005 maker gene 23339 23974 . . ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 24600 25138 XP_012143166.1 NW_003797196.1 LOC100881279 PREDICTED: ankyrin-2 isoform X2 tig00000005 maker gene 24600 25138 . - . ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10
要寫入新檔案,只需執行 awk 'NR==FNR{a[$1" "$2" "$3]=$0; next}; {if($1" "$4" "$5 in a){print a[$1" "$4" "$5],$0}}' file1 file2 > file3
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/341858.html
