通過匹配特定列來連接兩個檔案-有解無憂

我正在嘗試加入兩個已經排序的檔案

檔案 1

70 CBLB Cbl proto-oncogene B
70 HOXC11 centrosomal protein 57
70 CHD4 chromodomain helicase
70 FANCF FA complementation
70 LUZP2 leucine zipper protein 2

檔案 2

0.700140820757797 ELAVL1
0.700229616476825 HOXC11
0.700328646327188 CHD4
0.700328951649384 LUZP2

輸出

Gene Symbol  Gene Description         Target Score mirDB   Target Score Diana
HOXC11       centrosomal protein 57   70                   0.700229616476825
CHD4         chromodomain helicase    70                   0.700328646327188
LUZP2        leucine zipper protein 2 70                   0.700328951649384

為了執行此任務，我已嘗試使用此腳本，但它回傳一個空檔案

join -j 2 -o 1.1,1.2,1.3,1.4,2.4 File1 File2 | column -t | sed '1i Gene Symbol, Gene 
Description, Target Score mirDB, Target Score Diana' > Output

請求任何有關 awk 或 join 命令的幫助。

uj5u.com熱心網友回復：

你可以試試這個 awk

$ awk 'BEGIN {OFS="\t"; print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"} NR==FNR{array[$2]=$1; next} $0!~array[$2]{print $2,OFS $3" "$4" "$5,$6, $1,OFS array[$2]}' file2 file1

Gene Symbol     Gene Description        Target Score mirDB      Target Score Diana
HOX11           centrosomal protein 57          70              0.700229616476825
CHD4            chromodomain helicase           70              0.700328646327188
LUZP2           leucine zipper protein  2       70              0.700328951649384

BEGIN {
    OFS="\t" 
    print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
} NR==FNR {
    array[$2]=$1
    next
} $0!~array[$2] {
    print $2,OFS $3" "$4" "$5,$6, $1,OFS array[$2]
}

uj5u.com熱心網友回復：

更新：更新awk以洗掉 Windows 行結尾 ( \r)，因為這在 OP 的評論/其他問題期間作為問題彈出

問題：

OP 的當前代碼需要在呼叫之前對兩個檔案進行預排序 join
由于空白分隔列的可變數量File1將很難（不可能？）join生成不會被后續column呼叫打亂的格式
column 無法區分用作欄位分隔符的空格與作為欄位一部分的空格

由于這些問題，我認為一個awk解決方案，結合column“簡單”的重新格式化，更容易實作和理解，例如：

awk '
BEGIN      { OFS="|"                              # "|" will be used as the input delimiter for a follow-on "column" call
             print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
           }
           { sub(/\r/,"") }                       # remove Windows line ending "\r" for all lines in all files
FNR==NR    { gene[$2]=$1 ; next }
$2 in gene { lastF=pfx=""
             for (i=3;i<=NF;i  ) {                # pull fields #3 to #NF into a single variable 
                 lastF=lastF pfx $i
                 pfx=" "
             }
             print $2, lastF, $1, gene[$2]
           }
' File2 File1

這會產生：

Gene Symbol|Gene Description|Target Score mirDB|Target Score Diana
HOXC11|centrosomal protein 57|70|0.700229616476825
CHD4|chromodomain helicase|70|0.700328646327188
LUZP2|leucine zipper protein 2|70|0.700328951649384

雖然可以添加更多代碼以便awk在“漂亮”列中列印輸出，但我選擇了一種更簡單的方法來column完成額外的作業：

awk '
BEGIN      { OFS="|" 
             print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
           }
           { sub(/\r/,"") }                       # remove Windows line ending "\r" for all lines in all files
FNR==NR    { gene[$2]=$1 ; next }
$2 in gene { lastF=pfx=""
             for (i=3;i<=NF;i  ) {
                 lastF=lastF pfx $i
                 pfx=" "
             }
             print $2, lastF, $1, gene[$2]
           }
' File2 File1 | column -s'|' -t

這會產生：

Gene Symbol  Gene Description          Target Score mirDB  Target Score Diana
HOXC11       centrosomal protein 57    70                  0.700229616476825
CHD4         chromodomain helicase     70                  0.700328646327188
LUZP2        leucine zipper protein 2  70                  0.700328951649384

uj5u.com熱心網友回復：

這可能對您有用（GNU sed、join 和 column）：

( echo 'Gene Symbol@Gene Description@Target Score mirDB@Target Score Diana';
join -j2 -t@ --no -o 0,1.3,1.1,2.1 <(sed 's/ /@/;s//@/' file1) <(sed 's/ /@/' file2) ) |
column -s@ -t

制定最終標題，連接兩個輸入檔案并將總輸出傳遞給列命令，該列命令將結果制成表格。

注意標題由@標題或連接檔案中未找到的任意字符分隔。修改輸入檔案，使其欄位定界符與標題的欄位定界符相匹配，并且 column 命令使用相同的定界符將最終結果制成表格。的--no（簡稱--nocheck-order）防止警告訊息。

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/315243.html

標籤：linux 加入 awk sed

上一篇：加入表格，只保留正確的表格，但保留左邊表格的數量

下一篇：如何連接兩個資料集，其中資料集的每一行都與另一行的多行匹配