我正在嘗試加入兩個已經排序的檔案
檔案 1
70 CBLB Cbl proto-oncogene B
70 HOXC11 centrosomal protein 57
70 CHD4 chromodomain helicase
70 FANCF FA complementation
70 LUZP2 leucine zipper protein 2
檔案 2
0.700140820757797 ELAVL1
0.700229616476825 HOXC11
0.700328646327188 CHD4
0.700328951649384 LUZP2
輸出
Gene Symbol Gene Description Target Score mirDB Target Score Diana
HOXC11 centrosomal protein 57 70 0.700229616476825
CHD4 chromodomain helicase 70 0.700328646327188
LUZP2 leucine zipper protein 2 70 0.700328951649384
為了執行此任務,我已嘗試使用此腳本,但它回傳一個空檔案
join -j 2 -o 1.1,1.2,1.3,1.4,2.4 File1 File2 | column -t | sed '1i Gene Symbol, Gene
Description, Target Score mirDB, Target Score Diana' > Output
請求任何有關 awk 或 join 命令的幫助。
uj5u.com熱心網友回復:
你可以試試這個 awk
$ awk 'BEGIN {OFS="\t"; print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"} NR==FNR{array[$2]=$1; next} $0!~array[$2]{print $2,OFS $3" "$4" "$5,$6, $1,OFS array[$2]}' file2 file1
Gene Symbol Gene Description Target Score mirDB Target Score Diana
HOX11 centrosomal protein 57 70 0.700229616476825
CHD4 chromodomain helicase 70 0.700328646327188
LUZP2 leucine zipper protein 2 70 0.700328951649384
BEGIN {
OFS="\t"
print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
} NR==FNR {
array[$2]=$1
next
} $0!~array[$2] {
print $2,OFS $3" "$4" "$5,$6, $1,OFS array[$2]
}
uj5u.com熱心網友回復:
更新:更新awk以洗掉 Windows 行結尾 ( \r),因為這在 OP 的評論/其他問題期間作為問題彈出
問題:
- OP 的當前代碼需要在呼叫之前對兩個檔案進行預排序
join - 由于空白分隔列的可變數量
File1將很難(不可能?)join生成不會被后續column呼叫打亂的格式 column無法區分用作欄位分隔符的空格與作為欄位一部分的空格
由于這些問題,我認為一個awk解決方案,結合column“簡單”的重新格式化,更容易實作和理解,例如:
awk '
BEGIN { OFS="|" # "|" will be used as the input delimiter for a follow-on "column" call
print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
}
{ sub(/\r/,"") } # remove Windows line ending "\r" for all lines in all files
FNR==NR { gene[$2]=$1 ; next }
$2 in gene { lastF=pfx=""
for (i=3;i<=NF;i ) { # pull fields #3 to #NF into a single variable
lastF=lastF pfx $i
pfx=" "
}
print $2, lastF, $1, gene[$2]
}
' File2 File1
這會產生:
Gene Symbol|Gene Description|Target Score mirDB|Target Score Diana
HOXC11|centrosomal protein 57|70|0.700229616476825
CHD4|chromodomain helicase|70|0.700328646327188
LUZP2|leucine zipper protein 2|70|0.700328951649384
雖然可以添加更多代碼以便awk在“漂亮”列中列印輸出,但我選擇了一種更簡單的方法來column完成額外的作業:
awk '
BEGIN { OFS="|"
print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
}
{ sub(/\r/,"") } # remove Windows line ending "\r" for all lines in all files
FNR==NR { gene[$2]=$1 ; next }
$2 in gene { lastF=pfx=""
for (i=3;i<=NF;i ) {
lastF=lastF pfx $i
pfx=" "
}
print $2, lastF, $1, gene[$2]
}
' File2 File1 | column -s'|' -t
這會產生:
Gene Symbol Gene Description Target Score mirDB Target Score Diana
HOXC11 centrosomal protein 57 70 0.700229616476825
CHD4 chromodomain helicase 70 0.700328646327188
LUZP2 leucine zipper protein 2 70 0.700328951649384
uj5u.com熱心網友回復:
這可能對您有用(GNU sed、join 和 column):
( echo 'Gene Symbol@Gene Description@Target Score mirDB@Target Score Diana';
join -j2 -t@ --no -o 0,1.3,1.1,2.1 <(sed 's/ /@/;s//@/' file1) <(sed 's/ /@/' file2) ) |
column -s@ -t
制定最終標題,連接兩個輸入檔案并將總輸出傳遞給列命令,該列命令將結果制成表格。
注意標題由@標題或連接檔案中未找到的任意字符分隔。修改輸入檔案,使其欄位定界符與標題的欄位定界符相匹配,并且 column 命令使用相同的定界符將最終結果制成表格。的--no(簡稱--nocheck-order)防止警告訊息。
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/315243.html
