我的任務是將幾個檔案匯總到一個 tsv 檔案中。我必須從檔案串列中選擇特定資料并將其寫入 tsv 檔案中的一行制表符分隔列。檔案中的每一行都有一個“名稱”作為第一列,因此很容易過濾資料($1 ==“NAME”)。一個檔案 == tsv 中的一行。到目前為止,我寫了這個:
#! /bin/bash
cat > newFile.txt
for f in *.pdb; do
awk '$1 == "ACCESSION" {print $2}' ORS="/t" "$f" >> newFile.txt
awk '$1 == "DEFINITION" {print $2}' ORS="/t" "$f" >> newFile.txt
awk '$1 == "SOURCE" {print $2}' ORS="/t" "$f" >> newFile.txt
awk '$1 == "LOCUS" {print$4}' ORS="/r" "$f" >> newFile.txt
done
顯然,代碼的這種暴行是行不通的。是否可以修改我寫的內容并使用 awk 完成任務?
檔案示例:
LOCUS \t NM_123456 \t 2000bp \t mRNA
DEFINITION \t Very nice gene from a very nice mouse
ACCESSION \t NM_123456
VERSION \t 1.000
SOURCE \t Very nice mouse
最終結果:
NM_123456 /t Very nice gene from a very nice mouse /t Very nice mouse /t mRNA
NM_345678 /t Not so nice gene from an angry elephant /t Angry Elephant /t mRNA
“/t”代表制表符(對不起,我不知道如何寫下來)。此外示例檔案包含更多資訊,我只是給了一個“標題”讓我們說。
uj5u.com熱心網友回復:
在普通的 bash 中:
for file in *.pdb; do
acc=
def=
src=
loc=
while IFS=$'\t' read -ra fields; do
if [[ ${fields[0]} = "ACCESSION" ]]; then
acc=${fields[1]}
elif [[ ${fields[0]} = "DEFINITION" ]]; then
def=${fields[1]}
elif [[ ${fields[0]} = "SOURCE" ]]; then
src=${fields[1]}
elif [[ ${fields[0]} = "LOCUS" ]]; then
loc=${fields[3]}
fi
done < "$file"
printf '%s\t%s\t%s\t%s\n' "$acc" "$def" "$src" "$loc" >> newFile.txt
done
uj5u.com熱心網友回復:
如果每個檔案在每個檔案中都有相同順序的行,并且它們在每個檔案中只出現一次(不多也不少),你可以這樣做:
awk '
$1 == "ASCESSION" {printf "%s\t", $2}
$1 == "DEFINITION" {printf "%s\t", $2}
$1 == "SOURCE" {printf "%s\t", $2}
$1 == "LOCUS" {print $4}' *.pdb > table.tsv
但是,如果行的順序不同,或者某些檔案沒有每一行,或者某些檔案有多行相同(例如SOURCE foo出現兩次),您將需要更復雜的東西,如下所示:
awk '
function print_row(cols) {
for (i=0; i<3; i ) {
printf "%s\t", cols[i]
cols[i] = ""
}
print cols[3]
cols[3] = ""
}
NR!=FNR && FNR==1 {print_row(cols)}
$1 == "ASCESSION" {cols[0] = $2}
$1 == "DEFINITION" {cols[1] = $2}
$1 == "SOURCE" {cols[2] = $2}
$1 == "LOCUS" {cols[3] = $4}
END {print_row(cols)}' *.pdb > table.tsv
它總是列印一個整潔的表格,列正確排列,無論檔案中的行順序如何,即使某些行丟失或出現不止一次。如果一行出現多次,則使用最后一次出現。
uj5u.com熱心網友回復:
如果gawk支持ENDFILE塊的 可用,請嘗試:
awk -F'\t' -v OFS='\t' ' # assign input/output field separator to a tab character
BEGIN {
split("ACCESSION,DEFINITION,SOURCE,LOCUS", names, ",")
# assign an array "names" to the list of names
}
{
if ($1 == "LOCUS") a[$1] = $4
else a[$1] = $2
}
ENDFILE { # this block is invoked after reading each file
print a[names[1]], a[names[2]], a[names[3]], a[names[4]]
# print a["ACCESSION"], a["DEFINITION"], .. in order as a tsv
delete a # clear array "a"
}' *.tsv
uj5u.com熱心網友回復:
這可能是您正在尋找的,在每個 Unix 機器上的任何 shell 中使用任何 awk(未經測驗):
awk '
BEGIN { FS=OFS="\t" }
{ f[$1] = ($1 == "LOCUS" ? $4 : $2) }
$1 == "SOURCE" {
print f["ACCESSION"], f["DEFINITION"], f["SOURCE"], f["LOCUS"]
}
' *.pdb > newFile.txt
以上假設每個輸入檔案都具有與問題中的輸入檔案中顯示的相同的標簽值對,并且 SOURCE 始終是最后一個。
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/353228.html
