我需要將某些目錄中的某些檔案連接起來,這些目錄是在Snakefile. 我嘗試創建以下規則來連接這些目錄中的所有檔案:
# concatenate output per hmm
rule concatenate:
input:
output_{hmm}/* ,
output:
output_{hmm}/cat_{hmm}.txt,
params:
cmd='cat'
shell:
'{params.cmd} {input} > {output} '
它不起作用并產生以下錯誤:
"SyntaxError in line 62 of /scratch/data1/agalvez/domains/Snakefile_ecdf:
invalid syntax (Snakefile_ecdf, line 62)"
我不知道規則有什么問題,我認為使用*可能不夠,但我想不出另一種方法來做我想做的事。
編輯:這個問題可能缺少一些資訊,所以我也會附上完整的 Snakefile:
ARCHIVE_FILE = 'output.tar.gz'
# a single output file
OUTPUT_FILE = 'output_{hmm}/{species}_{hmm}.out'
# a single input file
INPUT_FILE = 'proteins/{species}.fasta'
# a single hmm file
HMM_FILE = 'hmm/{hmm}.hmm'
# a single cat file
CAT_FILE = 'cat/cat_{hmm}.txt'
# Build the list of input files.
INP = glob_wildcards(INPUT_FILE).species
# Build the list of hmm files.
HMM = glob_wildcards(HMM_FILE).hmm
# The list of all output files
OUT = expand(OUTPUT_FILE, species=INP, hmm=HMM)
# The list of all CAT files
CAT = expand(CAT_FILE, hmm=HMM)
# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
input: ARCHIVE_FILE
# hmmsearch
rule hmm:
input:
species=INPUT_FILE ,
hmm=HMM_FILE
output:
OUTPUT_FILE,
params:
cmd='hmmsearch --noali -E 99 --tblout'
shell:
'{params.cmd} {output} {input.hmm} {input.species} '
# concatenate output per hmm
from glob import glob
rule concatenate:
input:
files = glob("output_{hmm}/*") ,
output:
CAT_FILE,
params:
cmd='cat'
shell:
'{params.cmd} {input.files} {output} '
# create an archive with all results
rule create_archive:
input: OUT, CAT,
output: ARCHIVE_FILE
shell: 'tar -czvf {output} {input}'
uj5u.com熱心網友回復:
出于測驗目的,讓我們創建一個示例檔案集(使用虛擬檔案進行操作以確保作業流正常作業)。在終端中,我運行:
mkdir proteins && touch proteins/1.fasta proteins/2.fasta
mkdir hmm && touch hmm/A.hmm hmm/B.hmm
現在,您的作業流程大部分是正確的,除了 rule concatenate。此規則的輸入由規則創建,hmm此規則的輸出特定于通配符值hmm。因此,您對species給定的所有值感興趣hmm。獲得它的方法是使用expand但保持hmm通配符格式,使用expand(OUTPUT_FILE, species=INP, hmm="{hmm}"):
rule concatenate:
input:
expand(OUTPUT_FILE, species=INP, hmm="{hmm}"),
output:
CAT_FILE,
params:
cmd="cat",
shell:
"{params.cmd} {input} > {output} "
在下面的作業流程中,我修改hmm了快速測驗運行的規則,因此完整的作業流程將如下所示:
ARCHIVE_FILE = "output.tar.gz"
# a single output file
OUTPUT_FILE = "output_{hmm}/{species}_{hmm}.out"
# a single input file
INPUT_FILE = "proteins/{species}.fasta"
# a single hmm file
HMM_FILE = "hmm/{hmm}.hmm"
# a single cat file
CAT_FILE = "cat/cat_{hmm}.txt"
# Build the list of input files.
INP = glob_wildcards(INPUT_FILE).species
# Build the list of hmm files.
HMM = glob_wildcards(HMM_FILE).hmm
# The list of all output files
OUT = expand(OUTPUT_FILE, species=INP, hmm=HMM)
# The list of all CAT files
CAT = expand(CAT_FILE, hmm=HMM)
# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
input:
ARCHIVE_FILE,
# hmmsearch
rule hmm:
input:
species=INPUT_FILE,
hmm=HMM_FILE,
output:
touch(OUTPUT_FILE),
params:
# cmd='hmmsearch --noali -E 99 --tblout'
cmd="echo",
shell:
"{params.cmd} {output} {input.hmm} {input.species} "
rule concatenate:
input:
expand(OUTPUT_FILE, species=INP, hmm="{hmm}"),
output:
CAT_FILE,
params:
cmd="cat",
shell:
"{params.cmd} {input} > {output} "
# create an archive with all results
rule create_archive:
input:
CAT,
output:
ARCHIVE_FILE,
shell:
"tar -czvf {output} {input}"
uj5u.com熱心網友回復:
Snakefile 是 Python 代碼,因此所有檔案參考都應該是字串/類似路徑的物件(或回傳此類物件的變數/函式)。但是,通常輸入檔案應該是特定檔案,而不是目錄。
有幾種方法可以解決這個問題,其中一種是在 python 中呼叫 glob 檔案并顯式傳遞它們:
from glob import glob
rule concatenate:
input:
files = glob("output_{hmm}/*") ,
output:
combined = "output_{hmm}/cat_{hmm}.txt",
params:
cmd='cat'
shell:
'{params.cmd} {input.files} {output.combined} '
請注意,就目前而言,此規則將在重復運行時導致問題,因為連接檔案(“output_{hmm}/cat_{hmm}.txt”)將在重復運行時被覆寫。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/416871.html
標籤:
