作為Snakemake作業流程中輸入的值陣列-有解無憂

我開始遷移我的作業流程Nextflow，Snakemake并且已經在我的管道開始時遇到了困難，這些管道通常以數字串列開頭（代表我們檢測器的“運行編號”）。

例如，我有一個run-list.txt喜歡

# detector_id run_number
75 63433
75 67325
42 57584
42 57899
42 58998

然后需要逐行傳遞給查詢資料庫或資料存盤系統并將檔案檢索到本地系統的行程。

這意味著例如將通過接收和作為輸入引數的規則75 63433生成輸出。RUN_00000075_00063433.h5detector_id=75run_number=63433

這樣做相當容易，只需Nextflow定義一個發出這些值的元組的行程。

我不太明白我怎么能做這樣的事情，Snakemake因為輸入和輸出似乎總是需要是檔案（遠程或本地）。事實上，有些檔案確實可以通過 iRODS 和/或 XRootD 訪問，但即便如此，我還是需要先從運行選擇開始，它在上面的串列中定義run-list.txt。

我現在的問題是：解決這個問題的 Snakemake 風格的方法是什么？

一個不作業的偽代碼是：

rule:
    input:
        [line for line in open("run-list.txt").readlines()]
    output:
        "{detector_id}_{run_number}.h5"
    shell:
        "detector_id, run_number = line.split()"
        "touch "{detector_id}_{run_number}.h5""

uj5u.com熱心網友回復：

在 Snakemake 中，您將使用此檔案生成要輸入作業流程的值串列。您將決議檢測器 ID 并在規則之外運行編號。如果你想使用外部庫，你的運行串列看起來可以用 pandas 巧妙地處理。

import pandas as pd

run_list = pd.read_csv("run-list.txt", header=0, names=["detector_id", "run_number"], sep=" ")
detector_ids = list(run_list["detector_id"])
run_numbers = list(run_list["run_number"])

然后，在假設您的檔案名不需要填充零的情況下，運行您想要執行的獲取一個檔案的規則是：

rule do_something:
    output: "{detector_id}_{run_number}.h5"
    shell: "do_something_with {wildcards.detector_id} {wildcards.run_number}"

僅憑這條規則，detector_id理論上run_number可以是任何東西，所以你需要一些東西來告訴 Snakemake 以產生你想要的輸出的方式運行它。因此，要對檔案中的所有行運行此操作，您需要設定一個規則，將檔案定義的所有潛在輸出作為輸入。

rule run_all:
    input: expand("{detector_id}_{run_number}.h5", zip, detector_id=detector_ids, run_number=run_numbers)

zip零件確保第一個檢測器 ID 與第一個運行編號一致，依此類推。

最后，您將運行它并指定要運行的規則的名稱，因此snakemake run_all.

uj5u.com熱心網友回復：

要完成這項作業，您需要兩種成分：

指定生成單個檔案的邏輯的規則（定義任何檔案依賴項，如有必要）
定義應計算哪個檔案的規則，按照慣例，此規則稱為all.

這是代碼的粗略草圖：

def process_lines(file_name):
    """generates id/run, ignoring non-numeric lines"""
    with open(file_name, "r") as f:
        for line in f:
            detector_id, run_number, *_ = line.split()
            if detector_id.isnumeric() and run_number.isnumeric():
                detector_id = detector_id.zfill(8)
                run_number = run_number.zfill(8)
                yield detector_id, run_number


out_file_format = "{detector_id}_{run_number}.h5"
final_files = [
    out_file_format.format(detector_id=detector_id, run_number=run_number)
    for detector_id, run_number in process_lines("run-list.txt")
]


rule all:
    """Collect all outputs."""
    input:
        final_files,


rule:
    """Generate an output"""
    output:
        out_file_format,
    shell:
        """
        echo {wildcards[detector_id]}
        echo {wildcards[run_number]}
        echo {output}
        """

uj5u.com熱心網友回復：

已經有一些很好的答案，但由于我同時得到了代碼，這是我的 2p。另存為Snakefile，它應該可以運行。

import pandas

# In reality you read this from file using pandas.read_csv.
# Or use a solution other than pandas dataframes.
run_list = [(75, 63433),
(75, 67325),
(42, 57584),
(42, 57899),
(42, 58998)]

run_list = pandas.DataFrame(run_list, columns= ['detector_id', 'run_id'])

rule all:
    input:
        expand('RUN_{detector_id}_{run_id}.h5', zip, detector_id= run_list.detector_id, run_id= run_list.run_id),

rule make_run:
    output:
        'RUN_{detector_id}_{run_id}.h5',
    shell:
        r"""
        touch {output}
        """

您需要對零填充進行一些字串操作，但這是python的事情，而不是snakemake的。

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/424782.html

標籤：Python 熊猫蛇制造有向无环图下一个流

上一篇：根據列值創建一個新類別：Pandas

下一篇：使用每個單元格的原始值有條件地編輯PandasDataFrame中列的所有單元格