如何在snakemake的擴展函式引數中使用通配符？-有解無憂

我有一個像這樣的json檔案：

{
    "foo": {
        "bar1": 
            {"A1": {"name": "A1", "path": "/path/to/A1"}, 
             "B1": {"name": "B1", "path": "/path/to/B1"},
             "C1": {"name": "C1", "path": "/path/to/C1"},
             "D1": {"name": "D1", "path": "/path/to/D1"}},
        "bar2": 
            {"A2": {"name": "A2", "path": "/path/to/A2"}, 
             "B2": {"name": "B2", "path": "/path/to/B2"},
             "C2": {"name": "C2", "path": "/path/to/C2"},
             "D2": {"name": "D2", "path": "/path/to/D2"}}}
}

我正在嘗試分別對樣本集'bar1'和'bar2'中的樣本運行我的snakemake管道，將結果放入它們自己的檔案夾中。當我擴展通配符時，我不想要樣本集和樣本的所有迭代，我只希望它們在它們的特定組中，如下所示：

tmp/bar1/A1.bam
tmp/bar1/B1.bam
tmp/bar1/C1.bam
tmp/bar1/D1.bam
tmp/bar2/A2.bam
tmp/bar2/B2.bam
tmp/bar2/C2.bam
tmp/bar2/D2.bam

希望我的蛇檔案能幫助解釋。我試過讓我的蛇檔案是這樣的：

sample_sets = [ i for i in config['foo'] ]

samples_dict = config['foo'] #cleans it up

def get_samples(wildcards):
    return list(samples_dict[wildcards.sample_set].keys())

rule all:
    input:
        expand(expand("tmp/{{sample_set}}/{sample}.bam", sample = get_samples), sample_set = sample_sets),

這不起作用，我的檔案名以“<function get_samples at 0x7f6e00544320>”結尾！我也試過：

rule all:
    input:
        expand(expand("tmp/{{sample_set}}/{sample}.bam", sample = list(samples_dict["{{sample_set}}"].keys()), sample_set = sample_sets),

但那是一個KeyError。也試過這個：

rule all:
    input:
        [ ["tmp/{{sample_set}}/{sample}.aligned_bam.core.bam".format( sample = sample ) for sample in list(samples_dict[sample_set].keys())] for sample_set in sample_sets ]

它得到一個“輸入檔案中的通配符無法從輸出檔案中確定：'sample_set'”錯誤。

我覺得必須有一種簡單的方法可以做到這一點，也許我是個白癡。

任何幫助將不勝感激！如果我錯過了一些細節，請告訴我。

uj5u.com熱心網友回復：

可以在 expand中使用自定義組合函式。zip但是，在您的情況下，嵌套字典形狀通常需要設計自定義函式。相反，一個更簡單的解決方案是使用 Python 來構建所需檔案的串列。

d = {
    "foo": {
        "bar1": {
            "A1": {"name": "A1", "path": "/path/to/A1"},
            "B1": {"name": "B1", "path": "/path/to/B1"},
            "C1": {"name": "C1", "path": "/path/to/C1"},
            "D1": {"name": "D1", "path": "/path/to/D1"},
        },
        "bar2": {
            "A2": {"name": "A2", "path": "/path/to/A2"},
            "B2": {"name": "B2", "path": "/path/to/B2"},
            "C2": {"name": "C2", "path": "/path/to/C2"},
            "D2": {"name": "D2", "path": "/path/to/D2"},
        },
    }
}

list_files = []

for key in d["foo"]:
    for nested_key in d["foo"][key]:
        _tmp = f"tmp/{key}/{nested_key}.bam"
        list_files.append(_tmp)

print(*list_files, sep="\n")
#tmp/bar1/A1.bam
#tmp/bar1/B1.bam
#tmp/bar1/C1.bam
#tmp/bar1/D1.bam
#tmp/bar2/A2.bam
#tmp/bar2/B2.bam
#tmp/bar2/C2.bam
#tmp/bar2/D2.bam

uj5u.com熱心網友回復：

@SultanOrazbayev 有權，但只是提出了幾個替代方案。

如果你喜歡回圈，pythonic 的撰寫方式是使用串列推導。如果您有巨大的檔案串列，您可能會注意到性能有所提高。

list_files = [
    f"tmp/{key}/{nested_key}.bam"
    for key in d["foo"]
    for nested_key in d["foo"][key]
]

我能想到使用 expand 的唯一方法基本上是構建相同的串列。我將它作為字典傳遞，也保留通配符名稱，盡管元組會更有效。expand 的優點是如果您將檔案名放在配置變數中并且無法輕松格式化它，想要保留有意義的通配符名稱，或者將 allow_missing 用于其他通配符：

wcs = [{'sample_set': sample_set, 'sample': sample}
    for sample_set in d["foo"]
    for sample in d["foo"][sample_set]
    ]


list_files = expand("tmp/{sample_set}/{sample}.bam", zip, 
        sample_set=[wc['sample_set'] for wc in wcs],
        sample=[wc['sample'] for wc in wcs],
        )

有時snakemake方式不是pythonic！

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/431535.html

標籤：Python 字典通配符组合学蛇制造

上一篇：使用元組作為鍵時字典有什么問題？

下一篇：使用if-else塊中的閾值作為字典的鍵