擊敗GANs的新生成式模型：score-based model(diffusion model)原理、網路結構、應用、代碼、實驗、展望-有解無憂

前言：在近兩年的NeurIPS、ICCV、CVPR等頂會中，出現了二三十篇score-based generative models相關的論文，這是一種全新的生成式模型，特別是一些論文直接喊出了beat GANs(打敗GANs)的口號，全新的生成方式和部分領域領先GANs、VAE的生成效果，讓越來越多的人感興趣并投身于研究中，

會不會是下一個GANs？能否解決目前GANs遇到的問題？
和現有的生成式模型相比有哪些優點?哪些缺點？
目前的網路結構是怎樣？
如何用代碼實作？
常用的資料集有哪些？
常用的評價指標有哪些？
能應用到哪些領域？
遇到了哪些問題？
發展的瓶頸有哪些？
未來的發展會怎樣？

本文就這些問題進行探討，

原理概述

為什么叫做scored-based？

郎之萬動力學

score-based models與diffusion model

三維點云重建任務

網路結構

UNet

Denoising Score Matching

GANs、DPM、DDPM

GANs優點

GANs缺點

DDPM/DPM優點

DDPM/DPM缺點

常用評價指標

常用資料集

一維草圖

二維圖片

三維模型

應用領域

參考：

原理概述

從資料中估計分數函式，并使用朗之萬(Langevin)動力學生成新的樣本，因此，scored-based model和diffusion model的核心物理背景都是Langevin動力學，

因為在沒有訓練資料的區域，估計的分數函式是不準確的，當采樣軌跡遇到這些區域時，Langevin動力學可能不能正確收斂，作為補救，用不同強度的高斯噪聲對資料進行擾動，并聯合估計所有噪聲擾動資料分布的得分函式，在推理程序中，將所有噪聲尺度的資訊與Langevin動力學相結合，從每個噪聲擾動分布中依次采樣，

和GANs相比，最顯著的優勢是：

不需要對抗訓練的樣本質量，不需要進行對抗訓練，眾所周知，GANs訓練難一直是業界難題，主要是因為GANs這種implicit generative models的最大問題是需要對抗訓練，而這種訓練的方法通常會很不穩定，（PS：scored-based模型的訓練也不簡單）
靈活的模型架構，
精確的對數似然計算，
不需要再訓練模型的逆問題求解， train后的模型即可參與sampling重建，不需要像StyleGAN的模型訓練一個feature網路，

為什么叫做scored-based？

和GANs、VAE一樣，scored-based也是implicit generative models隱式生成模型，需要確保易處理的規則化常數(這個后面會提到)以便方便的計算likelihood，而這通常意味著網路結構有較大限制，即無法像NAS那樣任意組織和設計網路結構，或者必須依賴于替代的objectives來在訓練程序中，近似最大似然(approximate maximum likelihood training)，

但是scored-based對log PDF的梯度進行建模得到一個名為分數函式的量，不需要處理類似likelihood-based models的規則化常數，

這個分數函式被稱為：，我們的任務就是最小化模型和資料分布之間的Fisher散度：

郎之萬動力學

Langevin dynamics僅通過使用分數函式來對真實資料分布 P ( x )進行馬爾科夫鏈蒙特卡洛(Markov Chain Monte Carlo)的采樣，迭代程序如下：

score-based models與diffusion model

scored-based models和diffusion models的原理上大同小異，感興趣的同學可以參看本系列的上一篇文章：

《Diffusion Model擴散模型與深度學習(附Python示例)》

這篇文章著重講了從物理背景到深度學習的程序、數學推導和一般擴散程序的代碼示例，本文不再贅述這方面，

三維點云重建任務

1. 一個條件生成問題，因為所考慮的馬爾可夫鏈生成的點云的條件是一些形狀潛在的點，這種條件自適應導致的訓練和抽樣方案與之前對擴散概率模型的研究有顯著不同，
2. 二維影像相關DDPM不能直接推廣到點云，這是由于三維空間中的點的采樣模式是不規則的，而不是影像下方的規則網格結構，
3. 由于點云是由三維空間中的離散點組成的，將這些點視為與熱浴接觸的非平衡熱力學系統中的粒子，在熱浴的作用下，粒子的位置以它們擴散并最終擴散到空間的方式隨機演化，
4. 通過在每個時間步驟添加噪聲，將粒子的初始分布轉化為簡單的噪聲分布，
5. 通過擴散程序將點云的點分布與噪聲分布連接起來，為了對點云生成中的點分布進行建模，考慮了反向擴散程序，該程序從噪聲分布中恢復了目標點的分布，
6. 將這種反向擴散程序建模為一個馬爾可夫鏈，將噪聲分布轉換為目標分布，目標是學習它的過渡核，使馬爾可夫鏈可以重建所需的形狀，此外，由于馬爾可夫鏈的目的是對點分布進行建模，僅靠馬爾可夫鏈無法生成各種形狀的點云，為此，引入了一個形狀潛勢作為過渡核的條件，在生成設定中，形狀潛在遵循一個先驗分布，通過標準化流引數化它，以增強模型的表達能力，在自編碼的情況下，對形狀潛勢進行端到端學習，
7. 將訓練目標表述為在形狀潛勢的條件下，使點云的似然值的變分下界最大化，并將其進一步表述為易于處理的封閉運算式，

網路結構

UNet

unet在醫療領域大名鼎鼎，優點是能夠學到更豐富維度的資訊，一定要好好看一看原始論文：《U-Net: Convolutional Networks for Biomedical Image Segmentation》，UNet模型使用了一堆剩余層和下采樣卷積，然后是一堆剩余層和上采樣卷積，用跳過連接將空間大小相同的層連接起來，此外使用了一個單頭的16 *16解析度的全域注意層，并在每個殘差塊中添加嵌入時間步長的投影，

首次在score-based model中使用unet的是論文：Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models, 2020

后續的大部分作業都是在這篇論文提出的網路結構上修修補補，經典的unet model class代碼如下，復用的時候直接繼承即可，

class UNetModel(nn.Module):

    def __init__(
        self,
        in_channels,
        model_channels,
        out_channels,
        num_res_blocks,
        attention_resolutions,
        dropout=0,
        channel_mult=(1, 2, 4, 8),
        conv_resample=True,
        dims=2,
        # dims=1,
        num_classes=None,
        use_checkpoint=False,
        num_heads=1,
        num_heads_upsample=-1,
        use_scale_shift_norm=False,
    ):
        super().__init__()

        if num_heads_upsample == -1:
            num_heads_upsample = num_heads

        self.in_channels = in_channels
        self.model_channels = model_channels
        self.out_channels = out_channels
        self.num_res_blocks = num_res_blocks
        self.attention_resolutions = attention_resolutions
        self.dropout = dropout
        self.channel_mult = channel_mult
        # self.channel_mult = (1, 2, 4, 8)
        self.conv_resample = conv_resample
        self.num_classes = num_classes
        self.use_checkpoint = use_checkpoint
        self.num_heads = num_heads
        self.num_heads_upsample = num_heads_upsample

        time_embed_dim = model_channels * 4
        self.time_embed = nn.Sequential(
            linear(model_channels, time_embed_dim),
            SiLU(),
            linear(time_embed_dim, time_embed_dim),
        )

        if self.num_classes is not None:
            self.label_emb = nn.Embedding(num_classes, time_embed_dim)

        self.input_blocks = nn.ModuleList(
            [
                TimestepEmbedSequential(
                    conv_nd(dims, in_channels, model_channels, 3, padding=1)
                )
            ]
        )
        input_block_chans = [model_channels]
        ch = model_channels
        ds = 1
        for level, mult in enumerate(channel_mult):
            for _ in range(num_res_blocks):
                layers = [
                    ResBlock(
                        ch,
                        time_embed_dim,
                        dropout,
                        out_channels=mult * model_channels,
                        dims=dims,
                        use_checkpoint=use_checkpoint,
                        use_scale_shift_norm=use_scale_shift_norm,
                    )
                ]
                ch = mult * model_channels
                if ds in attention_resolutions:
                    layers.append(
                        AttentionBlock(
                            ch, use_checkpoint=use_checkpoint, num_heads=num_heads
                        )
                    )
                self.input_blocks.append(TimestepEmbedSequential(*layers))
                input_block_chans.append(ch)
            if level != len(channel_mult) - 1:
                self.input_blocks.append(
                    TimestepEmbedSequential(Downsample(ch, conv_resample, dims=dims))
                )
                input_block_chans.append(ch)
                ds *= 2

        self.middle_block = TimestepEmbedSequential(
            ResBlock(
                ch,
                time_embed_dim,
                dropout,
                dims=dims,
                use_checkpoint=use_checkpoint,
                use_scale_shift_norm=use_scale_shift_norm,
            ),
            AttentionBlock(ch, use_checkpoint=use_checkpoint, num_heads=num_heads),
            ResBlock(
                ch,
                time_embed_dim,
                dropout,
                dims=dims,
                use_checkpoint=use_checkpoint,
                use_scale_shift_norm=use_scale_shift_norm,
            ),
        )

        self.output_blocks = nn.ModuleList([])
        for level, mult in list(enumerate(channel_mult))[::-1]:
            for i in range(num_res_blocks + 1):
                layers = [
                    ResBlock(
                        ch + input_block_chans.pop(),
                        time_embed_dim,
                        dropout,
                        out_channels=model_channels * mult,
                        dims=dims,
                        use_checkpoint=use_checkpoint,
                        use_scale_shift_norm=use_scale_shift_norm,
                    )
                ]
                ch = model_channels * mult
                if ds in attention_resolutions:
                    layers.append(
                        AttentionBlock(
                            ch,
                            use_checkpoint=use_checkpoint,
                            num_heads=num_heads_upsample,
                        )
                    )
                if level and i == num_res_blocks:
                    layers.append(Upsample(ch, conv_resample, dims=dims))
                    ds //= 2
                self.output_blocks.append(TimestepEmbedSequential(*layers))

        self.out = nn.Sequential(
            normalization(ch),
            SiLU(),
            zero_module(conv_nd(dims, model_channels, out_channels, 3, padding=1)),
        )

    def convert_to_fp16(self):
        """
        Convert the torso of the model to float16.
        """
        self.input_blocks.apply(convert_module_to_f16)
        self.middle_block.apply(convert_module_to_f16)
        self.output_blocks.apply(convert_module_to_f16)

    def convert_to_fp32(self):
        """
        Convert the torso of the model to float32.
        """
        self.input_blocks.apply(convert_module_to_f32)
        self.middle_block.apply(convert_module_to_f32)
        self.output_blocks.apply(convert_module_to_f32)

    @property
    def inner_dtype(self):
        """
        Get the dtype used by the torso of the model.
        """
        return next(self.input_blocks.parameters()).dtype

    def forward(self, x, timesteps, y=None):
        """
        Apply the model to an input batch.

        :param x: an [N x C x ...] Tensor of inputs.
        :param timesteps: a 1-D batch of timesteps.
        :param y: an [N] Tensor of labels, if class-conditional.
        :return: an [N x C x ...] Tensor of outputs.
        """
        assert (y is not None) == (
            self.num_classes is not None
        ), "must specify y if and only if the model is class-conditional"

        hs = []
        emb = self.time_embed(timestep_embedding(timesteps, self.model_channels))

        if self.num_classes is not None:
            assert y.shape == (x.shape[0],)
            emb = emb + self.label_emb(y)

        h = x.type(self.inner_dtype)
        # 此時h是和每一個batch的資料size一樣
        # print(f"h size befor is {h.size()}")
        # 下采樣
        for module in self.input_blocks:
            h = module(h, emb)  # 卷積+池化
            # print(f"h size after is {h.size()}")
            hs.append(h)
        # 連接層
        h = self.middle_block(h, emb)
        # 上采樣
        for module in self.output_blocks:
            hs_temp = hs.pop()
            # print(f"h size is {h.size()}; hs.pop() size is {hs_temp.size()}")
            # if (h.size()[2] != hs_temp.size()[2]) or (h.size()[3] != hs_temp.size()[3]):
            #     # 一般h size大于hs size
            #     # temp_shape = (h.size()[0]*h.size()[1]*h.size()[2]*h.size()[3]) / (hs_temp.size()[0]*hs_temp.size()[2]*hs_temp.size()[3])
            #     continue
            # cat_in = th.cat([h, hs.pop()], dim=1)
            cat_in = th.cat([h, hs_temp], dim=1)
            h = module(cat_in, emb)
        h = h.type(x.dtype)
        return self.out(h)

    def get_feature_vectors(self, x, timesteps, y=None):
        """
        Apply the model and return all of the intermediate tensors.

        :param x: an [N x C x ...] Tensor of inputs.
        :param timesteps: a 1-D batch of timesteps.
        :param y: an [N] Tensor of labels, if class-conditional.
        :return: a dict with the following keys:
                 - 'down': a list of hidden state tensors from downsampling.
                 - 'middle': the tensor of the output of the lowest-resolution
                             block in the model.
                 - 'up': a list of hidden state tensors from upsampling.
        """
        hs = []
        emb = self.time_embed(timestep_embedding(timesteps, self.model_channels))
        if self.num_classes is not None:
            assert y.shape == (x.shape[0],)
            emb = emb + self.label_emb(y)
        result = dict(down=[], up=[])
        h = x.type(self.inner_dtype)
        for module in self.input_blocks:
            h = module(h, emb)
            hs.append(h)
            result["down"].append(h.type(x.dtype))
        h = self.middle_block(h, emb)
        result["middle"] = h.type(x.dtype)
        for module in self.output_blocks:
            cat_in = th.cat([h, hs.pop()], dim=1)
            h = module(cat_in, emb)
            result["up"].append(h.type(x.dtype))
        return result

Denoising Score Matching

unet運用于這一領域時間較晚，最早開山鼻祖論文2020年才發表，在此之前，業界普遍使用的是去噪分數匹配，

這一方法首先通過分數匹配去噪來學習分數函式，直觀上，這意味著訓練神經網路(稱為評分網路)去噪被高斯噪聲模糊的影像，一個關鍵點是使用多個噪聲尺度來干擾資料，以便評分網路既能捕獲粗粒度影像特征，也能捕獲細粒度影像特征，然而，如何選擇這些噪聲尺度是一個非常棘手的問題，

其次，通過運行Langevin動力學生成樣本，從白噪聲入手，利用評分網路將白噪聲逐步降噪成影像，

GANs、DPM、DDPM

GANs優點

1. 采樣wall-clock更快

GANs缺點

1. 很難訓練，在沒有仔細選擇超引數和正則化器的情況下崩潰，
2. gan能夠以多樣性換取保真度，產生高質量的樣本，但不覆寫整個分布，
3. 由于對抗損失，GANs的訓練程序可能不穩定，自回歸模型假設生成順序是不自然的，可能會限制模型的靈活性，

DDPM/DPM優點

DDPM = DPM + denoising score matching(denoising autoencoders)

1. 捕獲了更多的多樣性，而且通常比gan更容易縮放和訓練，
2. 分布覆寫、固定的訓練目標和易于擴展，

DDPM/DPM缺點

1. 采樣的wall-clock time比gan慢，
2. 在視覺樣本質量方面仍然存在不足，
3. 使用了多個去噪步驟(因此向前傳遞)，它們在采樣時間上仍然比gan慢，

常用評價指標

評價指標大部分文章都要對比GANs，所以和GANs運用的資料集相似，

FID 《Gans trained by a two time-scale update rule converge to a local nash equilibrium.》比IS能更好地捕捉多樣性，比IS更好地符合人類的判斷，描述初始潛空間中兩個影像分布之間距離的對稱度量，
Inception Score 《Improved techniques for training gans》衡量了一個模型在捕獲完整的ImageNet類分布的同時，仍然產生單個類的令人信服的樣本的程度，這個度量的一個缺點是，它沒有獎勵覆寫整個分布或捕獲類中的多樣性，并且記住完整資料集的一小部分的模型仍然會有很高的IS，
Precision 《Improved precision and recall metric for assessing generative models》主要描述精度、模型保真度，
recall 主要描述查全率、衡量多樣性、分布覆寫率，
retrieval
用retrieval對比來說明重建效果也是常用的方法

常用資料集

一維草圖

https://quickdraw.withgoogle.com/

二維圖片

imagenet：ImageNet
LSUN lmdb
FFHQ
CelebA

cifar10，可以使用以下代碼下載：

import os
import tempfile

import torchvision
from tqdm.auto import tqdm

CLASSES = (
    "plane",
    "car",
    "bird",
    "cat",
    "deer",
    "dog",
    "frog",
    "horse",
    "ship",
    "truck",
)


def main():
    for split in ["train", "test"]:
        out_dir = f"cifar_{split}"
        if os.path.exists(out_dir):
            print(f"skipping split {split} since {out_dir} already exists.")
            continue

        print("downloading...")
        with tempfile.TemporaryDirectory() as tmp_dir:
            dataset = torchvision.datasets.CIFAR10(
                root=tmp_dir, train=split == "train", download=True
            )

        print("dumping images...")
        os.mkdir(out_dir)
        for i in tqdm(range(len(dataset))):
            image, label = dataset[i]
            filename = os.path.join(out_dir, f"{CLASSES[label]}_{i:05d}.png")
            image.save(filename)


if __name__ == "__main__":
    main()

三維模型

shapenet:ShapeNet簡介和下載、binvox檔案python示例_沉迷單車的追風少年-CSDN博客

應用領域

音頻建模
DiffWave: A Versatile Diffusion Model for Audio Synthesis
PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Driven Adaptive Prior
語音合成
Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech
時間序列預測
Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting
二維影像生成
Diffusion Models Beat GANs on Image Synthesis
Improved Denoising Diffusion Probabilistic Models
Denoising Diffusion Probabilistic Models
Improved Techniques for Training Score-Based Generative Models
三維點云重建
Diffusion Probabilistic Models for 3D Point Cloud Generation

參考：

[生成模型新方向]: score-based generative models_g11d111的博客-CSDN博客
Diffusion Model擴散模型與深度學習(附Python示例)_沉迷單車的追風少年-CSDN博客
ShapeNet簡介和下載、binvox檔案python示例_沉迷單車的追風少年-CSDN博客
Yang Song | Generative Modeling by Estimating Gradients of the Data Distribution
Improved Techniques for Training Score-Based Generative Models
Diffusion Models Beat GANs on Image Synthesis

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/330081.html

標籤：AI

上一篇：關于Ubuntu16.04 ros系統安裝 rosdep的最新解決方法

下一篇：推薦系統 (一): 推薦系統的架構