💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

1. Paper Overview

"Denoising Diffusion Probabilistic Models" (DDPM) was published at NeurIPS 2020, co-authored by **Jonathan Ho**, **Ajay Jain**, and **Pieter Abbeel** from UC Berkeley. This paper is a landmark study that empirically demonstrated that high-quality image synthesis is achievable through **diffusion probabilistic models**.

The core idea is surprisingly simple. Define a Forward Process that **gradually adds Gaussian noise** to data, and learn a Reverse Process that **step-by-step removes** this noise to recover the original data. The final training objective reduces to a **simple MSE loss** between "model-predicted noise" and "actually added noise."

DDPM achieved **FID 3.17** and **Inception Score 9.46** on CIFAR-10, showing performance comparable to or surpassing GAN-based models of the time. More importantly, this paper became the foundation of modern image generation AI including **DALL-E 2**, **Imagen**, **Stable Diffusion**, and **Midjourney**.

> **Paper Information**

> - Title: Denoising Diffusion Probabilistic Models

> - Authors: Jonathan Ho, Ajay Jain, Pieter Abbeel

> - Venue: NeurIPS 2020

> - arXiv: [2006.11239](https://arxiv.org/abs/2006.11239)

> - Official Code: [hojonathanho/diffusion](https://github.com/hojonathanho/diffusion)

2. Background: From Thermodynamics to Generative Models

2.1 Inspiration from Non-equilibrium Thermodynamics

The intellectual origin of Diffusion Models lies in **non-equilibrium statistical mechanics**. In physics, diffusion refers to the process where particles randomly move from high-concentration regions to low-concentration regions, eventually reaching a state of **thermal equilibrium** (maximum entropy). The key insight of this process is:

- **Forward**: A state with complex structure $\rightarrow$ disordered equilibrium state (information destruction)

- **Reverse**: Equilibrium state $\rightarrow$ restoration to a structured state (information creation)

Sohl-Dickstein et al. (2015) first applied this idea to machine learning, publishing "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." By defining a diffusion process that transforms a complex data distribution into a simple known distribution (Gaussian), and learning its reverse process, one obtains a generative model.

2.2 Connection with Score Matching

Another theoretical pillar of Diffusion Models is **Score Matching**. The score function is defined as the gradient of the log probability density.

\nabla_x \log p(x)

If this score function can be estimated, samples can be generated through **Langevin Dynamics**.

x_{t+1} = x_t + \frac{\epsilon}{2} \nabla_x \log p(x_t) + \sqrt{\epsilon} \, z, \quad z \sim \mathcal{N}(0, I)

Yang Song and Stefano Ermon (2019) proposed **Noise Conditional Score Networks** (NCSN) in "Generative Modeling by Estimating Gradients of the Data Distribution," presenting a method for estimating the score function at various noise levels. Ho et al.'s DDPM is deeply connected to this Score Matching perspective, and the paper explicitly cites "a new connection with denoising score matching with Langevin dynamics" as a core contribution.

2.3 SDE Perspective: A Unified Framework

Song et al. (2021) unified DDPM and Score Matching under the framework of **Stochastic Differential Equations** (SDE) in "Score-Based Generative Modeling through Stochastic Differential Equations." The Forward Process described as a continuous-time SDE takes the form:

dx = f(x, t) \, dt + g(t) \, dw

where $f$ is the drift coefficient, $g$ is the diffusion coefficient, and $w$ is a standard Wiener process. A corresponding **Reverse-time SDE** exists:

dx = \left[ f(x, t) - g(t)^2 \nabla_x \log p_t(x) \right] dt + g(t) \, d\bar{w}

The key insight is that solving the reverse SDE requires only the **time-dependent score function** $\nabla_x \log p_t(x)$. DDPM's noise prediction network $\epsilon_\theta$ is essentially equivalent to estimating this score function.

\epsilon_\theta(x_t, t) \approx -\sqrt{1 - \bar{\alpha}_t} \, \nabla_{x_t} \log p(x_t)

This relationship is the key link that theoretically unifies DDPM and Score Matching.

3. Forward Process: Systematically Adding Noise

3.1 Forward Process as a Markov Chain

The Forward Process (or Diffusion Process) is a **fixed Markov Chain** that gradually adds Gaussian noise to original data $x_0$. It has no learnable parameters and is entirely determined by a predefined **Variance Schedule** $\{\beta_1, \beta_2, ..., \beta_T\}$.

The transition probability at each time step $t$ is defined as:

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \, \beta_t I)

In plain terms, at each step the data from the previous time step is scaled down by $\sqrt{1 - \beta_t}$ and Gaussian noise with variance $\beta_t$ is added.

x_t = \sqrt{1 - \beta_t} \, x_{t-1} + \sqrt{\beta_t} \, \epsilon_{t-1}, \quad \epsilon_{t-1} \sim \mathcal{N}(0, I)

**Why scale by $\sqrt{1 - \beta_t}$?** To preserve the total variance at each step. If the variance of $x_{t-1}$ is 1, then the variance of $\sqrt{1-\beta_t} \cdot x_{t-1}$ is $1-\beta_t$, and adding noise with variance $\beta_t$ gives a total variance of $(1-\beta_t) + \beta_t = 1$.

When $T$ is sufficiently large and $\beta_t$ is appropriately set, $x_T$ converges to nearly pure **isotropic Gaussian noise** $\mathcal{N}(0, I)$.

3.2 Complete Forward Process

The joint distribution of the complete Forward Process over $T$ steps is:

q(x_{1:T} | x_0) = \prod_{t=1}^{T} q(x_t | x_{t-1})

This follows from the Markov property, where each step depends only on the immediately preceding step. In DDPM, $T = 1000$ is used, with $\beta_t$ increasing **linearly** from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$.

4. Core Mathematics: Reparameterization Trick

4.1 Jumping to Arbitrary Time $t$ in One Step

The most powerful mathematical property of the Forward Process is that $x_t$ at any arbitrary time $t$ can be **computed directly from $x_0$ without going through intermediate steps**. This is what makes training efficient.

First, define the notation:

\alpha_t = 1 - \beta_t, \qquad \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s

$\bar{\alpha}_t$ is the cumulative product of $\alpha_s$, representing how much of the original signal is preserved up to time $t$.

4.2 Derivation

Starting from $x_1$ and deriving inductively:

x_1 = \sqrt{\alpha_1} \, x_0 + \sqrt{1 - \alpha_1} \, \epsilon_0

x_2 = \sqrt{\alpha_2} \, x_1 + \sqrt{1 - \alpha_2} \, \epsilon_1

Substituting $x_1$ into $x_2$:

x_2 = \sqrt{\alpha_2} \left( \sqrt{\alpha_1} \, x_0 + \sqrt{1 - \alpha_1} \, \epsilon_0 \right) + \sqrt{1 - \alpha_2} \, \epsilon_1

= \sqrt{\alpha_1 \alpha_2} \, x_0 + \sqrt{\alpha_2(1-\alpha_1)} \, \epsilon_0 + \sqrt{1-\alpha_2} \, \epsilon_1

Applying the **sum of independent Gaussians rule**: the sum of two independent Gaussians $\mathcal{N}(0, \sigma_1^2 I)$ and $\mathcal{N}(0, \sigma_2^2 I)$ follows $\mathcal{N}(0, (\sigma_1^2 + \sigma_2^2)I)$.

Summing the noise variances:

\alpha_2(1-\alpha_1) + (1-\alpha_2) = \alpha_2 - \alpha_1\alpha_2 + 1 - \alpha_2 = 1 - \alpha_1\alpha_2 = 1 - \bar{\alpha}_2

Therefore:

x_2 = \sqrt{\bar{\alpha}_2} \, x_0 + \sqrt{1 - \bar{\alpha}_2} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Generalizing this yields the following.

4.3 Final Result: Closed-form Expression

\boxed{q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, \, (1 - \bar{\alpha}_t) I)}

That is, $x_t$ at any time $t$ can be sampled **in one step**:

x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

The intuitive interpretation of this formula is:

| Term | Meaning | Change Over Time |

| --------------------------------------- | --------------- | -------------------------------------------------------------- |

| $\sqrt{\bar{\alpha}_t} \, x_0$ | Original signal | As $t \uparrow$, $\bar{\alpha}_t \downarrow$, signal decreases |

| $\sqrt{1 - \bar{\alpha}_t} \, \epsilon$ | Added noise | As $t \uparrow$, $1-\bar{\alpha}_t \uparrow$, noise increases |

At $t = 0$, $\bar{\alpha}_0 = 1$ so we get $x_0$ as-is, and at $t = T$, $\bar{\alpha}_T \approx 0$ so it becomes nearly pure noise. This gradual decrease in **Signal-to-Noise Ratio (SNR)** is the essence of the Forward Process.

\text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}

5. Reverse Process: Recovering Images from Noise

5.1 Definition of the Reverse Process

The Reverse Process starts from pure noise $x_T \sim \mathcal{N}(0, I)$ and progressively removes noise to generate data $x_0$. If each step of the Forward Process is a small Gaussian perturbation, the key assumption is that its reverse can also be approximated as Gaussian (when $\beta_t$ is sufficiently small).

p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1} | x_t)

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

Here, $\mu_\theta$ and $\Sigma_\theta$ are the **mean** and **variance** that the neural network must learn. In DDPM, the variance $\Sigma_\theta$ is not learned but fixed as $\sigma_t^2 I$, where either $\sigma_t^2 = \beta_t$ or $\sigma_t^2 = \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t$ is used.

5.2 Derivation of the Posterior $q(x_{t-1}|x_t, x_0)$

The key to training is that the **reverse conditional distribution** (posterior) given $x_0$ is computable in closed form. Applying Bayes' theorem:

q(x_{t-1} | x_t, x_0) = \frac{q(x_t | x_{t-1}, x_0) \, q(x_{t-1} | x_0)}{q(x_t | x_0)}

By the Markov property, $q(x_t|x_{t-1}, x_0) = q(x_t|x_{t-1})$, so all three terms are Gaussian. Since the product of Gaussians is also Gaussian, expanding the exponents and rearranging as a quadratic in $x_{t-1}$ yields:

q(x_{t-1} | x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I)

where the **posterior mean** is:

\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t

and the **posterior variance** is:

\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t

5.3 Replacing $x_0$ with $\epsilon$

Since the model cannot directly know $x_0$, we solve the Reparameterization formula $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ in reverse to express $x_0$:

x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1-\bar{\alpha}_t} \, \epsilon \right)

Substituting this into the posterior mean $\tilde{\mu}_t$:

\tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \, \epsilon \right)

If the model learns a network $\epsilon_\theta(x_t, t)$ that predicts the noise $\epsilon$, the Reverse Process mean is computed as:

\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \, \epsilon_\theta(x_t, t) \right)

This is why **noise prediction is equivalent to mean prediction** in DDPM's Reverse Process.

6. Deriving the Training Objective: From ELBO to Simplified Loss

6.1 Maximum Likelihood and ELBO

The ultimate goal of a generative model is to maximize the data log-likelihood $\log p_\theta(x_0)$. However, since this is intractable to compute directly, we optimize the **Evidence Lower Bound** (ELBO).

Applying Jensen's inequality:

\log p_\theta(x_0) \geq \mathbb{E}_{q(x_{1:T}|x_0)} \left[ \log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)} \right] = \text{ELBO}

6.2 Decomposition of the ELBO

Decomposing the ELBO into KL divergence terms:

\text{ELBO} = \underbrace{\mathbb{E}_q[\log p_\theta(x_0 | x_1)]}_{L_0: \text{Reconstruction term}} - \underbrace{D_{\text{KL}}(q(x_T | x_0) \| p(x_T))}_{L_T: \text{Prior matching term}} - \sum_{t=2}^{T} \underbrace{\mathbb{E}_q \left[ D_{\text{KL}}(q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t)) \right]}_{L_{t-1}: \text{Denoising matching term}}

Analyzing the meaning of each term:

**$L_T$ (Prior Matching)**: Measures how well $q(x_T|x_0)$ matches the prior distribution $p(x_T) = \mathcal{N}(0, I)$. When $T$ is sufficiently large, this term converges to 0, and since it has no learnable parameters, it is **ignored as a constant**.

**$L_0$ (Reconstruction)**: Measures the ability to reconstruct $x_0$ from $x_1$. Since $x_0$ and $x_1$ are very similar, its impact on overall training is small.

**$L_{t-1}$ (Denoising Matching)**: The **core training signal** that measures how well the model's Reverse transition $p_\theta(x_{t-1}|x_t)$ matches the true posterior $q(x_{t-1}|x_t, x_0)$.

6.3 KL Divergence Computation

The KL divergence between two Gaussians is computable in closed form. Since $q(x_{t-1}|x_t, x_0) = \mathcal{N}(\tilde{\mu}_t, \tilde{\beta}_t I)$ and $p_\theta(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta, \sigma_t^2 I)$:

D_{\text{KL}}(q \| p_\theta) = \frac{1}{2\sigma_t^2} \|\tilde{\mu}_t(x_t, x_0) - \mu_\theta(x_t, t)\|^2 + C

where $C$ is a constant related to the variances. With fixed variance, **only the difference in means** becomes the training objective.

6.4 Reparameterization to Noise Prediction

Substituting the expressions for $\tilde{\mu}_t$ and $\mu_\theta$ derived earlier:

\|\tilde{\mu}_t - \mu_\theta\|^2 = \frac{\beta_t^2}{(1-\bar{\alpha}_t)\alpha_t} \|\epsilon - \epsilon_\theta(x_t, t)\|^2

The **Simplified Loss** with the weighting coefficient removed is:

\boxed{L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \right]}

where $t \sim \text{Uniform}(\{1, ..., T\})$, $x_0 \sim q(x_0)$, and $\epsilon \sim \mathcal{N}(0, I)$.

This is DDPM's most important contribution. Starting from the complex ELBO, it ultimately arrives at the **"MSE between actual noise $\epsilon$ and predicted noise $\epsilon_\theta$"** — the simplest possible loss function in machine learning. Experimentally, this simplified loss also produces better sample quality than the weighted variational bound.

6.5 Training Algorithm Summary

Algorithm 1: Training

─────────────────────────────────

repeat

x_0 ~ q(x_0) # Sample from dataset

t ~ Uniform({1, ..., T}) # Select random time step

ε ~ N(0, I) # Sample standard Gaussian noise

x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε # Generate noisy image

∇_θ ||ε - ε_θ(x_t, t)||² # Compute gradient and update

until converged

7. Noise Scheduling: Design of $\beta_t$

7.1 Linear Schedule (Original DDPM)

Ho et al. used a schedule where $\beta_t$ increases **linearly** from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$.

\beta_t = \beta_1 + \frac{t-1}{T-1}(\beta_T - \beta_1)

The intuition behind this schedule is to add small noise initially to gradually destroy data structure, and larger noise in later stages to rapidly converge to a Gaussian.

7.2 Problems with the Linear Schedule

Nichol & Dhariwal (2021, "Improved Denoising Diffusion Probabilistic Models") identified two issues with the Linear Schedule.

**First, information is destroyed too quickly in the early stages.** $\bar{\alpha}_t$ decreases rapidly in the beginning, so significant noise is added even at low values of $t$. This is particularly problematic for **high-resolution images**.

**Second, late time steps are wasted.** At large values of $t$, $\bar{\alpha}_t \approx 0$, meaning $x_t$ is already close to pure noise and contributes little to meaningful training.

7.3 Cosine Schedule

The **Cosine Schedule** proposed by Nichol & Dhariwal defines $\bar{\alpha}_t$ directly.

\bar{\alpha}_t = \frac{f(t)}{f(0)}, \qquad f(t) = \cos\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2

where $s = 0.008$ is a small offset to prevent $\beta_t$ from becoming too small near $t=0$.

The key characteristics of the Cosine Schedule are:

- $\bar{\alpha}_t$ decreases **nearly linearly in the middle range**, providing uniformly useful training signals across all time steps

- Prevents excessive noise addition in the early stages, preserving fine details

- Ensures smooth transition to complete noise in the later stages

def cosine_beta_schedule(timesteps, s=0.008):

"""Cosine schedule as proposed in Nichol & Dhariwal (2021)."""

steps = timesteps + 1

t = torch.linspace(0, timesteps, steps) / timesteps

alphas_cumprod = torch.cos((t + s) / (1 + s) * math.pi * 0.5) ** 2

alphas_cumprod = alphas_cumprod / alphas_cumprod[0]

betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])

return torch.clip(betas, 0.0001, 0.9999)

def linear_beta_schedule(timesteps, beta_start=1e-4, beta_end=0.02):

"""Linear schedule as proposed in Ho et al. (2020)."""

return torch.linspace(beta_start, beta_end, timesteps)

7.4 Schedule Comparison

| Property | Linear Schedule | Cosine Schedule |

| ------------------------------ | -------------------------------- | ----------------------- |

| $\bar{\alpha}_t$ decay pattern | Rapid early, gradual late | Nearly linear in middle |

| Early information preservation | Low | High |

| Late time step utilization | Inefficient (already pure noise) | Efficient |

| High-resolution suitability | Low | High |

| Used in original DDPM | Yes | No |

| Used in Improved DDPM | No | Yes |

8. Sampling Algorithm

8.1 DDPM Sampling

After training is complete, the DDPM sampling algorithm for generating new images is:

Algorithm 2: Sampling

─────────────────────────────────

x_T ~ N(0, I) # Start from pure noise

for t = T, T-1, ..., 1:

z ~ N(0, I) if t > 1, else z = 0 # No noise added at the last step

x_{t-1} = 1/√α_t · (x_t - β_t/√(1-ᾱ_t) · ε_θ(x_t, t)) + σ_t · z

return x_0

8.2 Step-by-Step Interpretation

**Step 1: Initialization.** Sample pure Gaussian noise from $x_T \sim \mathcal{N}(0, I)$. This is the starting point of the generation process.

**Step 2: Noise Prediction.** Feed the current noisy image $x_t$ and time step $t$ into the network $\epsilon_\theta$ to predict the noise contained in $x_t$.

**Step 3: Mean Computation.** Compute the mean of the Reverse transition using the predicted noise.

\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right)

**Step 4: Stochastic Transition.** Generate $x_{t-1}$ by adding scaled Gaussian noise $\sigma_t z$ to the computed mean. No noise is added at the final step ($t=1$).

**Step 5: Repeat.** Repeat the above process from $t = T$ to $t = 1$.

8.3 Limitations of Sampling

The biggest drawback of DDPM sampling is **speed**. Sequential denoising over $T = 1000$ steps requires **1000 neural network forward passes** for a single image. This is extremely slow compared to GAN's single forward pass, spurring subsequent research on accelerated samplers such as DDIM and DPM-Solver.

9. Architecture: Time-conditioned U-Net

9.1 U-Net Based Design

DDPM's noise prediction network $\epsilon_\theta(x_t, t)$ is based on the **U-Net** architecture. U-Net was originally proposed by Ronneberger et al. (2015) for medical image segmentation, featuring an Encoder-Decoder structure with **Skip Connections** that combine features at various resolutions.

DDPM's U-Net is based on the PixelCNN++ structure with the following modifications.

9.2 Key Components

**Time Embedding**: To inject the time step $t$ into the network, Transformer-style **Sinusoidal Positional Encoding** is used.

\text{TE}(t)_{2i} = \sin\left(\frac{t}{10000^{2i/d}}\right), \quad \text{TE}(t)_{2i+1} = \cos\left(\frac{t}{10000^{2i/d}}\right)

This embedding passes through an MLP and is injected into each ResNet Block. Specifically, the time embedding is linearly transformed and then either **added** (additive) or **scaled** (FiLM conditioning) onto the intermediate feature maps of the ResNet Block.

**ResNet Block**: Each block consists of the following sequence:

1. Group Normalization

2. SiLU (Swish) Activation

3. Convolution

4. Time Embedding injection

5. Group Normalization

6. SiLU Activation

7. Dropout

8. Convolution

9. Residual Connection

**Self-Attention**: **Multi-Head Self-Attention** is applied at feature maps of $16 \times 16$ resolution. The spatial dimensions $(h, w)$ are flattened to sequence length $h \times w$ to perform standard Scaled Dot-Product Attention.

**Group Normalization**: Group Normalization is used instead of Batch Normalization. It is independent of batch size and provides more stable training for generative models.

9.3 Specific Architecture Specifications

Input: x_t ∈ R^(C×H×W), t ∈ {1,...,T}

Encoder:

[128] → [128] → ↓2 →

[256] → [256] → ↓2 →

[256] → [256] → ↓2 → (+ Self-Attention at 16×16)

[512] → [512] → ↓2

Bottleneck:

[512] → Self-Attention → [512]

Decoder (with skip connections):

[512] → [512] → ↑2 →

[256] → [256] → ↑2 → (+ Self-Attention at 16×16)

[256] → [256] → ↑2 →

[128] → [128] → ↑2

Output: ε_θ ∈ R^(C×H×W) (predicted noise with same dimensions as input)

DDPM used approximately **114M parameters** at $256 \times 256$ resolution.

10. Experimental Results

10.1 Quantitative Evaluation

DDPM was evaluated on the following benchmarks.

**CIFAR-10 (Unconditional, $32 \times 32$)**:

| Model | FID ($\downarrow$) | IS ($\uparrow$) |

| --------------- | ------------------ | --------------- |

| DDPM | **3.17** | **9.46** |

| StyleGAN2 + ADA | 2.92 | 9.83 |

| NCSN | 25.32 | 8.87 |

| ProgressiveGAN | 15.52 | 8.80 |

| NVAE | 23.5 | - |

DDPM achieved SOTA FID among unconditional generative models at the time, showing quality comparable to GAN-based StyleGAN2.

**LSUN ($256 \times 256$)**:

| Dataset | FID |

| ------------ | ---- |

| LSUN Bedroom | 4.90 |

| LSUN Cat | - |

| LSUN Church | 7.89 |

10.2 Qualitative Analysis

DDPM samples exhibited several distinct characteristics compared to GANs.

**High diversity**: While GANs suffer from limited generation diversity due to mode collapse, DDPM covers diverse modes of the data distribution in a balanced manner.

**Gradual generation**: The progressive transformation from noise to image can be visualized, confirming a **coarse-to-fine** generation pattern where the model first forms global structure and then adds fine details.

**Stable training**: Free from GAN's chronic problems of training instability (mode collapse, training oscillation), converging stably with a simple MSE loss.

10.3 Progressive Lossy Compression Interpretation

Ho et al. interpreted DDPM as naturally implementing a **Progressive Lossy Decompression** scheme. Information is progressively added at each Reverse step, which can be viewed as a **generalization of Autoregressive Decoding**. Rate-Distortion curve analysis confirmed that most bits are allocated to overall structure rather than perceptually insignificant details.

11. Comprehensive Overview of Subsequent Research: The Evolution of Diffusion

11.1 DDIM (Denoising Diffusion Implicit Models)

> Song et al., 2021 | [arXiv: 2010.02502](https://arxiv.org/abs/2010.02502)

Research that addressed DDPM's biggest limitation: **slow sampling speed**. The core idea is to generalize the Forward Process to be **Non-Markovian**.

DDIM uses the same trained model $\epsilon_\theta$ while modifying only the sampling process.

x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\left( \frac{x_t - \sqrt{1-\bar{\alpha}_t} \, \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}} \right)}_{\text{predicted } x_0} + \sqrt{1-\bar{\alpha}_{t-1} - \sigma_t^2} \cdot \epsilon_\theta(x_t, t) + \sigma_t \epsilon_t

Setting $\sigma_t = 0$ makes sampling **completely deterministic**, which provides:

- **Accelerated sampling**: Similar quality images with only 50-100 steps instead of $T=1000$ (10-20x speedup)

- **Semantic interpolation**: Thanks to the deterministic mapping, interpolation in latent space leads to meaningful image transformations

- **Consistency**: Always generates the same image from the same initial noise, ensuring reproducible results

11.2 Improved DDPM

> Nichol & Dhariwal, 2021 | [arXiv: 2102.09672](https://arxiv.org/abs/2102.09672)

Research that improved two aspects of the original DDPM.

**Learnable variance**: While DDPM fixed $\sigma_t^2$ as either $\beta_t$ or $\tilde{\beta}_t$, Improved DDPM makes it learnable. Specifically, $\sigma_t^2$ is parameterized as an interpolation between $\beta_t$ and $\tilde{\beta}_t$.

\Sigma_\theta(x_t, t) = \exp(v \log \beta_t + (1-v) \log \tilde{\beta}_t)

where $v$ is a value output by the network.

**Cosine Schedule**: Introduced the Cosine Variance Schedule described earlier, greatly improving training efficiency especially for high-resolution images.

**Hybrid Loss**: Adding a small amount of the variational lower bound $L_\text{vlb}$ to $L_\text{simple}$ also improved log-likelihood.

L_\text{hybrid} = L_\text{simple} + \lambda L_\text{vlb}

11.3 Classifier Guidance

> Dhariwal & Nichol, 2021 | [arXiv: 2105.05233](https://arxiv.org/abs/2105.05233)

A technique proposed in "Diffusion Models Beat GANs on Image Synthesis" that injects the **gradient of a pre-trained classifier** into the Reverse Process for conditional generation.

\hat{\epsilon}_\theta(x_t, t, y) = \epsilon_\theta(x_t, t) - s \cdot \sqrt{1-\bar{\alpha}_t} \cdot \nabla_{x_t} \log p_\phi(y|x_t)

where $s$ is the guidance scale and $p_\phi$ is a classifier trained on noisy images. Increasing $s$ **reduces diversity but increases fidelity to a specific class**. In this paper, Diffusion Models first **surpassed GANs in FID** (CIFAR-10 FID 2.97, ImageNet 256x256 FID 4.59).

**Limitation**: A separate classifier must be trained on noisy data, complicating the training pipeline.

11.4 Classifier-Free Guidance (CFG)

> Ho & Salimans, 2022 | [arXiv: 2207.12598](https://arxiv.org/abs/2207.12598)

An innovative technique that achieves guidance effects without a separate classifier, and has become the **de facto standard** in modern Diffusion Models.

The core idea is for a single network to learn **both conditional and unconditional** generation. During training, condition information $c$ is replaced with a null token $\varnothing$ with a certain probability (typically 10-20%).

At inference, conditional and unconditional predictions are linearly combined.

\hat{\epsilon}_\theta(x_t, t, c) = (1 + w) \cdot \epsilon_\theta(x_t, t, c) - w \cdot \epsilon_\theta(x_t, t, \varnothing)

where $w$ is the guidance weight. When $w = 0$, standard conditional generation occurs; when $w > 0$, fidelity to the condition increases.

Rearranging gives the following interpretation:

\hat{\epsilon}_\theta = \epsilon_\theta(x_t, t, \varnothing) + (1 + w) \cdot \underbrace{(\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \varnothing))}_{\text{shift toward condition}}

This can be interpreted as **pushing away from the unconditional prediction toward the conditional direction**, with larger $w$ increasing the pushing force. Nearly all state-of-the-art Text-to-Image models including DALL-E 2, Stable Diffusion, and Imagen use CFG.

11.5 Latent Diffusion Models (LDM) / Stable Diffusion

> Rombach et al., 2022 | [arXiv: 2112.10752](https://arxiv.org/abs/2112.10752)

LDM dramatically improved computational efficiency by performing the Diffusion Process in **latent space rather than pixel space**.

**Key Architecture:**

1. **Perceptual Compression**: A pre-trained Autoencoder (VQ-VAE or KL-regularized VAE) Encoder $\mathcal{E}$ compresses image $x$ into low-dimensional latent $z = \mathcal{E}(x)$. Typically, a $256 \times 256 \times 3$ image is compressed to $32 \times 32 \times 4$ latent (approximately 48x dimensionality reduction).

2. **Latent Diffusion**: DDPM's Forward/Reverse Process is performed in this latent space. Computation is significantly reduced compared to pixel space.

3. **Cross-Attention Conditioning**: Condition information such as text and segmentation maps is injected into the U-Net via **Cross-Attention**. For text, CLIP or BERT embeddings are used.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V

where $Q = W_Q \cdot \varphi(z_t)$, $K = W_K \cdot \tau_\theta(y)$, $V = W_V \cdot \tau_\theta(y)$, and $\tau_\theta(y)$ is the encoding of the condition information.

**Stable Diffusion** is trained by combining this LDM architecture with a CLIP text encoder and a large-scale dataset (LAION-5B), becoming the de facto standard for open-source Text-to-Image models.

11.6 Score SDE

> Song et al., 2021 | [arXiv: 2011.13456](https://arxiv.org/abs/2011.13456)

This ICLR 2021 Oral presentation connected DDPM and Score Matching under the unified framework of **Stochastic Differential Equations** (SDE).

Key contributions:

- **Variance Exploding (VE) SDE**: Corresponds to the NCSN/SMLD family

- **Variance Preserving (VP) SDE**: Corresponds to DDPM

- **Sub-VP SDE**: A variant providing better likelihood

\text{VP-SDE}: \quad dx = -\frac{1}{2}\beta(t) x \, dt + \sqrt{\beta(t)} \, dw

The extension to continuous time enables **exact log-likelihood computation** (via ODE), **more flexible sampler design**, and **conditional generation** tasks such as Inpainting and Colorization.

11.7 Consistency Models

> Song et al., 2023 | [arXiv: 2303.01469](https://arxiv.org/abs/2303.01469)

Consistency Models, proposed by Yang Song at OpenAI, represent an attempt to **fundamentally solve the multi-step sampling problem** of Diffusion Models.

The core idea is to learn a function $f_\theta$ that **maps all points on an ODE trajectory to the same starting point (original data)**.

f_\theta(x_t, t) = x_0, \quad \forall t \in [0, T]

By this self-consistency property, data can be recovered from a noisy sample at any time $t$ with a single network evaluation. That is, **1-step generation** is possible.

Two training approaches exist:

- **Consistency Distillation (CD)**: Distilling from a pre-trained Diffusion Model

- **Consistency Training (CT)**: Training independently without pre-training

In 2024, **Easy Consistency Models (ECM)** emerged, achieving better 2-step generation performance at 33% of the training cost compared to iCT.

11.8 Flow Matching / Rectified Flow

> Lipman et al., 2023; Liu et al., 2023 | [arXiv: 2210.02747](https://arxiv.org/abs/2210.02747), [arXiv: 2209.03003](https://arxiv.org/abs/2209.03003)

Flow Matching is an alternative approach to Diffusion Models that directly learns the **probability flow** connecting data and noise distributions.

**Core Idea**: Define **straight paths** from noise $x_1 \sim \mathcal{N}(0, I)$ to data $x_0$.

x_t = (1-t) x_0 + t \, \epsilon, \quad t \in [0, 1]

Learn a **velocity field** $v_\theta(x_t, t)$ along this path.

L_{\text{FM}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| v_\theta(x_t, t) - (x_0 - \epsilon) \|^2 \right]

**Rectified Flow** repeatedly "straightens" these paths (reflow), producing high-quality samples even with few steps.

**Stable Diffusion 3** adopted Rectified Flow, presenting a new paradigm for Diffusion Models alongside the transition from U-Net to Transformer.

11.9 DiT (Diffusion Transformer)

> Peebles & Xie, 2023 | [arXiv: 2212.09748](https://arxiv.org/abs/2212.09748)

DiT replaced the Diffusion Model backbone from U-Net to **Vision Transformer (ViT)**.

Key design choices:

- Images are divided into **patches** and processed as tokens

- Time step $t$ and class label $y$ are injected via **Adaptive Layer Normalization (adaLN-Zero)**

- Composed of $L$ Transformer Blocks

DiT, combined with Latent Diffusion, achieved **FID 2.27** on ImageNet $256 \times 256$ class-conditional generation, surpassing all previous Diffusion Models.

**Significance of DiT**: It empirically demonstrated that Transformer **scaling laws** can be applied to Diffusion Models. Performance consistently improves with increased model size and training compute. This finding directly influenced the architectural choices of the latest large-scale generative models such as **Sora** (OpenAI, Video generation) and **Stable Diffusion 3**.

12. PyTorch Code Examples: Simple DDPM Implementation

Below is a simplified PyTorch implementation of DDPM's core components. A more sophisticated U-Net and hyperparameter tuning would be needed for actual training.

12.1 Noise Schedule and Forward Process

class DDPMScheduler:

"""Scheduler managing DDPM's Forward Process."""

def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02, schedule='linear'):

self.num_timesteps = num_timesteps

if schedule == 'linear':

self.betas = torch.linspace(beta_start, beta_end, num_timesteps)

elif schedule == 'cosine':

self.betas = self._cosine_schedule(num_timesteps)

else:

raise ValueError(f"Unknown schedule: {schedule}")

Pre-compute key variables

self.alphas = 1.0 - self.betas

self.alphas_cumprod = torch.cumprod(self.alphas, dim=0) # ᾱ_t

self.alphas_cumprod_prev = F.pad(self.alphas_cumprod[:-1], (1, 0), value=1.0)

Forward process coefficients

self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod) # √ᾱ_t

self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod) # √(1-ᾱ_t)

Reverse process coefficients

self.sqrt_recip_alphas = torch.sqrt(1.0 / self.alphas) # 1/√α_t

self.posterior_variance = (

self.betas * (1.0 - self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)

) # β̃_t

def _cosine_schedule(self, timesteps, s=0.008):

steps = timesteps + 1

t = torch.linspace(0, timesteps, steps) / timesteps

alphas_cumprod = torch.cos((t + s) / (1 + s) * math.pi * 0.5) ** 2

alphas_cumprod = alphas_cumprod / alphas_cumprod[0]

betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])

return torch.clip(betas, 0.0001, 0.9999)

def add_noise(self, x_0, t, noise=None):

"""Forward process: compute q(x_t | x_0) in one step."""

if noise is None:

noise = torch.randn_like(x_0)

sqrt_alpha_cumprod = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)

sqrt_one_minus_alpha_cumprod = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)

x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε

x_t = sqrt_alpha_cumprod * x_0 + sqrt_one_minus_alpha_cumprod * noise

return x_t

12.2 Simplified U-Net

class SinusoidalPositionEmbedding(nn.Module):

"""Transformer-style Sinusoidal Time Embedding."""

def __init__(self, dim):

super().__init__()

self.dim = dim

def forward(self, t):

device = t.device

half_dim = self.dim // 2

emb = math.log(10000) / (half_dim - 1)

emb = torch.exp(torch.arange(half_dim, device=device) * -emb)

emb = t[:, None].float() * emb[None, :]

emb = torch.cat([emb.sin(), emb.cos()], dim=-1)

return emb

class ResBlock(nn.Module):

"""Time-conditioned Residual Block."""

def __init__(self, in_ch, out_ch, time_emb_dim):

super().__init__()

self.norm1 = nn.GroupNorm(8, in_ch)

self.conv1 = nn.Conv2d(in_ch, out_ch, 3, padding=1)

self.time_mlp = nn.Sequential(

nn.SiLU(),

nn.Linear(time_emb_dim, out_ch),

)

self.norm2 = nn.GroupNorm(8, out_ch)

self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1)

self.skip = nn.Conv2d(in_ch, out_ch, 1) if in_ch != out_ch else nn.Identity()

def forward(self, x, t_emb):

h = self.conv1(F.silu(self.norm1(x)))

h = h + self.time_mlp(t_emb)[:, :, None, None] # Inject time embedding

h = self.conv2(F.silu(self.norm2(h)))

return h + self.skip(x) # Residual connection

class SimpleUNet(nn.Module):

"""Simplified U-Net for DDPM training."""

def __init__(self, in_channels=3, base_channels=64, time_emb_dim=256):

super().__init__()

Time embedding

self.time_mlp = nn.Sequential(

SinusoidalPositionEmbedding(base_channels),

nn.Linear(base_channels, time_emb_dim),

nn.SiLU(),

nn.Linear(time_emb_dim, time_emb_dim),

)

Encoder

self.enc1 = ResBlock(in_channels, base_channels, time_emb_dim)

self.enc2 = ResBlock(base_channels, base_channels * 2, time_emb_dim)

self.enc3 = ResBlock(base_channels * 2, base_channels * 4, time_emb_dim)

self.pool = nn.MaxPool2d(2)

Bottleneck

self.bot = ResBlock(base_channels * 4, base_channels * 4, time_emb_dim)

Decoder (with skip connections)

self.dec3 = ResBlock(base_channels * 8, base_channels * 2, time_emb_dim)

self.dec2 = ResBlock(base_channels * 4, base_channels, time_emb_dim)

self.dec1 = ResBlock(base_channels * 2, base_channels, time_emb_dim)

self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)

Output

self.out = nn.Conv2d(base_channels, in_channels, 1)

def forward(self, x, t):

t_emb = self.time_mlp(t)

Encoder

e1 = self.enc1(x, t_emb)

e2 = self.enc2(self.pool(e1), t_emb)

e3 = self.enc3(self.pool(e2), t_emb)

Bottleneck

b = self.bot(self.pool(e3), t_emb)

Decoder with skip connections

d3 = self.dec3(torch.cat([self.up(b), e3], dim=1), t_emb)

d2 = self.dec2(torch.cat([self.up(d3), e2], dim=1), t_emb)

d1 = self.dec1(torch.cat([self.up(d2), e1], dim=1), t_emb)

return self.out(d1) # Predicted noise ε_θ

12.3 Training Loop

def train_ddpm(model, dataloader, scheduler, epochs=100, lr=2e-4, device='cuda'):

"""DDPM training loop (Algorithm 1 implementation)."""

optimizer = torch.optim.Adam(model.parameters(), lr=lr)

model.train()

for epoch in range(epochs):

total_loss = 0

for batch_idx, (x_0, _) in enumerate(dataloader):

x_0 = x_0.to(device)

1. Select random time step: t ~ Uniform({1, ..., T})

t = torch.randint(0, scheduler.num_timesteps, (x_0.shape[0],), device=device)

2. Sample noise: ε ~ N(0, I)

noise = torch.randn_like(x_0)

3. Forward process: x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε

x_t = scheduler.add_noise(x_0, t, noise)

4. Predict noise: ε_θ(x_t, t)

noise_pred = model(x_t, t)

5. Simplified loss: L = ||ε - ε_θ(x_t, t)||²

loss = F.mse_loss(noise_pred, noise)

optimizer.zero_grad()

loss.backward()

optimizer.step()

total_loss += loss.item()

avg_loss = total_loss / len(dataloader)

print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

12.4 Sampling

@torch.no_grad()

def sample_ddpm(model, scheduler, image_shape, device='cuda'):

"""DDPM sampling (Algorithm 2 implementation)."""

model.eval()

x_T ~ N(0, I)

x = torch.randn(image_shape, device=device)

for t in reversed(range(scheduler.num_timesteps)):

t_batch = torch.full((image_shape[0],), t, device=device, dtype=torch.long)

Predict noise

predicted_noise = model(x, t_batch)

Reverse process coefficients

alpha_t = scheduler.alphas[t]

alpha_cumprod_t = scheduler.alphas_cumprod[t]

beta_t = scheduler.betas[t]

Compute mean: μ_θ = 1/√α_t · (x_t - β_t/√(1-ᾱ_t) · ε_θ)

mean = (1.0 / torch.sqrt(alpha_t)) * (

x - (beta_t / torch.sqrt(1.0 - alpha_cumprod_t)) * predicted_noise

)

if t > 0:

Add stochastic noise (except at the last step)

noise = torch.randn_like(x)

sigma_t = torch.sqrt(scheduler.posterior_variance[t])

x = mean + sigma_t * noise

else:

x = mean

return x

12.5 Usage Example

Hyperparameters

device = 'cuda' if torch.cuda.is_available() else 'cpu'

image_size = 32

batch_size = 128

num_timesteps = 1000

Initialize scheduler and model

scheduler = DDPMScheduler(num_timesteps=num_timesteps, schedule='cosine')

model = SimpleUNet(in_channels=3, base_channels=64).to(device)

Dataset (e.g., CIFAR-10)

from torchvision import datasets, transforms

transform = transforms.Compose([

transforms.ToTensor(),

transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), # Normalize to [-1, 1]

])

dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)

dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

Train

train_ddpm(model, dataloader, scheduler, epochs=100, device=device)

Sample

samples = sample_ddpm(model, scheduler, (16, 3, image_size, image_size), device=device)

samples: 16 generated images in [-1, 1] range

13. Diffusion Model vs GAN vs VAE: Comparative Analysis

13.1 Comprehensive Comparison Table

| Property | Diffusion Model (DDPM) | GAN | VAE |

| -------------------------- | --------------------------------- | ------------------------------------- | ---------------------------- |

13.2 When to Choose Which Model?

**Choose Diffusion Models when:**

- Both generation quality and diversity are important

- Complex conditional generation like text-to-image is needed

- Training stability is critical

- Generation speed is not the top priority

**Choose GANs when:**

- Real-time generation is needed

- High-quality images for a specific domain are needed (faces, landscapes, etc.)

- The dataset is relatively small and uniform

**Choose VAEs when:**

- Meaningful Latent Space manipulation is needed

- Likelihood-based anomaly detection is needed

- Fast encoding/decoding is required

- Semi-supervised learning or representation learning is the main purpose

14. Present and Future of Diffusion Models

14.1 Major Trends in 2024-2025

**Architecture Transition: From U-Net to Transformer.** The latest models such as Stable Diffusion 3, FLUX, and Sora adopt DiT-based architectures. Transformer scaling laws have been confirmed to apply to Diffusion Models, and model scale expansion (8B+ parameters) is actively underway.

**Sampling Efficiency.** With advances in Consistency Models, Flow Matching, and DPM-Solver, 1-4 step generation has become possible. Rectified Flow learns straight paths, achieving high quality even with few steps.

**Multimodal Expansion.** Diffusion Models are expanding beyond images to **video** (Sora, Runway Gen-3), **audio** (AudioLDM), **3D** (DreamFusion, Zero-1-to-3), **robotics** (Diffusion Policy), and other domains.

**Acceleration and Optimization.** Techniques such as Distillation, Quantization, and Caching have greatly improved inference speed, approaching real-time image generation.

14.2 Historical Significance of DDPM

DDPM represents a turning point in generative model history in the following ways:

1. Demonstrated the **competitiveness of Likelihood-based models** in the image generation space dominated by GANs

2. Showed that high-quality generation is possible with an extremely simple training objective ($L_\text{simple}$)

3. Established a **theoretical framework** connecting thermodynamics and Score Matching

4. Became the direct foundation of the modern AI revolution including Stable Diffusion, DALL-E 2, and Midjourney

15. References

1. **Ho, J., Jain, A., & Abbeel, P.** (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020. [arXiv:2006.11239](https://arxiv.org/abs/2006.11239)

2. **Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S.** (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML 2015. [arXiv:1503.03585](https://arxiv.org/abs/1503.03585)

3. **Song, J., Meng, C., & Ermon, S.** (2021). Denoising Diffusion Implicit Models (DDIM). ICLR 2021. [arXiv:2010.02502](https://arxiv.org/abs/2010.02502)

4. **Nichol, A. & Dhariwal, P.** (2021). Improved Denoising Diffusion Probabilistic Models. ICML 2021. [arXiv:2102.09672](https://arxiv.org/abs/2102.09672)

5. **Dhariwal, P. & Nichol, A.** (2021). Diffusion Models Beat GANs on Image Synthesis. NeurIPS 2021. [arXiv:2105.05233](https://arxiv.org/abs/2105.05233)

6. **Ho, J. & Salimans, T.** (2022). Classifier-Free Diffusion Guidance. NeurIPS Workshop 2021. [arXiv:2207.12598](https://arxiv.org/abs/2207.12598)

7. **Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B.** (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022. [arXiv:2112.10752](https://arxiv.org/abs/2112.10752)

8. **Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., & Poole, B.** (2021). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021. [arXiv:2011.13456](https://arxiv.org/abs/2011.13456)

9. **Song, Y., Dhariwal, P., Chen, M., & Sutskever, I.** (2023). Consistency Models. ICML 2023. [arXiv:2303.01469](https://arxiv.org/abs/2303.01469)

10. **Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., & Le, M.** (2023). Flow Matching for Generative Modeling. ICLR 2023. [arXiv:2210.02747](https://arxiv.org/abs/2210.02747)

11. **Peebles, W. & Xie, S.** (2023). Scalable Diffusion Models with Transformers (DiT). ICCV 2023. [arXiv:2212.09748](https://arxiv.org/abs/2212.09748)

12. **Song, Y. & Ermon, S.** (2019). Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019. [arXiv:1907.05600](https://arxiv.org/abs/1907.05600)

13. **Weng, L.** (2021). What are Diffusion Models? [lilianweng.github.io](https://lilianweng.github.io/posts/2021-07-11-diffusion-models/)

14. **Hugging Face.** The Annotated Diffusion Model. [huggingface.co/blog/annotated-diffusion](https://huggingface.co/blog/annotated-diffusion)