Complete Analysis of the DDPM Paper: The Mathematics and Principles of Diffusion Models that Create Images from Noise

1. Paper Overview
2. Background: From Thermodynamics to Generative Models
3. Forward Process: Systematically Adding Noise
- 3.1 Forward Process as a Markov Chain
- 3.2 Complete Forward Process
4. Core Mathematics: Reparameterization Trick
5. Reverse Process: Recovering Images from Noise
6. Deriving the Training Objective: From ELBO to Simplified Loss
7. Noise Scheduling: Design of $\beta_t$
8. Sampling Algorithm
9. Architecture: Time-conditioned U-Net
10. Experimental Results
11. Comprehensive Overview of Subsequent Research: The Evolution of Diffusion
12. PyTorch Code Examples: Simple DDPM Implementation
13. Diffusion Model vs GAN vs VAE: Comparative Analysis
- 13.1 Comprehensive Comparison Table
- 13.2 When to Choose Which Model?
14. Present and Future of Diffusion Models
- 14.1 Major Trends in 2024-2025
- 14.2 Historical Significance of DDPM
15. References

1. Paper Overview

"Denoising Diffusion Probabilistic Models" (DDPM) was published at NeurIPS 2020, co-authored by Jonathan Ho, Ajay Jain, and Pieter Abbeel from UC Berkeley. This paper is a landmark study that empirically demonstrated that high-quality image synthesis is achievable through diffusion probabilistic models.

The core idea is surprisingly simple. Define a Forward Process that gradually adds Gaussian noise to data, and learn a Reverse Process that step-by-step removes this noise to recover the original data. The final training objective reduces to a simple MSE loss between "model-predicted noise" and "actually added noise."

DDPM achieved FID 3.17 and Inception Score 9.46 on CIFAR-10, showing performance comparable to or surpassing GAN-based models of the time. More importantly, this paper became the foundation of modern image generation AI including DALL-E 2, Imagen, Stable Diffusion, and Midjourney.

Paper Information
Title: Denoising Diffusion Probabilistic Models
Authors: Jonathan Ho, Ajay Jain, Pieter Abbeel
Venue: NeurIPS 2020
arXiv: 2006.11239
Official Code: hojonathanho/diffusion

2. Background: From Thermodynamics to Generative Models

2.1 Inspiration from Non-equilibrium Thermodynamics

The intellectual origin of Diffusion Models lies in non-equilibrium statistical mechanics. In physics, diffusion refers to the process where particles randomly move from high-concentration regions to low-concentration regions, eventually reaching a state of thermal equilibrium (maximum entropy). The key insight of this process is:

Forward: A state with complex structure $\rightarrow$ disordered equilibrium state (information destruction)
Reverse: Equilibrium state $\rightarrow$ restoration to a structured state (information creation)

Sohl-Dickstein et al. (2015) first applied this idea to machine learning, publishing "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." By defining a diffusion process that transforms a complex data distribution into a simple known distribution (Gaussian), and learning its reverse process, one obtains a generative model.

2.2 Connection with Score Matching

Another theoretical pillar of Diffusion Models is Score Matching. The score function is defined as the gradient of the log probability density.

\nabla_x \log p(x)

If this score function can be estimated, samples can be generated through Langevin Dynamics.

x_{t+1} = x_t + \frac{\epsilon}{2} \nabla_x \log p(x_t) + \sqrt{\epsilon} \, z, \quad z \sim \mathcal{N}(0, I)

Yang Song and Stefano Ermon (2019) proposed Noise Conditional Score Networks (NCSN) in "Generative Modeling by Estimating Gradients of the Data Distribution," presenting a method for estimating the score function at various noise levels. Ho et al.'s DDPM is deeply connected to this Score Matching perspective, and the paper explicitly cites "a new connection with denoising score matching with Langevin dynamics" as a core contribution.

2.3 SDE Perspective: A Unified Framework

Song et al. (2021) unified DDPM and Score Matching under the framework of Stochastic Differential Equations (SDE) in "Score-Based Generative Modeling through Stochastic Differential Equations." The Forward Process described as a continuous-time SDE takes the form:

dx = f(x, t) \, dt + g(t) \, dw

where $f$ is the drift coefficient, $g$ is the diffusion coefficient, and $w$ is a standard Wiener process. A corresponding Reverse-time SDE exists:

dx = \left[ f(x, t) - g(t)^2 \nabla_x \log p_t(x) \right] dt + g(t) \, d\bar{w}

The key insight is that solving the reverse SDE requires only the time-dependent score function $\nabla_x \log p_t(x)$ . DDPM's noise prediction network $\epsilon_\theta$ is essentially equivalent to estimating this score function.

\epsilon_\theta(x_t, t) \approx -\sqrt{1 - \bar{\alpha}_t} \, \nabla_{x_t} \log p(x_t)

This relationship is the key link that theoretically unifies DDPM and Score Matching.

3. Forward Process: Systematically Adding Noise

3.1 Forward Process as a Markov Chain

The Forward Process (or Diffusion Process) is a fixed Markov Chain that gradually adds Gaussian noise to original data $x_0$ . It has no learnable parameters and is entirely determined by a predefined Variance Schedule $\{\beta_1, \beta_2, ..., \beta_T\}$ .

The transition probability at each time step $t$ is defined as:

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \, \beta_t I)

In plain terms, at each step the data from the previous time step is scaled down by $\sqrt{1 - \beta_t}$ and Gaussian noise with variance $\beta_t$ is added.

x_t = \sqrt{1 - \beta_t} \, x_{t-1} + \sqrt{\beta_t} \, \epsilon_{t-1}, \quad \epsilon_{t-1} \sim \mathcal{N}(0, I)

Why scale by $\sqrt{1 - \beta_t}$ ? To preserve the total variance at each step. If the variance of $x_{t-1}$ is 1, then the variance of $\sqrt{1-\beta_t} \cdot x_{t-1}$ is $1-\beta_t$ , and adding noise with variance $\beta_t$ gives a total variance of $(1-\beta_t) + \beta_t = 1$ .

When $T$ is sufficiently large and $\beta_t$ is appropriately set, $x_T$ converges to nearly pure isotropic Gaussian noise $\mathcal{N}(0, I)$ .

3.2 Complete Forward Process

The joint distribution of the complete Forward Process over $T$ steps is:

q(x_{1:T} | x_0) = \prod_{t=1}^{T} q(x_t | x_{t-1})

This follows from the Markov property, where each step depends only on the immediately preceding step. In DDPM, $T = 1000$ is used, with $\beta_t$ increasing linearly from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$ .

4. Core Mathematics: Reparameterization Trick

4.1 Jumping to Arbitrary Time $t$ in One Step

The most powerful mathematical property of the Forward Process is that $x_t$ at any arbitrary time $t$ can be computed directly from $x_0$ without going through intermediate steps. This is what makes training efficient.

First, define the notation:

\alpha_t = 1 - \beta_t, \qquad \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s

$\bar{\alpha}_t$ is the cumulative product of $\alpha_s$ , representing how much of the original signal is preserved up to time $t$ .

4.2 Derivation

Starting from $x_1$ and deriving inductively:

x_1 = \sqrt{\alpha_1} \, x_0 + \sqrt{1 - \alpha_1} \, \epsilon_0

x_2 = \sqrt{\alpha_2} \, x_1 + \sqrt{1 - \alpha_2} \, \epsilon_1

Substituting $x_1$ into $x_2$ :

x_2 = \sqrt{\alpha_2} \left( \sqrt{\alpha_1} \, x_0 + \sqrt{1 - \alpha_1} \, \epsilon_0 \right) + \sqrt{1 - \alpha_2} \, \epsilon_1

= \sqrt{\alpha_1 \alpha_2} \, x_0 + \sqrt{\alpha_2(1-\alpha_1)} \, \epsilon_0 + \sqrt{1-\alpha_2} \, \epsilon_1

Applying the sum of independent Gaussians rule: the sum of two independent Gaussians $\mathcal{N}(0, \sigma_1^2 I)$ and $\mathcal{N}(0, \sigma_2^2 I)$ follows $\mathcal{N}(0, (\sigma_1^2 + \sigma_2^2)I)$ .

Summing the noise variances:

\alpha_2(1-\alpha_1) + (1-\alpha_2) = \alpha_2 - \alpha_1\alpha_2 + 1 - \alpha_2 = 1 - \alpha_1\alpha_2 = 1 - \bar{\alpha}_2

Therefore:

x_2 = \sqrt{\bar{\alpha}_2} \, x_0 + \sqrt{1 - \bar{\alpha}_2} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Generalizing this yields the following.

4.3 Final Result: Closed-form Expression

\boxed{q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, \, (1 - \bar{\alpha}_t) I)}

That is, $x_t$ at any time $t$ can be sampled in one step:

x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

The intuitive interpretation of this formula is:

Term	Meaning	Change Over Time
$\sqrt{\bar{\alpha}_t} \, x_0$	Original signal	As $t \uparrow$ , $\bar{\alpha}_t \downarrow$ , signal decreases
$\sqrt{1 - \bar{\alpha}_t} \, \epsilon$	Added noise	As $t \uparrow$ , $1-\bar{\alpha}_t \uparrow$ , noise increases

At $t = 0$ , $\bar{\alpha}_0 = 1$ so we get $x_0$ as-is, and at $t = T$ , $\bar{\alpha}_T \approx 0$ so it becomes nearly pure noise. This gradual decrease in Signal-to-Noise Ratio (SNR) is the essence of the Forward Process.

\text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}

5. Reverse Process: Recovering Images from Noise

5.1 Definition of the Reverse Process

The Reverse Process starts from pure noise $x_T \sim \mathcal{N}(0, I)$ and progressively removes noise to generate data $x_0$ . If each step of the Forward Process is a small Gaussian perturbation, the key assumption is that its reverse can also be approximated as Gaussian (when $\beta_t$ is sufficiently small).

p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1} | x_t)

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

Here, $\mu_\theta$ and $\Sigma_\theta$ are the mean and variance that the neural network must learn. In DDPM, the variance $\Sigma_\theta$ is not learned but fixed as $\sigma_t^2 I$ , where either $\sigma_t^2 = \beta_t$ or $\sigma_t^2 = \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t$ is used.

5.2 Derivation of the Posterior $q(x_{t-1}|x_t, x_0)$

The key to training is that the reverse conditional distribution (posterior) given $x_0$ is computable in closed form. Applying Bayes' theorem:

q(x_{t-1} | x_t, x_0) = \frac{q(x_t | x_{t-1}, x_0) \, q(x_{t-1} | x_0)}{q(x_t | x_0)}

By the Markov property, $q(x_t|x_{t-1}, x_0) = q(x_t|x_{t-1})$ , so all three terms are Gaussian. Since the product of Gaussians is also Gaussian, expanding the exponents and rearranging as a quadratic in $x_{t-1}$ yields:

q(x_{t-1} | x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I)

where the posterior mean is:

\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t

and the posterior variance is:

\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t

5.3 Replacing $x_0$ with $\epsilon$

Since the model cannot directly know $x_0$ , we solve the Reparameterization formula $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ in reverse to express $x_0$ :

x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1-\bar{\alpha}_t} \, \epsilon \right)

Substituting this into the posterior mean $\tilde{\mu}_t$ :

\tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \, \epsilon \right)

If the model learns a network $\epsilon_\theta(x_t, t)$ that predicts the noise $\epsilon$ , the Reverse Process mean is computed as:

\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \, \epsilon_\theta(x_t, t) \right)

This is why noise prediction is equivalent to mean prediction in DDPM's Reverse Process.

6. Deriving the Training Objective: From ELBO to Simplified Loss

6.1 Maximum Likelihood and ELBO

The ultimate goal of a generative model is to maximize the data log-likelihood $\log p_\theta(x_0)$ . However, since this is intractable to compute directly, we optimize the Evidence Lower Bound (ELBO).

Applying Jensen's inequality:

\log p_\theta(x_0) \geq \mathbb{E}_{q(x_{1:T}|x_0)} \left[ \log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)} \right] = \text{ELBO}

6.2 Decomposition of the ELBO

Decomposing the ELBO into KL divergence terms:

\text{ELBO} = \underbrace{\mathbb{E}_q[\log p_\theta(x_0 | x_1)]}_{L_0: \text{Reconstruction term}} - \underbrace{D_{\text{KL}}(q(x_T | x_0) \| p(x_T))}_{L_T: \text{Prior matching term}} - \sum_{t=2}^{T} \underbrace{\mathbb{E}_q \left[ D_{\text{KL}}(q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t)) \right]}_{L_{t-1}: \text{Denoising matching term}}

Analyzing the meaning of each term:

$L_T$ (Prior Matching): Measures how well $q(x_T|x_0)$ matches the prior distribution $p(x_T) = \mathcal{N}(0, I)$ . When $T$ is sufficiently large, this term converges to 0, and since it has no learnable parameters, it is ignored as a constant.

$L_0$ (Reconstruction): Measures the ability to reconstruct $x_0$ from $x_1$ . Since $x_0$ and $x_1$ are very similar, its impact on overall training is small.

$L_{t-1}$ (Denoising Matching): The core training signal that measures how well the model's Reverse transition $p_\theta(x_{t-1}|x_t)$ matches the true posterior $q(x_{t-1}|x_t, x_0)$ .

6.3 KL Divergence Computation

The KL divergence between two Gaussians is computable in closed form. Since $q(x_{t-1}|x_t, x_0) = \mathcal{N}(\tilde{\mu}_t, \tilde{\beta}_t I)$ and $p_\theta(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta, \sigma_t^2 I)$ :

D_{\text{KL}}(q \| p_\theta) = \frac{1}{2\sigma_t^2} \|\tilde{\mu}_t(x_t, x_0) - \mu_\theta(x_t, t)\|^2 + C

where $C$ is a constant related to the variances. With fixed variance, only the difference in means becomes the training objective.

6.4 Reparameterization to Noise Prediction

Substituting the expressions for $\tilde{\mu}_t$ and $\mu_\theta$ derived earlier:

\|\tilde{\mu}_t - \mu_\theta\|^2 = \frac{\beta_t^2}{(1-\bar{\alpha}_t)\alpha_t} \|\epsilon - \epsilon_\theta(x_t, t)\|^2

The Simplified Loss with the weighting coefficient removed is:

\boxed{L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \right]}

where $t \sim \text{Uniform}(\{1, ..., T\})$ , $x_0 \sim q(x_0)$ , and $\epsilon \sim \mathcal{N}(0, I)$ .

This is DDPM's most important contribution. Starting from the complex ELBO, it ultimately arrives at the "MSE between actual noise $\epsilon$ and predicted noise $\epsilon_\theta$ " — the simplest possible loss function in machine learning. Experimentally, this simplified loss also produces better sample quality than the weighted variational bound.

6.5 Training Algorithm Summary

Algorithm 1: Training
─────────────────────────────────
repeat
    x_0 ~ q(x_0)                    # Sample from dataset
    t ~ Uniform({1, ..., T})         # Select random time step
    ε ~ N(0, I)                      # Sample standard Gaussian noise
    x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε   # Generate noisy image
    ∇_θ ||ε - ε_θ(x_t, t)||²        # Compute gradient and update
until converged

7. Noise Scheduling: Design of $\beta_t$

7.1 Linear Schedule (Original DDPM)

Ho et al. used a schedule where $\beta_t$ increases linearly from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$ .

\beta_t = \beta_1 + \frac{t-1}{T-1}(\beta_T - \beta_1)

The intuition behind this schedule is to add small noise initially to gradually destroy data structure, and larger noise in later stages to rapidly converge to a Gaussian.

7.2 Problems with the Linear Schedule

Nichol & Dhariwal (2021, "Improved Denoising Diffusion Probabilistic Models") identified two issues with the Linear Schedule.

First, information is destroyed too quickly in the early stages. $\bar{\alpha}_t$ decreases rapidly in the beginning, so significant noise is added even at low values of $t$ . This is particularly problematic for high-resolution images.

Second, late time steps are wasted. At large values of $t$ , $\bar{\alpha}_t \approx 0$ , meaning $x_t$ is already close to pure noise and contributes little to meaningful training.

7.3 Cosine Schedule

The Cosine Schedule proposed by Nichol & Dhariwal defines $\bar{\alpha}_t$ directly.

\bar{\alpha}_t = \frac{f(t)}{f(0)}, \qquad f(t) = \cos\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2

where $s = 0.008$ is a small offset to prevent $\beta_t$ from becoming too small near $t=0$ .

The key characteristics of the Cosine Schedule are:

$\bar{\alpha}_t$ decreases nearly linearly in the middle range, providing uniformly useful training signals across all time steps
Prevents excessive noise addition in the early stages, preserving fine details
Ensures smooth transition to complete noise in the later stages

import torch
import math

def cosine_beta_schedule(timesteps, s=0.008):
    """Cosine schedule as proposed in Nichol & Dhariwal (2021)."""
    steps = timesteps + 1
    t = torch.linspace(0, timesteps, steps) / timesteps
    alphas_cumprod = torch.cos((t + s) / (1 + s) * math.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clip(betas, 0.0001, 0.9999)

def linear_beta_schedule(timesteps, beta_start=1e-4, beta_end=0.02):
    """Linear schedule as proposed in Ho et al. (2020)."""
    return torch.linspace(beta_start, beta_end, timesteps)

7.4 Schedule Comparison

Property	Linear Schedule	Cosine Schedule
$\bar{\alpha}_t$ decay pattern	Rapid early, gradual late	Nearly linear in middle
Early information preservation	Low	High
Late time step utilization	Inefficient (already pure noise)	Efficient
High-resolution suitability	Low	High
Used in original DDPM	Yes	No
Used in Improved DDPM	No	Yes

8. Sampling Algorithm

8.1 DDPM Sampling

After training is complete, the DDPM sampling algorithm for generating new images is:

Algorithm 2: Sampling
─────────────────────────────────
x_T ~ N(0, I)                          # Start from pure noise
for t = T, T-1, ..., 1:
    z ~ N(0, I)  if t > 1, else z = 0  # No noise added at the last step
    x_{t-1} = 1/√α_t · (x_t - β_t/√(1-ᾱ_t) · ε_θ(x_t, t)) + σ_t · z
return x_0

8.2 Step-by-Step Interpretation

Step 1: Initialization. Sample pure Gaussian noise from $x_T \sim \mathcal{N}(0, I)$ . This is the starting point of the generation process.

Step 2: Noise Prediction. Feed the current noisy image $x_t$ and time step $t$ into the network $\epsilon_\theta$ to predict the noise contained in $x_t$ .

Step 3: Mean Computation. Compute the mean of the Reverse transition using the predicted noise.

\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right)

Step 4: Stochastic Transition. Generate $x_{t-1}$ by adding scaled Gaussian noise $\sigma_t z$ to the computed mean. No noise is added at the final step ( $t=1$ ).

Step 5: Repeat. Repeat the above process from $t = T$ to $t = 1$ .

8.3 Limitations of Sampling

The biggest drawback of DDPM sampling is speed. Sequential denoising over $T = 1000$ steps requires 1000 neural network forward passes for a single image. This is extremely slow compared to GAN's single forward pass, spurring subsequent research on accelerated samplers such as DDIM and DPM-Solver.

9. Architecture: Time-conditioned U-Net

9.1 U-Net Based Design

DDPM's noise prediction network $\epsilon_\theta(x_t, t)$ is based on the U-Net architecture. U-Net was originally proposed by Ronneberger et al. (2015) for medical image segmentation, featuring an Encoder-Decoder structure with Skip Connections that combine features at various resolutions.

DDPM's U-Net is based on the PixelCNN++ structure with the following modifications.

9.2 Key Components

Time Embedding: To inject the time step $t$ into the network, Transformer-style Sinusoidal Positional Encoding is used.

\text{TE}(t)_{2i} = \sin\left(\frac{t}{10000^{2i/d}}\right), \quad \text{TE}(t)_{2i+1} = \cos\left(\frac{t}{10000^{2i/d}}\right)

This embedding passes through an MLP and is injected into each ResNet Block. Specifically, the time embedding is linearly transformed and then either added (additive) or scaled (FiLM conditioning) onto the intermediate feature maps of the ResNet Block.

ResNet Block: Each block consists of the following sequence:

Group Normalization
SiLU (Swish) Activation
Convolution
Time Embedding injection
Group Normalization
SiLU Activation
Dropout
Convolution
Residual Connection

Self-Attention: Multi-Head Self-Attention is applied at feature maps of $16 \times 16$ resolution. The spatial dimensions $(h, w)$ are flattened to sequence length $h \times w$ to perform standard Scaled Dot-Product Attention.

Group Normalization: Group Normalization is used instead of Batch Normalization. It is independent of batch size and provides more stable training for generative models.

9.3 Specific Architecture Specifications

Input: x_t ∈ R^(C×H×W), t ∈ {1,...,T}

Encoder:
  [128] → [128] → ↓2 →
  [256] → [256] → ↓2 →
  [256] → [256] → ↓2 →      (+ Self-Attention at 16×16)
  [512] → [512] → ↓2

Bottleneck:
  [512] → Self-Attention → [512]

Decoder (with skip connections):
  [512] → [512] → ↑2 →
  [256] → [256] → ↑2 →      (+ Self-Attention at 16×16)
  [256] → [256] → ↑2 →
  [128] → [128] → ↑2

Output: ε_θ ∈ R^(C×H×W)       (predicted noise with same dimensions as input)

DDPM used approximately 114M parameters at $256 \times 256$ resolution.

10. Experimental Results

10.1 Quantitative Evaluation

DDPM was evaluated on the following benchmarks.

CIFAR-10 (Unconditional, $32 \times 32$ ):

Model	FID ( $\downarrow$ )	IS ( $\uparrow$ )
DDPM	3.17	9.46
StyleGAN2 + ADA	2.92	9.83
NCSN	25.32	8.87
ProgressiveGAN	15.52	8.80
NVAE	23.5	-

DDPM achieved SOTA FID among unconditional generative models at the time, showing quality comparable to GAN-based StyleGAN2.

LSUN ( $256 \times 256$ ):

Dataset	FID
LSUN Bedroom	4.90
LSUN Cat	-
LSUN Church	7.89

10.2 Qualitative Analysis

DDPM samples exhibited several distinct characteristics compared to GANs.

High diversity: While GANs suffer from limited generation diversity due to mode collapse, DDPM covers diverse modes of the data distribution in a balanced manner.

Gradual generation: The progressive transformation from noise to image can be visualized, confirming a coarse-to-fine generation pattern where the model first forms global structure and then adds fine details.

Stable training: Free from GAN's chronic problems of training instability (mode collapse, training oscillation), converging stably with a simple MSE loss.

10.3 Progressive Lossy Compression Interpretation

Ho et al. interpreted DDPM as naturally implementing a Progressive Lossy Decompression scheme. Information is progressively added at each Reverse step, which can be viewed as a generalization of Autoregressive Decoding. Rate-Distortion curve analysis confirmed that most bits are allocated to overall structure rather than perceptually insignificant details.

11. Comprehensive Overview of Subsequent Research: The Evolution of Diffusion

11.1 DDIM (Denoising Diffusion Implicit Models)

Song et al., 2021 | arXiv: 2010.02502

Research that addressed DDPM's biggest limitation: slow sampling speed. The core idea is to generalize the Forward Process to be Non-Markovian.

DDIM uses the same trained model $\epsilon_\theta$ while modifying only the sampling process.

x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\left( \frac{x_t - \sqrt{1-\bar{\alpha}_t} \, \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}} \right)}_{\text{predicted } x_0} + \sqrt{1-\bar{\alpha}_{t-1} - \sigma_t^2} \cdot \epsilon_\theta(x_t, t) + \sigma_t \epsilon_t

Setting $\sigma_t = 0$ makes sampling completely deterministic, which provides:

Accelerated sampling: Similar quality images with only 50-100 steps instead of $T=1000$ (10-20x speedup)
Semantic interpolation: Thanks to the deterministic mapping, interpolation in latent space leads to meaningful image transformations
Consistency: Always generates the same image from the same initial noise, ensuring reproducible results

11.2 Improved DDPM

Nichol & Dhariwal, 2021 | arXiv: 2102.09672

Research that improved two aspects of the original DDPM.

Learnable variance: While DDPM fixed $\sigma_t^2$ as either $\beta_t$ or $\tilde{\beta}_t$ , Improved DDPM makes it learnable. Specifically, $\sigma_t^2$ is parameterized as an interpolation between $\beta_t$ and $\tilde{\beta}_t$ .

\Sigma_\theta(x_t, t) = \exp(v \log \beta_t + (1-v) \log \tilde{\beta}_t)

where $v$ is a value output by the network.

Cosine Schedule: Introduced the Cosine Variance Schedule described earlier, greatly improving training efficiency especially for high-resolution images.

Hybrid Loss: Adding a small amount of the variational lower bound $L_\text{vlb}$ to $L_\text{simple}$ also improved log-likelihood.

L_\text{hybrid} = L_\text{simple} + \lambda L_\text{vlb}

11.3 Classifier Guidance

Dhariwal & Nichol, 2021 | arXiv: 2105.05233

A technique proposed in "Diffusion Models Beat GANs on Image Synthesis" that injects the gradient of a pre-trained classifier into the Reverse Process for conditional generation.

\hat{\epsilon}_\theta(x_t, t, y) = \epsilon_\theta(x_t, t) - s \cdot \sqrt{1-\bar{\alpha}_t} \cdot \nabla_{x_t} \log p_\phi(y|x_t)

where $s$ is the guidance scale and $p_\phi$ is a classifier trained on noisy images. Increasing $s$ reduces diversity but increases fidelity to a specific class. In this paper, Diffusion Models first surpassed GANs in FID (CIFAR-10 FID 2.97, ImageNet 256x256 FID 4.59).

Limitation: A separate classifier must be trained on noisy data, complicating the training pipeline.

11.4 Classifier-Free Guidance (CFG)

Ho & Salimans, 2022 | arXiv: 2207.12598

An innovative technique that achieves guidance effects without a separate classifier, and has become the de facto standard in modern Diffusion Models.

The core idea is for a single network to learn both conditional and unconditional generation. During training, condition information $c$ is replaced with a null token $\varnothing$ with a certain probability (typically 10-20%).

At inference, conditional and unconditional predictions are linearly combined.

\hat{\epsilon}_\theta(x_t, t, c) = (1 + w) \cdot \epsilon_\theta(x_t, t, c) - w \cdot \epsilon_\theta(x_t, t, \varnothing)

where $w$ is the guidance weight. When $w = 0$ , standard conditional generation occurs; when $w > 0$ , fidelity to the condition increases.

Rearranging gives the following interpretation:

\hat{\epsilon}_\theta = \epsilon_\theta(x_t, t, \varnothing) + (1 + w) \cdot \underbrace{(\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \varnothing))}_{\text{shift toward condition}}

This can be interpreted as pushing away from the unconditional prediction toward the conditional direction, with larger $w$ increasing the pushing force. Nearly all state-of-the-art Text-to-Image models including DALL-E 2, Stable Diffusion, and Imagen use CFG.

11.5 Latent Diffusion Models (LDM) / Stable Diffusion

Rombach et al., 2022 | arXiv: 2112.10752

LDM dramatically improved computational efficiency by performing the Diffusion Process in latent space rather than pixel space.

Key Architecture:

Perceptual Compression: A pre-trained Autoencoder (VQ-VAE or KL-regularized VAE) Encoder $\mathcal{E}$ compresses image $x$ into low-dimensional latent $z = \mathcal{E}(x)$ . Typically, a $256 \times 256 \times 3$ image is compressed to $32 \times 32 \times 4$ latent (approximately 48x dimensionality reduction).
Latent Diffusion: DDPM's Forward/Reverse Process is performed in this latent space. Computation is significantly reduced compared to pixel space.
Cross-Attention Conditioning: Condition information such as text and segmentation maps is injected into the U-Net via Cross-Attention. For text, CLIP or BERT embeddings are used.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V

where $Q = W_Q \cdot \varphi(z_t)$ , $K = W_K \cdot \tau_\theta(y)$ , $V = W_V \cdot \tau_\theta(y)$ , and $\tau_\theta(y)$ is the encoding of the condition information.

Stable Diffusion is trained by combining this LDM architecture with a CLIP text encoder and a large-scale dataset (LAION-5B), becoming the de facto standard for open-source Text-to-Image models.

11.6 Score SDE

Song et al., 2021 | arXiv: 2011.13456

This ICLR 2021 Oral presentation connected DDPM and Score Matching under the unified framework of Stochastic Differential Equations (SDE).

Key contributions:

Variance Exploding (VE) SDE: Corresponds to the NCSN/SMLD family
Variance Preserving (VP) SDE: Corresponds to DDPM
Sub-VP SDE: A variant providing better likelihood

\text{VP-SDE}: \quad dx = -\frac{1}{2}\beta(t) x \, dt + \sqrt{\beta(t)} \, dw

The extension to continuous time enables exact log-likelihood computation (via ODE), more flexible sampler design, and conditional generation tasks such as Inpainting and Colorization.

11.7 Consistency Models

Song et al., 2023 | arXiv: 2303.01469

Consistency Models, proposed by Yang Song at OpenAI, represent an attempt to fundamentally solve the multi-step sampling problem of Diffusion Models.

The core idea is to learn a function $f_\theta$ that maps all points on an ODE trajectory to the same starting point (original data).

f_\theta(x_t, t) = x_0, \quad \forall t \in [0, T]

By this self-consistency property, data can be recovered from a noisy sample at any time $t$ with a single network evaluation. That is, 1-step generation is possible.

Two training approaches exist:

Consistency Distillation (CD): Distilling from a pre-trained Diffusion Model
Consistency Training (CT): Training independently without pre-training

In 2024, Easy Consistency Models (ECM) emerged, achieving better 2-step generation performance at 33% of the training cost compared to iCT.

11.8 Flow Matching / Rectified Flow

Lipman et al., 2023; Liu et al., 2023 | arXiv: 2210.02747, arXiv: 2209.03003

Flow Matching is an alternative approach to Diffusion Models that directly learns the probability flow connecting data and noise distributions.

Core Idea: Define straight paths from noise $x_1 \sim \mathcal{N}(0, I)$ to data $x_0$ .

x_t = (1-t) x_0 + t \, \epsilon, \quad t \in [0, 1]

Learn a velocity field $v_\theta(x_t, t)$ along this path.

L_{\text{FM}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| v_\theta(x_t, t) - (x_0 - \epsilon) \|^2 \right]

Rectified Flow repeatedly "straightens" these paths (reflow), producing high-quality samples even with few steps.

Stable Diffusion 3 adopted Rectified Flow, presenting a new paradigm for Diffusion Models alongside the transition from U-Net to Transformer.

11.9 DiT (Diffusion Transformer)

Peebles & Xie, 2023 | arXiv: 2212.09748

DiT replaced the Diffusion Model backbone from U-Net to Vision Transformer (ViT).

Key design choices:

Images are divided into patches and processed as tokens
Time step $t$ and class label $y$ are injected via Adaptive Layer Normalization (adaLN-Zero)
Composed of $L$ Transformer Blocks

DiT, combined with Latent Diffusion, achieved FID 2.27 on ImageNet $256 \times 256$ class-conditional generation, surpassing all previous Diffusion Models.

Significance of DiT: It empirically demonstrated that Transformer scaling laws can be applied to Diffusion Models. Performance consistently improves with increased model size and training compute. This finding directly influenced the architectural choices of the latest large-scale generative models such as Sora (OpenAI, Video generation) and Stable Diffusion 3.

12. PyTorch Code Examples: Simple DDPM Implementation

Below is a simplified PyTorch implementation of DDPM's core components. A more sophisticated U-Net and hyperparameter tuning would be needed for actual training.

12.1 Noise Schedule and Forward Process

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class DDPMScheduler:
    """Scheduler managing DDPM's Forward Process."""

    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02, schedule='linear'):
        self.num_timesteps = num_timesteps

        if schedule == 'linear':
            self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        elif schedule == 'cosine':
            self.betas = self._cosine_schedule(num_timesteps)
        else:
            raise ValueError(f"Unknown schedule: {schedule}")

        # Pre-compute key variables
        self.alphas = 1.0 - self.betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)          # ᾱ_t
        self.alphas_cumprod_prev = F.pad(self.alphas_cumprod[:-1], (1, 0), value=1.0)

        # Forward process coefficients
        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)        # √ᾱ_t
        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)  # √(1-ᾱ_t)

        # Reverse process coefficients
        self.sqrt_recip_alphas = torch.sqrt(1.0 / self.alphas)           # 1/√α_t
        self.posterior_variance = (
            self.betas * (1.0 - self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
        )  # β̃_t

    def _cosine_schedule(self, timesteps, s=0.008):
        steps = timesteps + 1
        t = torch.linspace(0, timesteps, steps) / timesteps
        alphas_cumprod = torch.cos((t + s) / (1 + s) * math.pi * 0.5) ** 2
        alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
        betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
        return torch.clip(betas, 0.0001, 0.9999)

    def add_noise(self, x_0, t, noise=None):
        """Forward process: compute q(x_t | x_0) in one step."""
        if noise is None:
            noise = torch.randn_like(x_0)

        sqrt_alpha_cumprod = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
        sqrt_one_minus_alpha_cumprod = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)

        # x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε
        x_t = sqrt_alpha_cumprod * x_0 + sqrt_one_minus_alpha_cumprod * noise
        return x_t

12.2 Simplified U-Net

class SinusoidalPositionEmbedding(nn.Module):
    """Transformer-style Sinusoidal Time Embedding."""

    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, t):
        device = t.device
        half_dim = self.dim // 2
        emb = math.log(10000) / (half_dim - 1)
        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
        emb = t[:, None].float() * emb[None, :]
        emb = torch.cat([emb.sin(), emb.cos()], dim=-1)
        return emb


class ResBlock(nn.Module):
    """Time-conditioned Residual Block."""

    def __init__(self, in_ch, out_ch, time_emb_dim):
        super().__init__()
        self.norm1 = nn.GroupNorm(8, in_ch)
        self.conv1 = nn.Conv2d(in_ch, out_ch, 3, padding=1)
        self.time_mlp = nn.Sequential(
            nn.SiLU(),
            nn.Linear(time_emb_dim, out_ch),
        )
        self.norm2 = nn.GroupNorm(8, out_ch)
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1)
        self.skip = nn.Conv2d(in_ch, out_ch, 1) if in_ch != out_ch else nn.Identity()

    def forward(self, x, t_emb):
        h = self.conv1(F.silu(self.norm1(x)))
        h = h + self.time_mlp(t_emb)[:, :, None, None]  # Inject time embedding
        h = self.conv2(F.silu(self.norm2(h)))
        return h + self.skip(x)                           # Residual connection


class SimpleUNet(nn.Module):
    """Simplified U-Net for DDPM training."""

    def __init__(self, in_channels=3, base_channels=64, time_emb_dim=256):
        super().__init__()

        # Time embedding
        self.time_mlp = nn.Sequential(
            SinusoidalPositionEmbedding(base_channels),
            nn.Linear(base_channels, time_emb_dim),
            nn.SiLU(),
            nn.Linear(time_emb_dim, time_emb_dim),
        )

        # Encoder
        self.enc1 = ResBlock(in_channels, base_channels, time_emb_dim)
        self.enc2 = ResBlock(base_channels, base_channels * 2, time_emb_dim)
        self.enc3 = ResBlock(base_channels * 2, base_channels * 4, time_emb_dim)
        self.pool = nn.MaxPool2d(2)

        # Bottleneck
        self.bot = ResBlock(base_channels * 4, base_channels * 4, time_emb_dim)

        # Decoder (with skip connections)
        self.dec3 = ResBlock(base_channels * 8, base_channels * 2, time_emb_dim)
        self.dec2 = ResBlock(base_channels * 4, base_channels, time_emb_dim)
        self.dec1 = ResBlock(base_channels * 2, base_channels, time_emb_dim)
        self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)

        # Output
        self.out = nn.Conv2d(base_channels, in_channels, 1)

    def forward(self, x, t):
        t_emb = self.time_mlp(t)

        # Encoder
        e1 = self.enc1(x, t_emb)
        e2 = self.enc2(self.pool(e1), t_emb)
        e3 = self.enc3(self.pool(e2), t_emb)

        # Bottleneck
        b = self.bot(self.pool(e3), t_emb)

        # Decoder with skip connections
        d3 = self.dec3(torch.cat([self.up(b), e3], dim=1), t_emb)
        d2 = self.dec2(torch.cat([self.up(d3), e2], dim=1), t_emb)
        d1 = self.dec1(torch.cat([self.up(d2), e1], dim=1), t_emb)

        return self.out(d1)  # Predicted noise ε_θ

12.3 Training Loop

def train_ddpm(model, dataloader, scheduler, epochs=100, lr=2e-4, device='cuda'):
    """DDPM training loop (Algorithm 1 implementation)."""
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    model.train()

    for epoch in range(epochs):
        total_loss = 0
        for batch_idx, (x_0, _) in enumerate(dataloader):
            x_0 = x_0.to(device)

            # 1. Select random time step: t ~ Uniform({1, ..., T})
            t = torch.randint(0, scheduler.num_timesteps, (x_0.shape[0],), device=device)

            # 2. Sample noise: ε ~ N(0, I)
            noise = torch.randn_like(x_0)

            # 3. Forward process: x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε
            x_t = scheduler.add_noise(x_0, t, noise)

            # 4. Predict noise: ε_θ(x_t, t)
            noise_pred = model(x_t, t)

            # 5. Simplified loss: L = ||ε - ε_θ(x_t, t)||²
            loss = F.mse_loss(noise_pred, noise)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)
        print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

12.4 Sampling

@torch.no_grad()
def sample_ddpm(model, scheduler, image_shape, device='cuda'):
    """DDPM sampling (Algorithm 2 implementation)."""
    model.eval()

    # x_T ~ N(0, I)
    x = torch.randn(image_shape, device=device)

    for t in reversed(range(scheduler.num_timesteps)):
        t_batch = torch.full((image_shape[0],), t, device=device, dtype=torch.long)

        # Predict noise
        predicted_noise = model(x, t_batch)

        # Reverse process coefficients
        alpha_t = scheduler.alphas[t]
        alpha_cumprod_t = scheduler.alphas_cumprod[t]
        beta_t = scheduler.betas[t]

        # Compute mean: μ_θ = 1/√α_t · (x_t - β_t/√(1-ᾱ_t) · ε_θ)
        mean = (1.0 / torch.sqrt(alpha_t)) * (
            x - (beta_t / torch.sqrt(1.0 - alpha_cumprod_t)) * predicted_noise
        )

        if t > 0:
            # Add stochastic noise (except at the last step)
            noise = torch.randn_like(x)
            sigma_t = torch.sqrt(scheduler.posterior_variance[t])
            x = mean + sigma_t * noise
        else:
            x = mean

    return x

12.5 Usage Example

# Hyperparameters
device = 'cuda' if torch.cuda.is_available() else 'cpu'
image_size = 32
batch_size = 128
num_timesteps = 1000

# Initialize scheduler and model
scheduler = DDPMScheduler(num_timesteps=num_timesteps, schedule='cosine')
model = SimpleUNet(in_channels=3, base_channels=64).to(device)

# Dataset (e.g., CIFAR-10)
from torchvision import datasets, transforms
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),  # Normalize to [-1, 1]
])
dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Train
train_ddpm(model, dataloader, scheduler, epochs=100, device=device)

# Sample
samples = sample_ddpm(model, scheduler, (16, 3, image_size, image_size), device=device)
# samples: 16 generated images in [-1, 1] range

13. Diffusion Model vs GAN vs VAE: Comparative Analysis

13.1 Comprehensive Comparison Table

Property	Diffusion Model (DDPM)	GAN	VAE
Training Method	Noise prediction (MSE)	Adversarial training (Min-Max)	Variational inference (ELBO)
Training Stability	Very stable	Unstable (mode collapse, oscillation)	Stable
Generation Quality	Very high	Very high	Moderate (blurry)
Diversity	High (full distribution coverage)	Low (mode collapse risk)	High
Generation Speed	Slow (1000 steps)	Very fast (1 step)	Fast (1 step)
Log-likelihood	Computable (ELBO)	Not computable	Computable (ELBO)
Latent Space	Implicit	None (or limited)	Explicit, continuous
Mode Coverage	High	Low	High
Conditional Generation	Very effective via CFG	Possible via cGAN	Conditional VAE
Resolution Scaling	Efficient via LDM	Progressive training needed	Hierarchical VAE needed
Theoretical Basis	Thermodynamics, Score Matching	Game theory	Variational Bayes
Representative Models	Stable Diffusion, DALL-E 2	StyleGAN, BigGAN	VQ-VAE-2, NVAE
CIFAR-10 FID	~2.0 (latest)	~2.9 (StyleGAN2)	~23.5 (NVAE)

13.2 When to Choose Which Model?

Choose Diffusion Models when:

Both generation quality and diversity are important
Complex conditional generation like text-to-image is needed
Training stability is critical
Generation speed is not the top priority

Choose GANs when:

Real-time generation is needed
High-quality images for a specific domain are needed (faces, landscapes, etc.)
The dataset is relatively small and uniform

Choose VAEs when:

Meaningful Latent Space manipulation is needed
Likelihood-based anomaly detection is needed
Fast encoding/decoding is required
Semi-supervised learning or representation learning is the main purpose

14. Present and Future of Diffusion Models

14.1 Major Trends in 2024-2025

Architecture Transition: From U-Net to Transformer. The latest models such as Stable Diffusion 3, FLUX, and Sora adopt DiT-based architectures. Transformer scaling laws have been confirmed to apply to Diffusion Models, and model scale expansion (8B+ parameters) is actively underway.

Sampling Efficiency. With advances in Consistency Models, Flow Matching, and DPM-Solver, 1-4 step generation has become possible. Rectified Flow learns straight paths, achieving high quality even with few steps.

Multimodal Expansion. Diffusion Models are expanding beyond images to video (Sora, Runway Gen-3), audio (AudioLDM), 3D (DreamFusion, Zero-1-to-3), robotics (Diffusion Policy), and other domains.

Acceleration and Optimization. Techniques such as Distillation, Quantization, and Caching have greatly improved inference speed, approaching real-time image generation.

14.2 Historical Significance of DDPM

DDPM represents a turning point in generative model history in the following ways:

Demonstrated the competitiveness of Likelihood-based models in the image generation space dominated by GANs
Showed that high-quality generation is possible with an extremely simple training objective ( $L_\text{simple}$ )
Established a theoretical framework connecting thermodynamics and Score Matching
Became the direct foundation of the modern AI revolution including Stable Diffusion, DALL-E 2, and Midjourney

15. References

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020. arXiv:2006.11239
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML 2015. arXiv:1503.03585
Song, J., Meng, C., & Ermon, S. (2021). Denoising Diffusion Implicit Models (DDIM). ICLR 2021. arXiv:2010.02502
Nichol, A. & Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. ICML 2021. arXiv:2102.09672
Dhariwal, P. & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. NeurIPS 2021. arXiv:2105.05233
Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance. NeurIPS Workshop 2021. arXiv:2207.12598
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022. arXiv:2112.10752
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021. arXiv:2011.13456
Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). Consistency Models. ICML 2023. arXiv:2303.01469
Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling. ICLR 2023. arXiv:2210.02747
Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers (DiT). ICCV 2023. arXiv:2212.09748
Song, Y. & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019. arXiv:1907.05600
Weng, L. (2021). What are Diffusion Models? lilianweng.github.io
Hugging Face. The Annotated Diffusion Model. huggingface.co/blog/annotated-diffusion