Skip to content
Published on

Complete Analysis of the DDPM Paper: The Mathematics and Principles of Diffusion Models that Create Images from Noise

Authors
  • Name
    Twitter

1. Paper Overview

"Denoising Diffusion Probabilistic Models" (DDPM) was published at NeurIPS 2020, co-authored by Jonathan Ho, Ajay Jain, and Pieter Abbeel from UC Berkeley. This paper is a landmark study that empirically demonstrated that high-quality image synthesis is achievable through diffusion probabilistic models.

The core idea is surprisingly simple. Define a Forward Process that gradually adds Gaussian noise to data, and learn a Reverse Process that step-by-step removes this noise to recover the original data. The final training objective reduces to a simple MSE loss between "model-predicted noise" and "actually added noise."

DDPM achieved FID 3.17 and Inception Score 9.46 on CIFAR-10, showing performance comparable to or surpassing GAN-based models of the time. More importantly, this paper became the foundation of modern image generation AI including DALL-E 2, Imagen, Stable Diffusion, and Midjourney.

Paper Information

  • Title: Denoising Diffusion Probabilistic Models
  • Authors: Jonathan Ho, Ajay Jain, Pieter Abbeel
  • Venue: NeurIPS 2020
  • arXiv: 2006.11239
  • Official Code: hojonathanho/diffusion

2. Background: From Thermodynamics to Generative Models

2.1 Inspiration from Non-equilibrium Thermodynamics

The intellectual origin of Diffusion Models lies in non-equilibrium statistical mechanics. In physics, diffusion refers to the process where particles randomly move from high-concentration regions to low-concentration regions, eventually reaching a state of thermal equilibrium (maximum entropy). The key insight of this process is:

  • Forward: A state with complex structure \rightarrow disordered equilibrium state (information destruction)
  • Reverse: Equilibrium state \rightarrow restoration to a structured state (information creation)

Sohl-Dickstein et al. (2015) first applied this idea to machine learning, publishing "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." By defining a diffusion process that transforms a complex data distribution into a simple known distribution (Gaussian), and learning its reverse process, one obtains a generative model.

2.2 Connection with Score Matching

Another theoretical pillar of Diffusion Models is Score Matching. The score function is defined as the gradient of the log probability density.

xlogp(x)\nabla_x \log p(x)

If this score function can be estimated, samples can be generated through Langevin Dynamics.

xt+1=xt+ϵ2xlogp(xt)+ϵz,zN(0,I)x_{t+1} = x_t + \frac{\epsilon}{2} \nabla_x \log p(x_t) + \sqrt{\epsilon} \, z, \quad z \sim \mathcal{N}(0, I)

Yang Song and Stefano Ermon (2019) proposed Noise Conditional Score Networks (NCSN) in "Generative Modeling by Estimating Gradients of the Data Distribution," presenting a method for estimating the score function at various noise levels. Ho et al.'s DDPM is deeply connected to this Score Matching perspective, and the paper explicitly cites "a new connection with denoising score matching with Langevin dynamics" as a core contribution.

2.3 SDE Perspective: A Unified Framework

Song et al. (2021) unified DDPM and Score Matching under the framework of Stochastic Differential Equations (SDE) in "Score-Based Generative Modeling through Stochastic Differential Equations." The Forward Process described as a continuous-time SDE takes the form:

dx=f(x,t)dt+g(t)dwdx = f(x, t) \, dt + g(t) \, dw

where ff is the drift coefficient, gg is the diffusion coefficient, and ww is a standard Wiener process. A corresponding Reverse-time SDE exists:

dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dwˉdx = \left[ f(x, t) - g(t)^2 \nabla_x \log p_t(x) \right] dt + g(t) \, d\bar{w}

The key insight is that solving the reverse SDE requires only the time-dependent score function xlogpt(x)\nabla_x \log p_t(x). DDPM's noise prediction network ϵθ\epsilon_\theta is essentially equivalent to estimating this score function.

ϵθ(xt,t)1αˉtxtlogp(xt)\epsilon_\theta(x_t, t) \approx -\sqrt{1 - \bar{\alpha}_t} \, \nabla_{x_t} \log p(x_t)

This relationship is the key link that theoretically unifies DDPM and Score Matching.


3. Forward Process: Systematically Adding Noise

3.1 Forward Process as a Markov Chain

The Forward Process (or Diffusion Process) is a fixed Markov Chain that gradually adds Gaussian noise to original data x0x_0. It has no learnable parameters and is entirely determined by a predefined Variance Schedule {β1,β2,...,βT}\{\beta_1, \beta_2, ..., \beta_T\}.

The transition probability at each time step tt is defined as:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \, \beta_t I)

In plain terms, at each step the data from the previous time step is scaled down by 1βt\sqrt{1 - \beta_t} and Gaussian noise with variance βt\beta_t is added.

xt=1βtxt1+βtϵt1,ϵt1N(0,I)x_t = \sqrt{1 - \beta_t} \, x_{t-1} + \sqrt{\beta_t} \, \epsilon_{t-1}, \quad \epsilon_{t-1} \sim \mathcal{N}(0, I)

Why scale by 1βt\sqrt{1 - \beta_t}? To preserve the total variance at each step. If the variance of xt1x_{t-1} is 1, then the variance of 1βtxt1\sqrt{1-\beta_t} \cdot x_{t-1} is 1βt1-\beta_t, and adding noise with variance βt\beta_t gives a total variance of (1βt)+βt=1(1-\beta_t) + \beta_t = 1.

When TT is sufficiently large and βt\beta_t is appropriately set, xTx_T converges to nearly pure isotropic Gaussian noise N(0,I)\mathcal{N}(0, I).

3.2 Complete Forward Process

The joint distribution of the complete Forward Process over TT steps is:

q(x1:Tx0)=t=1Tq(xtxt1)q(x_{1:T} | x_0) = \prod_{t=1}^{T} q(x_t | x_{t-1})

This follows from the Markov property, where each step depends only on the immediately preceding step. In DDPM, T=1000T = 1000 is used, with βt\beta_t increasing linearly from β1=104\beta_1 = 10^{-4} to βT=0.02\beta_T = 0.02.


4. Core Mathematics: Reparameterization Trick

4.1 Jumping to Arbitrary Time tt in One Step

The most powerful mathematical property of the Forward Process is that xtx_t at any arbitrary time tt can be computed directly from x0x_0 without going through intermediate steps. This is what makes training efficient.

First, define the notation:

αt=1βt,αˉt=s=1tαs\alpha_t = 1 - \beta_t, \qquad \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s

αˉt\bar{\alpha}_t is the cumulative product of αs\alpha_s, representing how much of the original signal is preserved up to time tt.

4.2 Derivation

Starting from x1x_1 and deriving inductively:

x1=α1x0+1α1ϵ0x_1 = \sqrt{\alpha_1} \, x_0 + \sqrt{1 - \alpha_1} \, \epsilon_0 x2=α2x1+1α2ϵ1x_2 = \sqrt{\alpha_2} \, x_1 + \sqrt{1 - \alpha_2} \, \epsilon_1

Substituting x1x_1 into x2x_2:

x2=α2(α1x0+1α1ϵ0)+1α2ϵ1x_2 = \sqrt{\alpha_2} \left( \sqrt{\alpha_1} \, x_0 + \sqrt{1 - \alpha_1} \, \epsilon_0 \right) + \sqrt{1 - \alpha_2} \, \epsilon_1 =α1α2x0+α2(1α1)ϵ0+1α2ϵ1= \sqrt{\alpha_1 \alpha_2} \, x_0 + \sqrt{\alpha_2(1-\alpha_1)} \, \epsilon_0 + \sqrt{1-\alpha_2} \, \epsilon_1

Applying the sum of independent Gaussians rule: the sum of two independent Gaussians N(0,σ12I)\mathcal{N}(0, \sigma_1^2 I) and N(0,σ22I)\mathcal{N}(0, \sigma_2^2 I) follows N(0,(σ12+σ22)I)\mathcal{N}(0, (\sigma_1^2 + \sigma_2^2)I).

Summing the noise variances:

α2(1α1)+(1α2)=α2α1α2+1α2=1α1α2=1αˉ2\alpha_2(1-\alpha_1) + (1-\alpha_2) = \alpha_2 - \alpha_1\alpha_2 + 1 - \alpha_2 = 1 - \alpha_1\alpha_2 = 1 - \bar{\alpha}_2

Therefore:

x2=αˉ2x0+1αˉ2ϵ,ϵN(0,I)x_2 = \sqrt{\bar{\alpha}_2} \, x_0 + \sqrt{1 - \bar{\alpha}_2} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Generalizing this yields the following.

4.3 Final Result: Closed-form Expression

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)\boxed{q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, \, (1 - \bar{\alpha}_t) I)}

That is, xtx_t at any time tt can be sampled in one step:

xt=αˉtx0+1αˉtϵ,ϵN(0,I)x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

The intuitive interpretation of this formula is:

TermMeaningChange Over Time
αˉtx0\sqrt{\bar{\alpha}_t} \, x_0Original signalAs tt \uparrow, αˉt\bar{\alpha}_t \downarrow, signal decreases
1αˉtϵ\sqrt{1 - \bar{\alpha}_t} \, \epsilonAdded noiseAs tt \uparrow, 1αˉt1-\bar{\alpha}_t \uparrow, noise increases

At t=0t = 0, αˉ0=1\bar{\alpha}_0 = 1 so we get x0x_0 as-is, and at t=Tt = T, αˉT0\bar{\alpha}_T \approx 0 so it becomes nearly pure noise. This gradual decrease in Signal-to-Noise Ratio (SNR) is the essence of the Forward Process.

SNR(t)=αˉt1αˉt\text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}

5. Reverse Process: Recovering Images from Noise

5.1 Definition of the Reverse Process

The Reverse Process starts from pure noise xTN(0,I)x_T \sim \mathcal{N}(0, I) and progressively removes noise to generate data x0x_0. If each step of the Forward Process is a small Gaussian perturbation, the key assumption is that its reverse can also be approximated as Gaussian (when βt\beta_t is sufficiently small).

pθ(x0:T)=p(xT)t=1Tpθ(xt1xt)p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1} | x_t) pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

Here, μθ\mu_\theta and Σθ\Sigma_\theta are the mean and variance that the neural network must learn. In DDPM, the variance Σθ\Sigma_\theta is not learned but fixed as σt2I\sigma_t^2 I, where either σt2=βt\sigma_t^2 = \beta_t or σt2=β~t=1αˉt11αˉtβt\sigma_t^2 = \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t is used.

5.2 Derivation of the Posterior q(xt1xt,x0)q(x_{t-1}|x_t, x_0)

The key to training is that the reverse conditional distribution (posterior) given x0x_0 is computable in closed form. Applying Bayes' theorem:

q(xt1xt,x0)=q(xtxt1,x0)q(xt1x0)q(xtx0)q(x_{t-1} | x_t, x_0) = \frac{q(x_t | x_{t-1}, x_0) \, q(x_{t-1} | x_0)}{q(x_t | x_0)}

By the Markov property, q(xtxt1,x0)=q(xtxt1)q(x_t|x_{t-1}, x_0) = q(x_t|x_{t-1}), so all three terms are Gaussian. Since the product of Gaussians is also Gaussian, expanding the exponents and rearranging as a quadratic in xt1x_{t-1} yields:

q(xt1xt,x0)=N(xt1;μ~t(xt,x0),β~tI)q(x_{t-1} | x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I)

where the posterior mean is:

μ~t(xt,x0)=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t

and the posterior variance is:

β~t=1αˉt11αˉtβt\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t

5.3 Replacing x0x_0 with ϵ\epsilon

Since the model cannot directly know x0x_0, we solve the Reparameterization formula xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon in reverse to express x0x_0:

x0=1αˉt(xt1αˉtϵ)x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1-\bar{\alpha}_t} \, \epsilon \right)

Substituting this into the posterior mean μ~t\tilde{\mu}_t:

μ~t=1αt(xtβt1αˉtϵ)\tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \, \epsilon \right)

If the model learns a network ϵθ(xt,t)\epsilon_\theta(x_t, t) that predicts the noise ϵ\epsilon, the Reverse Process mean is computed as:

μθ(xt,t)=1αt(xtβt1αˉtϵθ(xt,t))\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \, \epsilon_\theta(x_t, t) \right)

This is why noise prediction is equivalent to mean prediction in DDPM's Reverse Process.


6. Deriving the Training Objective: From ELBO to Simplified Loss

6.1 Maximum Likelihood and ELBO

The ultimate goal of a generative model is to maximize the data log-likelihood logpθ(x0)\log p_\theta(x_0). However, since this is intractable to compute directly, we optimize the Evidence Lower Bound (ELBO).

Applying Jensen's inequality:

logpθ(x0)Eq(x1:Tx0)[logpθ(x0:T)q(x1:Tx0)]=ELBO\log p_\theta(x_0) \geq \mathbb{E}_{q(x_{1:T}|x_0)} \left[ \log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)} \right] = \text{ELBO}

6.2 Decomposition of the ELBO

Decomposing the ELBO into KL divergence terms:

ELBO=Eq[logpθ(x0x1)]L0:Reconstruction termDKL(q(xTx0)p(xT))LT:Prior matching termt=2TEq[DKL(q(xt1xt,x0)pθ(xt1xt))]Lt1:Denoising matching term\text{ELBO} = \underbrace{\mathbb{E}_q[\log p_\theta(x_0 | x_1)]}_{L_0: \text{Reconstruction term}} - \underbrace{D_{\text{KL}}(q(x_T | x_0) \| p(x_T))}_{L_T: \text{Prior matching term}} - \sum_{t=2}^{T} \underbrace{\mathbb{E}_q \left[ D_{\text{KL}}(q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t)) \right]}_{L_{t-1}: \text{Denoising matching term}}

Analyzing the meaning of each term:

LTL_T (Prior Matching): Measures how well q(xTx0)q(x_T|x_0) matches the prior distribution p(xT)=N(0,I)p(x_T) = \mathcal{N}(0, I). When TT is sufficiently large, this term converges to 0, and since it has no learnable parameters, it is ignored as a constant.

L0L_0 (Reconstruction): Measures the ability to reconstruct x0x_0 from x1x_1. Since x0x_0 and x1x_1 are very similar, its impact on overall training is small.

Lt1L_{t-1} (Denoising Matching): The core training signal that measures how well the model's Reverse transition pθ(xt1xt)p_\theta(x_{t-1}|x_t) matches the true posterior q(xt1xt,x0)q(x_{t-1}|x_t, x_0).

6.3 KL Divergence Computation

The KL divergence between two Gaussians is computable in closed form. Since q(xt1xt,x0)=N(μ~t,β~tI)q(x_{t-1}|x_t, x_0) = \mathcal{N}(\tilde{\mu}_t, \tilde{\beta}_t I) and pθ(xt1xt)=N(μθ,σt2I)p_\theta(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta, \sigma_t^2 I):

DKL(qpθ)=12σt2μ~t(xt,x0)μθ(xt,t)2+CD_{\text{KL}}(q \| p_\theta) = \frac{1}{2\sigma_t^2} \|\tilde{\mu}_t(x_t, x_0) - \mu_\theta(x_t, t)\|^2 + C

where CC is a constant related to the variances. With fixed variance, only the difference in means becomes the training objective.

6.4 Reparameterization to Noise Prediction

Substituting the expressions for μ~t\tilde{\mu}_t and μθ\mu_\theta derived earlier:

μ~tμθ2=βt2(1αˉt)αtϵϵθ(xt,t)2\|\tilde{\mu}_t - \mu_\theta\|^2 = \frac{\beta_t^2}{(1-\bar{\alpha}_t)\alpha_t} \|\epsilon - \epsilon_\theta(x_t, t)\|^2

The Simplified Loss with the weighting coefficient removed is:

Lsimple=Et,x0,ϵ[ϵϵθ(xt,t)2]\boxed{L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \right]}

where tUniform({1,...,T})t \sim \text{Uniform}(\{1, ..., T\}), x0q(x0)x_0 \sim q(x_0), and ϵN(0,I)\epsilon \sim \mathcal{N}(0, I).

This is DDPM's most important contribution. Starting from the complex ELBO, it ultimately arrives at the "MSE between actual noise ϵ\epsilon and predicted noise ϵθ\epsilon_\theta" — the simplest possible loss function in machine learning. Experimentally, this simplified loss also produces better sample quality than the weighted variational bound.

6.5 Training Algorithm Summary

Algorithm 1: Training
─────────────────────────────────
repeat
    x_0 ~ q(x_0)                    # Sample from dataset
    t ~ Uniform({1, ..., T})         # Select random time step
    ε ~ N(0, I)                      # Sample standard Gaussian noise
    x_t = √ᾱ_t · x_0 + (1-ᾱ_t) · ε   # Generate noisy image
    ∇_θ ||ε - ε_θ(x_t, t)||²        # Compute gradient and update
until converged

7. Noise Scheduling: Design of βt\beta_t

7.1 Linear Schedule (Original DDPM)

Ho et al. used a schedule where βt\beta_t increases linearly from β1=104\beta_1 = 10^{-4} to βT=0.02\beta_T = 0.02.

βt=β1+t1T1(βTβ1)\beta_t = \beta_1 + \frac{t-1}{T-1}(\beta_T - \beta_1)

The intuition behind this schedule is to add small noise initially to gradually destroy data structure, and larger noise in later stages to rapidly converge to a Gaussian.

7.2 Problems with the Linear Schedule

Nichol & Dhariwal (2021, "Improved Denoising Diffusion Probabilistic Models") identified two issues with the Linear Schedule.

First, information is destroyed too quickly in the early stages. αˉt\bar{\alpha}_t decreases rapidly in the beginning, so significant noise is added even at low values of tt. This is particularly problematic for high-resolution images.

Second, late time steps are wasted. At large values of tt, αˉt0\bar{\alpha}_t \approx 0, meaning xtx_t is already close to pure noise and contributes little to meaningful training.

7.3 Cosine Schedule

The Cosine Schedule proposed by Nichol & Dhariwal defines αˉt\bar{\alpha}_t directly.

αˉt=f(t)f(0),f(t)=cos(t/T+s1+sπ2)2\bar{\alpha}_t = \frac{f(t)}{f(0)}, \qquad f(t) = \cos\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2

where s=0.008s = 0.008 is a small offset to prevent βt\beta_t from becoming too small near t=0t=0.

The key characteristics of the Cosine Schedule are:

  • αˉt\bar{\alpha}_t decreases nearly linearly in the middle range, providing uniformly useful training signals across all time steps
  • Prevents excessive noise addition in the early stages, preserving fine details
  • Ensures smooth transition to complete noise in the later stages
import torch
import math

def cosine_beta_schedule(timesteps, s=0.008):
    """Cosine schedule as proposed in Nichol & Dhariwal (2021)."""
    steps = timesteps + 1
    t = torch.linspace(0, timesteps, steps) / timesteps
    alphas_cumprod = torch.cos((t + s) / (1 + s) * math.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clip(betas, 0.0001, 0.9999)

def linear_beta_schedule(timesteps, beta_start=1e-4, beta_end=0.02):
    """Linear schedule as proposed in Ho et al. (2020)."""
    return torch.linspace(beta_start, beta_end, timesteps)

7.4 Schedule Comparison

PropertyLinear ScheduleCosine Schedule
αˉt\bar{\alpha}_t decay patternRapid early, gradual lateNearly linear in middle
Early information preservationLowHigh
Late time step utilizationInefficient (already pure noise)Efficient
High-resolution suitabilityLowHigh
Used in original DDPMYesNo
Used in Improved DDPMNoYes

8. Sampling Algorithm

8.1 DDPM Sampling

After training is complete, the DDPM sampling algorithm for generating new images is:

Algorithm 2: Sampling
─────────────────────────────────
x_T ~ N(0, I)                          # Start from pure noise
for t = T, T-1, ..., 1:
    z ~ N(0, I)  if t > 1, else z = 0  # No noise added at the last step
    x_{t-1} = 1/√α_t · (x_t - β_t/(1-ᾱ_t) · ε_θ(x_t, t)) + σ_t · z
return x_0

8.2 Step-by-Step Interpretation

Step 1: Initialization. Sample pure Gaussian noise from xTN(0,I)x_T \sim \mathcal{N}(0, I). This is the starting point of the generation process.

Step 2: Noise Prediction. Feed the current noisy image xtx_t and time step tt into the network ϵθ\epsilon_\theta to predict the noise contained in xtx_t.

Step 3: Mean Computation. Compute the mean of the Reverse transition using the predicted noise.

μθ(xt,t)=1αt(xtβt1αˉtϵθ(xt,t))\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right)

Step 4: Stochastic Transition. Generate xt1x_{t-1} by adding scaled Gaussian noise σtz\sigma_t z to the computed mean. No noise is added at the final step (t=1t=1).

Step 5: Repeat. Repeat the above process from t=Tt = T to t=1t = 1.

8.3 Limitations of Sampling

The biggest drawback of DDPM sampling is speed. Sequential denoising over T=1000T = 1000 steps requires 1000 neural network forward passes for a single image. This is extremely slow compared to GAN's single forward pass, spurring subsequent research on accelerated samplers such as DDIM and DPM-Solver.


9. Architecture: Time-conditioned U-Net

9.1 U-Net Based Design

DDPM's noise prediction network ϵθ(xt,t)\epsilon_\theta(x_t, t) is based on the U-Net architecture. U-Net was originally proposed by Ronneberger et al. (2015) for medical image segmentation, featuring an Encoder-Decoder structure with Skip Connections that combine features at various resolutions.

DDPM's U-Net is based on the PixelCNN++ structure with the following modifications.

9.2 Key Components

Time Embedding: To inject the time step tt into the network, Transformer-style Sinusoidal Positional Encoding is used.

TE(t)2i=sin(t100002i/d),TE(t)2i+1=cos(t100002i/d)\text{TE}(t)_{2i} = \sin\left(\frac{t}{10000^{2i/d}}\right), \quad \text{TE}(t)_{2i+1} = \cos\left(\frac{t}{10000^{2i/d}}\right)

This embedding passes through an MLP and is injected into each ResNet Block. Specifically, the time embedding is linearly transformed and then either added (additive) or scaled (FiLM conditioning) onto the intermediate feature maps of the ResNet Block.

ResNet Block: Each block consists of the following sequence:

  1. Group Normalization
  2. SiLU (Swish) Activation
  3. Convolution
  4. Time Embedding injection
  5. Group Normalization
  6. SiLU Activation
  7. Dropout
  8. Convolution
  9. Residual Connection

Self-Attention: Multi-Head Self-Attention is applied at feature maps of 16×1616 \times 16 resolution. The spatial dimensions (h,w)(h, w) are flattened to sequence length h×wh \times w to perform standard Scaled Dot-Product Attention.

Group Normalization: Group Normalization is used instead of Batch Normalization. It is independent of batch size and provides more stable training for generative models.

9.3 Specific Architecture Specifications

Input: x_t ∈ R^(C×H×W), t ∈ {1,...,T}

Encoder:
  [128][128] → ↓2  [256][256] → ↓2  [256][256] → ↓2       (+ Self-Attention at 16×16)
  [512][512] → ↓2

Bottleneck:
  [512]Self-Attention[512]

Decoder (with skip connections):
  [512][512] → ↑2  [256][256] → ↑2       (+ Self-Attention at 16×16)
  [256][256] → ↑2  [128][128] → ↑2

Output: ε_θ ∈ R^(C×H×W)       (predicted noise with same dimensions as input)

DDPM used approximately 114M parameters at 256×256256 \times 256 resolution.


10. Experimental Results

10.1 Quantitative Evaluation

DDPM was evaluated on the following benchmarks.

CIFAR-10 (Unconditional, 32×3232 \times 32):

ModelFID (\downarrow)IS (\uparrow)
DDPM3.179.46
StyleGAN2 + ADA2.929.83
NCSN25.328.87
ProgressiveGAN15.528.80
NVAE23.5-

DDPM achieved SOTA FID among unconditional generative models at the time, showing quality comparable to GAN-based StyleGAN2.

LSUN (256×256256 \times 256):

DatasetFID
LSUN Bedroom4.90
LSUN Cat-
LSUN Church7.89

10.2 Qualitative Analysis

DDPM samples exhibited several distinct characteristics compared to GANs.

High diversity: While GANs suffer from limited generation diversity due to mode collapse, DDPM covers diverse modes of the data distribution in a balanced manner.

Gradual generation: The progressive transformation from noise to image can be visualized, confirming a coarse-to-fine generation pattern where the model first forms global structure and then adds fine details.

Stable training: Free from GAN's chronic problems of training instability (mode collapse, training oscillation), converging stably with a simple MSE loss.

10.3 Progressive Lossy Compression Interpretation

Ho et al. interpreted DDPM as naturally implementing a Progressive Lossy Decompression scheme. Information is progressively added at each Reverse step, which can be viewed as a generalization of Autoregressive Decoding. Rate-Distortion curve analysis confirmed that most bits are allocated to overall structure rather than perceptually insignificant details.


11. Comprehensive Overview of Subsequent Research: The Evolution of Diffusion

11.1 DDIM (Denoising Diffusion Implicit Models)

Song et al., 2021 | arXiv: 2010.02502

Research that addressed DDPM's biggest limitation: slow sampling speed. The core idea is to generalize the Forward Process to be Non-Markovian.

DDIM uses the same trained model ϵθ\epsilon_\theta while modifying only the sampling process.

xt1=αˉt1(xt1αˉtϵθ(xt,t)αˉt)predicted x0+1αˉt1σt2ϵθ(xt,t)+σtϵtx_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\left( \frac{x_t - \sqrt{1-\bar{\alpha}_t} \, \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}} \right)}_{\text{predicted } x_0} + \sqrt{1-\bar{\alpha}_{t-1} - \sigma_t^2} \cdot \epsilon_\theta(x_t, t) + \sigma_t \epsilon_t

Setting σt=0\sigma_t = 0 makes sampling completely deterministic, which provides:

  • Accelerated sampling: Similar quality images with only 50-100 steps instead of T=1000T=1000 (10-20x speedup)
  • Semantic interpolation: Thanks to the deterministic mapping, interpolation in latent space leads to meaningful image transformations
  • Consistency: Always generates the same image from the same initial noise, ensuring reproducible results

11.2 Improved DDPM

Nichol & Dhariwal, 2021 | arXiv: 2102.09672

Research that improved two aspects of the original DDPM.

Learnable variance: While DDPM fixed σt2\sigma_t^2 as either βt\beta_t or β~t\tilde{\beta}_t, Improved DDPM makes it learnable. Specifically, σt2\sigma_t^2 is parameterized as an interpolation between βt\beta_t and β~t\tilde{\beta}_t.

Σθ(xt,t)=exp(vlogβt+(1v)logβ~t)\Sigma_\theta(x_t, t) = \exp(v \log \beta_t + (1-v) \log \tilde{\beta}_t)

where vv is a value output by the network.

Cosine Schedule: Introduced the Cosine Variance Schedule described earlier, greatly improving training efficiency especially for high-resolution images.

Hybrid Loss: Adding a small amount of the variational lower bound LvlbL_\text{vlb} to LsimpleL_\text{simple} also improved log-likelihood.

Lhybrid=Lsimple+λLvlbL_\text{hybrid} = L_\text{simple} + \lambda L_\text{vlb}

11.3 Classifier Guidance

Dhariwal & Nichol, 2021 | arXiv: 2105.05233

A technique proposed in "Diffusion Models Beat GANs on Image Synthesis" that injects the gradient of a pre-trained classifier into the Reverse Process for conditional generation.

ϵ^θ(xt,t,y)=ϵθ(xt,t)s1αˉtxtlogpϕ(yxt)\hat{\epsilon}_\theta(x_t, t, y) = \epsilon_\theta(x_t, t) - s \cdot \sqrt{1-\bar{\alpha}_t} \cdot \nabla_{x_t} \log p_\phi(y|x_t)

where ss is the guidance scale and pϕp_\phi is a classifier trained on noisy images. Increasing ss reduces diversity but increases fidelity to a specific class. In this paper, Diffusion Models first surpassed GANs in FID (CIFAR-10 FID 2.97, ImageNet 256x256 FID 4.59).

Limitation: A separate classifier must be trained on noisy data, complicating the training pipeline.

11.4 Classifier-Free Guidance (CFG)

Ho & Salimans, 2022 | arXiv: 2207.12598

An innovative technique that achieves guidance effects without a separate classifier, and has become the de facto standard in modern Diffusion Models.

The core idea is for a single network to learn both conditional and unconditional generation. During training, condition information cc is replaced with a null token \varnothing with a certain probability (typically 10-20%).

At inference, conditional and unconditional predictions are linearly combined.

ϵ^θ(xt,t,c)=(1+w)ϵθ(xt,t,c)wϵθ(xt,t,)\hat{\epsilon}_\theta(x_t, t, c) = (1 + w) \cdot \epsilon_\theta(x_t, t, c) - w \cdot \epsilon_\theta(x_t, t, \varnothing)

where ww is the guidance weight. When w=0w = 0, standard conditional generation occurs; when w>0w > 0, fidelity to the condition increases.

Rearranging gives the following interpretation:

ϵ^θ=ϵθ(xt,t,)+(1+w)(ϵθ(xt,t,c)ϵθ(xt,t,))shift toward condition\hat{\epsilon}_\theta = \epsilon_\theta(x_t, t, \varnothing) + (1 + w) \cdot \underbrace{(\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \varnothing))}_{\text{shift toward condition}}

This can be interpreted as pushing away from the unconditional prediction toward the conditional direction, with larger ww increasing the pushing force. Nearly all state-of-the-art Text-to-Image models including DALL-E 2, Stable Diffusion, and Imagen use CFG.

11.5 Latent Diffusion Models (LDM) / Stable Diffusion

Rombach et al., 2022 | arXiv: 2112.10752

LDM dramatically improved computational efficiency by performing the Diffusion Process in latent space rather than pixel space.

Key Architecture:

  1. Perceptual Compression: A pre-trained Autoencoder (VQ-VAE or KL-regularized VAE) Encoder E\mathcal{E} compresses image xx into low-dimensional latent z=E(x)z = \mathcal{E}(x). Typically, a 256×256×3256 \times 256 \times 3 image is compressed to 32×32×432 \times 32 \times 4 latent (approximately 48x dimensionality reduction).

  2. Latent Diffusion: DDPM's Forward/Reverse Process is performed in this latent space. Computation is significantly reduced compared to pixel space.

  3. Cross-Attention Conditioning: Condition information such as text and segmentation maps is injected into the U-Net via Cross-Attention. For text, CLIP or BERT embeddings are used.

Attention(Q,K,V)=softmax(QKTd)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V

where Q=WQφ(zt)Q = W_Q \cdot \varphi(z_t), K=WKτθ(y)K = W_K \cdot \tau_\theta(y), V=WVτθ(y)V = W_V \cdot \tau_\theta(y), and τθ(y)\tau_\theta(y) is the encoding of the condition information.

Stable Diffusion is trained by combining this LDM architecture with a CLIP text encoder and a large-scale dataset (LAION-5B), becoming the de facto standard for open-source Text-to-Image models.

11.6 Score SDE

Song et al., 2021 | arXiv: 2011.13456

This ICLR 2021 Oral presentation connected DDPM and Score Matching under the unified framework of Stochastic Differential Equations (SDE).

Key contributions:

  • Variance Exploding (VE) SDE: Corresponds to the NCSN/SMLD family
  • Variance Preserving (VP) SDE: Corresponds to DDPM
  • Sub-VP SDE: A variant providing better likelihood
VP-SDE:dx=12β(t)xdt+β(t)dw\text{VP-SDE}: \quad dx = -\frac{1}{2}\beta(t) x \, dt + \sqrt{\beta(t)} \, dw

The extension to continuous time enables exact log-likelihood computation (via ODE), more flexible sampler design, and conditional generation tasks such as Inpainting and Colorization.

11.7 Consistency Models

Song et al., 2023 | arXiv: 2303.01469

Consistency Models, proposed by Yang Song at OpenAI, represent an attempt to fundamentally solve the multi-step sampling problem of Diffusion Models.

The core idea is to learn a function fθf_\theta that maps all points on an ODE trajectory to the same starting point (original data).

fθ(xt,t)=x0,t[0,T]f_\theta(x_t, t) = x_0, \quad \forall t \in [0, T]

By this self-consistency property, data can be recovered from a noisy sample at any time tt with a single network evaluation. That is, 1-step generation is possible.

Two training approaches exist:

  • Consistency Distillation (CD): Distilling from a pre-trained Diffusion Model
  • Consistency Training (CT): Training independently without pre-training

In 2024, Easy Consistency Models (ECM) emerged, achieving better 2-step generation performance at 33% of the training cost compared to iCT.

11.8 Flow Matching / Rectified Flow

Lipman et al., 2023; Liu et al., 2023 | arXiv: 2210.02747, arXiv: 2209.03003

Flow Matching is an alternative approach to Diffusion Models that directly learns the probability flow connecting data and noise distributions.

Core Idea: Define straight paths from noise x1N(0,I)x_1 \sim \mathcal{N}(0, I) to data x0x_0.

xt=(1t)x0+tϵ,t[0,1]x_t = (1-t) x_0 + t \, \epsilon, \quad t \in [0, 1]

Learn a velocity field vθ(xt,t)v_\theta(x_t, t) along this path.

LFM=Et,x0,ϵ[vθ(xt,t)(x0ϵ)2]L_{\text{FM}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| v_\theta(x_t, t) - (x_0 - \epsilon) \|^2 \right]

Rectified Flow repeatedly "straightens" these paths (reflow), producing high-quality samples even with few steps.

Stable Diffusion 3 adopted Rectified Flow, presenting a new paradigm for Diffusion Models alongside the transition from U-Net to Transformer.

11.9 DiT (Diffusion Transformer)

Peebles & Xie, 2023 | arXiv: 2212.09748

DiT replaced the Diffusion Model backbone from U-Net to Vision Transformer (ViT).

Key design choices:

  • Images are divided into patches and processed as tokens
  • Time step tt and class label yy are injected via Adaptive Layer Normalization (adaLN-Zero)
  • Composed of LL Transformer Blocks

DiT, combined with Latent Diffusion, achieved FID 2.27 on ImageNet 256×256256 \times 256 class-conditional generation, surpassing all previous Diffusion Models.

Significance of DiT: It empirically demonstrated that Transformer scaling laws can be applied to Diffusion Models. Performance consistently improves with increased model size and training compute. This finding directly influenced the architectural choices of the latest large-scale generative models such as Sora (OpenAI, Video generation) and Stable Diffusion 3.


12. PyTorch Code Examples: Simple DDPM Implementation

Below is a simplified PyTorch implementation of DDPM's core components. A more sophisticated U-Net and hyperparameter tuning would be needed for actual training.

12.1 Noise Schedule and Forward Process

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class DDPMScheduler:
    """Scheduler managing DDPM's Forward Process."""

    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02, schedule='linear'):
        self.num_timesteps = num_timesteps

        if schedule == 'linear':
            self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        elif schedule == 'cosine':
            self.betas = self._cosine_schedule(num_timesteps)
        else:
            raise ValueError(f"Unknown schedule: {schedule}")

        # Pre-compute key variables
        self.alphas = 1.0 - self.betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)          # ᾱ_t
        self.alphas_cumprod_prev = F.pad(self.alphas_cumprod[:-1], (1, 0), value=1.0)

        # Forward process coefficients
        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)        # √ᾱ_t
        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)  # √(1-ᾱ_t)

        # Reverse process coefficients
        self.sqrt_recip_alphas = torch.sqrt(1.0 / self.alphas)           # 1/√α_t
        self.posterior_variance = (
            self.betas * (1.0 - self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
        )  # β̃_t

    def _cosine_schedule(self, timesteps, s=0.008):
        steps = timesteps + 1
        t = torch.linspace(0, timesteps, steps) / timesteps
        alphas_cumprod = torch.cos((t + s) / (1 + s) * math.pi * 0.5) ** 2
        alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
        betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
        return torch.clip(betas, 0.0001, 0.9999)

    def add_noise(self, x_0, t, noise=None):
        """Forward process: compute q(x_t | x_0) in one step."""
        if noise is None:
            noise = torch.randn_like(x_0)

        sqrt_alpha_cumprod = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
        sqrt_one_minus_alpha_cumprod = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)

        # x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε
        x_t = sqrt_alpha_cumprod * x_0 + sqrt_one_minus_alpha_cumprod * noise
        return x_t

12.2 Simplified U-Net

class SinusoidalPositionEmbedding(nn.Module):
    """Transformer-style Sinusoidal Time Embedding."""

    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, t):
        device = t.device
        half_dim = self.dim // 2
        emb = math.log(10000) / (half_dim - 1)
        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
        emb = t[:, None].float() * emb[None, :]
        emb = torch.cat([emb.sin(), emb.cos()], dim=-1)
        return emb


class ResBlock(nn.Module):
    """Time-conditioned Residual Block."""

    def __init__(self, in_ch, out_ch, time_emb_dim):
        super().__init__()
        self.norm1 = nn.GroupNorm(8, in_ch)
        self.conv1 = nn.Conv2d(in_ch, out_ch, 3, padding=1)
        self.time_mlp = nn.Sequential(
            nn.SiLU(),
            nn.Linear(time_emb_dim, out_ch),
        )
        self.norm2 = nn.GroupNorm(8, out_ch)
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1)
        self.skip = nn.Conv2d(in_ch, out_ch, 1) if in_ch != out_ch else nn.Identity()

    def forward(self, x, t_emb):
        h = self.conv1(F.silu(self.norm1(x)))
        h = h + self.time_mlp(t_emb)[:, :, None, None]  # Inject time embedding
        h = self.conv2(F.silu(self.norm2(h)))
        return h + self.skip(x)                           # Residual connection


class SimpleUNet(nn.Module):
    """Simplified U-Net for DDPM training."""

    def __init__(self, in_channels=3, base_channels=64, time_emb_dim=256):
        super().__init__()

        # Time embedding
        self.time_mlp = nn.Sequential(
            SinusoidalPositionEmbedding(base_channels),
            nn.Linear(base_channels, time_emb_dim),
            nn.SiLU(),
            nn.Linear(time_emb_dim, time_emb_dim),
        )

        # Encoder
        self.enc1 = ResBlock(in_channels, base_channels, time_emb_dim)
        self.enc2 = ResBlock(base_channels, base_channels * 2, time_emb_dim)
        self.enc3 = ResBlock(base_channels * 2, base_channels * 4, time_emb_dim)
        self.pool = nn.MaxPool2d(2)

        # Bottleneck
        self.bot = ResBlock(base_channels * 4, base_channels * 4, time_emb_dim)

        # Decoder (with skip connections)
        self.dec3 = ResBlock(base_channels * 8, base_channels * 2, time_emb_dim)
        self.dec2 = ResBlock(base_channels * 4, base_channels, time_emb_dim)
        self.dec1 = ResBlock(base_channels * 2, base_channels, time_emb_dim)
        self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)

        # Output
        self.out = nn.Conv2d(base_channels, in_channels, 1)

    def forward(self, x, t):
        t_emb = self.time_mlp(t)

        # Encoder
        e1 = self.enc1(x, t_emb)
        e2 = self.enc2(self.pool(e1), t_emb)
        e3 = self.enc3(self.pool(e2), t_emb)

        # Bottleneck
        b = self.bot(self.pool(e3), t_emb)

        # Decoder with skip connections
        d3 = self.dec3(torch.cat([self.up(b), e3], dim=1), t_emb)
        d2 = self.dec2(torch.cat([self.up(d3), e2], dim=1), t_emb)
        d1 = self.dec1(torch.cat([self.up(d2), e1], dim=1), t_emb)

        return self.out(d1)  # Predicted noise ε_θ

12.3 Training Loop

def train_ddpm(model, dataloader, scheduler, epochs=100, lr=2e-4, device='cuda'):
    """DDPM training loop (Algorithm 1 implementation)."""
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    model.train()

    for epoch in range(epochs):
        total_loss = 0
        for batch_idx, (x_0, _) in enumerate(dataloader):
            x_0 = x_0.to(device)

            # 1. Select random time step: t ~ Uniform({1, ..., T})
            t = torch.randint(0, scheduler.num_timesteps, (x_0.shape[0],), device=device)

            # 2. Sample noise: ε ~ N(0, I)
            noise = torch.randn_like(x_0)

            # 3. Forward process: x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε
            x_t = scheduler.add_noise(x_0, t, noise)

            # 4. Predict noise: ε_θ(x_t, t)
            noise_pred = model(x_t, t)

            # 5. Simplified loss: L = ||ε - ε_θ(x_t, t)||²
            loss = F.mse_loss(noise_pred, noise)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)
        print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

12.4 Sampling

@torch.no_grad()
def sample_ddpm(model, scheduler, image_shape, device='cuda'):
    """DDPM sampling (Algorithm 2 implementation)."""
    model.eval()

    # x_T ~ N(0, I)
    x = torch.randn(image_shape, device=device)

    for t in reversed(range(scheduler.num_timesteps)):
        t_batch = torch.full((image_shape[0],), t, device=device, dtype=torch.long)

        # Predict noise
        predicted_noise = model(x, t_batch)

        # Reverse process coefficients
        alpha_t = scheduler.alphas[t]
        alpha_cumprod_t = scheduler.alphas_cumprod[t]
        beta_t = scheduler.betas[t]

        # Compute mean: μ_θ = 1/√α_t · (x_t - β_t/√(1-ᾱ_t) · ε_θ)
        mean = (1.0 / torch.sqrt(alpha_t)) * (
            x - (beta_t / torch.sqrt(1.0 - alpha_cumprod_t)) * predicted_noise
        )

        if t > 0:
            # Add stochastic noise (except at the last step)
            noise = torch.randn_like(x)
            sigma_t = torch.sqrt(scheduler.posterior_variance[t])
            x = mean + sigma_t * noise
        else:
            x = mean

    return x

12.5 Usage Example

# Hyperparameters
device = 'cuda' if torch.cuda.is_available() else 'cpu'
image_size = 32
batch_size = 128
num_timesteps = 1000

# Initialize scheduler and model
scheduler = DDPMScheduler(num_timesteps=num_timesteps, schedule='cosine')
model = SimpleUNet(in_channels=3, base_channels=64).to(device)

# Dataset (e.g., CIFAR-10)
from torchvision import datasets, transforms
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),  # Normalize to [-1, 1]
])
dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Train
train_ddpm(model, dataloader, scheduler, epochs=100, device=device)

# Sample
samples = sample_ddpm(model, scheduler, (16, 3, image_size, image_size), device=device)
# samples: 16 generated images in [-1, 1] range

13. Diffusion Model vs GAN vs VAE: Comparative Analysis

13.1 Comprehensive Comparison Table

PropertyDiffusion Model (DDPM)GANVAE
Training MethodNoise prediction (MSE)Adversarial training (Min-Max)Variational inference (ELBO)
Training StabilityVery stableUnstable (mode collapse, oscillation)Stable
Generation QualityVery highVery highModerate (blurry)
DiversityHigh (full distribution coverage)Low (mode collapse risk)High
Generation SpeedSlow (1000 steps)Very fast (1 step)Fast (1 step)
Log-likelihoodComputable (ELBO)Not computableComputable (ELBO)
Latent SpaceImplicitNone (or limited)Explicit, continuous
Mode CoverageHighLowHigh
Conditional GenerationVery effective via CFGPossible via cGANConditional VAE
Resolution ScalingEfficient via LDMProgressive training neededHierarchical VAE needed
Theoretical BasisThermodynamics, Score MatchingGame theoryVariational Bayes
Representative ModelsStable Diffusion, DALL-E 2StyleGAN, BigGANVQ-VAE-2, NVAE
CIFAR-10 FID~2.0 (latest)~2.9 (StyleGAN2)~23.5 (NVAE)

13.2 When to Choose Which Model?

Choose Diffusion Models when:

  • Both generation quality and diversity are important
  • Complex conditional generation like text-to-image is needed
  • Training stability is critical
  • Generation speed is not the top priority

Choose GANs when:

  • Real-time generation is needed
  • High-quality images for a specific domain are needed (faces, landscapes, etc.)
  • The dataset is relatively small and uniform

Choose VAEs when:

  • Meaningful Latent Space manipulation is needed
  • Likelihood-based anomaly detection is needed
  • Fast encoding/decoding is required
  • Semi-supervised learning or representation learning is the main purpose

14. Present and Future of Diffusion Models

Architecture Transition: From U-Net to Transformer. The latest models such as Stable Diffusion 3, FLUX, and Sora adopt DiT-based architectures. Transformer scaling laws have been confirmed to apply to Diffusion Models, and model scale expansion (8B+ parameters) is actively underway.

Sampling Efficiency. With advances in Consistency Models, Flow Matching, and DPM-Solver, 1-4 step generation has become possible. Rectified Flow learns straight paths, achieving high quality even with few steps.

Multimodal Expansion. Diffusion Models are expanding beyond images to video (Sora, Runway Gen-3), audio (AudioLDM), 3D (DreamFusion, Zero-1-to-3), robotics (Diffusion Policy), and other domains.

Acceleration and Optimization. Techniques such as Distillation, Quantization, and Caching have greatly improved inference speed, approaching real-time image generation.

14.2 Historical Significance of DDPM

DDPM represents a turning point in generative model history in the following ways:

  1. Demonstrated the competitiveness of Likelihood-based models in the image generation space dominated by GANs
  2. Showed that high-quality generation is possible with an extremely simple training objective (LsimpleL_\text{simple})
  3. Established a theoretical framework connecting thermodynamics and Score Matching
  4. Became the direct foundation of the modern AI revolution including Stable Diffusion, DALL-E 2, and Midjourney

15. References

  1. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020. arXiv:2006.11239

  2. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML 2015. arXiv:1503.03585

  3. Song, J., Meng, C., & Ermon, S. (2021). Denoising Diffusion Implicit Models (DDIM). ICLR 2021. arXiv:2010.02502

  4. Nichol, A. & Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. ICML 2021. arXiv:2102.09672

  5. Dhariwal, P. & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. NeurIPS 2021. arXiv:2105.05233

  6. Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance. NeurIPS Workshop 2021. arXiv:2207.12598

  7. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022. arXiv:2112.10752

  8. Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021. arXiv:2011.13456

  9. Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). Consistency Models. ICML 2023. arXiv:2303.01469

  10. Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling. ICLR 2023. arXiv:2210.02747

  11. Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers (DiT). ICCV 2023. arXiv:2212.09748

  12. Song, Y. & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019. arXiv:1907.05600

  13. Weng, L. (2021). What are Diffusion Models? lilianweng.github.io

  14. Hugging Face. The Annotated Diffusion Model. huggingface.co/blog/annotated-diffusion