- Authors
- Name
- 1. Paper Overview
- 2. Background: From Thermodynamics to Generative Models
- 3. Forward Process: Systematically Adding Noise
- 4. Core Mathematics: Reparameterization Trick
- 5. Reverse Process: Recovering Images from Noise
- 6. Deriving the Training Objective: From ELBO to Simplified Loss
- 7. Noise Scheduling: Design of $\beta_t$
- 8. Sampling Algorithm
- 9. Architecture: Time-conditioned U-Net
- 10. Experimental Results
- 11. Comprehensive Overview of Subsequent Research: The Evolution of Diffusion
- 12. PyTorch Code Examples: Simple DDPM Implementation
- 13. Diffusion Model vs GAN vs VAE: Comparative Analysis
- 14. Present and Future of Diffusion Models
- 15. References
1. Paper Overview
"Denoising Diffusion Probabilistic Models" (DDPM) was published at NeurIPS 2020, co-authored by Jonathan Ho, Ajay Jain, and Pieter Abbeel from UC Berkeley. This paper is a landmark study that empirically demonstrated that high-quality image synthesis is achievable through diffusion probabilistic models.
The core idea is surprisingly simple. Define a Forward Process that gradually adds Gaussian noise to data, and learn a Reverse Process that step-by-step removes this noise to recover the original data. The final training objective reduces to a simple MSE loss between "model-predicted noise" and "actually added noise."
DDPM achieved FID 3.17 and Inception Score 9.46 on CIFAR-10, showing performance comparable to or surpassing GAN-based models of the time. More importantly, this paper became the foundation of modern image generation AI including DALL-E 2, Imagen, Stable Diffusion, and Midjourney.
Paper Information
- Title: Denoising Diffusion Probabilistic Models
- Authors: Jonathan Ho, Ajay Jain, Pieter Abbeel
- Venue: NeurIPS 2020
- arXiv: 2006.11239
- Official Code: hojonathanho/diffusion
2. Background: From Thermodynamics to Generative Models
2.1 Inspiration from Non-equilibrium Thermodynamics
The intellectual origin of Diffusion Models lies in non-equilibrium statistical mechanics. In physics, diffusion refers to the process where particles randomly move from high-concentration regions to low-concentration regions, eventually reaching a state of thermal equilibrium (maximum entropy). The key insight of this process is:
- Forward: A state with complex structure disordered equilibrium state (information destruction)
- Reverse: Equilibrium state restoration to a structured state (information creation)
Sohl-Dickstein et al. (2015) first applied this idea to machine learning, publishing "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." By defining a diffusion process that transforms a complex data distribution into a simple known distribution (Gaussian), and learning its reverse process, one obtains a generative model.
2.2 Connection with Score Matching
Another theoretical pillar of Diffusion Models is Score Matching. The score function is defined as the gradient of the log probability density.
If this score function can be estimated, samples can be generated through Langevin Dynamics.
Yang Song and Stefano Ermon (2019) proposed Noise Conditional Score Networks (NCSN) in "Generative Modeling by Estimating Gradients of the Data Distribution," presenting a method for estimating the score function at various noise levels. Ho et al.'s DDPM is deeply connected to this Score Matching perspective, and the paper explicitly cites "a new connection with denoising score matching with Langevin dynamics" as a core contribution.
2.3 SDE Perspective: A Unified Framework
Song et al. (2021) unified DDPM and Score Matching under the framework of Stochastic Differential Equations (SDE) in "Score-Based Generative Modeling through Stochastic Differential Equations." The Forward Process described as a continuous-time SDE takes the form:
where is the drift coefficient, is the diffusion coefficient, and is a standard Wiener process. A corresponding Reverse-time SDE exists:
The key insight is that solving the reverse SDE requires only the time-dependent score function . DDPM's noise prediction network is essentially equivalent to estimating this score function.
This relationship is the key link that theoretically unifies DDPM and Score Matching.
3. Forward Process: Systematically Adding Noise
3.1 Forward Process as a Markov Chain
The Forward Process (or Diffusion Process) is a fixed Markov Chain that gradually adds Gaussian noise to original data . It has no learnable parameters and is entirely determined by a predefined Variance Schedule .
The transition probability at each time step is defined as:
In plain terms, at each step the data from the previous time step is scaled down by and Gaussian noise with variance is added.
Why scale by ? To preserve the total variance at each step. If the variance of is 1, then the variance of is , and adding noise with variance gives a total variance of .
When is sufficiently large and is appropriately set, converges to nearly pure isotropic Gaussian noise .
3.2 Complete Forward Process
The joint distribution of the complete Forward Process over steps is:
This follows from the Markov property, where each step depends only on the immediately preceding step. In DDPM, is used, with increasing linearly from to .
4. Core Mathematics: Reparameterization Trick
4.1 Jumping to Arbitrary Time in One Step
The most powerful mathematical property of the Forward Process is that at any arbitrary time can be computed directly from without going through intermediate steps. This is what makes training efficient.
First, define the notation:
is the cumulative product of , representing how much of the original signal is preserved up to time .
4.2 Derivation
Starting from and deriving inductively:
Substituting into :
Applying the sum of independent Gaussians rule: the sum of two independent Gaussians and follows .
Summing the noise variances:
Therefore:
Generalizing this yields the following.
4.3 Final Result: Closed-form Expression
That is, at any time can be sampled in one step:
The intuitive interpretation of this formula is:
| Term | Meaning | Change Over Time |
|---|---|---|
| Original signal | As , , signal decreases | |
| Added noise | As , , noise increases |
At , so we get as-is, and at , so it becomes nearly pure noise. This gradual decrease in Signal-to-Noise Ratio (SNR) is the essence of the Forward Process.
5. Reverse Process: Recovering Images from Noise
5.1 Definition of the Reverse Process
The Reverse Process starts from pure noise and progressively removes noise to generate data . If each step of the Forward Process is a small Gaussian perturbation, the key assumption is that its reverse can also be approximated as Gaussian (when is sufficiently small).
Here, and are the mean and variance that the neural network must learn. In DDPM, the variance is not learned but fixed as , where either or is used.
5.2 Derivation of the Posterior
The key to training is that the reverse conditional distribution (posterior) given is computable in closed form. Applying Bayes' theorem:
By the Markov property, , so all three terms are Gaussian. Since the product of Gaussians is also Gaussian, expanding the exponents and rearranging as a quadratic in yields:
where the posterior mean is:
and the posterior variance is:
5.3 Replacing with
Since the model cannot directly know , we solve the Reparameterization formula in reverse to express :
Substituting this into the posterior mean :
If the model learns a network that predicts the noise , the Reverse Process mean is computed as:
This is why noise prediction is equivalent to mean prediction in DDPM's Reverse Process.
6. Deriving the Training Objective: From ELBO to Simplified Loss
6.1 Maximum Likelihood and ELBO
The ultimate goal of a generative model is to maximize the data log-likelihood . However, since this is intractable to compute directly, we optimize the Evidence Lower Bound (ELBO).
Applying Jensen's inequality:
6.2 Decomposition of the ELBO
Decomposing the ELBO into KL divergence terms:
Analyzing the meaning of each term:
(Prior Matching): Measures how well matches the prior distribution . When is sufficiently large, this term converges to 0, and since it has no learnable parameters, it is ignored as a constant.
(Reconstruction): Measures the ability to reconstruct from . Since and are very similar, its impact on overall training is small.
(Denoising Matching): The core training signal that measures how well the model's Reverse transition matches the true posterior .
6.3 KL Divergence Computation
The KL divergence between two Gaussians is computable in closed form. Since and :
where is a constant related to the variances. With fixed variance, only the difference in means becomes the training objective.
6.4 Reparameterization to Noise Prediction
Substituting the expressions for and derived earlier:
The Simplified Loss with the weighting coefficient removed is:
where , , and .
This is DDPM's most important contribution. Starting from the complex ELBO, it ultimately arrives at the "MSE between actual noise and predicted noise " — the simplest possible loss function in machine learning. Experimentally, this simplified loss also produces better sample quality than the weighted variational bound.
6.5 Training Algorithm Summary
Algorithm 1: Training
─────────────────────────────────
repeat
x_0 ~ q(x_0) # Sample from dataset
t ~ Uniform({1, ..., T}) # Select random time step
ε ~ N(0, I) # Sample standard Gaussian noise
x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε # Generate noisy image
∇_θ ||ε - ε_θ(x_t, t)||² # Compute gradient and update
until converged
7. Noise Scheduling: Design of
7.1 Linear Schedule (Original DDPM)
Ho et al. used a schedule where increases linearly from to .
The intuition behind this schedule is to add small noise initially to gradually destroy data structure, and larger noise in later stages to rapidly converge to a Gaussian.
7.2 Problems with the Linear Schedule
Nichol & Dhariwal (2021, "Improved Denoising Diffusion Probabilistic Models") identified two issues with the Linear Schedule.
First, information is destroyed too quickly in the early stages. decreases rapidly in the beginning, so significant noise is added even at low values of . This is particularly problematic for high-resolution images.
Second, late time steps are wasted. At large values of , , meaning is already close to pure noise and contributes little to meaningful training.
7.3 Cosine Schedule
The Cosine Schedule proposed by Nichol & Dhariwal defines directly.
where is a small offset to prevent from becoming too small near .
The key characteristics of the Cosine Schedule are:
- decreases nearly linearly in the middle range, providing uniformly useful training signals across all time steps
- Prevents excessive noise addition in the early stages, preserving fine details
- Ensures smooth transition to complete noise in the later stages
import torch
import math
def cosine_beta_schedule(timesteps, s=0.008):
"""Cosine schedule as proposed in Nichol & Dhariwal (2021)."""
steps = timesteps + 1
t = torch.linspace(0, timesteps, steps) / timesteps
alphas_cumprod = torch.cos((t + s) / (1 + s) * math.pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return torch.clip(betas, 0.0001, 0.9999)
def linear_beta_schedule(timesteps, beta_start=1e-4, beta_end=0.02):
"""Linear schedule as proposed in Ho et al. (2020)."""
return torch.linspace(beta_start, beta_end, timesteps)
7.4 Schedule Comparison
| Property | Linear Schedule | Cosine Schedule |
|---|---|---|
| decay pattern | Rapid early, gradual late | Nearly linear in middle |
| Early information preservation | Low | High |
| Late time step utilization | Inefficient (already pure noise) | Efficient |
| High-resolution suitability | Low | High |
| Used in original DDPM | Yes | No |
| Used in Improved DDPM | No | Yes |
8. Sampling Algorithm
8.1 DDPM Sampling
After training is complete, the DDPM sampling algorithm for generating new images is:
Algorithm 2: Sampling
─────────────────────────────────
x_T ~ N(0, I) # Start from pure noise
for t = T, T-1, ..., 1:
z ~ N(0, I) if t > 1, else z = 0 # No noise added at the last step
x_{t-1} = 1/√α_t · (x_t - β_t/√(1-ᾱ_t) · ε_θ(x_t, t)) + σ_t · z
return x_0
8.2 Step-by-Step Interpretation
Step 1: Initialization. Sample pure Gaussian noise from . This is the starting point of the generation process.
Step 2: Noise Prediction. Feed the current noisy image and time step into the network to predict the noise contained in .
Step 3: Mean Computation. Compute the mean of the Reverse transition using the predicted noise.
Step 4: Stochastic Transition. Generate by adding scaled Gaussian noise to the computed mean. No noise is added at the final step ().
Step 5: Repeat. Repeat the above process from to .
8.3 Limitations of Sampling
The biggest drawback of DDPM sampling is speed. Sequential denoising over steps requires 1000 neural network forward passes for a single image. This is extremely slow compared to GAN's single forward pass, spurring subsequent research on accelerated samplers such as DDIM and DPM-Solver.
9. Architecture: Time-conditioned U-Net
9.1 U-Net Based Design
DDPM's noise prediction network is based on the U-Net architecture. U-Net was originally proposed by Ronneberger et al. (2015) for medical image segmentation, featuring an Encoder-Decoder structure with Skip Connections that combine features at various resolutions.
DDPM's U-Net is based on the PixelCNN++ structure with the following modifications.
9.2 Key Components
Time Embedding: To inject the time step into the network, Transformer-style Sinusoidal Positional Encoding is used.
This embedding passes through an MLP and is injected into each ResNet Block. Specifically, the time embedding is linearly transformed and then either added (additive) or scaled (FiLM conditioning) onto the intermediate feature maps of the ResNet Block.
ResNet Block: Each block consists of the following sequence:
- Group Normalization
- SiLU (Swish) Activation
- Convolution
- Time Embedding injection
- Group Normalization
- SiLU Activation
- Dropout
- Convolution
- Residual Connection
Self-Attention: Multi-Head Self-Attention is applied at feature maps of resolution. The spatial dimensions are flattened to sequence length to perform standard Scaled Dot-Product Attention.
Group Normalization: Group Normalization is used instead of Batch Normalization. It is independent of batch size and provides more stable training for generative models.
9.3 Specific Architecture Specifications
Input: x_t ∈ R^(C×H×W), t ∈ {1,...,T}
Encoder:
[128] → [128] → ↓2 →
[256] → [256] → ↓2 →
[256] → [256] → ↓2 → (+ Self-Attention at 16×16)
[512] → [512] → ↓2
Bottleneck:
[512] → Self-Attention → [512]
Decoder (with skip connections):
[512] → [512] → ↑2 →
[256] → [256] → ↑2 → (+ Self-Attention at 16×16)
[256] → [256] → ↑2 →
[128] → [128] → ↑2
Output: ε_θ ∈ R^(C×H×W) (predicted noise with same dimensions as input)
DDPM used approximately 114M parameters at resolution.
10. Experimental Results
10.1 Quantitative Evaluation
DDPM was evaluated on the following benchmarks.
CIFAR-10 (Unconditional, ):
| Model | FID () | IS () |
|---|---|---|
| DDPM | 3.17 | 9.46 |
| StyleGAN2 + ADA | 2.92 | 9.83 |
| NCSN | 25.32 | 8.87 |
| ProgressiveGAN | 15.52 | 8.80 |
| NVAE | 23.5 | - |
DDPM achieved SOTA FID among unconditional generative models at the time, showing quality comparable to GAN-based StyleGAN2.
LSUN ():
| Dataset | FID |
|---|---|
| LSUN Bedroom | 4.90 |
| LSUN Cat | - |
| LSUN Church | 7.89 |
10.2 Qualitative Analysis
DDPM samples exhibited several distinct characteristics compared to GANs.
High diversity: While GANs suffer from limited generation diversity due to mode collapse, DDPM covers diverse modes of the data distribution in a balanced manner.
Gradual generation: The progressive transformation from noise to image can be visualized, confirming a coarse-to-fine generation pattern where the model first forms global structure and then adds fine details.
Stable training: Free from GAN's chronic problems of training instability (mode collapse, training oscillation), converging stably with a simple MSE loss.
10.3 Progressive Lossy Compression Interpretation
Ho et al. interpreted DDPM as naturally implementing a Progressive Lossy Decompression scheme. Information is progressively added at each Reverse step, which can be viewed as a generalization of Autoregressive Decoding. Rate-Distortion curve analysis confirmed that most bits are allocated to overall structure rather than perceptually insignificant details.
11. Comprehensive Overview of Subsequent Research: The Evolution of Diffusion
11.1 DDIM (Denoising Diffusion Implicit Models)
Song et al., 2021 | arXiv: 2010.02502
Research that addressed DDPM's biggest limitation: slow sampling speed. The core idea is to generalize the Forward Process to be Non-Markovian.
DDIM uses the same trained model while modifying only the sampling process.
Setting makes sampling completely deterministic, which provides:
- Accelerated sampling: Similar quality images with only 50-100 steps instead of (10-20x speedup)
- Semantic interpolation: Thanks to the deterministic mapping, interpolation in latent space leads to meaningful image transformations
- Consistency: Always generates the same image from the same initial noise, ensuring reproducible results
11.2 Improved DDPM
Nichol & Dhariwal, 2021 | arXiv: 2102.09672
Research that improved two aspects of the original DDPM.
Learnable variance: While DDPM fixed as either or , Improved DDPM makes it learnable. Specifically, is parameterized as an interpolation between and .
where is a value output by the network.
Cosine Schedule: Introduced the Cosine Variance Schedule described earlier, greatly improving training efficiency especially for high-resolution images.
Hybrid Loss: Adding a small amount of the variational lower bound to also improved log-likelihood.
11.3 Classifier Guidance
Dhariwal & Nichol, 2021 | arXiv: 2105.05233
A technique proposed in "Diffusion Models Beat GANs on Image Synthesis" that injects the gradient of a pre-trained classifier into the Reverse Process for conditional generation.
where is the guidance scale and is a classifier trained on noisy images. Increasing reduces diversity but increases fidelity to a specific class. In this paper, Diffusion Models first surpassed GANs in FID (CIFAR-10 FID 2.97, ImageNet 256x256 FID 4.59).
Limitation: A separate classifier must be trained on noisy data, complicating the training pipeline.
11.4 Classifier-Free Guidance (CFG)
Ho & Salimans, 2022 | arXiv: 2207.12598
An innovative technique that achieves guidance effects without a separate classifier, and has become the de facto standard in modern Diffusion Models.
The core idea is for a single network to learn both conditional and unconditional generation. During training, condition information is replaced with a null token with a certain probability (typically 10-20%).
At inference, conditional and unconditional predictions are linearly combined.
where is the guidance weight. When , standard conditional generation occurs; when , fidelity to the condition increases.
Rearranging gives the following interpretation:
This can be interpreted as pushing away from the unconditional prediction toward the conditional direction, with larger increasing the pushing force. Nearly all state-of-the-art Text-to-Image models including DALL-E 2, Stable Diffusion, and Imagen use CFG.
11.5 Latent Diffusion Models (LDM) / Stable Diffusion
Rombach et al., 2022 | arXiv: 2112.10752
LDM dramatically improved computational efficiency by performing the Diffusion Process in latent space rather than pixel space.
Key Architecture:
Perceptual Compression: A pre-trained Autoencoder (VQ-VAE or KL-regularized VAE) Encoder compresses image into low-dimensional latent . Typically, a image is compressed to latent (approximately 48x dimensionality reduction).
Latent Diffusion: DDPM's Forward/Reverse Process is performed in this latent space. Computation is significantly reduced compared to pixel space.
Cross-Attention Conditioning: Condition information such as text and segmentation maps is injected into the U-Net via Cross-Attention. For text, CLIP or BERT embeddings are used.
where , , , and is the encoding of the condition information.
Stable Diffusion is trained by combining this LDM architecture with a CLIP text encoder and a large-scale dataset (LAION-5B), becoming the de facto standard for open-source Text-to-Image models.
11.6 Score SDE
Song et al., 2021 | arXiv: 2011.13456
This ICLR 2021 Oral presentation connected DDPM and Score Matching under the unified framework of Stochastic Differential Equations (SDE).
Key contributions:
- Variance Exploding (VE) SDE: Corresponds to the NCSN/SMLD family
- Variance Preserving (VP) SDE: Corresponds to DDPM
- Sub-VP SDE: A variant providing better likelihood
The extension to continuous time enables exact log-likelihood computation (via ODE), more flexible sampler design, and conditional generation tasks such as Inpainting and Colorization.
11.7 Consistency Models
Song et al., 2023 | arXiv: 2303.01469
Consistency Models, proposed by Yang Song at OpenAI, represent an attempt to fundamentally solve the multi-step sampling problem of Diffusion Models.
The core idea is to learn a function that maps all points on an ODE trajectory to the same starting point (original data).
By this self-consistency property, data can be recovered from a noisy sample at any time with a single network evaluation. That is, 1-step generation is possible.
Two training approaches exist:
- Consistency Distillation (CD): Distilling from a pre-trained Diffusion Model
- Consistency Training (CT): Training independently without pre-training
In 2024, Easy Consistency Models (ECM) emerged, achieving better 2-step generation performance at 33% of the training cost compared to iCT.
11.8 Flow Matching / Rectified Flow
Lipman et al., 2023; Liu et al., 2023 | arXiv: 2210.02747, arXiv: 2209.03003
Flow Matching is an alternative approach to Diffusion Models that directly learns the probability flow connecting data and noise distributions.
Core Idea: Define straight paths from noise to data .
Learn a velocity field along this path.
Rectified Flow repeatedly "straightens" these paths (reflow), producing high-quality samples even with few steps.
Stable Diffusion 3 adopted Rectified Flow, presenting a new paradigm for Diffusion Models alongside the transition from U-Net to Transformer.
11.9 DiT (Diffusion Transformer)
Peebles & Xie, 2023 | arXiv: 2212.09748
DiT replaced the Diffusion Model backbone from U-Net to Vision Transformer (ViT).
Key design choices:
- Images are divided into patches and processed as tokens
- Time step and class label are injected via Adaptive Layer Normalization (adaLN-Zero)
- Composed of Transformer Blocks
DiT, combined with Latent Diffusion, achieved FID 2.27 on ImageNet class-conditional generation, surpassing all previous Diffusion Models.
Significance of DiT: It empirically demonstrated that Transformer scaling laws can be applied to Diffusion Models. Performance consistently improves with increased model size and training compute. This finding directly influenced the architectural choices of the latest large-scale generative models such as Sora (OpenAI, Video generation) and Stable Diffusion 3.
12. PyTorch Code Examples: Simple DDPM Implementation
Below is a simplified PyTorch implementation of DDPM's core components. A more sophisticated U-Net and hyperparameter tuning would be needed for actual training.
12.1 Noise Schedule and Forward Process
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class DDPMScheduler:
"""Scheduler managing DDPM's Forward Process."""
def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02, schedule='linear'):
self.num_timesteps = num_timesteps
if schedule == 'linear':
self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
elif schedule == 'cosine':
self.betas = self._cosine_schedule(num_timesteps)
else:
raise ValueError(f"Unknown schedule: {schedule}")
# Pre-compute key variables
self.alphas = 1.0 - self.betas
self.alphas_cumprod = torch.cumprod(self.alphas, dim=0) # ᾱ_t
self.alphas_cumprod_prev = F.pad(self.alphas_cumprod[:-1], (1, 0), value=1.0)
# Forward process coefficients
self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod) # √ᾱ_t
self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod) # √(1-ᾱ_t)
# Reverse process coefficients
self.sqrt_recip_alphas = torch.sqrt(1.0 / self.alphas) # 1/√α_t
self.posterior_variance = (
self.betas * (1.0 - self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
) # β̃_t
def _cosine_schedule(self, timesteps, s=0.008):
steps = timesteps + 1
t = torch.linspace(0, timesteps, steps) / timesteps
alphas_cumprod = torch.cos((t + s) / (1 + s) * math.pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return torch.clip(betas, 0.0001, 0.9999)
def add_noise(self, x_0, t, noise=None):
"""Forward process: compute q(x_t | x_0) in one step."""
if noise is None:
noise = torch.randn_like(x_0)
sqrt_alpha_cumprod = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
sqrt_one_minus_alpha_cumprod = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)
# x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε
x_t = sqrt_alpha_cumprod * x_0 + sqrt_one_minus_alpha_cumprod * noise
return x_t
12.2 Simplified U-Net
class SinusoidalPositionEmbedding(nn.Module):
"""Transformer-style Sinusoidal Time Embedding."""
def __init__(self, dim):
super().__init__()
self.dim = dim
def forward(self, t):
device = t.device
half_dim = self.dim // 2
emb = math.log(10000) / (half_dim - 1)
emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
emb = t[:, None].float() * emb[None, :]
emb = torch.cat([emb.sin(), emb.cos()], dim=-1)
return emb
class ResBlock(nn.Module):
"""Time-conditioned Residual Block."""
def __init__(self, in_ch, out_ch, time_emb_dim):
super().__init__()
self.norm1 = nn.GroupNorm(8, in_ch)
self.conv1 = nn.Conv2d(in_ch, out_ch, 3, padding=1)
self.time_mlp = nn.Sequential(
nn.SiLU(),
nn.Linear(time_emb_dim, out_ch),
)
self.norm2 = nn.GroupNorm(8, out_ch)
self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1)
self.skip = nn.Conv2d(in_ch, out_ch, 1) if in_ch != out_ch else nn.Identity()
def forward(self, x, t_emb):
h = self.conv1(F.silu(self.norm1(x)))
h = h + self.time_mlp(t_emb)[:, :, None, None] # Inject time embedding
h = self.conv2(F.silu(self.norm2(h)))
return h + self.skip(x) # Residual connection
class SimpleUNet(nn.Module):
"""Simplified U-Net for DDPM training."""
def __init__(self, in_channels=3, base_channels=64, time_emb_dim=256):
super().__init__()
# Time embedding
self.time_mlp = nn.Sequential(
SinusoidalPositionEmbedding(base_channels),
nn.Linear(base_channels, time_emb_dim),
nn.SiLU(),
nn.Linear(time_emb_dim, time_emb_dim),
)
# Encoder
self.enc1 = ResBlock(in_channels, base_channels, time_emb_dim)
self.enc2 = ResBlock(base_channels, base_channels * 2, time_emb_dim)
self.enc3 = ResBlock(base_channels * 2, base_channels * 4, time_emb_dim)
self.pool = nn.MaxPool2d(2)
# Bottleneck
self.bot = ResBlock(base_channels * 4, base_channels * 4, time_emb_dim)
# Decoder (with skip connections)
self.dec3 = ResBlock(base_channels * 8, base_channels * 2, time_emb_dim)
self.dec2 = ResBlock(base_channels * 4, base_channels, time_emb_dim)
self.dec1 = ResBlock(base_channels * 2, base_channels, time_emb_dim)
self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
# Output
self.out = nn.Conv2d(base_channels, in_channels, 1)
def forward(self, x, t):
t_emb = self.time_mlp(t)
# Encoder
e1 = self.enc1(x, t_emb)
e2 = self.enc2(self.pool(e1), t_emb)
e3 = self.enc3(self.pool(e2), t_emb)
# Bottleneck
b = self.bot(self.pool(e3), t_emb)
# Decoder with skip connections
d3 = self.dec3(torch.cat([self.up(b), e3], dim=1), t_emb)
d2 = self.dec2(torch.cat([self.up(d3), e2], dim=1), t_emb)
d1 = self.dec1(torch.cat([self.up(d2), e1], dim=1), t_emb)
return self.out(d1) # Predicted noise ε_θ
12.3 Training Loop
def train_ddpm(model, dataloader, scheduler, epochs=100, lr=2e-4, device='cuda'):
"""DDPM training loop (Algorithm 1 implementation)."""
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
model.train()
for epoch in range(epochs):
total_loss = 0
for batch_idx, (x_0, _) in enumerate(dataloader):
x_0 = x_0.to(device)
# 1. Select random time step: t ~ Uniform({1, ..., T})
t = torch.randint(0, scheduler.num_timesteps, (x_0.shape[0],), device=device)
# 2. Sample noise: ε ~ N(0, I)
noise = torch.randn_like(x_0)
# 3. Forward process: x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε
x_t = scheduler.add_noise(x_0, t, noise)
# 4. Predict noise: ε_θ(x_t, t)
noise_pred = model(x_t, t)
# 5. Simplified loss: L = ||ε - ε_θ(x_t, t)||²
loss = F.mse_loss(noise_pred, noise)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(dataloader)
print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")
12.4 Sampling
@torch.no_grad()
def sample_ddpm(model, scheduler, image_shape, device='cuda'):
"""DDPM sampling (Algorithm 2 implementation)."""
model.eval()
# x_T ~ N(0, I)
x = torch.randn(image_shape, device=device)
for t in reversed(range(scheduler.num_timesteps)):
t_batch = torch.full((image_shape[0],), t, device=device, dtype=torch.long)
# Predict noise
predicted_noise = model(x, t_batch)
# Reverse process coefficients
alpha_t = scheduler.alphas[t]
alpha_cumprod_t = scheduler.alphas_cumprod[t]
beta_t = scheduler.betas[t]
# Compute mean: μ_θ = 1/√α_t · (x_t - β_t/√(1-ᾱ_t) · ε_θ)
mean = (1.0 / torch.sqrt(alpha_t)) * (
x - (beta_t / torch.sqrt(1.0 - alpha_cumprod_t)) * predicted_noise
)
if t > 0:
# Add stochastic noise (except at the last step)
noise = torch.randn_like(x)
sigma_t = torch.sqrt(scheduler.posterior_variance[t])
x = mean + sigma_t * noise
else:
x = mean
return x
12.5 Usage Example
# Hyperparameters
device = 'cuda' if torch.cuda.is_available() else 'cpu'
image_size = 32
batch_size = 128
num_timesteps = 1000
# Initialize scheduler and model
scheduler = DDPMScheduler(num_timesteps=num_timesteps, schedule='cosine')
model = SimpleUNet(in_channels=3, base_channels=64).to(device)
# Dataset (e.g., CIFAR-10)
from torchvision import datasets, transforms
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), # Normalize to [-1, 1]
])
dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
# Train
train_ddpm(model, dataloader, scheduler, epochs=100, device=device)
# Sample
samples = sample_ddpm(model, scheduler, (16, 3, image_size, image_size), device=device)
# samples: 16 generated images in [-1, 1] range
13. Diffusion Model vs GAN vs VAE: Comparative Analysis
13.1 Comprehensive Comparison Table
| Property | Diffusion Model (DDPM) | GAN | VAE |
|---|---|---|---|
| Training Method | Noise prediction (MSE) | Adversarial training (Min-Max) | Variational inference (ELBO) |
| Training Stability | Very stable | Unstable (mode collapse, oscillation) | Stable |
| Generation Quality | Very high | Very high | Moderate (blurry) |
| Diversity | High (full distribution coverage) | Low (mode collapse risk) | High |
| Generation Speed | Slow (1000 steps) | Very fast (1 step) | Fast (1 step) |
| Log-likelihood | Computable (ELBO) | Not computable | Computable (ELBO) |
| Latent Space | Implicit | None (or limited) | Explicit, continuous |
| Mode Coverage | High | Low | High |
| Conditional Generation | Very effective via CFG | Possible via cGAN | Conditional VAE |
| Resolution Scaling | Efficient via LDM | Progressive training needed | Hierarchical VAE needed |
| Theoretical Basis | Thermodynamics, Score Matching | Game theory | Variational Bayes |
| Representative Models | Stable Diffusion, DALL-E 2 | StyleGAN, BigGAN | VQ-VAE-2, NVAE |
| CIFAR-10 FID | ~2.0 (latest) | ~2.9 (StyleGAN2) | ~23.5 (NVAE) |
13.2 When to Choose Which Model?
Choose Diffusion Models when:
- Both generation quality and diversity are important
- Complex conditional generation like text-to-image is needed
- Training stability is critical
- Generation speed is not the top priority
Choose GANs when:
- Real-time generation is needed
- High-quality images for a specific domain are needed (faces, landscapes, etc.)
- The dataset is relatively small and uniform
Choose VAEs when:
- Meaningful Latent Space manipulation is needed
- Likelihood-based anomaly detection is needed
- Fast encoding/decoding is required
- Semi-supervised learning or representation learning is the main purpose
14. Present and Future of Diffusion Models
14.1 Major Trends in 2024-2025
Architecture Transition: From U-Net to Transformer. The latest models such as Stable Diffusion 3, FLUX, and Sora adopt DiT-based architectures. Transformer scaling laws have been confirmed to apply to Diffusion Models, and model scale expansion (8B+ parameters) is actively underway.
Sampling Efficiency. With advances in Consistency Models, Flow Matching, and DPM-Solver, 1-4 step generation has become possible. Rectified Flow learns straight paths, achieving high quality even with few steps.
Multimodal Expansion. Diffusion Models are expanding beyond images to video (Sora, Runway Gen-3), audio (AudioLDM), 3D (DreamFusion, Zero-1-to-3), robotics (Diffusion Policy), and other domains.
Acceleration and Optimization. Techniques such as Distillation, Quantization, and Caching have greatly improved inference speed, approaching real-time image generation.
14.2 Historical Significance of DDPM
DDPM represents a turning point in generative model history in the following ways:
- Demonstrated the competitiveness of Likelihood-based models in the image generation space dominated by GANs
- Showed that high-quality generation is possible with an extremely simple training objective ()
- Established a theoretical framework connecting thermodynamics and Score Matching
- Became the direct foundation of the modern AI revolution including Stable Diffusion, DALL-E 2, and Midjourney
15. References
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020. arXiv:2006.11239
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML 2015. arXiv:1503.03585
Song, J., Meng, C., & Ermon, S. (2021). Denoising Diffusion Implicit Models (DDIM). ICLR 2021. arXiv:2010.02502
Nichol, A. & Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. ICML 2021. arXiv:2102.09672
Dhariwal, P. & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. NeurIPS 2021. arXiv:2105.05233
Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance. NeurIPS Workshop 2021. arXiv:2207.12598
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022. arXiv:2112.10752
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021. arXiv:2011.13456
Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). Consistency Models. ICML 2023. arXiv:2303.01469
Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling. ICLR 2023. arXiv:2210.02747
Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers (DiT). ICCV 2023. arXiv:2212.09748
Song, Y. & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019. arXiv:1907.05600
Weng, L. (2021). What are Diffusion Models? lilianweng.github.io
Hugging Face. The Annotated Diffusion Model. huggingface.co/blog/annotated-diffusion