Skip to content
Published on

Diffusion Model Paper Survey: Evolution of Image Generation from DDPM to Stable Diffusion, DiT, and SDXL

Authors
  • Name
    Twitter
Diffusion Model Survey: DDPM to Stable Diffusion, DiT, SDXL

Introduction

In the field of image generation, Diffusion Models have established themselves as a new paradigm replacing GANs (Generative Adversarial Networks). Since Ho et al. published DDPM (Denoising Diffusion Probabilistic Models) in 2020, commercial services like Stable Diffusion, DALL-E 2, and Midjourney emerged within just three years, driving the democratization of image generation.

The core idea behind Diffusion Models is remarkably simple. It involves learning a Forward Process that gradually adds noise to data and a Reverse Process that removes this noise in reverse to reconstruct the data. Through this process, the model learns "which direction to remove noise" at each noise level.

In this article, we survey the evolution of major models chronologically: from the mathematical foundations of DDPM to DDIM's accelerated sampling, the relationship with score-based models, Latent Diffusion (Stable Diffusion) architecture, Classifier-free Guidance, DiT (Diffusion Transformer), SDXL, and ControlNet. We comprehensively cover each model's key contributions, implementation code, performance comparisons, and operational considerations.

DDPM: Foundations of Diffusion Models

Forward Process (Adding Noise)

DDPM's Forward Process gradually adds Gaussian noise to the original data x_0 over T steps. The noise schedule at each step t is controlled by beta_t.

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

Using the reparameterization trick, we can directly compute the noised image at any arbitrary timestep t.

xt=αˉtx0+1αˉtϵ,ϵN(0,I)x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Here, alpha_t = 1 - beta_t, and alpha_bar_t is the cumulative product from alpha_1 to alpha_t.

import torch
import torch.nn as nn
import numpy as np

class DDPMScheduler:
    """DDPM Forward Process Scheduler"""
    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
        self.num_timesteps = num_timesteps
        # Linear noise schedule
        self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        self.alphas = 1.0 - self.betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)

    def add_noise(self, x_0, t, noise=None):
        """Generate noised image at arbitrary timestep t"""
        if noise is None:
            noise = torch.randn_like(x_0)

        sqrt_alpha_bar = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
        sqrt_one_minus_alpha_bar = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)

        # x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
        x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise
        return x_t

    def sample_timesteps(self, batch_size):
        """Sample random timesteps for training"""
        return torch.randint(0, self.num_timesteps, (batch_size,))

Reverse Process (Denoising)

In the Reverse Process, starting from x_T ~ N(0, I), the trained model epsilon_theta is used to progressively remove noise step by step.

pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)
class DDPMSampler:
    """DDPM Reverse Process Sampler"""
    def __init__(self, scheduler):
        self.scheduler = scheduler

    @torch.no_grad()
    def sample(self, model, shape, device):
        """DDPM reverse diffusion sampling"""
        # Start from pure noise
        x = torch.randn(shape, device=device)

        for t in reversed(range(self.scheduler.num_timesteps)):
            t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)

            # Predict noise
            predicted_noise = model(x, t_batch)

            # Compute mean
            alpha = self.scheduler.alphas[t]
            alpha_bar = self.scheduler.alphas_cumprod[t]
            beta = self.scheduler.betas[t]

            mean = (1 / torch.sqrt(alpha)) * (
                x - (beta / torch.sqrt(1 - alpha_bar)) * predicted_noise
            )

            # Add noise only when t > 0
            if t > 0:
                noise = torch.randn_like(x)
                sigma = torch.sqrt(beta)
                x = mean + sigma * noise
            else:
                x = mean

        return x

Training Objective: Simple Loss

DDPM training minimizes the MSE between the model-predicted noise and the actual noise.

Lsimple=Et,x0,ϵ[ϵϵθ(xt,t)2]L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]
def ddpm_training_step(model, x_0, scheduler, optimizer):
    """DDPM training single step"""
    batch_size = x_0.shape[0]
    device = x_0.device

    # 1. Sample random timesteps
    t = scheduler.sample_timesteps(batch_size).to(device)

    # 2. Generate noise and noised image
    noise = torch.randn_like(x_0)
    x_t = scheduler.add_noise(x_0, t, noise)

    # 3. Model predicts noise
    predicted_noise = model(x_t, t)

    # 4. Compute Simple Loss
    loss = nn.functional.mse_loss(predicted_noise, noise)

    # 5. Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss.item()

DDIM: Accelerated Sampling

DDPM requires 1000 steps of reverse diffusion, making generation extremely slow. DDIM (Denoising Diffusion Implicit Models) proposed by Song et al. (2020) defines a non-Markovian diffusion process that enables 10-50x faster sampling with the same trained model.

The key to DDIM is the eta parameter that controls stochastic/deterministic sampling. When eta=0, the sampling is fully deterministic; when eta=1, it becomes identical to DDPM.

class DDIMSampler:
    """DDIM Accelerated Sampler"""
    def __init__(self, scheduler, ddim_steps=50, eta=0.0):
        self.scheduler = scheduler
        self.ddim_steps = ddim_steps
        self.eta = eta
        # Generate subset timesteps (e.g., 1000 -> 50)
        self.timesteps = np.linspace(
            0, scheduler.num_timesteps - 1, ddim_steps, dtype=int
        )[::-1]

    @torch.no_grad()
    def sample(self, model, shape, device):
        """DDIM accelerated sampling - high quality in 50 steps"""
        x = torch.randn(shape, device=device)

        for i in range(len(self.timesteps)):
            t = self.timesteps[i]
            t_prev = self.timesteps[i + 1] if i + 1 < len(self.timesteps) else 0

            t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
            predicted_noise = model(x, t_batch)

            alpha_bar_t = self.scheduler.alphas_cumprod[t]
            alpha_bar_prev = self.scheduler.alphas_cumprod[t_prev]

            # Predict x_0
            x_0_pred = (x - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t)
            x_0_pred = torch.clamp(x_0_pred, -1, 1)

            # Compute direction
            sigma = self.eta * torch.sqrt(
                (1 - alpha_bar_prev) / (1 - alpha_bar_t) * (1 - alpha_bar_t / alpha_bar_prev)
            )
            direction = torch.sqrt(1 - alpha_bar_prev - sigma**2) * predicted_noise

            # Compute x_{t-1}
            x = torch.sqrt(alpha_bar_prev) * x_0_pred + direction

            if self.eta > 0 and t > 0:
                x = x + sigma * torch.randn_like(x)

        return x

Relationship with Score-based Models

Song and Ermon (2019) interpreted diffusion models from the Score Matching perspective. The score function is the gradient of the log density of the data distribution.

sθ(x)xlogp(x)s_\theta(x) \approx \nabla_x \log p(x)

DDPM's noise prediction epsilon_theta and the score function have the following relationship:

sθ(xt,t)=ϵθ(xt,t)1αˉts_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1 - \bar{\alpha}_t}}

This relationship was unified through the Score SDE (Stochastic Differential Equation) framework, which describes the diffusion process in continuous time as:

dx=f(x,t)dt+g(t)dwdx = f(x, t)dt + g(t)dw

Latent Diffusion Model (Stable Diffusion)

Architecture Overview

Latent Diffusion Model (LDM) by Rombach et al. (2022) dramatically reduced computational costs by performing the diffusion process in latent space rather than pixel space. This is the core architecture behind Stable Diffusion.

LDM consists of three key components:

ComponentRoleDetails
VAE EncoderEncode images to latent spaceCompress 512x512 images to 64x64x4 latent representations
U-Net (Denoiser)Predict noise in latent spaceIncorporates text conditions via Cross-Attention
VAE DecoderDecode latents to imagesReconstruct 64x64x4 latent representations to 512x512 images
Text EncoderEncode text promptsGenerate 77-token embeddings using CLIP ViT-L/14

Core Code Structure

import torch
from diffusers import StableDiffusionPipeline, DDIMScheduler

class LatentDiffusionInference:
    """Stable Diffusion Inference Pipeline (Simplified)"""

    def __init__(self, model_id="stable-diffusion-v1-5/stable-diffusion-v1-5"):
        self.pipe = StableDiffusionPipeline.from_pretrained(
            model_id,
            torch_dtype=torch.float16,
            safety_checker=None
        ).to("cuda")

        # Switch to DDIM scheduler (accelerate with 50 steps)
        self.pipe.scheduler = DDIMScheduler.from_config(
            self.pipe.scheduler.config
        )

    def generate(self, prompt, negative_prompt="", num_steps=50, guidance_scale=7.5):
        """Text-to-image generation"""
        image = self.pipe(
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=num_steps,
            guidance_scale=guidance_scale,
        ).images[0]
        return image

    def generate_with_latent_control(self, prompt, seed=42):
        """Direct latent space control"""
        generator = torch.Generator(device="cuda").manual_seed(seed)

        # Generate latent vector directly
        latents = torch.randn(
            (1, 4, 64, 64),
            generator=generator,
            device="cuda",
            dtype=torch.float16
        )

        image = self.pipe(
            prompt=prompt,
            latents=latents,
            num_inference_steps=50,
            guidance_scale=7.5,
        ).images[0]
        return image

Cross-Attention Mechanism

In Stable Diffusion's U-Net, Cross-Attention incorporates text conditions into image generation. Query is generated from the image latent representation, while Key and Value come from the text embeddings.

class CrossAttention(nn.Module):
    """Cross-Attention Layer in Stable Diffusion U-Net"""
    def __init__(self, d_model=320, d_context=768, n_heads=8):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_model // n_heads

        self.to_q = nn.Linear(d_model, d_model, bias=False)
        self.to_k = nn.Linear(d_context, d_model, bias=False)
        self.to_v = nn.Linear(d_context, d_model, bias=False)
        self.to_out = nn.Linear(d_model, d_model)

    def forward(self, x, context):
        """
        x: Image latent representation (B, H*W, d_model)
        context: Text embeddings (B, seq_len, d_context)
        """
        B, N, C = x.shape

        q = self.to_q(x).view(B, N, self.n_heads, self.d_head).transpose(1, 2)
        k = self.to_k(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
        v = self.to_v(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)

        # Scaled Dot-Product Attention
        scale = self.d_head ** -0.5
        attn = torch.matmul(q, k.transpose(-2, -1)) * scale
        attn = torch.softmax(attn, dim=-1)
        out = torch.matmul(attn, v)

        out = out.transpose(1, 2).contiguous().view(B, N, C)
        return self.to_out(out)

Classifier-free Guidance (CFG)

Classifier-free Guidance proposed by Ho and Salimans (2022) is a key technique for controlling generation quality without a separate classifier.

During training, the conditional and unconditional models are trained simultaneously (by replacing the text condition with an empty string at a certain probability). During inference, a weighted average of both predictions is used.

ϵ~θ(xt,c)=ϵθ(xt,)+w(ϵθ(xt,c)ϵθ(xt,))\tilde{\epsilon}_\theta(x_t, c) = \epsilon_\theta(x_t, \varnothing) + w \cdot (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \varnothing))

Here, w is the guidance scale. w=1 means pure conditional generation, and higher w values make the generation follow the text condition more strongly (typically 7.5-15).

def classifier_free_guidance_step(model, x_t, t, text_embedding, null_embedding, guidance_scale=7.5):
    """Classifier-free Guidance single step"""

    # Process conditional/unconditional predictions as a single batch
    x_in = torch.cat([x_t, x_t], dim=0)
    t_in = torch.cat([t, t], dim=0)
    c_in = torch.cat([null_embedding, text_embedding], dim=0)

    # Generate both predictions in a single forward pass
    noise_pred = model(x_in, t_in, encoder_hidden_states=c_in)
    noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)

    # Apply CFG
    noise_pred_guided = noise_pred_uncond + guidance_scale * (
        noise_pred_cond - noise_pred_uncond
    )
    return noise_pred_guided

DiT: Diffusion Transformer

From U-Net to Transformer

DiT (Diffusion Transformer) by Peebles and Xie (2023) replaced the diffusion model backbone from U-Net to Transformer. The key finding is that increasing the Transformer size (GFLOPs) consistently improves generation quality (FID).

ModelBackboneParametersFID (ImageNet 256)GFLOPs
ADMU-Net554M10.941120
LDM-4U-Net400M10.56103
DiT-S/2Transformer33M68.406
DiT-B/2Transformer130M43.4723
DiT-L/2Transformer458M9.6280
DiT-XL/2Transformer675M2.27119

adaLN-Zero Block

The key innovation of DiT is the adaLN-Zero conditioning approach. Timestep and class embeddings are injected as scale/shift parameters of Adaptive Layer Normalization, with gating parameters initialized to zero so that the block acts as an identity function (residual connection) at the start of training.

class DiTBlock(nn.Module):
    """DiT adaLN-Zero Transformer Block"""
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model, elementwise_affine=False)
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.norm2 = nn.LayerNorm(d_model, elementwise_affine=False)
        self.mlp = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.GELU(),
            nn.Linear(d_model * 4, d_model),
        )
        # adaLN modulation: 6 parameters (gamma1, beta1, alpha1, gamma2, beta2, alpha2)
        self.adaLN_modulation = nn.Sequential(
            nn.SiLU(),
            nn.Linear(d_model, 6 * d_model),
        )
        # Zero initialization - acts as identity at start of training
        nn.init.zeros_(self.adaLN_modulation[-1].weight)
        nn.init.zeros_(self.adaLN_modulation[-1].bias)

    def forward(self, x, c):
        """
        x: Patch tokens (B, N, D)
        c: Condition embedding - timestep + class (B, D)
        """
        # Generate adaLN parameters
        shift1, scale1, gate1, shift2, scale2, gate2 = (
            self.adaLN_modulation(c).chunk(6, dim=-1)
        )

        # Self-Attention with adaLN
        h = self.norm1(x)
        h = h * (1 + scale1.unsqueeze(1)) + shift1.unsqueeze(1)
        h, _ = self.attn(h, h, h)
        x = x + gate1.unsqueeze(1) * h

        # FFN with adaLN
        h = self.norm2(x)
        h = h * (1 + scale2.unsqueeze(1)) + shift2.unsqueeze(1)
        h = self.mlp(h)
        x = x + gate2.unsqueeze(1) * h

        return x

Patchify Strategy

DiT splits the latent representation into p x p patches to use as Transformer input tokens. Smaller patch sizes result in more tokens, improving performance but increasing computational cost.

class PatchEmbed(nn.Module):
    """DiT Patchify Layer"""
    def __init__(self, patch_size=2, in_channels=4, embed_dim=1152):
        super().__init__()
        self.patch_size = patch_size
        self.proj = nn.Conv2d(
            in_channels, embed_dim,
            kernel_size=patch_size, stride=patch_size
        )

    def forward(self, x):
        """(B, C, H, W) -> (B, N, D) patch token sequence"""
        x = self.proj(x)  # (B, D, H/p, W/p)
        x = x.flatten(2).transpose(1, 2)  # (B, N, D)
        return x

SDXL: Evolution of Stable Diffusion

Key Improvements

SDXL by Podell et al. (2023) introduced the following core improvements over Stable Diffusion v1.5:

FeatureSD v1.5SDXL Base
U-Net Parameters860M2.6B (3x increase)
Text EncoderCLIP ViT-L/14OpenCLIP ViT-bigG + CLIP ViT-L
Text Embedding Dimension7682048
Default Resolution512x5121024x1024
Attention Blocks1670
Refiner ModelNoneDedicated Refiner included

Dual Text Encoders

One of SDXL's greatest innovations is the use of two text encoders. It combines the rich semantic representations from OpenCLIP ViT-bigG with complementary features from CLIP ViT-L, significantly improving text understanding.

from diffusers import StableDiffusionXLPipeline
import torch

class SDXLInference:
    """SDXL Inference Pipeline"""

    def __init__(self):
        self.pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            variant="fp16",
            use_safetensors=True,
        ).to("cuda")

        # Memory optimization
        self.pipe.enable_model_cpu_offload()
        self.pipe.enable_vae_tiling()

    def generate(self, prompt, negative_prompt="", steps=30):
        """SDXL basic generation"""
        image = self.pipe(
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=steps,
            guidance_scale=7.5,
            height=1024,
            width=1024,
        ).images[0]
        return image

    def generate_with_refiner(self, prompt, base_pipe, refiner_pipe):
        """Base + Refiner two-stage pipeline"""
        # Base model: 80% of total steps
        high_noise_frac = 0.8
        image = base_pipe(
            prompt=prompt,
            num_inference_steps=40,
            denoising_end=high_noise_frac,
            output_type="latent",
        ).images

        # Refiner: remaining 20% (enhance fine details)
        image = refiner_pipe(
            prompt=prompt,
            num_inference_steps=40,
            denoising_start=high_noise_frac,
            image=image,
        ).images[0]
        return image

Size/Crop Conditioning

SDXL provides the original image size and crop coordinates as conditions during training, enabling effective learning of images with diverse aspect ratios. This is implemented using Fourier Feature Encoding.

def get_sdxl_conditioning(original_size, crop_coords, target_size):
    """Generate SDXL size/crop conditioning"""
    # Original size (height, width)
    original_size = torch.tensor(original_size, dtype=torch.float32)
    # Crop coordinates (top, left)
    crop_coords = torch.tensor(crop_coords, dtype=torch.float32)
    # Target size (height, width)
    target_size = torch.tensor(target_size, dtype=torch.float32)

    # Fourier Feature Encoding
    conditioning = torch.cat([original_size, crop_coords, target_size])

    # Sinusoidal embedding
    freqs = torch.exp(
        -torch.arange(0, 128) * np.log(10000) / 128
    )
    emb = conditioning.unsqueeze(-1) * freqs.unsqueeze(0)
    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)

    return emb.flatten()

ControlNet: Conditional Generation Control

ControlNet by Zhang et al. (2023) adds spatial conditions such as edges, depth, and pose to pretrained diffusion models. The Zero Convolution technique preserves the existing capabilities of the model at the beginning of training while gradually learning new conditions.

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from controlnet_aux import CannyDetector
from PIL import Image
import torch

def controlnet_canny_generation(input_image_path, prompt):
    """ControlNet Canny Edge based image generation"""
    # Load ControlNet model
    controlnet = ControlNetModel.from_pretrained(
        "lllyasviel/control_v11p_sd15_canny",
        torch_dtype=torch.float16,
    )

    pipe = StableDiffusionControlNetPipeline.from_pretrained(
        "stable-diffusion-v1-5/stable-diffusion-v1-5",
        controlnet=controlnet,
        torch_dtype=torch.float16,
    ).to("cuda")

    # Extract Canny Edge
    canny_detector = CannyDetector()
    input_image = Image.open(input_image_path)
    canny_image = canny_detector(input_image, low_threshold=100, high_threshold=200)

    # ControlNet-based generation
    output = pipe(
        prompt=prompt,
        image=canny_image,
        num_inference_steps=30,
        guidance_scale=7.5,
        controlnet_conditioning_scale=1.0,
    ).images[0]

    return output

Training Pipeline and Data Preparation

Dataset Composition

Here is a comparison of major datasets used for training large-scale diffusion models.

DatasetScaleResolutionUsage
LAION-5B5.8B image-text pairsVariousStable Diffusion training
LAION-Aesthetics120M (filtered)VariousHigh-quality fine-tuning
ImageNet1.3M256/512DiT training (class-conditional)
COYO-700M700MVariousMultilingual training (incl. Korean)

Fine-tuning Strategies

# LoRA Fine-tuning (Stable Diffusion)
accelerate launch train_text_to_image_lora.py \
    --pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
    --dataset_name="custom_dataset" \
    --resolution=512 \
    --train_batch_size=4 \
    --gradient_accumulation_steps=4 \
    --learning_rate=1e-4 \
    --lr_scheduler="cosine" \
    --lr_warmup_steps=500 \
    --max_train_steps=10000 \
    --rank=64 \
    --output_dir="./lora_output" \
    --mixed_precision="fp16" \
    --enable_xformers_memory_efficient_attention

# DreamBooth Fine-tuning (specific object/style learning)
accelerate launch train_dreambooth.py \
    --pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
    --instance_data_dir="./my_images" \
    --instance_prompt="a photo of sks dog" \
    --class_data_dir="./class_images" \
    --class_prompt="a photo of dog" \
    --with_prior_preservation \
    --prior_loss_weight=1.0 \
    --num_class_images=200 \
    --resolution=512 \
    --train_batch_size=1 \
    --learning_rate=5e-6 \
    --max_train_steps=800

Inference Optimization Techniques

Key Optimization Techniques Comparison

TechniqueSpeed ImprovementQuality ImpactMemory Savings
DDIM (50 steps)20xMinimal-
DPM-Solver++ (20 steps)50xMinimal-
xFormers Memory Efficient Attention1.5xNone30-40%
torch.compile1.2-1.5xNone-
VAE Tiling-Minimal70%+
FP16/BF161.5-2xMinimal50%
TensorRT2-4xNone-

Production Optimization Code

import torch
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler

def optimized_sdxl_pipeline():
    """Production-optimized SDXL Pipeline"""
    pipe = StableDiffusionXLPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0",
        torch_dtype=torch.float16,
        variant="fp16",
        use_safetensors=True,
    ).to("cuda")

    # 1. Apply fast scheduler
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(
        pipe.scheduler.config,
        algorithm_type="dpmsolver++",
        use_karras_sigmas=True,
    )

    # 2. VAE Tiling (memory savings for high-resolution generation)
    pipe.enable_vae_tiling()

    # 3. Attention Slicing (when VRAM is limited)
    pipe.enable_attention_slicing()

    # 4. torch.compile (PyTorch 2.0+)
    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

    return pipe

# GPU memory monitoring
def monitor_gpu_memory():
    """Monitor GPU memory usage"""
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    max_allocated = torch.cuda.max_memory_allocated() / 1024**3
    print(f"Allocated: {allocated:.2f} GB")
    print(f"Reserved:  {reserved:.2f} GB")
    print(f"Peak:      {max_allocated:.2f} GB")

Comprehensive Model Comparison

ModelYearKey ContributionBackboneConditioningResolution
DDPM2020Practical diffusion modelsU-NetNone (unconditional)256
DDIM2020Accelerated samplingU-NetNone256
LDM (SD)2022Latent space diffusionU-Net + VAECross-Attention512
DiT2023Transformer backboneTransformeradaLN-Zero256/512
SDXL2023Large-scale U-Net + dual encodersU-Net + VAECross-Attention + CFG1024
ControlNet2023Spatial condition controlZero Conv + U-NetEdge/Depth/Pose512
SD32024MMDiT (Multi-Modal DiT)TransformerFlow Matching1024

Operational Considerations

GPU Memory Management

The most common issue when operating Stable Diffusion-based services is GPU OOM (Out of Memory). The following items should be checked:

  1. Batch size limits: For 1024x1024 SDXL generation, approximately 12GB is needed per image on A100 80GB, while V100 16GB will encounter OOM
  2. Concurrent request limits: Rate limiters must be applied to prevent GPU memory overflow
  3. Enable VAE Tiling: Essential for high-resolution (2048x2048+) generation
  4. Memory profiling: Regular GPU memory monitoring to detect memory leaks

Failure Case: GPU OOM Recovery

# Check GPU memory status
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

# Check GPU memory leaks from Python processes
fuser -v /dev/nvidia*

# Force GPU memory release (without process restart)
python -c "
import torch
import gc
gc.collect()
torch.cuda.empty_cache()
print('GPU memory cleared')
print(f'Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB')
"

# Service recovery procedure after OOM
# 1. Graceful shutdown of the affected worker process
# 2. Verify GPU memory release
# 3. Adjust batch size/concurrent request count
# 4. Restart worker process
# 5. Resume traffic after health check passes

NSFW Filtering

For commercial services, the Safety Checker must always be enabled. Disabling it can result in NSFW content being generated, which may cause legal issues.

# Safety Checker configuration (required for production)
pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    safety_checker=None,  # Only disable in development
)

# Must be enabled in production
from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker
from transformers import CLIPImageProcessor

safety_checker = StableDiffusionSafetyChecker.from_pretrained(
    "CompVis/stable-diffusion-safety-checker"
)
feature_extractor = CLIPImageProcessor.from_pretrained(
    "openai/clip-vit-base-patch32"
)

Failure Cases and Recovery Procedures

Case 1: Model Loading Failure

Disk I/O timeouts or checkpoint corruption can occur when loading large-scale models.

import os
from diffusers import StableDiffusionXLPipeline

def robust_model_loading(model_id, max_retries=3):
    """Robust model loading (with retries)"""
    for attempt in range(max_retries):
        try:
            pipe = StableDiffusionXLPipeline.from_pretrained(
                model_id,
                torch_dtype=torch.float16,
                use_safetensors=True,
                local_files_only=os.path.exists(
                    os.path.join(model_id, "model_index.json")
                ),
            )
            pipe = pipe.to("cuda")
            # Warmup run
            _ = pipe("test", num_inference_steps=1)
            return pipe
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                import time
                time.sleep(10)
                # Clear cache and retry
                torch.cuda.empty_cache()
            else:
                raise RuntimeError(f"Model loading failed after {max_retries} attempts")

Case 2: Image Quality Degradation (Inappropriate CFG Scale)

# CFG Scale Guidelines
guidance_scale_guidelines:
  1.0: 'Condition nearly ignored - generation close to random'
  3.0-5.0: 'Creative and diverse generation'
  7.0-8.5: 'Generally recommended range - quality/diversity balance'
  10.0-15.0: 'High text fidelity - risk of oversaturation'
  20.0+: 'Excessive guidance - artifacts may appear'

# Troubleshooting checklist
troubleshooting:
  blurry_output:
    - 'Increase num_inference_steps (minimum 30+)'
    - 'Switch scheduler to DPM-Solver++'
  oversaturated:
    - 'Lower guidance_scale to 7.0 or below'
    - "Add 'oversaturated, vivid' to negative_prompt"
  wrong_composition:
    - 'Improve prompt structure (clear subject-verb-object)'
    - 'Use ControlNet for composition control'

Conclusion

Diffusion Models have evolved rapidly, building on DDPM's theoretical foundations with DDIM's accelerated sampling, Latent Diffusion's efficient architecture, Classifier-free Guidance's quality control, DiT's scalability, SDXL's large-scale design, and ControlNet's fine-grained control.

Currently, new paradigms like SD3's MMDiT (Multi-Modal Diffusion Transformer) and Flow Matching, as well as Consistency Models, are emerging to enable even faster and higher-quality image generation. In particular, the DiT architecture serves as the foundation for video generation models like Sora (OpenAI), and the applications of Diffusion Models are expanding beyond images to video, 3D, and audio.

From an engineering perspective, understanding the theoretical background of models is the key to optimization and debugging. Accurately grasping the role of each component, including noise schedules, CFG Scale, scheduler selection, and memory management, is essential for operating stable services in production environments.

References