- Published on
Diffusion Model Paper Survey: Evolution of Image Generation from DDPM to Stable Diffusion, DiT, and SDXL
- Authors
- Name
- Introduction
- DDPM: Foundations of Diffusion Models
- DDIM: Accelerated Sampling
- Relationship with Score-based Models
- Latent Diffusion Model (Stable Diffusion)
- Classifier-free Guidance (CFG)
- DiT: Diffusion Transformer
- SDXL: Evolution of Stable Diffusion
- ControlNet: Conditional Generation Control
- Training Pipeline and Data Preparation
- Inference Optimization Techniques
- Comprehensive Model Comparison
- Operational Considerations
- Failure Cases and Recovery Procedures
- Conclusion
- References

Introduction
In the field of image generation, Diffusion Models have established themselves as a new paradigm replacing GANs (Generative Adversarial Networks). Since Ho et al. published DDPM (Denoising Diffusion Probabilistic Models) in 2020, commercial services like Stable Diffusion, DALL-E 2, and Midjourney emerged within just three years, driving the democratization of image generation.
The core idea behind Diffusion Models is remarkably simple. It involves learning a Forward Process that gradually adds noise to data and a Reverse Process that removes this noise in reverse to reconstruct the data. Through this process, the model learns "which direction to remove noise" at each noise level.
In this article, we survey the evolution of major models chronologically: from the mathematical foundations of DDPM to DDIM's accelerated sampling, the relationship with score-based models, Latent Diffusion (Stable Diffusion) architecture, Classifier-free Guidance, DiT (Diffusion Transformer), SDXL, and ControlNet. We comprehensively cover each model's key contributions, implementation code, performance comparisons, and operational considerations.
DDPM: Foundations of Diffusion Models
Forward Process (Adding Noise)
DDPM's Forward Process gradually adds Gaussian noise to the original data x_0 over T steps. The noise schedule at each step t is controlled by beta_t.
Using the reparameterization trick, we can directly compute the noised image at any arbitrary timestep t.
Here, alpha_t = 1 - beta_t, and alpha_bar_t is the cumulative product from alpha_1 to alpha_t.
import torch
import torch.nn as nn
import numpy as np
class DDPMScheduler:
"""DDPM Forward Process Scheduler"""
def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
self.num_timesteps = num_timesteps
# Linear noise schedule
self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
self.alphas = 1.0 - self.betas
self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)
def add_noise(self, x_0, t, noise=None):
"""Generate noised image at arbitrary timestep t"""
if noise is None:
noise = torch.randn_like(x_0)
sqrt_alpha_bar = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
sqrt_one_minus_alpha_bar = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)
# x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise
return x_t
def sample_timesteps(self, batch_size):
"""Sample random timesteps for training"""
return torch.randint(0, self.num_timesteps, (batch_size,))
Reverse Process (Denoising)
In the Reverse Process, starting from x_T ~ N(0, I), the trained model epsilon_theta is used to progressively remove noise step by step.
class DDPMSampler:
"""DDPM Reverse Process Sampler"""
def __init__(self, scheduler):
self.scheduler = scheduler
@torch.no_grad()
def sample(self, model, shape, device):
"""DDPM reverse diffusion sampling"""
# Start from pure noise
x = torch.randn(shape, device=device)
for t in reversed(range(self.scheduler.num_timesteps)):
t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
# Predict noise
predicted_noise = model(x, t_batch)
# Compute mean
alpha = self.scheduler.alphas[t]
alpha_bar = self.scheduler.alphas_cumprod[t]
beta = self.scheduler.betas[t]
mean = (1 / torch.sqrt(alpha)) * (
x - (beta / torch.sqrt(1 - alpha_bar)) * predicted_noise
)
# Add noise only when t > 0
if t > 0:
noise = torch.randn_like(x)
sigma = torch.sqrt(beta)
x = mean + sigma * noise
else:
x = mean
return x
Training Objective: Simple Loss
DDPM training minimizes the MSE between the model-predicted noise and the actual noise.
def ddpm_training_step(model, x_0, scheduler, optimizer):
"""DDPM training single step"""
batch_size = x_0.shape[0]
device = x_0.device
# 1. Sample random timesteps
t = scheduler.sample_timesteps(batch_size).to(device)
# 2. Generate noise and noised image
noise = torch.randn_like(x_0)
x_t = scheduler.add_noise(x_0, t, noise)
# 3. Model predicts noise
predicted_noise = model(x_t, t)
# 4. Compute Simple Loss
loss = nn.functional.mse_loss(predicted_noise, noise)
# 5. Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
DDIM: Accelerated Sampling
DDPM requires 1000 steps of reverse diffusion, making generation extremely slow. DDIM (Denoising Diffusion Implicit Models) proposed by Song et al. (2020) defines a non-Markovian diffusion process that enables 10-50x faster sampling with the same trained model.
The key to DDIM is the eta parameter that controls stochastic/deterministic sampling. When eta=0, the sampling is fully deterministic; when eta=1, it becomes identical to DDPM.
class DDIMSampler:
"""DDIM Accelerated Sampler"""
def __init__(self, scheduler, ddim_steps=50, eta=0.0):
self.scheduler = scheduler
self.ddim_steps = ddim_steps
self.eta = eta
# Generate subset timesteps (e.g., 1000 -> 50)
self.timesteps = np.linspace(
0, scheduler.num_timesteps - 1, ddim_steps, dtype=int
)[::-1]
@torch.no_grad()
def sample(self, model, shape, device):
"""DDIM accelerated sampling - high quality in 50 steps"""
x = torch.randn(shape, device=device)
for i in range(len(self.timesteps)):
t = self.timesteps[i]
t_prev = self.timesteps[i + 1] if i + 1 < len(self.timesteps) else 0
t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
predicted_noise = model(x, t_batch)
alpha_bar_t = self.scheduler.alphas_cumprod[t]
alpha_bar_prev = self.scheduler.alphas_cumprod[t_prev]
# Predict x_0
x_0_pred = (x - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t)
x_0_pred = torch.clamp(x_0_pred, -1, 1)
# Compute direction
sigma = self.eta * torch.sqrt(
(1 - alpha_bar_prev) / (1 - alpha_bar_t) * (1 - alpha_bar_t / alpha_bar_prev)
)
direction = torch.sqrt(1 - alpha_bar_prev - sigma**2) * predicted_noise
# Compute x_{t-1}
x = torch.sqrt(alpha_bar_prev) * x_0_pred + direction
if self.eta > 0 and t > 0:
x = x + sigma * torch.randn_like(x)
return x
Relationship with Score-based Models
Song and Ermon (2019) interpreted diffusion models from the Score Matching perspective. The score function is the gradient of the log density of the data distribution.
DDPM's noise prediction epsilon_theta and the score function have the following relationship:
This relationship was unified through the Score SDE (Stochastic Differential Equation) framework, which describes the diffusion process in continuous time as:
Latent Diffusion Model (Stable Diffusion)
Architecture Overview
Latent Diffusion Model (LDM) by Rombach et al. (2022) dramatically reduced computational costs by performing the diffusion process in latent space rather than pixel space. This is the core architecture behind Stable Diffusion.
LDM consists of three key components:
| Component | Role | Details |
|---|---|---|
| VAE Encoder | Encode images to latent space | Compress 512x512 images to 64x64x4 latent representations |
| U-Net (Denoiser) | Predict noise in latent space | Incorporates text conditions via Cross-Attention |
| VAE Decoder | Decode latents to images | Reconstruct 64x64x4 latent representations to 512x512 images |
| Text Encoder | Encode text prompts | Generate 77-token embeddings using CLIP ViT-L/14 |
Core Code Structure
import torch
from diffusers import StableDiffusionPipeline, DDIMScheduler
class LatentDiffusionInference:
"""Stable Diffusion Inference Pipeline (Simplified)"""
def __init__(self, model_id="stable-diffusion-v1-5/stable-diffusion-v1-5"):
self.pipe = StableDiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
safety_checker=None
).to("cuda")
# Switch to DDIM scheduler (accelerate with 50 steps)
self.pipe.scheduler = DDIMScheduler.from_config(
self.pipe.scheduler.config
)
def generate(self, prompt, negative_prompt="", num_steps=50, guidance_scale=7.5):
"""Text-to-image generation"""
image = self.pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=num_steps,
guidance_scale=guidance_scale,
).images[0]
return image
def generate_with_latent_control(self, prompt, seed=42):
"""Direct latent space control"""
generator = torch.Generator(device="cuda").manual_seed(seed)
# Generate latent vector directly
latents = torch.randn(
(1, 4, 64, 64),
generator=generator,
device="cuda",
dtype=torch.float16
)
image = self.pipe(
prompt=prompt,
latents=latents,
num_inference_steps=50,
guidance_scale=7.5,
).images[0]
return image
Cross-Attention Mechanism
In Stable Diffusion's U-Net, Cross-Attention incorporates text conditions into image generation. Query is generated from the image latent representation, while Key and Value come from the text embeddings.
class CrossAttention(nn.Module):
"""Cross-Attention Layer in Stable Diffusion U-Net"""
def __init__(self, d_model=320, d_context=768, n_heads=8):
super().__init__()
self.n_heads = n_heads
self.d_head = d_model // n_heads
self.to_q = nn.Linear(d_model, d_model, bias=False)
self.to_k = nn.Linear(d_context, d_model, bias=False)
self.to_v = nn.Linear(d_context, d_model, bias=False)
self.to_out = nn.Linear(d_model, d_model)
def forward(self, x, context):
"""
x: Image latent representation (B, H*W, d_model)
context: Text embeddings (B, seq_len, d_context)
"""
B, N, C = x.shape
q = self.to_q(x).view(B, N, self.n_heads, self.d_head).transpose(1, 2)
k = self.to_k(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
v = self.to_v(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
# Scaled Dot-Product Attention
scale = self.d_head ** -0.5
attn = torch.matmul(q, k.transpose(-2, -1)) * scale
attn = torch.softmax(attn, dim=-1)
out = torch.matmul(attn, v)
out = out.transpose(1, 2).contiguous().view(B, N, C)
return self.to_out(out)
Classifier-free Guidance (CFG)
Classifier-free Guidance proposed by Ho and Salimans (2022) is a key technique for controlling generation quality without a separate classifier.
During training, the conditional and unconditional models are trained simultaneously (by replacing the text condition with an empty string at a certain probability). During inference, a weighted average of both predictions is used.
Here, w is the guidance scale. w=1 means pure conditional generation, and higher w values make the generation follow the text condition more strongly (typically 7.5-15).
def classifier_free_guidance_step(model, x_t, t, text_embedding, null_embedding, guidance_scale=7.5):
"""Classifier-free Guidance single step"""
# Process conditional/unconditional predictions as a single batch
x_in = torch.cat([x_t, x_t], dim=0)
t_in = torch.cat([t, t], dim=0)
c_in = torch.cat([null_embedding, text_embedding], dim=0)
# Generate both predictions in a single forward pass
noise_pred = model(x_in, t_in, encoder_hidden_states=c_in)
noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)
# Apply CFG
noise_pred_guided = noise_pred_uncond + guidance_scale * (
noise_pred_cond - noise_pred_uncond
)
return noise_pred_guided
DiT: Diffusion Transformer
From U-Net to Transformer
DiT (Diffusion Transformer) by Peebles and Xie (2023) replaced the diffusion model backbone from U-Net to Transformer. The key finding is that increasing the Transformer size (GFLOPs) consistently improves generation quality (FID).
| Model | Backbone | Parameters | FID (ImageNet 256) | GFLOPs |
|---|---|---|---|---|
| ADM | U-Net | 554M | 10.94 | 1120 |
| LDM-4 | U-Net | 400M | 10.56 | 103 |
| DiT-S/2 | Transformer | 33M | 68.40 | 6 |
| DiT-B/2 | Transformer | 130M | 43.47 | 23 |
| DiT-L/2 | Transformer | 458M | 9.62 | 80 |
| DiT-XL/2 | Transformer | 675M | 2.27 | 119 |
adaLN-Zero Block
The key innovation of DiT is the adaLN-Zero conditioning approach. Timestep and class embeddings are injected as scale/shift parameters of Adaptive Layer Normalization, with gating parameters initialized to zero so that the block acts as an identity function (residual connection) at the start of training.
class DiTBlock(nn.Module):
"""DiT adaLN-Zero Transformer Block"""
def __init__(self, d_model, n_heads):
super().__init__()
self.norm1 = nn.LayerNorm(d_model, elementwise_affine=False)
self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
self.norm2 = nn.LayerNorm(d_model, elementwise_affine=False)
self.mlp = nn.Sequential(
nn.Linear(d_model, d_model * 4),
nn.GELU(),
nn.Linear(d_model * 4, d_model),
)
# adaLN modulation: 6 parameters (gamma1, beta1, alpha1, gamma2, beta2, alpha2)
self.adaLN_modulation = nn.Sequential(
nn.SiLU(),
nn.Linear(d_model, 6 * d_model),
)
# Zero initialization - acts as identity at start of training
nn.init.zeros_(self.adaLN_modulation[-1].weight)
nn.init.zeros_(self.adaLN_modulation[-1].bias)
def forward(self, x, c):
"""
x: Patch tokens (B, N, D)
c: Condition embedding - timestep + class (B, D)
"""
# Generate adaLN parameters
shift1, scale1, gate1, shift2, scale2, gate2 = (
self.adaLN_modulation(c).chunk(6, dim=-1)
)
# Self-Attention with adaLN
h = self.norm1(x)
h = h * (1 + scale1.unsqueeze(1)) + shift1.unsqueeze(1)
h, _ = self.attn(h, h, h)
x = x + gate1.unsqueeze(1) * h
# FFN with adaLN
h = self.norm2(x)
h = h * (1 + scale2.unsqueeze(1)) + shift2.unsqueeze(1)
h = self.mlp(h)
x = x + gate2.unsqueeze(1) * h
return x
Patchify Strategy
DiT splits the latent representation into p x p patches to use as Transformer input tokens. Smaller patch sizes result in more tokens, improving performance but increasing computational cost.
class PatchEmbed(nn.Module):
"""DiT Patchify Layer"""
def __init__(self, patch_size=2, in_channels=4, embed_dim=1152):
super().__init__()
self.patch_size = patch_size
self.proj = nn.Conv2d(
in_channels, embed_dim,
kernel_size=patch_size, stride=patch_size
)
def forward(self, x):
"""(B, C, H, W) -> (B, N, D) patch token sequence"""
x = self.proj(x) # (B, D, H/p, W/p)
x = x.flatten(2).transpose(1, 2) # (B, N, D)
return x
SDXL: Evolution of Stable Diffusion
Key Improvements
SDXL by Podell et al. (2023) introduced the following core improvements over Stable Diffusion v1.5:
| Feature | SD v1.5 | SDXL Base |
|---|---|---|
| U-Net Parameters | 860M | 2.6B (3x increase) |
| Text Encoder | CLIP ViT-L/14 | OpenCLIP ViT-bigG + CLIP ViT-L |
| Text Embedding Dimension | 768 | 2048 |
| Default Resolution | 512x512 | 1024x1024 |
| Attention Blocks | 16 | 70 |
| Refiner Model | None | Dedicated Refiner included |
Dual Text Encoders
One of SDXL's greatest innovations is the use of two text encoders. It combines the rich semantic representations from OpenCLIP ViT-bigG with complementary features from CLIP ViT-L, significantly improving text understanding.
from diffusers import StableDiffusionXLPipeline
import torch
class SDXLInference:
"""SDXL Inference Pipeline"""
def __init__(self):
self.pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True,
).to("cuda")
# Memory optimization
self.pipe.enable_model_cpu_offload()
self.pipe.enable_vae_tiling()
def generate(self, prompt, negative_prompt="", steps=30):
"""SDXL basic generation"""
image = self.pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=steps,
guidance_scale=7.5,
height=1024,
width=1024,
).images[0]
return image
def generate_with_refiner(self, prompt, base_pipe, refiner_pipe):
"""Base + Refiner two-stage pipeline"""
# Base model: 80% of total steps
high_noise_frac = 0.8
image = base_pipe(
prompt=prompt,
num_inference_steps=40,
denoising_end=high_noise_frac,
output_type="latent",
).images
# Refiner: remaining 20% (enhance fine details)
image = refiner_pipe(
prompt=prompt,
num_inference_steps=40,
denoising_start=high_noise_frac,
image=image,
).images[0]
return image
Size/Crop Conditioning
SDXL provides the original image size and crop coordinates as conditions during training, enabling effective learning of images with diverse aspect ratios. This is implemented using Fourier Feature Encoding.
def get_sdxl_conditioning(original_size, crop_coords, target_size):
"""Generate SDXL size/crop conditioning"""
# Original size (height, width)
original_size = torch.tensor(original_size, dtype=torch.float32)
# Crop coordinates (top, left)
crop_coords = torch.tensor(crop_coords, dtype=torch.float32)
# Target size (height, width)
target_size = torch.tensor(target_size, dtype=torch.float32)
# Fourier Feature Encoding
conditioning = torch.cat([original_size, crop_coords, target_size])
# Sinusoidal embedding
freqs = torch.exp(
-torch.arange(0, 128) * np.log(10000) / 128
)
emb = conditioning.unsqueeze(-1) * freqs.unsqueeze(0)
emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)
return emb.flatten()
ControlNet: Conditional Generation Control
ControlNet by Zhang et al. (2023) adds spatial conditions such as edges, depth, and pose to pretrained diffusion models. The Zero Convolution technique preserves the existing capabilities of the model at the beginning of training while gradually learning new conditions.
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from controlnet_aux import CannyDetector
from PIL import Image
import torch
def controlnet_canny_generation(input_image_path, prompt):
"""ControlNet Canny Edge based image generation"""
# Load ControlNet model
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/control_v11p_sd15_canny",
torch_dtype=torch.float16,
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16,
).to("cuda")
# Extract Canny Edge
canny_detector = CannyDetector()
input_image = Image.open(input_image_path)
canny_image = canny_detector(input_image, low_threshold=100, high_threshold=200)
# ControlNet-based generation
output = pipe(
prompt=prompt,
image=canny_image,
num_inference_steps=30,
guidance_scale=7.5,
controlnet_conditioning_scale=1.0,
).images[0]
return output
Training Pipeline and Data Preparation
Dataset Composition
Here is a comparison of major datasets used for training large-scale diffusion models.
| Dataset | Scale | Resolution | Usage |
|---|---|---|---|
| LAION-5B | 5.8B image-text pairs | Various | Stable Diffusion training |
| LAION-Aesthetics | 120M (filtered) | Various | High-quality fine-tuning |
| ImageNet | 1.3M | 256/512 | DiT training (class-conditional) |
| COYO-700M | 700M | Various | Multilingual training (incl. Korean) |
Fine-tuning Strategies
# LoRA Fine-tuning (Stable Diffusion)
accelerate launch train_text_to_image_lora.py \
--pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
--dataset_name="custom_dataset" \
--resolution=512 \
--train_batch_size=4 \
--gradient_accumulation_steps=4 \
--learning_rate=1e-4 \
--lr_scheduler="cosine" \
--lr_warmup_steps=500 \
--max_train_steps=10000 \
--rank=64 \
--output_dir="./lora_output" \
--mixed_precision="fp16" \
--enable_xformers_memory_efficient_attention
# DreamBooth Fine-tuning (specific object/style learning)
accelerate launch train_dreambooth.py \
--pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
--instance_data_dir="./my_images" \
--instance_prompt="a photo of sks dog" \
--class_data_dir="./class_images" \
--class_prompt="a photo of dog" \
--with_prior_preservation \
--prior_loss_weight=1.0 \
--num_class_images=200 \
--resolution=512 \
--train_batch_size=1 \
--learning_rate=5e-6 \
--max_train_steps=800
Inference Optimization Techniques
Key Optimization Techniques Comparison
| Technique | Speed Improvement | Quality Impact | Memory Savings |
|---|---|---|---|
| DDIM (50 steps) | 20x | Minimal | - |
| DPM-Solver++ (20 steps) | 50x | Minimal | - |
| xFormers Memory Efficient Attention | 1.5x | None | 30-40% |
| torch.compile | 1.2-1.5x | None | - |
| VAE Tiling | - | Minimal | 70%+ |
| FP16/BF16 | 1.5-2x | Minimal | 50% |
| TensorRT | 2-4x | None | - |
Production Optimization Code
import torch
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
def optimized_sdxl_pipeline():
"""Production-optimized SDXL Pipeline"""
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True,
).to("cuda")
# 1. Apply fast scheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
pipe.scheduler.config,
algorithm_type="dpmsolver++",
use_karras_sigmas=True,
)
# 2. VAE Tiling (memory savings for high-resolution generation)
pipe.enable_vae_tiling()
# 3. Attention Slicing (when VRAM is limited)
pipe.enable_attention_slicing()
# 4. torch.compile (PyTorch 2.0+)
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
return pipe
# GPU memory monitoring
def monitor_gpu_memory():
"""Monitor GPU memory usage"""
allocated = torch.cuda.memory_allocated() / 1024**3
reserved = torch.cuda.memory_reserved() / 1024**3
max_allocated = torch.cuda.max_memory_allocated() / 1024**3
print(f"Allocated: {allocated:.2f} GB")
print(f"Reserved: {reserved:.2f} GB")
print(f"Peak: {max_allocated:.2f} GB")
Comprehensive Model Comparison
| Model | Year | Key Contribution | Backbone | Conditioning | Resolution |
|---|---|---|---|---|---|
| DDPM | 2020 | Practical diffusion models | U-Net | None (unconditional) | 256 |
| DDIM | 2020 | Accelerated sampling | U-Net | None | 256 |
| LDM (SD) | 2022 | Latent space diffusion | U-Net + VAE | Cross-Attention | 512 |
| DiT | 2023 | Transformer backbone | Transformer | adaLN-Zero | 256/512 |
| SDXL | 2023 | Large-scale U-Net + dual encoders | U-Net + VAE | Cross-Attention + CFG | 1024 |
| ControlNet | 2023 | Spatial condition control | Zero Conv + U-Net | Edge/Depth/Pose | 512 |
| SD3 | 2024 | MMDiT (Multi-Modal DiT) | Transformer | Flow Matching | 1024 |
Operational Considerations
GPU Memory Management
The most common issue when operating Stable Diffusion-based services is GPU OOM (Out of Memory). The following items should be checked:
- Batch size limits: For 1024x1024 SDXL generation, approximately 12GB is needed per image on A100 80GB, while V100 16GB will encounter OOM
- Concurrent request limits: Rate limiters must be applied to prevent GPU memory overflow
- Enable VAE Tiling: Essential for high-resolution (2048x2048+) generation
- Memory profiling: Regular GPU memory monitoring to detect memory leaks
Failure Case: GPU OOM Recovery
# Check GPU memory status
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
# Check GPU memory leaks from Python processes
fuser -v /dev/nvidia*
# Force GPU memory release (without process restart)
python -c "
import torch
import gc
gc.collect()
torch.cuda.empty_cache()
print('GPU memory cleared')
print(f'Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB')
"
# Service recovery procedure after OOM
# 1. Graceful shutdown of the affected worker process
# 2. Verify GPU memory release
# 3. Adjust batch size/concurrent request count
# 4. Restart worker process
# 5. Resume traffic after health check passes
NSFW Filtering
For commercial services, the Safety Checker must always be enabled. Disabling it can result in NSFW content being generated, which may cause legal issues.
# Safety Checker configuration (required for production)
pipe = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
safety_checker=None, # Only disable in development
)
# Must be enabled in production
from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker
from transformers import CLIPImageProcessor
safety_checker = StableDiffusionSafetyChecker.from_pretrained(
"CompVis/stable-diffusion-safety-checker"
)
feature_extractor = CLIPImageProcessor.from_pretrained(
"openai/clip-vit-base-patch32"
)
Failure Cases and Recovery Procedures
Case 1: Model Loading Failure
Disk I/O timeouts or checkpoint corruption can occur when loading large-scale models.
import os
from diffusers import StableDiffusionXLPipeline
def robust_model_loading(model_id, max_retries=3):
"""Robust model loading (with retries)"""
for attempt in range(max_retries):
try:
pipe = StableDiffusionXLPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
use_safetensors=True,
local_files_only=os.path.exists(
os.path.join(model_id, "model_index.json")
),
)
pipe = pipe.to("cuda")
# Warmup run
_ = pipe("test", num_inference_steps=1)
return pipe
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
import time
time.sleep(10)
# Clear cache and retry
torch.cuda.empty_cache()
else:
raise RuntimeError(f"Model loading failed after {max_retries} attempts")
Case 2: Image Quality Degradation (Inappropriate CFG Scale)
# CFG Scale Guidelines
guidance_scale_guidelines:
1.0: 'Condition nearly ignored - generation close to random'
3.0-5.0: 'Creative and diverse generation'
7.0-8.5: 'Generally recommended range - quality/diversity balance'
10.0-15.0: 'High text fidelity - risk of oversaturation'
20.0+: 'Excessive guidance - artifacts may appear'
# Troubleshooting checklist
troubleshooting:
blurry_output:
- 'Increase num_inference_steps (minimum 30+)'
- 'Switch scheduler to DPM-Solver++'
oversaturated:
- 'Lower guidance_scale to 7.0 or below'
- "Add 'oversaturated, vivid' to negative_prompt"
wrong_composition:
- 'Improve prompt structure (clear subject-verb-object)'
- 'Use ControlNet for composition control'
Conclusion
Diffusion Models have evolved rapidly, building on DDPM's theoretical foundations with DDIM's accelerated sampling, Latent Diffusion's efficient architecture, Classifier-free Guidance's quality control, DiT's scalability, SDXL's large-scale design, and ControlNet's fine-grained control.
Currently, new paradigms like SD3's MMDiT (Multi-Modal Diffusion Transformer) and Flow Matching, as well as Consistency Models, are emerging to enable even faster and higher-quality image generation. In particular, the DiT architecture serves as the foundation for video generation models like Sora (OpenAI), and the applications of Diffusion Models are expanding beyond images to video, 3D, and audio.
From an engineering perspective, understanding the theoretical background of models is the key to optimization and debugging. Accurately grasping the role of each component, including noise schedules, CFG Scale, scheduler selection, and memory management, is essential for operating stable services in production environments.
References
- Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020.
- Song, J., Meng, C., & Ermon, S. (2020). Denoising Diffusion Implicit Models. ICLR 2021.
- Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022.
- Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance.
- Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV 2023.
- Podell, D., et al. (2023). SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis.
- Zhang, L., Rao, A., & Agrawala, M. (2023). Adding Conditional Control to Text-to-Image Diffusion Models. ICCV 2023.
- Lilian Weng. (2021). What are Diffusion Models?