💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

In the field of image generation, Diffusion Models have established themselves as a new paradigm replacing GANs (Generative Adversarial Networks). Since Ho et al. published **DDPM (Denoising Diffusion Probabilistic Models)** in 2020, commercial services like Stable Diffusion, DALL-E 2, and Midjourney emerged within just three years, driving the democratization of image generation.

The core idea behind Diffusion Models is remarkably simple. It involves learning a **Forward Process** that gradually adds noise to data and a **Reverse Process** that removes this noise in reverse to reconstruct the data. Through this process, the model learns "which direction to remove noise" at each noise level.

In this article, we survey the evolution of major models chronologically: from the mathematical foundations of DDPM to DDIM's accelerated sampling, the relationship with score-based models, Latent Diffusion (Stable Diffusion) architecture, Classifier-free Guidance, DiT (Diffusion Transformer), SDXL, and ControlNet. We comprehensively cover each model's key contributions, implementation code, performance comparisons, and operational considerations.

DDPM: Foundations of Diffusion Models

Forward Process (Adding Noise)

DDPM's Forward Process gradually adds Gaussian noise to the original data x_0 over T steps. The noise schedule at each step t is controlled by beta_t.

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

Using the reparameterization trick, we can directly compute the noised image at any arbitrary timestep t.

x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Here, alpha_t = 1 - beta_t, and alpha_bar_t is the cumulative product from alpha_1 to alpha_t.

class DDPMScheduler:

"""DDPM Forward Process Scheduler"""

def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):

self.num_timesteps = num_timesteps

Linear noise schedule

self.betas = torch.linspace(beta_start, beta_end, num_timesteps)

self.alphas = 1.0 - self.betas

self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)

self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)

self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)

def add_noise(self, x_0, t, noise=None):

"""Generate noised image at arbitrary timestep t"""

if noise is None:

noise = torch.randn_like(x_0)

sqrt_alpha_bar = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)

sqrt_one_minus_alpha_bar = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)

x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon

x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise

return x_t

def sample_timesteps(self, batch_size):

"""Sample random timesteps for training"""

return torch.randint(0, self.num_timesteps, (batch_size,))

Reverse Process (Denoising)

In the Reverse Process, starting from x_T ~ N(0, I), the trained model epsilon_theta is used to progressively remove noise step by step.

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)

class DDPMSampler:

"""DDPM Reverse Process Sampler"""

def __init__(self, scheduler):

self.scheduler = scheduler

@torch.no_grad()

def sample(self, model, shape, device):

"""DDPM reverse diffusion sampling"""

Start from pure noise

x = torch.randn(shape, device=device)

for t in reversed(range(self.scheduler.num_timesteps)):

t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)

Predict noise

predicted_noise = model(x, t_batch)

Compute mean

alpha = self.scheduler.alphas[t]

alpha_bar = self.scheduler.alphas_cumprod[t]

beta = self.scheduler.betas[t]

mean = (1 / torch.sqrt(alpha)) * (

x - (beta / torch.sqrt(1 - alpha_bar)) * predicted_noise

)

Add noise only when t > 0

if t > 0:

noise = torch.randn_like(x)

sigma = torch.sqrt(beta)

x = mean + sigma * noise

else:

x = mean

return x

Training Objective: Simple Loss

DDPM training minimizes the MSE between the model-predicted noise and the actual noise.

L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

def ddpm_training_step(model, x_0, scheduler, optimizer):

"""DDPM training single step"""

batch_size = x_0.shape[0]

device = x_0.device

1. Sample random timesteps

t = scheduler.sample_timesteps(batch_size).to(device)

2. Generate noise and noised image

noise = torch.randn_like(x_0)

x_t = scheduler.add_noise(x_0, t, noise)

3. Model predicts noise

predicted_noise = model(x_t, t)

4. Compute Simple Loss

loss = nn.functional.mse_loss(predicted_noise, noise)

5. Backpropagation

optimizer.zero_grad()

loss.backward()

optimizer.step()

return loss.item()

DDIM: Accelerated Sampling

DDPM requires 1000 steps of reverse diffusion, making generation extremely slow. **DDIM (Denoising Diffusion Implicit Models)** proposed by Song et al. (2020) defines a non-Markovian diffusion process that enables 10-50x faster sampling with the same trained model.

The key to DDIM is the eta parameter that controls stochastic/deterministic sampling. When eta=0, the sampling is fully deterministic; when eta=1, it becomes identical to DDPM.

class DDIMSampler:

"""DDIM Accelerated Sampler"""

def __init__(self, scheduler, ddim_steps=50, eta=0.0):

self.scheduler = scheduler

self.ddim_steps = ddim_steps

self.eta = eta

Generate subset timesteps (e.g., 1000 -> 50)

self.timesteps = np.linspace(

0, scheduler.num_timesteps - 1, ddim_steps, dtype=int

)[::-1]

@torch.no_grad()

def sample(self, model, shape, device):

"""DDIM accelerated sampling - high quality in 50 steps"""

x = torch.randn(shape, device=device)

for i in range(len(self.timesteps)):

t = self.timesteps[i]

t_prev = self.timesteps[i + 1] if i + 1 < len(self.timesteps) else 0

t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)

predicted_noise = model(x, t_batch)

alpha_bar_t = self.scheduler.alphas_cumprod[t]

alpha_bar_prev = self.scheduler.alphas_cumprod[t_prev]

Predict x_0

x_0_pred = (x - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t)

x_0_pred = torch.clamp(x_0_pred, -1, 1)

Compute direction

sigma = self.eta * torch.sqrt(

(1 - alpha_bar_prev) / (1 - alpha_bar_t) * (1 - alpha_bar_t / alpha_bar_prev)

)

direction = torch.sqrt(1 - alpha_bar_prev - sigma**2) * predicted_noise

Compute x_{t-1}

x = torch.sqrt(alpha_bar_prev) * x_0_pred + direction

if self.eta > 0 and t > 0:

x = x + sigma * torch.randn_like(x)

return x

Relationship with Score-based Models

Song and Ermon (2019) interpreted diffusion models from the Score Matching perspective. The score function is the gradient of the log density of the data distribution.

s_\theta(x) \approx \nabla_x \log p(x)

DDPM's noise prediction epsilon_theta and the score function have the following relationship:

s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1 - \bar{\alpha}_t}}

This relationship was unified through the Score SDE (Stochastic Differential Equation) framework, which describes the diffusion process in continuous time as:

dx = f(x, t)dt + g(t)dw

Latent Diffusion Model (Stable Diffusion)

Architecture Overview

**Latent Diffusion Model (LDM)** by Rombach et al. (2022) dramatically reduced computational costs by performing the diffusion process in **latent space** rather than pixel space. This is the core architecture behind Stable Diffusion.

LDM consists of three key components:

| Component | Role | Details |

| ---------------- | ----------------------------- | ------------------------------------------------------------ |

| VAE Encoder | Encode images to latent space | Compress 512x512 images to 64x64x4 latent representations |

| U-Net (Denoiser) | Predict noise in latent space | Incorporates text conditions via Cross-Attention |

| VAE Decoder | Decode latents to images | Reconstruct 64x64x4 latent representations to 512x512 images |

| Text Encoder | Encode text prompts | Generate 77-token embeddings using CLIP ViT-L/14 |

Core Code Structure

from diffusers import StableDiffusionPipeline, DDIMScheduler

class LatentDiffusionInference:

"""Stable Diffusion Inference Pipeline (Simplified)"""

def __init__(self, model_id="stable-diffusion-v1-5/stable-diffusion-v1-5"):

self.pipe = StableDiffusionPipeline.from_pretrained(

model_id,

torch_dtype=torch.float16,

safety_checker=None

).to("cuda")

Switch to DDIM scheduler (accelerate with 50 steps)

self.pipe.scheduler = DDIMScheduler.from_config(

self.pipe.scheduler.config

)

def generate(self, prompt, negative_prompt="", num_steps=50, guidance_scale=7.5):

"""Text-to-image generation"""

image = self.pipe(

prompt=prompt,

negative_prompt=negative_prompt,

num_inference_steps=num_steps,

guidance_scale=guidance_scale,

).images[0]

return image

def generate_with_latent_control(self, prompt, seed=42):

"""Direct latent space control"""

generator = torch.Generator(device="cuda").manual_seed(seed)

Generate latent vector directly

latents = torch.randn(

(1, 4, 64, 64),

generator=generator,

device="cuda",

dtype=torch.float16

)

image = self.pipe(

prompt=prompt,

latents=latents,

num_inference_steps=50,

guidance_scale=7.5,

).images[0]

return image

Cross-Attention Mechanism

In Stable Diffusion's U-Net, **Cross-Attention** incorporates text conditions into image generation. Query is generated from the image latent representation, while Key and Value come from the text embeddings.

class CrossAttention(nn.Module):

"""Cross-Attention Layer in Stable Diffusion U-Net"""

def __init__(self, d_model=320, d_context=768, n_heads=8):

super().__init__()

self.n_heads = n_heads

self.d_head = d_model // n_heads

self.to_q = nn.Linear(d_model, d_model, bias=False)

self.to_k = nn.Linear(d_context, d_model, bias=False)

self.to_v = nn.Linear(d_context, d_model, bias=False)

self.to_out = nn.Linear(d_model, d_model)

def forward(self, x, context):

"""

x: Image latent representation (B, H*W, d_model)

context: Text embeddings (B, seq_len, d_context)

"""

B, N, C = x.shape

q = self.to_q(x).view(B, N, self.n_heads, self.d_head).transpose(1, 2)

k = self.to_k(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)

v = self.to_v(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)

Scaled Dot-Product Attention

scale = self.d_head ** -0.5

attn = torch.matmul(q, k.transpose(-2, -1)) * scale

attn = torch.softmax(attn, dim=-1)

out = torch.matmul(attn, v)

out = out.transpose(1, 2).contiguous().view(B, N, C)

return self.to_out(out)

Classifier-free Guidance (CFG)

**Classifier-free Guidance** proposed by Ho and Salimans (2022) is a key technique for controlling generation quality without a separate classifier.

During training, the conditional and unconditional models are trained simultaneously (by replacing the text condition with an empty string at a certain probability). During inference, a weighted average of both predictions is used.

\tilde{\epsilon}_\theta(x_t, c) = \epsilon_\theta(x_t, \varnothing) + w \cdot (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \varnothing))

Here, w is the guidance scale. w=1 means pure conditional generation, and higher w values make the generation follow the text condition more strongly (typically 7.5-15).

def classifier_free_guidance_step(model, x_t, t, text_embedding, null_embedding, guidance_scale=7.5):

"""Classifier-free Guidance single step"""

Process conditional/unconditional predictions as a single batch

x_in = torch.cat([x_t, x_t], dim=0)

t_in = torch.cat([t, t], dim=0)

c_in = torch.cat([null_embedding, text_embedding], dim=0)

Generate both predictions in a single forward pass

noise_pred = model(x_in, t_in, encoder_hidden_states=c_in)

noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)

Apply CFG

noise_pred_guided = noise_pred_uncond + guidance_scale * (

noise_pred_cond - noise_pred_uncond

)

return noise_pred_guided

DiT: Diffusion Transformer

From U-Net to Transformer

**DiT (Diffusion Transformer)** by Peebles and Xie (2023) replaced the diffusion model backbone from U-Net to Transformer. The key finding is that increasing the Transformer size (GFLOPs) consistently improves generation quality (FID).

| -------- | ----------- | ---------- | ------------------ | ------ |

| ADM | U-Net | 554M | 10.94 | 1120 |

| LDM-4 | U-Net | 400M | 10.56 | 103 |

| DiT-S/2 | Transformer | 33M | 68.40 | 6 |

| DiT-B/2 | Transformer | 130M | 43.47 | 23 |

| DiT-L/2 | Transformer | 458M | 9.62 | 80 |

| DiT-XL/2 | Transformer | 675M | 2.27 | 119 |

adaLN-Zero Block

The key innovation of DiT is the **adaLN-Zero** conditioning approach. Timestep and class embeddings are injected as scale/shift parameters of Adaptive Layer Normalization, with gating parameters initialized to zero so that the block acts as an identity function (residual connection) at the start of training.

class DiTBlock(nn.Module):

"""DiT adaLN-Zero Transformer Block"""

def __init__(self, d_model, n_heads):

super().__init__()

self.norm1 = nn.LayerNorm(d_model, elementwise_affine=False)

self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)

self.norm2 = nn.LayerNorm(d_model, elementwise_affine=False)

self.mlp = nn.Sequential(

nn.Linear(d_model, d_model * 4),

nn.GELU(),

nn.Linear(d_model * 4, d_model),

)

adaLN modulation: 6 parameters (gamma1, beta1, alpha1, gamma2, beta2, alpha2)

self.adaLN_modulation = nn.Sequential(

nn.SiLU(),

nn.Linear(d_model, 6 * d_model),

)

Zero initialization - acts as identity at start of training

nn.init.zeros_(self.adaLN_modulation[-1].weight)

nn.init.zeros_(self.adaLN_modulation[-1].bias)

def forward(self, x, c):

"""

x: Patch tokens (B, N, D)

c: Condition embedding - timestep + class (B, D)

"""

Generate adaLN parameters

shift1, scale1, gate1, shift2, scale2, gate2 = (

self.adaLN_modulation(c).chunk(6, dim=-1)

)

Self-Attention with adaLN

h = self.norm1(x)

h = h * (1 + scale1.unsqueeze(1)) + shift1.unsqueeze(1)

h, _ = self.attn(h, h, h)

x = x + gate1.unsqueeze(1) * h

FFN with adaLN

h = self.norm2(x)

h = h * (1 + scale2.unsqueeze(1)) + shift2.unsqueeze(1)

h = self.mlp(h)

x = x + gate2.unsqueeze(1) * h

return x

Patchify Strategy

DiT splits the latent representation into p x p patches to use as Transformer input tokens. Smaller patch sizes result in more tokens, improving performance but increasing computational cost.

class PatchEmbed(nn.Module):

"""DiT Patchify Layer"""

def __init__(self, patch_size=2, in_channels=4, embed_dim=1152):

super().__init__()

self.patch_size = patch_size

self.proj = nn.Conv2d(

in_channels, embed_dim,

kernel_size=patch_size, stride=patch_size

)

def forward(self, x):

"""(B, C, H, W) -> (B, N, D) patch token sequence"""

x = self.proj(x) # (B, D, H/p, W/p)

x = x.flatten(2).transpose(1, 2) # (B, N, D)

return x

SDXL: Evolution of Stable Diffusion

Key Improvements

**SDXL** by Podell et al. (2023) introduced the following core improvements over Stable Diffusion v1.5:

| Feature | SD v1.5 | SDXL Base |

| ------------------------ | ------------- | ------------------------------ |

| U-Net Parameters | 860M | 2.6B (3x increase) |

| Text Encoder | CLIP ViT-L/14 | OpenCLIP ViT-bigG + CLIP ViT-L |

| Text Embedding Dimension | 768 | 2048 |

| Default Resolution | 512x512 | 1024x1024 |

| Attention Blocks | 16 | 70 |

| Refiner Model | None | Dedicated Refiner included |

Dual Text Encoders

One of SDXL's greatest innovations is the use of **two text encoders**. It combines the rich semantic representations from OpenCLIP ViT-bigG with complementary features from CLIP ViT-L, significantly improving text understanding.

from diffusers import StableDiffusionXLPipeline

class SDXLInference:

"""SDXL Inference Pipeline"""

def __init__(self):

self.pipe = StableDiffusionXLPipeline.from_pretrained(

"stabilityai/stable-diffusion-xl-base-1.0",

torch_dtype=torch.float16,

variant="fp16",

use_safetensors=True,

).to("cuda")

Memory optimization

self.pipe.enable_model_cpu_offload()

self.pipe.enable_vae_tiling()

def generate(self, prompt, negative_prompt="", steps=30):

"""SDXL basic generation"""

image = self.pipe(

prompt=prompt,

negative_prompt=negative_prompt,

num_inference_steps=steps,

guidance_scale=7.5,

height=1024,

width=1024,

).images[0]

return image

def generate_with_refiner(self, prompt, base_pipe, refiner_pipe):

"""Base + Refiner two-stage pipeline"""

Base model: 80% of total steps

high_noise_frac = 0.8

image = base_pipe(

prompt=prompt,

num_inference_steps=40,

denoising_end=high_noise_frac,

output_type="latent",

).images

Refiner: remaining 20% (enhance fine details)

image = refiner_pipe(

prompt=prompt,

num_inference_steps=40,

denoising_start=high_noise_frac,

image=image,

).images[0]

return image

Size/Crop Conditioning

SDXL provides the original image size and crop coordinates as conditions during training, enabling effective learning of images with diverse aspect ratios. This is implemented using Fourier Feature Encoding.

def get_sdxl_conditioning(original_size, crop_coords, target_size):

"""Generate SDXL size/crop conditioning"""

Original size (height, width)

original_size = torch.tensor(original_size, dtype=torch.float32)

Crop coordinates (top, left)

crop_coords = torch.tensor(crop_coords, dtype=torch.float32)

Target size (height, width)

target_size = torch.tensor(target_size, dtype=torch.float32)

Fourier Feature Encoding

conditioning = torch.cat([original_size, crop_coords, target_size])

Sinusoidal embedding

freqs = torch.exp(

-torch.arange(0, 128) * np.log(10000) / 128

)

emb = conditioning.unsqueeze(-1) * freqs.unsqueeze(0)

emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)

return emb.flatten()

ControlNet: Conditional Generation Control

**ControlNet** by Zhang et al. (2023) adds spatial conditions such as edges, depth, and pose to pretrained diffusion models. The **Zero Convolution** technique preserves the existing capabilities of the model at the beginning of training while gradually learning new conditions.

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

from controlnet_aux import CannyDetector

from PIL import Image

def controlnet_canny_generation(input_image_path, prompt):

"""ControlNet Canny Edge based image generation"""

Load ControlNet model

controlnet = ControlNetModel.from_pretrained(

"lllyasviel/control_v11p_sd15_canny",

torch_dtype=torch.float16,

)

pipe = StableDiffusionControlNetPipeline.from_pretrained(

"stable-diffusion-v1-5/stable-diffusion-v1-5",

controlnet=controlnet,

torch_dtype=torch.float16,

).to("cuda")

Extract Canny Edge

canny_detector = CannyDetector()

input_image = Image.open(input_image_path)

canny_image = canny_detector(input_image, low_threshold=100, high_threshold=200)

ControlNet-based generation

output = pipe(

prompt=prompt,

image=canny_image,

num_inference_steps=30,

guidance_scale=7.5,

controlnet_conditioning_scale=1.0,

).images[0]

return output

Training Pipeline and Data Preparation

Dataset Composition

Here is a comparison of major datasets used for training large-scale diffusion models.

| ---------------- | --------------------- | ---------- | ------------------------------------ |

| ImageNet | 1.3M | 256/512 | DiT training (class-conditional) |

Fine-tuning Strategies

LoRA Fine-tuning (Stable Diffusion)

accelerate launch train_text_to_image_lora.py \

--pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \

--dataset_name="custom_dataset" \

--resolution=512 \

--train_batch_size=4 \

--gradient_accumulation_steps=4 \

--learning_rate=1e-4 \

--lr_scheduler="cosine" \

--lr_warmup_steps=500 \

--max_train_steps=10000 \

--rank=64 \

--output_dir="./lora_output" \

--mixed_precision="fp16" \

--enable_xformers_memory_efficient_attention

DreamBooth Fine-tuning (specific object/style learning)

accelerate launch train_dreambooth.py \

--pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \

--instance_data_dir="./my_images" \

--instance_prompt="a photo of sks dog" \

--class_data_dir="./class_images" \

--class_prompt="a photo of dog" \

--with_prior_preservation \

--prior_loss_weight=1.0 \

--num_class_images=200 \

--resolution=512 \

--train_batch_size=1 \

--learning_rate=5e-6 \

--max_train_steps=800

Inference Optimization Techniques

Key Optimization Techniques Comparison

| ----------------------------------- | ----------------- | -------------- | -------------- |

| DDIM (50 steps) | 20x | Minimal | - |

| DPM-Solver++ (20 steps) | 50x | Minimal | - |

| xFormers Memory Efficient Attention | 1.5x | None | 30-40% |

| torch.compile | 1.2-1.5x | None | - |

| VAE Tiling | - | Minimal | 70%+ |

| FP16/BF16 | 1.5-2x | Minimal | 50% |

| TensorRT | 2-4x | None | - |

Production Optimization Code

from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler

def optimized_sdxl_pipeline():

"""Production-optimized SDXL Pipeline"""

pipe = StableDiffusionXLPipeline.from_pretrained(

"stabilityai/stable-diffusion-xl-base-1.0",

torch_dtype=torch.float16,

variant="fp16",

use_safetensors=True,

).to("cuda")

1. Apply fast scheduler

pipe.scheduler = DPMSolverMultistepScheduler.from_config(

pipe.scheduler.config,

algorithm_type="dpmsolver++",

use_karras_sigmas=True,

)

2. VAE Tiling (memory savings for high-resolution generation)

pipe.enable_vae_tiling()

3. Attention Slicing (when VRAM is limited)

pipe.enable_attention_slicing()

4. torch.compile (PyTorch 2.0+)

pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

return pipe

GPU memory monitoring

def monitor_gpu_memory():

"""Monitor GPU memory usage"""

allocated = torch.cuda.memory_allocated() / 1024**3

reserved = torch.cuda.memory_reserved() / 1024**3

max_allocated = torch.cuda.max_memory_allocated() / 1024**3

print(f"Allocated: {allocated:.2f} GB")

print(f"Reserved: {reserved:.2f} GB")

print(f"Peak: {max_allocated:.2f} GB")

Comprehensive Model Comparison

| ---------- | ---- | --------------------------------- | ----------------- | --------------------- | ---------- |

| DiT | 2023 | Transformer backbone | Transformer | adaLN-Zero | 256/512 |

| SD3 | 2024 | MMDiT (Multi-Modal DiT) | Transformer | Flow Matching | 1024 |

Operational Considerations

GPU Memory Management

The most common issue when operating Stable Diffusion-based services is GPU OOM (Out of Memory). The following items should be checked:

1. **Batch size limits**: For 1024x1024 SDXL generation, approximately 12GB is needed per image on A100 80GB, while V100 16GB will encounter OOM

2. **Concurrent request limits**: Rate limiters must be applied to prevent GPU memory overflow

3. **Enable VAE Tiling**: Essential for high-resolution (2048x2048+) generation

4. **Memory profiling**: Regular GPU memory monitoring to detect memory leaks

Failure Case: GPU OOM Recovery

Check GPU memory status

nvidia-smi --query-gpu=memory.used,memory.total --format=csv

Check GPU memory leaks from Python processes

fuser -v /dev/nvidia*

Force GPU memory release (without process restart)

python -c "

gc.collect()

torch.cuda.empty_cache()

print('GPU memory cleared')

print(f'Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB')

Service recovery procedure after OOM

1. Graceful shutdown of the affected worker process

2. Verify GPU memory release

3. Adjust batch size/concurrent request count

4. Restart worker process

5. Resume traffic after health check passes

NSFW Filtering

For commercial services, the Safety Checker must always be enabled. Disabling it can result in NSFW content being generated, which may cause legal issues.

Safety Checker configuration (required for production)

pipe = StableDiffusionPipeline.from_pretrained(

"stable-diffusion-v1-5/stable-diffusion-v1-5",

safety_checker=None, # Only disable in development

)

Must be enabled in production

from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker

from transformers import CLIPImageProcessor

safety_checker = StableDiffusionSafetyChecker.from_pretrained(

"CompVis/stable-diffusion-safety-checker"

)

feature_extractor = CLIPImageProcessor.from_pretrained(

"openai/clip-vit-base-patch32"

)

Failure Cases and Recovery Procedures

Case 1: Model Loading Failure

Disk I/O timeouts or checkpoint corruption can occur when loading large-scale models.

from diffusers import StableDiffusionXLPipeline

def robust_model_loading(model_id, max_retries=3):

"""Robust model loading (with retries)"""

for attempt in range(max_retries):

try:

pipe = StableDiffusionXLPipeline.from_pretrained(

model_id,

torch_dtype=torch.float16,

use_safetensors=True,

local_files_only=os.path.exists(

os.path.join(model_id, "model_index.json")

)

pipe = pipe.to("cuda")

Warmup run

_ = pipe("test", num_inference_steps=1)

return pipe

except Exception as e:

print(f"Attempt {attempt + 1} failed: {e}")

if attempt < max_retries - 1:

time.sleep(10)

Clear cache and retry

torch.cuda.empty_cache()

else:

raise RuntimeError(f"Model loading failed after {max_retries} attempts")

Case 2: Image Quality Degradation (Inappropriate CFG Scale)

CFG Scale Guidelines

guidance_scale_guidelines:

1.0: 'Condition nearly ignored - generation close to random'

3.0-5.0: 'Creative and diverse generation'

7.0-8.5: 'Generally recommended range - quality/diversity balance'

10.0-15.0: 'High text fidelity - risk of oversaturation'

20.0+: 'Excessive guidance - artifacts may appear'

Troubleshooting checklist

troubleshooting:

blurry_output:

- 'Increase num_inference_steps (minimum 30+)'

- 'Switch scheduler to DPM-Solver++'

oversaturated:

- 'Lower guidance_scale to 7.0 or below'

- "Add 'oversaturated, vivid' to negative_prompt"

wrong_composition:

- 'Improve prompt structure (clear subject-verb-object)'

- 'Use ControlNet for composition control'

Conclusion

Diffusion Models have evolved rapidly, building on DDPM's theoretical foundations with DDIM's accelerated sampling, Latent Diffusion's efficient architecture, Classifier-free Guidance's quality control, DiT's scalability, SDXL's large-scale design, and ControlNet's fine-grained control.

Currently, new paradigms like SD3's MMDiT (Multi-Modal Diffusion Transformer) and Flow Matching, as well as Consistency Models, are emerging to enable even faster and higher-quality image generation. In particular, the DiT architecture serves as the foundation for video generation models like Sora (OpenAI), and the applications of Diffusion Models are expanding beyond images to video, 3D, and audio.

From an engineering perspective, understanding the theoretical background of models is the key to optimization and debugging. Accurately grasping the role of each component, including noise schedules, CFG Scale, scheduler selection, and memory management, is essential for operating stable services in production environments.

References

- [Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020.](https://arxiv.org/abs/2006.11239)

- [Song, J., Meng, C., & Ermon, S. (2020). Denoising Diffusion Implicit Models. ICLR 2021.](https://arxiv.org/abs/2010.02502)

- [Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022.](https://arxiv.org/abs/2112.10752)

- [Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance.](https://arxiv.org/abs/2207.12598)

- [Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV 2023.](https://arxiv.org/abs/2212.09748)

- [Podell, D., et al. (2023). SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis.](https://arxiv.org/abs/2307.01952)

- [Zhang, L., Rao, A., & Agrawala, M. (2023). Adding Conditional Control to Text-to-Image Diffusion Models. ICCV 2023.](https://arxiv.org/abs/2302.05543)

- [Lilian Weng. (2021). What are Diffusion Models?](https://lilianweng.github.io/posts/2021-07-11-diffusion-models/)