Split View: Diffusion Model 논문 서베이: DDPM에서 Stable Diffusion·DiT·SDXL까지 이미지 생성 모델의 진화
Diffusion Model 논문 서베이: DDPM에서 Stable Diffusion·DiT·SDXL까지 이미지 생성 모델의 진화
- 들어가며
- DDPM: 확산 모델의 기초
- DDIM: 가속 샘플링
- Score-based 모델과의 관계
- Latent Diffusion Model (Stable Diffusion)
- Classifier-free Guidance (CFG)
- DiT: Diffusion Transformer
- SDXL: Stable Diffusion의 진화
- ControlNet: 조건부 생성 제어
- 학습 파이프라인과 데이터 준비
- 추론 최적화 기법
- 모델 비교 종합
- 운영 시 주의사항
- 장애 사례와 복구 절차
- 마치며
- 참고자료

들어가며
이미지 생성 분야에서 Diffusion Model은 GAN(Generative Adversarial Network)을 대체하는 새로운 패러다임으로 자리 잡았다. 2020년 Ho 등이 발표한 DDPM(Denoising Diffusion Probabilistic Models) 이후, 불과 3년 만에 Stable Diffusion, DALL-E 2, Midjourney 등의 상용 서비스가 등장하며 이미지 생성의 대중화를 이끌었다.
Diffusion Model의 핵심 아이디어는 놀랍도록 단순하다. 데이터에 점진적으로 노이즈를 추가하는 Forward Process와 이 노이즈를 역으로 제거하여 데이터를 복원하는 Reverse Process를 학습하는 것이다. 이 과정에서 모델은 각 노이즈 수준에서 "어떤 방향으로 노이즈를 제거해야 하는지"를 학습하게 된다.
이 글에서는 DDPM의 수학적 기초부터 DDIM의 가속 샘플링, Score-based 모델과의 관계, Latent Diffusion(Stable Diffusion)의 아키텍처, Classifier-free Guidance, DiT(Diffusion Transformer), SDXL, ControlNet까지 주요 모델의 진화를 시간순으로 서베이한다. 각 모델의 핵심 기여, 구현 코드, 성능 비교, 운영 시 주의사항을 포괄적으로 다룬다.
DDPM: 확산 모델의 기초
Forward Process (노이즈 추가)
DDPM의 Forward Process는 원본 데이터 x_0에 T 단계에 걸쳐 점진적으로 가우시안 노이즈를 추가한다. 각 단계 t에서의 노이즈 스케줄은 beta_t로 제어된다.
Reparameterization trick을 활용하면 임의의 타임스텝 t에서의 노이즈 이미지를 직접 계산할 수 있다.
여기서 alpha_t = 1 - beta_t 이고, alpha_bar_t는 alpha_1부터 alpha_t까지의 누적 곱이다.
import torch
import torch.nn as nn
import numpy as np
class DDPMScheduler:
"""DDPM Forward Process 스케줄러"""
def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
self.num_timesteps = num_timesteps
# 선형 노이즈 스케줄
self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
self.alphas = 1.0 - self.betas
self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)
def add_noise(self, x_0, t, noise=None):
"""임의의 타임스텝 t에서의 노이즈 이미지 생성"""
if noise is None:
noise = torch.randn_like(x_0)
sqrt_alpha_bar = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
sqrt_one_minus_alpha_bar = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)
# x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise
return x_t
def sample_timesteps(self, batch_size):
"""학습용 랜덤 타임스텝 샘플링"""
return torch.randint(0, self.num_timesteps, (batch_size,))
Reverse Process (노이즈 제거)
Reverse Process에서는 x_T ~ N(0, I) 로부터 시작하여 학습된 모델 epsilon_theta를 사용하여 단계적으로 노이즈를 제거한다.
class DDPMSampler:
"""DDPM Reverse Process 샘플러"""
def __init__(self, scheduler):
self.scheduler = scheduler
@torch.no_grad()
def sample(self, model, shape, device):
"""DDPM 역확산 샘플링"""
# 순수 노이즈에서 시작
x = torch.randn(shape, device=device)
for t in reversed(range(self.scheduler.num_timesteps)):
t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
# 노이즈 예측
predicted_noise = model(x, t_batch)
# 평균 계산
alpha = self.scheduler.alphas[t]
alpha_bar = self.scheduler.alphas_cumprod[t]
beta = self.scheduler.betas[t]
mean = (1 / torch.sqrt(alpha)) * (
x - (beta / torch.sqrt(1 - alpha_bar)) * predicted_noise
)
# t > 0일 때만 노이즈 추가
if t > 0:
noise = torch.randn_like(x)
sigma = torch.sqrt(beta)
x = mean + sigma * noise
else:
x = mean
return x
학습 목표: Simple Loss
DDPM의 학습은 모델이 예측한 노이즈와 실제 노이즈 사이의 MSE를 최소화하는 것이다.
def ddpm_training_step(model, x_0, scheduler, optimizer):
"""DDPM 학습 단일 스텝"""
batch_size = x_0.shape[0]
device = x_0.device
# 1. 랜덤 타임스텝 샘플링
t = scheduler.sample_timesteps(batch_size).to(device)
# 2. 노이즈 생성 및 노이즈 이미지 생성
noise = torch.randn_like(x_0)
x_t = scheduler.add_noise(x_0, t, noise)
# 3. 모델이 노이즈 예측
predicted_noise = model(x_t, t)
# 4. Simple Loss 계산
loss = nn.functional.mse_loss(predicted_noise, noise)
# 5. 역전파
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
DDIM: 가속 샘플링
DDPM은 1000 스텝의 역확산 과정이 필요하여 생성 속도가 매우 느리다. Song 등(2020)이 제안한 DDIM(Denoising Diffusion Implicit Models) 은 비마르코프(non-Markovian) 확산 과정을 정의하여 동일한 학습된 모델로 10~50배 빠른 샘플링을 가능하게 했다.
DDIM의 핵심은 eta 파라미터로 확률적/결정적 샘플링을 제어하는 것이다. eta=0이면 완전 결정적(deterministic)이며, eta=1이면 DDPM과 동일해진다.
class DDIMSampler:
"""DDIM 가속 샘플러"""
def __init__(self, scheduler, ddim_steps=50, eta=0.0):
self.scheduler = scheduler
self.ddim_steps = ddim_steps
self.eta = eta
# 서브셋 타임스텝 생성 (예: 1000 -> 50)
self.timesteps = np.linspace(
0, scheduler.num_timesteps - 1, ddim_steps, dtype=int
)[::-1]
@torch.no_grad()
def sample(self, model, shape, device):
"""DDIM 가속 샘플링 - 50 스텝으로 고품질 생성"""
x = torch.randn(shape, device=device)
for i in range(len(self.timesteps)):
t = self.timesteps[i]
t_prev = self.timesteps[i + 1] if i + 1 < len(self.timesteps) else 0
t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
predicted_noise = model(x, t_batch)
alpha_bar_t = self.scheduler.alphas_cumprod[t]
alpha_bar_prev = self.scheduler.alphas_cumprod[t_prev]
# x_0 예측
x_0_pred = (x - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t)
x_0_pred = torch.clamp(x_0_pred, -1, 1)
# 방향 계산
sigma = self.eta * torch.sqrt(
(1 - alpha_bar_prev) / (1 - alpha_bar_t) * (1 - alpha_bar_t / alpha_bar_prev)
)
direction = torch.sqrt(1 - alpha_bar_prev - sigma**2) * predicted_noise
# x_{t-1} 계산
x = torch.sqrt(alpha_bar_prev) * x_0_pred + direction
if self.eta > 0 and t > 0:
x = x + sigma * torch.randn_like(x)
return x
Score-based 모델과의 관계
Song과 Ermon(2019)은 Score Matching 관점에서 확산 모델을 해석했다. Score function은 데이터 분포의 로그 밀도의 기울기이다.
DDPM의 노이즈 예측 epsilon_theta와 Score function은 다음 관계를 갖는다.
이 관계는 Score SDE(Stochastic Differential Equation) 프레임워크로 통합되어, 연속 시간에서의 확산 과정을 다음과 같이 기술한다.
Latent Diffusion Model (Stable Diffusion)
아키텍처 개요
Rombach 등(2022)의 Latent Diffusion Model(LDM) 은 확산 과정을 픽셀 공간이 아닌 잠재 공간(latent space) 에서 수행하여 계산 비용을 획기적으로 줄였다. 이것이 바로 Stable Diffusion의 핵심 아키텍처이다.
LDM은 세 가지 핵심 구성 요소로 이루어진다.
| 구성 요소 | 역할 | 상세 |
|---|---|---|
| VAE Encoder | 이미지를 잠재 공간으로 인코딩 | 512x512 이미지를 64x64x4 잠재 표현으로 압축 |
| U-Net (Denoiser) | 잠재 공간에서 노이즈 예측 | Cross-attention으로 텍스트 조건 반영 |
| VAE Decoder | 잠재 표현을 이미지로 디코딩 | 64x64x4 잠재 표현을 512x512 이미지로 복원 |
| Text Encoder | 텍스트 프롬프트 인코딩 | CLIP ViT-L/14로 77 토큰 임베딩 생성 |
핵심 코드 구조
import torch
from diffusers import StableDiffusionPipeline, DDIMScheduler
class LatentDiffusionInference:
"""Stable Diffusion 추론 파이프라인 (간소화)"""
def __init__(self, model_id="stable-diffusion-v1-5/stable-diffusion-v1-5"):
self.pipe = StableDiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
safety_checker=None
).to("cuda")
# DDIM 스케줄러로 교체 (50 스텝으로 가속)
self.pipe.scheduler = DDIMScheduler.from_config(
self.pipe.scheduler.config
)
def generate(self, prompt, negative_prompt="", num_steps=50, guidance_scale=7.5):
"""텍스트-이미지 생성"""
image = self.pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=num_steps,
guidance_scale=guidance_scale,
).images[0]
return image
def generate_with_latent_control(self, prompt, seed=42):
"""잠재 공간 직접 제어"""
generator = torch.Generator(device="cuda").manual_seed(seed)
# 잠재 벡터 직접 생성
latents = torch.randn(
(1, 4, 64, 64),
generator=generator,
device="cuda",
dtype=torch.float16
)
image = self.pipe(
prompt=prompt,
latents=latents,
num_inference_steps=50,
guidance_scale=7.5,
).images[0]
return image
Cross-Attention 메커니즘
Stable Diffusion의 U-Net에서는 Cross-Attention을 통해 텍스트 조건을 이미지 생성에 반영한다. Query는 이미지 잠재 표현에서, Key와 Value는 텍스트 임베딩에서 생성된다.
class CrossAttention(nn.Module):
"""Stable Diffusion U-Net의 Cross-Attention 레이어"""
def __init__(self, d_model=320, d_context=768, n_heads=8):
super().__init__()
self.n_heads = n_heads
self.d_head = d_model // n_heads
self.to_q = nn.Linear(d_model, d_model, bias=False)
self.to_k = nn.Linear(d_context, d_model, bias=False)
self.to_v = nn.Linear(d_context, d_model, bias=False)
self.to_out = nn.Linear(d_model, d_model)
def forward(self, x, context):
"""
x: 이미지 잠재 표현 (B, H*W, d_model)
context: 텍스트 임베딩 (B, seq_len, d_context)
"""
B, N, C = x.shape
q = self.to_q(x).view(B, N, self.n_heads, self.d_head).transpose(1, 2)
k = self.to_k(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
v = self.to_v(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
# Scaled Dot-Product Attention
scale = self.d_head ** -0.5
attn = torch.matmul(q, k.transpose(-2, -1)) * scale
attn = torch.softmax(attn, dim=-1)
out = torch.matmul(attn, v)
out = out.transpose(1, 2).contiguous().view(B, N, C)
return self.to_out(out)
Classifier-free Guidance (CFG)
Ho와 Salimans(2022)가 제안한 Classifier-free Guidance는 별도의 분류기 없이 생성 품질을 제어하는 핵심 기법이다.
학습 시에는 조건부 모델과 비조건부 모델을 동시에 학습한다(일정 확률로 텍스트 조건을 빈 문자열로 대체). 추론 시에는 두 예측의 가중 평균을 사용한다.
여기서 w는 guidance scale이다. w=1이면 순수 조건부 생성, w가 클수록 텍스트 조건에 더 강하게 따른다(일반적으로 7.5~15).
def classifier_free_guidance_step(model, x_t, t, text_embedding, null_embedding, guidance_scale=7.5):
"""Classifier-free Guidance 단일 스텝"""
# 조건부/비조건부 예측을 배치로 한번에 처리
x_in = torch.cat([x_t, x_t], dim=0)
t_in = torch.cat([t, t], dim=0)
c_in = torch.cat([null_embedding, text_embedding], dim=0)
# 한 번의 forward pass로 두 예측 동시 생성
noise_pred = model(x_in, t_in, encoder_hidden_states=c_in)
noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)
# CFG 적용
noise_pred_guided = noise_pred_uncond + guidance_scale * (
noise_pred_cond - noise_pred_uncond
)
return noise_pred_guided
DiT: Diffusion Transformer
U-Net에서 Transformer로
Peebles와 Xie(2023)의 DiT(Diffusion Transformer) 는 확산 모델의 백본을 U-Net에서 Transformer로 교체했다. 핵심 발견은 Transformer의 크기(GFLOPs)를 늘리면 생성 품질(FID)이 일관되게 향상된다는 것이다.
| 모델 | 백본 | 파라미터 수 | FID (ImageNet 256) | GFLOPs |
|---|---|---|---|---|
| ADM | U-Net | 554M | 10.94 | 1120 |
| LDM-4 | U-Net | 400M | 10.56 | 103 |
| DiT-S/2 | Transformer | 33M | 68.40 | 6 |
| DiT-B/2 | Transformer | 130M | 43.47 | 23 |
| DiT-L/2 | Transformer | 458M | 9.62 | 80 |
| DiT-XL/2 | Transformer | 675M | 2.27 | 119 |
adaLN-Zero 블록
DiT의 핵심 혁신은 adaLN-Zero 조건화 방식이다. 타임스텝과 클래스 임베딩을 Adaptive Layer Normalization의 scale/shift 파라미터로 주입하되, 초기화 시 게이팅 파라미터를 0으로 설정하여 학습 초기에는 잔차 연결(identity function)으로 동작하게 한다.
class DiTBlock(nn.Module):
"""DiT의 adaLN-Zero Transformer Block"""
def __init__(self, d_model, n_heads):
super().__init__()
self.norm1 = nn.LayerNorm(d_model, elementwise_affine=False)
self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
self.norm2 = nn.LayerNorm(d_model, elementwise_affine=False)
self.mlp = nn.Sequential(
nn.Linear(d_model, d_model * 4),
nn.GELU(),
nn.Linear(d_model * 4, d_model),
)
# adaLN modulation: 6개의 파라미터 (gamma1, beta1, alpha1, gamma2, beta2, alpha2)
self.adaLN_modulation = nn.Sequential(
nn.SiLU(),
nn.Linear(d_model, 6 * d_model),
)
# Zero 초기화 - 학습 초기에 identity로 동작
nn.init.zeros_(self.adaLN_modulation[-1].weight)
nn.init.zeros_(self.adaLN_modulation[-1].bias)
def forward(self, x, c):
"""
x: 패치 토큰 (B, N, D)
c: 조건 임베딩 - 타임스텝 + 클래스 (B, D)
"""
# adaLN 파라미터 생성
shift1, scale1, gate1, shift2, scale2, gate2 = (
self.adaLN_modulation(c).chunk(6, dim=-1)
)
# Self-Attention with adaLN
h = self.norm1(x)
h = h * (1 + scale1.unsqueeze(1)) + shift1.unsqueeze(1)
h, _ = self.attn(h, h, h)
x = x + gate1.unsqueeze(1) * h
# FFN with adaLN
h = self.norm2(x)
h = h * (1 + scale2.unsqueeze(1)) + shift2.unsqueeze(1)
h = self.mlp(h)
x = x + gate2.unsqueeze(1) * h
return x
Patchify 전략
DiT는 잠재 표현을 p x p 패치로 분할하여 Transformer의 입력 토큰으로 사용한다. 패치 크기가 작을수록 토큰 수가 많아져 성능이 향상되지만 계산 비용도 증가한다.
class PatchEmbed(nn.Module):
"""DiT의 Patchify 레이어"""
def __init__(self, patch_size=2, in_channels=4, embed_dim=1152):
super().__init__()
self.patch_size = patch_size
self.proj = nn.Conv2d(
in_channels, embed_dim,
kernel_size=patch_size, stride=patch_size
)
def forward(self, x):
"""(B, C, H, W) -> (B, N, D) 패치 토큰 시퀀스"""
x = self.proj(x) # (B, D, H/p, W/p)
x = x.flatten(2).transpose(1, 2) # (B, N, D)
return x
SDXL: Stable Diffusion의 진화
주요 개선 사항
Podell 등(2023)의 SDXL은 Stable Diffusion v1.5 대비 다음의 핵심 개선을 도입했다.
| 특징 | SD v1.5 | SDXL Base |
|---|---|---|
| U-Net 파라미터 | 860M | 2.6B (3배 증가) |
| 텍스트 인코더 | CLIP ViT-L/14 | OpenCLIP ViT-bigG + CLIP ViT-L |
| 텍스트 임베딩 차원 | 768 | 2048 |
| 기본 해상도 | 512x512 | 1024x1024 |
| Attention 블록 수 | 16 | 70 |
| Refiner 모델 | 없음 | 전용 Refiner 포함 |
이중 텍스트 인코더
SDXL의 가장 큰 혁신 중 하나는 두 개의 텍스트 인코더를 사용하는 것이다. OpenCLIP ViT-bigG의 풍부한 의미 표현과 CLIP ViT-L의 보완적 특징을 결합하여 텍스트 이해력을 크게 향상시켰다.
from diffusers import StableDiffusionXLPipeline
import torch
class SDXLInference:
"""SDXL 추론 파이프라인"""
def __init__(self):
self.pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True,
).to("cuda")
# 메모리 최적화
self.pipe.enable_model_cpu_offload()
self.pipe.enable_vae_tiling()
def generate(self, prompt, negative_prompt="", steps=30):
"""SDXL 기본 생성"""
image = self.pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=steps,
guidance_scale=7.5,
height=1024,
width=1024,
).images[0]
return image
def generate_with_refiner(self, prompt, base_pipe, refiner_pipe):
"""Base + Refiner 2단계 파이프라인"""
# Base 모델: 전체 스텝의 80%
high_noise_frac = 0.8
image = base_pipe(
prompt=prompt,
num_inference_steps=40,
denoising_end=high_noise_frac,
output_type="latent",
).images
# Refiner: 나머지 20% (세부 디테일 향상)
image = refiner_pipe(
prompt=prompt,
num_inference_steps=40,
denoising_start=high_noise_frac,
image=image,
).images[0]
return image
크기/크롭 조건화
SDXL은 학습 시 이미지의 원본 크기와 크롭 좌표를 조건으로 제공하여, 다양한 종횡비의 이미지를 효과적으로 학습할 수 있게 했다. 이는 Fourier Feature Encoding을 사용하여 구현된다.
def get_sdxl_conditioning(original_size, crop_coords, target_size):
"""SDXL의 크기/크롭 조건 생성"""
# 원본 크기 (height, width)
original_size = torch.tensor(original_size, dtype=torch.float32)
# 크롭 좌표 (top, left)
crop_coords = torch.tensor(crop_coords, dtype=torch.float32)
# 목표 크기 (height, width)
target_size = torch.tensor(target_size, dtype=torch.float32)
# Fourier Feature Encoding
conditioning = torch.cat([original_size, crop_coords, target_size])
# Sinusoidal embedding
freqs = torch.exp(
-torch.arange(0, 128) * np.log(10000) / 128
)
emb = conditioning.unsqueeze(-1) * freqs.unsqueeze(0)
emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)
return emb.flatten()
ControlNet: 조건부 생성 제어
Zhang 등(2023)의 ControlNet은 사전 학습된 확산 모델에 엣지, 깊이, 포즈 등의 공간 조건을 추가한다. Zero Convolution 기법으로 학습 초기에 모델의 기존 능력을 보존하면서 새로운 조건을 점진적으로 학습한다.
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from controlnet_aux import CannyDetector
from PIL import Image
import torch
def controlnet_canny_generation(input_image_path, prompt):
"""ControlNet Canny Edge 기반 이미지 생성"""
# ControlNet 모델 로드
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/control_v11p_sd15_canny",
torch_dtype=torch.float16,
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16,
).to("cuda")
# Canny Edge 추출
canny_detector = CannyDetector()
input_image = Image.open(input_image_path)
canny_image = canny_detector(input_image, low_threshold=100, high_threshold=200)
# ControlNet 기반 생성
output = pipe(
prompt=prompt,
image=canny_image,
num_inference_steps=30,
guidance_scale=7.5,
controlnet_conditioning_scale=1.0,
).images[0]
return output
학습 파이프라인과 데이터 준비
데이터셋 구성
대규모 확산 모델의 학습에 사용되는 주요 데이터셋 비교이다.
| 데이터셋 | 규모 | 해상도 | 용도 |
|---|---|---|---|
| LAION-5B | 58억 이미지-텍스트 쌍 | 다양 | Stable Diffusion 학습 |
| LAION-Aesthetics | 1.2억 (필터링) | 다양 | 고품질 파인튜닝 |
| ImageNet | 130만 | 256/512 | DiT 학습 (클래스 조건부) |
| COYO-700M | 7억 | 다양 | 한국어 포함 다국어 학습 |
파인튜닝 전략
# LoRA 파인튜닝 (Stable Diffusion)
accelerate launch train_text_to_image_lora.py \
--pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
--dataset_name="custom_dataset" \
--resolution=512 \
--train_batch_size=4 \
--gradient_accumulation_steps=4 \
--learning_rate=1e-4 \
--lr_scheduler="cosine" \
--lr_warmup_steps=500 \
--max_train_steps=10000 \
--rank=64 \
--output_dir="./lora_output" \
--mixed_precision="fp16" \
--enable_xformers_memory_efficient_attention
# DreamBooth 파인튜닝 (특정 객체/스타일 학습)
accelerate launch train_dreambooth.py \
--pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
--instance_data_dir="./my_images" \
--instance_prompt="a photo of sks dog" \
--class_data_dir="./class_images" \
--class_prompt="a photo of dog" \
--with_prior_preservation \
--prior_loss_weight=1.0 \
--num_class_images=200 \
--resolution=512 \
--train_batch_size=1 \
--learning_rate=5e-6 \
--max_train_steps=800
추론 최적화 기법
주요 최적화 기법 비교
| 기법 | 속도 향상 | 품질 영향 | 메모리 절감 |
|---|---|---|---|
| DDIM (50 steps) | 20x | 미미 | - |
| DPM-Solver++ (20 steps) | 50x | 미미 | - |
| xFormers Memory Efficient Attention | 1.5x | 없음 | 30-40% |
| torch.compile | 1.2-1.5x | 없음 | - |
| VAE Tiling | - | 미미 | 70%+ |
| FP16/BF16 | 1.5-2x | 미미 | 50% |
| TensorRT | 2-4x | 없음 | - |
실전 최적화 코드
import torch
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
def optimized_sdxl_pipeline():
"""프로덕션 최적화된 SDXL 파이프라인"""
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True,
).to("cuda")
# 1. 고속 스케줄러 적용
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
pipe.scheduler.config,
algorithm_type="dpmsolver++",
use_karras_sigmas=True,
)
# 2. VAE Tiling (고해상도 생성 시 메모리 절감)
pipe.enable_vae_tiling()
# 3. Attention Slicing (VRAM 부족 시)
pipe.enable_attention_slicing()
# 4. torch.compile (PyTorch 2.0+)
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
return pipe
# GPU 메모리 모니터링
def monitor_gpu_memory():
"""GPU 메모리 사용량 모니터링"""
allocated = torch.cuda.memory_allocated() / 1024**3
reserved = torch.cuda.memory_reserved() / 1024**3
max_allocated = torch.cuda.max_memory_allocated() / 1024**3
print(f"Allocated: {allocated:.2f} GB")
print(f"Reserved: {reserved:.2f} GB")
print(f"Peak: {max_allocated:.2f} GB")
모델 비교 종합
| 모델 | 연도 | 핵심 기여 | 백본 | 조건화 방식 | 해상도 |
|---|---|---|---|---|---|
| DDPM | 2020 | 확산 모델 실용화 | U-Net | 없음 (비조건부) | 256 |
| DDIM | 2020 | 가속 샘플링 | U-Net | 없음 | 256 |
| LDM (SD) | 2022 | 잠재 공간 확산 | U-Net + VAE | Cross-Attention | 512 |
| DiT | 2023 | Transformer 백본 | Transformer | adaLN-Zero | 256/512 |
| SDXL | 2023 | 대규모 U-Net + 이중 인코더 | U-Net + VAE | Cross-Attention + CFG | 1024 |
| ControlNet | 2023 | 공간 조건 제어 | Zero Conv + U-Net | 엣지/깊이/포즈 | 512 |
| SD3 | 2024 | MMDiT (다중 모달 DiT) | Transformer | Flow Matching | 1024 |
운영 시 주의사항
GPU 메모리 관리
Stable Diffusion 기반 서비스를 운영할 때 가장 빈번한 문제는 GPU OOM(Out of Memory)이다. 다음 사항을 체크해야 한다.
- 배치 크기 제한: 1024x1024 SDXL 생성 시 단일 이미지 기준 A100 80GB에서 약 12GB, V100 16GB에서는 OOM 발생
- 동시 요청 제한: Rate limiter를 반드시 적용하여 GPU 메모리 초과 방지
- VAE Tiling 활성화: 고해상도(2048x2048+) 생성 시 필수
- 메모리 프로파일링: 주기적인 GPU 메모리 모니터링으로 메모리 누수 감지
장애 사례: GPU OOM 복구
# GPU 메모리 상태 확인
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
# Python 프로세스의 GPU 메모리 누수 확인
fuser -v /dev/nvidia*
# 강제 GPU 메모리 해제 (프로세스 재시작 없이)
python -c "
import torch
import gc
gc.collect()
torch.cuda.empty_cache()
print('GPU memory cleared')
print(f'Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB')
"
# OOM 발생 시 서비스 복구 절차
# 1. 해당 워커 프로세스 graceful shutdown
# 2. GPU 메모리 해제 확인
# 3. 배치 크기/동시 요청 수 조정
# 4. 워커 프로세스 재시작
# 5. 헬스체크 통과 확인 후 트래픽 복구
NSFW 필터링
상용 서비스에서는 반드시 Safety Checker를 활성화해야 한다. Safety Checker를 비활성화하면 NSFW 콘텐츠가 생성될 수 있어 법적 문제가 발생할 수 있다.
# Safety Checker 설정 (프로덕션 필수)
pipe = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
safety_checker=None, # 개발 환경에서만 비활성화
)
# 프로덕션에서는 반드시 활성화
from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker
from transformers import CLIPImageProcessor
safety_checker = StableDiffusionSafetyChecker.from_pretrained(
"CompVis/stable-diffusion-safety-checker"
)
feature_extractor = CLIPImageProcessor.from_pretrained(
"openai/clip-vit-base-patch32"
)
장애 사례와 복구 절차
사례 1: 모델 로딩 실패
대규모 모델 로딩 시 디스크 I/O 타임아웃이나 체크포인트 손상이 발생할 수 있다.
import os
from diffusers import StableDiffusionXLPipeline
def robust_model_loading(model_id, max_retries=3):
"""안정적인 모델 로딩 (재시도 포함)"""
for attempt in range(max_retries):
try:
pipe = StableDiffusionXLPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
use_safetensors=True,
local_files_only=os.path.exists(
os.path.join(model_id, "model_index.json")
),
)
pipe = pipe.to("cuda")
# 워밍업 실행
_ = pipe("test", num_inference_steps=1)
return pipe
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
import time
time.sleep(10)
# 캐시 클리어 후 재시도
torch.cuda.empty_cache()
else:
raise RuntimeError(f"Model loading failed after {max_retries} attempts")
사례 2: 이미지 품질 저하 (CFG Scale 부적절)
# CFG Scale 가이드라인
guidance_scale_guidelines:
1.0: '조건 거의 무시 - 랜덤에 가까운 생성'
3.0-5.0: '창의적이고 다양한 생성'
7.0-8.5: '일반적 권장 범위 - 품질/다양성 균형'
10.0-15.0: '텍스트 충실도 높음 - 과포화 위험'
20.0+: '과도한 가이던스 - 아티팩트 발생'
# 문제 진단 체크리스트
troubleshooting:
blurry_output:
- 'num_inference_steps 증가 (최소 30 이상)'
- '스케줄러를 DPM-Solver++로 변경'
oversaturated:
- 'guidance_scale을 7.0 이하로 낮춤'
- "negative_prompt에 'oversaturated, vivid' 추가"
wrong_composition:
- '프롬프트 구조 개선 (주어-동사-목적어 명확히)'
- 'ControlNet으로 구도 제어'
마치며
Diffusion Model은 DDPM의 이론적 기초 위에 DDIM의 가속 샘플링, Latent Diffusion의 효율적 아키텍처, Classifier-free Guidance의 품질 제어, DiT의 확장성, SDXL의 대규모화, ControlNet의 세밀한 제어가 더해지며 급속히 발전했다.
현재 SD3의 MMDiT(Multi-Modal Diffusion Transformer)와 Flow Matching, Consistency Models 등의 새로운 패러다임이 등장하며 더 빠르고 고품질의 이미지 생성이 가능해지고 있다. 특히 DiT 아키텍처는 Sora(OpenAI)와 같은 비디오 생성 모델의 기반이 되며, Diffusion Model의 응용 범위가 이미지를 넘어 비디오, 3D, 오디오까지 확장되고 있다.
엔지니어 관점에서는 모델의 이론적 배경을 이해하는 것이 최적화와 디버깅의 핵심이다. 노이즈 스케줄, CFG Scale, 스케줄러 선택, 메모리 관리 등 각 구성 요소의 역할을 정확히 파악해야 프로덕션 환경에서 안정적인 서비스를 운영할 수 있다.
참고자료
- Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020.
- Song, J., Meng, C., & Ermon, S. (2020). Denoising Diffusion Implicit Models. ICLR 2021.
- Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022.
- Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance.
- Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV 2023.
- Podell, D., et al. (2023). SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis.
- Zhang, L., Rao, A., & Agrawala, M. (2023). Adding Conditional Control to Text-to-Image Diffusion Models. ICCV 2023.
- Lilian Weng. (2021). What are Diffusion Models?
Diffusion Model Paper Survey: Evolution of Image Generation from DDPM to Stable Diffusion, DiT, and SDXL
- Introduction
- DDPM: Foundations of Diffusion Models
- DDIM: Accelerated Sampling
- Relationship with Score-based Models
- Latent Diffusion Model (Stable Diffusion)
- Classifier-free Guidance (CFG)
- DiT: Diffusion Transformer
- SDXL: Evolution of Stable Diffusion
- ControlNet: Conditional Generation Control
- Training Pipeline and Data Preparation
- Inference Optimization Techniques
- Comprehensive Model Comparison
- Operational Considerations
- Failure Cases and Recovery Procedures
- Conclusion
- References

Introduction
In the field of image generation, Diffusion Models have established themselves as a new paradigm replacing GANs (Generative Adversarial Networks). Since Ho et al. published DDPM (Denoising Diffusion Probabilistic Models) in 2020, commercial services like Stable Diffusion, DALL-E 2, and Midjourney emerged within just three years, driving the democratization of image generation.
The core idea behind Diffusion Models is remarkably simple. It involves learning a Forward Process that gradually adds noise to data and a Reverse Process that removes this noise in reverse to reconstruct the data. Through this process, the model learns "which direction to remove noise" at each noise level.
In this article, we survey the evolution of major models chronologically: from the mathematical foundations of DDPM to DDIM's accelerated sampling, the relationship with score-based models, Latent Diffusion (Stable Diffusion) architecture, Classifier-free Guidance, DiT (Diffusion Transformer), SDXL, and ControlNet. We comprehensively cover each model's key contributions, implementation code, performance comparisons, and operational considerations.
DDPM: Foundations of Diffusion Models
Forward Process (Adding Noise)
DDPM's Forward Process gradually adds Gaussian noise to the original data x_0 over T steps. The noise schedule at each step t is controlled by beta_t.
Using the reparameterization trick, we can directly compute the noised image at any arbitrary timestep t.
Here, alpha_t = 1 - beta_t, and alpha_bar_t is the cumulative product from alpha_1 to alpha_t.
import torch
import torch.nn as nn
import numpy as np
class DDPMScheduler:
"""DDPM Forward Process Scheduler"""
def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
self.num_timesteps = num_timesteps
# Linear noise schedule
self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
self.alphas = 1.0 - self.betas
self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)
def add_noise(self, x_0, t, noise=None):
"""Generate noised image at arbitrary timestep t"""
if noise is None:
noise = torch.randn_like(x_0)
sqrt_alpha_bar = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
sqrt_one_minus_alpha_bar = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)
# x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise
return x_t
def sample_timesteps(self, batch_size):
"""Sample random timesteps for training"""
return torch.randint(0, self.num_timesteps, (batch_size,))
Reverse Process (Denoising)
In the Reverse Process, starting from x_T ~ N(0, I), the trained model epsilon_theta is used to progressively remove noise step by step.
class DDPMSampler:
"""DDPM Reverse Process Sampler"""
def __init__(self, scheduler):
self.scheduler = scheduler
@torch.no_grad()
def sample(self, model, shape, device):
"""DDPM reverse diffusion sampling"""
# Start from pure noise
x = torch.randn(shape, device=device)
for t in reversed(range(self.scheduler.num_timesteps)):
t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
# Predict noise
predicted_noise = model(x, t_batch)
# Compute mean
alpha = self.scheduler.alphas[t]
alpha_bar = self.scheduler.alphas_cumprod[t]
beta = self.scheduler.betas[t]
mean = (1 / torch.sqrt(alpha)) * (
x - (beta / torch.sqrt(1 - alpha_bar)) * predicted_noise
)
# Add noise only when t > 0
if t > 0:
noise = torch.randn_like(x)
sigma = torch.sqrt(beta)
x = mean + sigma * noise
else:
x = mean
return x
Training Objective: Simple Loss
DDPM training minimizes the MSE between the model-predicted noise and the actual noise.
def ddpm_training_step(model, x_0, scheduler, optimizer):
"""DDPM training single step"""
batch_size = x_0.shape[0]
device = x_0.device
# 1. Sample random timesteps
t = scheduler.sample_timesteps(batch_size).to(device)
# 2. Generate noise and noised image
noise = torch.randn_like(x_0)
x_t = scheduler.add_noise(x_0, t, noise)
# 3. Model predicts noise
predicted_noise = model(x_t, t)
# 4. Compute Simple Loss
loss = nn.functional.mse_loss(predicted_noise, noise)
# 5. Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
DDIM: Accelerated Sampling
DDPM requires 1000 steps of reverse diffusion, making generation extremely slow. DDIM (Denoising Diffusion Implicit Models) proposed by Song et al. (2020) defines a non-Markovian diffusion process that enables 10-50x faster sampling with the same trained model.
The key to DDIM is the eta parameter that controls stochastic/deterministic sampling. When eta=0, the sampling is fully deterministic; when eta=1, it becomes identical to DDPM.
class DDIMSampler:
"""DDIM Accelerated Sampler"""
def __init__(self, scheduler, ddim_steps=50, eta=0.0):
self.scheduler = scheduler
self.ddim_steps = ddim_steps
self.eta = eta
# Generate subset timesteps (e.g., 1000 -> 50)
self.timesteps = np.linspace(
0, scheduler.num_timesteps - 1, ddim_steps, dtype=int
)[::-1]
@torch.no_grad()
def sample(self, model, shape, device):
"""DDIM accelerated sampling - high quality in 50 steps"""
x = torch.randn(shape, device=device)
for i in range(len(self.timesteps)):
t = self.timesteps[i]
t_prev = self.timesteps[i + 1] if i + 1 < len(self.timesteps) else 0
t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
predicted_noise = model(x, t_batch)
alpha_bar_t = self.scheduler.alphas_cumprod[t]
alpha_bar_prev = self.scheduler.alphas_cumprod[t_prev]
# Predict x_0
x_0_pred = (x - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t)
x_0_pred = torch.clamp(x_0_pred, -1, 1)
# Compute direction
sigma = self.eta * torch.sqrt(
(1 - alpha_bar_prev) / (1 - alpha_bar_t) * (1 - alpha_bar_t / alpha_bar_prev)
)
direction = torch.sqrt(1 - alpha_bar_prev - sigma**2) * predicted_noise
# Compute x_{t-1}
x = torch.sqrt(alpha_bar_prev) * x_0_pred + direction
if self.eta > 0 and t > 0:
x = x + sigma * torch.randn_like(x)
return x
Relationship with Score-based Models
Song and Ermon (2019) interpreted diffusion models from the Score Matching perspective. The score function is the gradient of the log density of the data distribution.
DDPM's noise prediction epsilon_theta and the score function have the following relationship:
This relationship was unified through the Score SDE (Stochastic Differential Equation) framework, which describes the diffusion process in continuous time as:
Latent Diffusion Model (Stable Diffusion)
Architecture Overview
Latent Diffusion Model (LDM) by Rombach et al. (2022) dramatically reduced computational costs by performing the diffusion process in latent space rather than pixel space. This is the core architecture behind Stable Diffusion.
LDM consists of three key components:
| Component | Role | Details |
|---|---|---|
| VAE Encoder | Encode images to latent space | Compress 512x512 images to 64x64x4 latent representations |
| U-Net (Denoiser) | Predict noise in latent space | Incorporates text conditions via Cross-Attention |
| VAE Decoder | Decode latents to images | Reconstruct 64x64x4 latent representations to 512x512 images |
| Text Encoder | Encode text prompts | Generate 77-token embeddings using CLIP ViT-L/14 |
Core Code Structure
import torch
from diffusers import StableDiffusionPipeline, DDIMScheduler
class LatentDiffusionInference:
"""Stable Diffusion Inference Pipeline (Simplified)"""
def __init__(self, model_id="stable-diffusion-v1-5/stable-diffusion-v1-5"):
self.pipe = StableDiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
safety_checker=None
).to("cuda")
# Switch to DDIM scheduler (accelerate with 50 steps)
self.pipe.scheduler = DDIMScheduler.from_config(
self.pipe.scheduler.config
)
def generate(self, prompt, negative_prompt="", num_steps=50, guidance_scale=7.5):
"""Text-to-image generation"""
image = self.pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=num_steps,
guidance_scale=guidance_scale,
).images[0]
return image
def generate_with_latent_control(self, prompt, seed=42):
"""Direct latent space control"""
generator = torch.Generator(device="cuda").manual_seed(seed)
# Generate latent vector directly
latents = torch.randn(
(1, 4, 64, 64),
generator=generator,
device="cuda",
dtype=torch.float16
)
image = self.pipe(
prompt=prompt,
latents=latents,
num_inference_steps=50,
guidance_scale=7.5,
).images[0]
return image
Cross-Attention Mechanism
In Stable Diffusion's U-Net, Cross-Attention incorporates text conditions into image generation. Query is generated from the image latent representation, while Key and Value come from the text embeddings.
class CrossAttention(nn.Module):
"""Cross-Attention Layer in Stable Diffusion U-Net"""
def __init__(self, d_model=320, d_context=768, n_heads=8):
super().__init__()
self.n_heads = n_heads
self.d_head = d_model // n_heads
self.to_q = nn.Linear(d_model, d_model, bias=False)
self.to_k = nn.Linear(d_context, d_model, bias=False)
self.to_v = nn.Linear(d_context, d_model, bias=False)
self.to_out = nn.Linear(d_model, d_model)
def forward(self, x, context):
"""
x: Image latent representation (B, H*W, d_model)
context: Text embeddings (B, seq_len, d_context)
"""
B, N, C = x.shape
q = self.to_q(x).view(B, N, self.n_heads, self.d_head).transpose(1, 2)
k = self.to_k(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
v = self.to_v(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
# Scaled Dot-Product Attention
scale = self.d_head ** -0.5
attn = torch.matmul(q, k.transpose(-2, -1)) * scale
attn = torch.softmax(attn, dim=-1)
out = torch.matmul(attn, v)
out = out.transpose(1, 2).contiguous().view(B, N, C)
return self.to_out(out)
Classifier-free Guidance (CFG)
Classifier-free Guidance proposed by Ho and Salimans (2022) is a key technique for controlling generation quality without a separate classifier.
During training, the conditional and unconditional models are trained simultaneously (by replacing the text condition with an empty string at a certain probability). During inference, a weighted average of both predictions is used.
Here, w is the guidance scale. w=1 means pure conditional generation, and higher w values make the generation follow the text condition more strongly (typically 7.5-15).
def classifier_free_guidance_step(model, x_t, t, text_embedding, null_embedding, guidance_scale=7.5):
"""Classifier-free Guidance single step"""
# Process conditional/unconditional predictions as a single batch
x_in = torch.cat([x_t, x_t], dim=0)
t_in = torch.cat([t, t], dim=0)
c_in = torch.cat([null_embedding, text_embedding], dim=0)
# Generate both predictions in a single forward pass
noise_pred = model(x_in, t_in, encoder_hidden_states=c_in)
noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)
# Apply CFG
noise_pred_guided = noise_pred_uncond + guidance_scale * (
noise_pred_cond - noise_pred_uncond
)
return noise_pred_guided
DiT: Diffusion Transformer
From U-Net to Transformer
DiT (Diffusion Transformer) by Peebles and Xie (2023) replaced the diffusion model backbone from U-Net to Transformer. The key finding is that increasing the Transformer size (GFLOPs) consistently improves generation quality (FID).
| Model | Backbone | Parameters | FID (ImageNet 256) | GFLOPs |
|---|---|---|---|---|
| ADM | U-Net | 554M | 10.94 | 1120 |
| LDM-4 | U-Net | 400M | 10.56 | 103 |
| DiT-S/2 | Transformer | 33M | 68.40 | 6 |
| DiT-B/2 | Transformer | 130M | 43.47 | 23 |
| DiT-L/2 | Transformer | 458M | 9.62 | 80 |
| DiT-XL/2 | Transformer | 675M | 2.27 | 119 |
adaLN-Zero Block
The key innovation of DiT is the adaLN-Zero conditioning approach. Timestep and class embeddings are injected as scale/shift parameters of Adaptive Layer Normalization, with gating parameters initialized to zero so that the block acts as an identity function (residual connection) at the start of training.
class DiTBlock(nn.Module):
"""DiT adaLN-Zero Transformer Block"""
def __init__(self, d_model, n_heads):
super().__init__()
self.norm1 = nn.LayerNorm(d_model, elementwise_affine=False)
self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
self.norm2 = nn.LayerNorm(d_model, elementwise_affine=False)
self.mlp = nn.Sequential(
nn.Linear(d_model, d_model * 4),
nn.GELU(),
nn.Linear(d_model * 4, d_model),
)
# adaLN modulation: 6 parameters (gamma1, beta1, alpha1, gamma2, beta2, alpha2)
self.adaLN_modulation = nn.Sequential(
nn.SiLU(),
nn.Linear(d_model, 6 * d_model),
)
# Zero initialization - acts as identity at start of training
nn.init.zeros_(self.adaLN_modulation[-1].weight)
nn.init.zeros_(self.adaLN_modulation[-1].bias)
def forward(self, x, c):
"""
x: Patch tokens (B, N, D)
c: Condition embedding - timestep + class (B, D)
"""
# Generate adaLN parameters
shift1, scale1, gate1, shift2, scale2, gate2 = (
self.adaLN_modulation(c).chunk(6, dim=-1)
)
# Self-Attention with adaLN
h = self.norm1(x)
h = h * (1 + scale1.unsqueeze(1)) + shift1.unsqueeze(1)
h, _ = self.attn(h, h, h)
x = x + gate1.unsqueeze(1) * h
# FFN with adaLN
h = self.norm2(x)
h = h * (1 + scale2.unsqueeze(1)) + shift2.unsqueeze(1)
h = self.mlp(h)
x = x + gate2.unsqueeze(1) * h
return x
Patchify Strategy
DiT splits the latent representation into p x p patches to use as Transformer input tokens. Smaller patch sizes result in more tokens, improving performance but increasing computational cost.
class PatchEmbed(nn.Module):
"""DiT Patchify Layer"""
def __init__(self, patch_size=2, in_channels=4, embed_dim=1152):
super().__init__()
self.patch_size = patch_size
self.proj = nn.Conv2d(
in_channels, embed_dim,
kernel_size=patch_size, stride=patch_size
)
def forward(self, x):
"""(B, C, H, W) -> (B, N, D) patch token sequence"""
x = self.proj(x) # (B, D, H/p, W/p)
x = x.flatten(2).transpose(1, 2) # (B, N, D)
return x
SDXL: Evolution of Stable Diffusion
Key Improvements
SDXL by Podell et al. (2023) introduced the following core improvements over Stable Diffusion v1.5:
| Feature | SD v1.5 | SDXL Base |
|---|---|---|
| U-Net Parameters | 860M | 2.6B (3x increase) |
| Text Encoder | CLIP ViT-L/14 | OpenCLIP ViT-bigG + CLIP ViT-L |
| Text Embedding Dimension | 768 | 2048 |
| Default Resolution | 512x512 | 1024x1024 |
| Attention Blocks | 16 | 70 |
| Refiner Model | None | Dedicated Refiner included |
Dual Text Encoders
One of SDXL's greatest innovations is the use of two text encoders. It combines the rich semantic representations from OpenCLIP ViT-bigG with complementary features from CLIP ViT-L, significantly improving text understanding.
from diffusers import StableDiffusionXLPipeline
import torch
class SDXLInference:
"""SDXL Inference Pipeline"""
def __init__(self):
self.pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True,
).to("cuda")
# Memory optimization
self.pipe.enable_model_cpu_offload()
self.pipe.enable_vae_tiling()
def generate(self, prompt, negative_prompt="", steps=30):
"""SDXL basic generation"""
image = self.pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=steps,
guidance_scale=7.5,
height=1024,
width=1024,
).images[0]
return image
def generate_with_refiner(self, prompt, base_pipe, refiner_pipe):
"""Base + Refiner two-stage pipeline"""
# Base model: 80% of total steps
high_noise_frac = 0.8
image = base_pipe(
prompt=prompt,
num_inference_steps=40,
denoising_end=high_noise_frac,
output_type="latent",
).images
# Refiner: remaining 20% (enhance fine details)
image = refiner_pipe(
prompt=prompt,
num_inference_steps=40,
denoising_start=high_noise_frac,
image=image,
).images[0]
return image
Size/Crop Conditioning
SDXL provides the original image size and crop coordinates as conditions during training, enabling effective learning of images with diverse aspect ratios. This is implemented using Fourier Feature Encoding.
def get_sdxl_conditioning(original_size, crop_coords, target_size):
"""Generate SDXL size/crop conditioning"""
# Original size (height, width)
original_size = torch.tensor(original_size, dtype=torch.float32)
# Crop coordinates (top, left)
crop_coords = torch.tensor(crop_coords, dtype=torch.float32)
# Target size (height, width)
target_size = torch.tensor(target_size, dtype=torch.float32)
# Fourier Feature Encoding
conditioning = torch.cat([original_size, crop_coords, target_size])
# Sinusoidal embedding
freqs = torch.exp(
-torch.arange(0, 128) * np.log(10000) / 128
)
emb = conditioning.unsqueeze(-1) * freqs.unsqueeze(0)
emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)
return emb.flatten()
ControlNet: Conditional Generation Control
ControlNet by Zhang et al. (2023) adds spatial conditions such as edges, depth, and pose to pretrained diffusion models. The Zero Convolution technique preserves the existing capabilities of the model at the beginning of training while gradually learning new conditions.
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from controlnet_aux import CannyDetector
from PIL import Image
import torch
def controlnet_canny_generation(input_image_path, prompt):
"""ControlNet Canny Edge based image generation"""
# Load ControlNet model
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/control_v11p_sd15_canny",
torch_dtype=torch.float16,
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16,
).to("cuda")
# Extract Canny Edge
canny_detector = CannyDetector()
input_image = Image.open(input_image_path)
canny_image = canny_detector(input_image, low_threshold=100, high_threshold=200)
# ControlNet-based generation
output = pipe(
prompt=prompt,
image=canny_image,
num_inference_steps=30,
guidance_scale=7.5,
controlnet_conditioning_scale=1.0,
).images[0]
return output
Training Pipeline and Data Preparation
Dataset Composition
Here is a comparison of major datasets used for training large-scale diffusion models.
| Dataset | Scale | Resolution | Usage |
|---|---|---|---|
| LAION-5B | 5.8B image-text pairs | Various | Stable Diffusion training |
| LAION-Aesthetics | 120M (filtered) | Various | High-quality fine-tuning |
| ImageNet | 1.3M | 256/512 | DiT training (class-conditional) |
| COYO-700M | 700M | Various | Multilingual training (incl. Korean) |
Fine-tuning Strategies
# LoRA Fine-tuning (Stable Diffusion)
accelerate launch train_text_to_image_lora.py \
--pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
--dataset_name="custom_dataset" \
--resolution=512 \
--train_batch_size=4 \
--gradient_accumulation_steps=4 \
--learning_rate=1e-4 \
--lr_scheduler="cosine" \
--lr_warmup_steps=500 \
--max_train_steps=10000 \
--rank=64 \
--output_dir="./lora_output" \
--mixed_precision="fp16" \
--enable_xformers_memory_efficient_attention
# DreamBooth Fine-tuning (specific object/style learning)
accelerate launch train_dreambooth.py \
--pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
--instance_data_dir="./my_images" \
--instance_prompt="a photo of sks dog" \
--class_data_dir="./class_images" \
--class_prompt="a photo of dog" \
--with_prior_preservation \
--prior_loss_weight=1.0 \
--num_class_images=200 \
--resolution=512 \
--train_batch_size=1 \
--learning_rate=5e-6 \
--max_train_steps=800
Inference Optimization Techniques
Key Optimization Techniques Comparison
| Technique | Speed Improvement | Quality Impact | Memory Savings |
|---|---|---|---|
| DDIM (50 steps) | 20x | Minimal | - |
| DPM-Solver++ (20 steps) | 50x | Minimal | - |
| xFormers Memory Efficient Attention | 1.5x | None | 30-40% |
| torch.compile | 1.2-1.5x | None | - |
| VAE Tiling | - | Minimal | 70%+ |
| FP16/BF16 | 1.5-2x | Minimal | 50% |
| TensorRT | 2-4x | None | - |
Production Optimization Code
import torch
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
def optimized_sdxl_pipeline():
"""Production-optimized SDXL Pipeline"""
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True,
).to("cuda")
# 1. Apply fast scheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
pipe.scheduler.config,
algorithm_type="dpmsolver++",
use_karras_sigmas=True,
)
# 2. VAE Tiling (memory savings for high-resolution generation)
pipe.enable_vae_tiling()
# 3. Attention Slicing (when VRAM is limited)
pipe.enable_attention_slicing()
# 4. torch.compile (PyTorch 2.0+)
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
return pipe
# GPU memory monitoring
def monitor_gpu_memory():
"""Monitor GPU memory usage"""
allocated = torch.cuda.memory_allocated() / 1024**3
reserved = torch.cuda.memory_reserved() / 1024**3
max_allocated = torch.cuda.max_memory_allocated() / 1024**3
print(f"Allocated: {allocated:.2f} GB")
print(f"Reserved: {reserved:.2f} GB")
print(f"Peak: {max_allocated:.2f} GB")
Comprehensive Model Comparison
| Model | Year | Key Contribution | Backbone | Conditioning | Resolution |
|---|---|---|---|---|---|
| DDPM | 2020 | Practical diffusion models | U-Net | None (unconditional) | 256 |
| DDIM | 2020 | Accelerated sampling | U-Net | None | 256 |
| LDM (SD) | 2022 | Latent space diffusion | U-Net + VAE | Cross-Attention | 512 |
| DiT | 2023 | Transformer backbone | Transformer | adaLN-Zero | 256/512 |
| SDXL | 2023 | Large-scale U-Net + dual encoders | U-Net + VAE | Cross-Attention + CFG | 1024 |
| ControlNet | 2023 | Spatial condition control | Zero Conv + U-Net | Edge/Depth/Pose | 512 |
| SD3 | 2024 | MMDiT (Multi-Modal DiT) | Transformer | Flow Matching | 1024 |
Operational Considerations
GPU Memory Management
The most common issue when operating Stable Diffusion-based services is GPU OOM (Out of Memory). The following items should be checked:
- Batch size limits: For 1024x1024 SDXL generation, approximately 12GB is needed per image on A100 80GB, while V100 16GB will encounter OOM
- Concurrent request limits: Rate limiters must be applied to prevent GPU memory overflow
- Enable VAE Tiling: Essential for high-resolution (2048x2048+) generation
- Memory profiling: Regular GPU memory monitoring to detect memory leaks
Failure Case: GPU OOM Recovery
# Check GPU memory status
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
# Check GPU memory leaks from Python processes
fuser -v /dev/nvidia*
# Force GPU memory release (without process restart)
python -c "
import torch
import gc
gc.collect()
torch.cuda.empty_cache()
print('GPU memory cleared')
print(f'Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB')
"
# Service recovery procedure after OOM
# 1. Graceful shutdown of the affected worker process
# 2. Verify GPU memory release
# 3. Adjust batch size/concurrent request count
# 4. Restart worker process
# 5. Resume traffic after health check passes
NSFW Filtering
For commercial services, the Safety Checker must always be enabled. Disabling it can result in NSFW content being generated, which may cause legal issues.
# Safety Checker configuration (required for production)
pipe = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
safety_checker=None, # Only disable in development
)
# Must be enabled in production
from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker
from transformers import CLIPImageProcessor
safety_checker = StableDiffusionSafetyChecker.from_pretrained(
"CompVis/stable-diffusion-safety-checker"
)
feature_extractor = CLIPImageProcessor.from_pretrained(
"openai/clip-vit-base-patch32"
)
Failure Cases and Recovery Procedures
Case 1: Model Loading Failure
Disk I/O timeouts or checkpoint corruption can occur when loading large-scale models.
import os
from diffusers import StableDiffusionXLPipeline
def robust_model_loading(model_id, max_retries=3):
"""Robust model loading (with retries)"""
for attempt in range(max_retries):
try:
pipe = StableDiffusionXLPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
use_safetensors=True,
local_files_only=os.path.exists(
os.path.join(model_id, "model_index.json")
),
)
pipe = pipe.to("cuda")
# Warmup run
_ = pipe("test", num_inference_steps=1)
return pipe
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
import time
time.sleep(10)
# Clear cache and retry
torch.cuda.empty_cache()
else:
raise RuntimeError(f"Model loading failed after {max_retries} attempts")
Case 2: Image Quality Degradation (Inappropriate CFG Scale)
# CFG Scale Guidelines
guidance_scale_guidelines:
1.0: 'Condition nearly ignored - generation close to random'
3.0-5.0: 'Creative and diverse generation'
7.0-8.5: 'Generally recommended range - quality/diversity balance'
10.0-15.0: 'High text fidelity - risk of oversaturation'
20.0+: 'Excessive guidance - artifacts may appear'
# Troubleshooting checklist
troubleshooting:
blurry_output:
- 'Increase num_inference_steps (minimum 30+)'
- 'Switch scheduler to DPM-Solver++'
oversaturated:
- 'Lower guidance_scale to 7.0 or below'
- "Add 'oversaturated, vivid' to negative_prompt"
wrong_composition:
- 'Improve prompt structure (clear subject-verb-object)'
- 'Use ControlNet for composition control'
Conclusion
Diffusion Models have evolved rapidly, building on DDPM's theoretical foundations with DDIM's accelerated sampling, Latent Diffusion's efficient architecture, Classifier-free Guidance's quality control, DiT's scalability, SDXL's large-scale design, and ControlNet's fine-grained control.
Currently, new paradigms like SD3's MMDiT (Multi-Modal Diffusion Transformer) and Flow Matching, as well as Consistency Models, are emerging to enable even faster and higher-quality image generation. In particular, the DiT architecture serves as the foundation for video generation models like Sora (OpenAI), and the applications of Diffusion Models are expanding beyond images to video, 3D, and audio.
From an engineering perspective, understanding the theoretical background of models is the key to optimization and debugging. Accurately grasping the role of each component, including noise schedules, CFG Scale, scheduler selection, and memory management, is essential for operating stable services in production environments.
References
- Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020.
- Song, J., Meng, C., & Ermon, S. (2020). Denoising Diffusion Implicit Models. ICLR 2021.
- Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022.
- Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance.
- Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV 2023.
- Podell, D., et al. (2023). SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis.
- Zhang, L., Rao, A., & Agrawala, M. (2023). Adding Conditional Control to Text-to-Image Diffusion Models. ICCV 2023.
- Lilian Weng. (2021). What are Diffusion Models?