Split View: Diffusion Model 논문 서베이: DDPM에서 Stable Diffusion·DiT·SDXL까지 이미지 생성 모델의 진화

Diffusion Model 논문 서베이: DDPM에서 Stable Diffusion·DiT·SDXL까지 이미지 생성 모델의 진화

들어가며
DDPM: 확산 모델의 기초
DDIM: 가속 샘플링
Score-based 모델과의 관계
Latent Diffusion Model (Stable Diffusion)
Classifier-free Guidance (CFG)
DiT: Diffusion Transformer
SDXL: Stable Diffusion의 진화
ControlNet: 조건부 생성 제어
학습 파이프라인과 데이터 준비
- 데이터셋 구성
- 파인튜닝 전략
추론 최적화 기법
- 주요 최적화 기법 비교
- 실전 최적화 코드
모델 비교 종합
운영 시 주의사항
장애 사례와 복구 절차
- 사례 1: 모델 로딩 실패
- 사례 2: 이미지 품질 저하 (CFG Scale 부적절)
마치며
참고자료

Diffusion Model Survey: DDPM to Stable Diffusion, DiT, SDXL

들어가며

이미지 생성 분야에서 Diffusion Model은 GAN(Generative Adversarial Network)을 대체하는 새로운 패러다임으로 자리 잡았다. 2020년 Ho 등이 발표한 DDPM(Denoising Diffusion Probabilistic Models) 이후, 불과 3년 만에 Stable Diffusion, DALL-E 2, Midjourney 등의 상용 서비스가 등장하며 이미지 생성의 대중화를 이끌었다.

Diffusion Model의 핵심 아이디어는 놀랍도록 단순하다. 데이터에 점진적으로 노이즈를 추가하는 Forward Process와 이 노이즈를 역으로 제거하여 데이터를 복원하는 Reverse Process를 학습하는 것이다. 이 과정에서 모델은 각 노이즈 수준에서 "어떤 방향으로 노이즈를 제거해야 하는지"를 학습하게 된다.

이 글에서는 DDPM의 수학적 기초부터 DDIM의 가속 샘플링, Score-based 모델과의 관계, Latent Diffusion(Stable Diffusion)의 아키텍처, Classifier-free Guidance, DiT(Diffusion Transformer), SDXL, ControlNet까지 주요 모델의 진화를 시간순으로 서베이한다. 각 모델의 핵심 기여, 구현 코드, 성능 비교, 운영 시 주의사항을 포괄적으로 다룬다.

DDPM: 확산 모델의 기초

Forward Process (노이즈 추가)

DDPM의 Forward Process는 원본 데이터 x_0에 T 단계에 걸쳐 점진적으로 가우시안 노이즈를 추가한다. 각 단계 t에서의 노이즈 스케줄은 beta_t로 제어된다.

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

Reparameterization trick을 활용하면 임의의 타임스텝 t에서의 노이즈 이미지를 직접 계산할 수 있다.

x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

여기서 alpha_t = 1 - beta_t 이고, alpha_bar_t는 alpha_1부터 alpha_t까지의 누적 곱이다.

import torch
import torch.nn as nn
import numpy as np

class DDPMScheduler:
    """DDPM Forward Process 스케줄러"""
    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
        self.num_timesteps = num_timesteps
        # 선형 노이즈 스케줄
        self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        self.alphas = 1.0 - self.betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)

    def add_noise(self, x_0, t, noise=None):
        """임의의 타임스텝 t에서의 노이즈 이미지 생성"""
        if noise is None:
            noise = torch.randn_like(x_0)

        sqrt_alpha_bar = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
        sqrt_one_minus_alpha_bar = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)

        # x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
        x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise
        return x_t

    def sample_timesteps(self, batch_size):
        """학습용 랜덤 타임스텝 샘플링"""
        return torch.randint(0, self.num_timesteps, (batch_size,))

Reverse Process (노이즈 제거)

Reverse Process에서는 x_T ~ N(0, I) 로부터 시작하여 학습된 모델 epsilon_theta를 사용하여 단계적으로 노이즈를 제거한다.

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)

class DDPMSampler:
    """DDPM Reverse Process 샘플러"""
    def __init__(self, scheduler):
        self.scheduler = scheduler

    @torch.no_grad()
    def sample(self, model, shape, device):
        """DDPM 역확산 샘플링"""
        # 순수 노이즈에서 시작
        x = torch.randn(shape, device=device)

        for t in reversed(range(self.scheduler.num_timesteps)):
            t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)

            # 노이즈 예측
            predicted_noise = model(x, t_batch)

            # 평균 계산
            alpha = self.scheduler.alphas[t]
            alpha_bar = self.scheduler.alphas_cumprod[t]
            beta = self.scheduler.betas[t]

            mean = (1 / torch.sqrt(alpha)) * (
                x - (beta / torch.sqrt(1 - alpha_bar)) * predicted_noise
            )

            # t > 0일 때만 노이즈 추가
            if t > 0:
                noise = torch.randn_like(x)
                sigma = torch.sqrt(beta)
                x = mean + sigma * noise
            else:
                x = mean

        return x

학습 목표: Simple Loss

DDPM의 학습은 모델이 예측한 노이즈와 실제 노이즈 사이의 MSE를 최소화하는 것이다.

L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

def ddpm_training_step(model, x_0, scheduler, optimizer):
    """DDPM 학습 단일 스텝"""
    batch_size = x_0.shape[0]
    device = x_0.device

    # 1. 랜덤 타임스텝 샘플링
    t = scheduler.sample_timesteps(batch_size).to(device)

    # 2. 노이즈 생성 및 노이즈 이미지 생성
    noise = torch.randn_like(x_0)
    x_t = scheduler.add_noise(x_0, t, noise)

    # 3. 모델이 노이즈 예측
    predicted_noise = model(x_t, t)

    # 4. Simple Loss 계산
    loss = nn.functional.mse_loss(predicted_noise, noise)

    # 5. 역전파
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss.item()

DDIM: 가속 샘플링

DDPM은 1000 스텝의 역확산 과정이 필요하여 생성 속도가 매우 느리다. Song 등(2020)이 제안한 DDIM(Denoising Diffusion Implicit Models) 은 비마르코프(non-Markovian) 확산 과정을 정의하여 동일한 학습된 모델로 10~50배 빠른 샘플링을 가능하게 했다.

DDIM의 핵심은 eta 파라미터로 확률적/결정적 샘플링을 제어하는 것이다. eta=0이면 완전 결정적(deterministic)이며, eta=1이면 DDPM과 동일해진다.

class DDIMSampler:
    """DDIM 가속 샘플러"""
    def __init__(self, scheduler, ddim_steps=50, eta=0.0):
        self.scheduler = scheduler
        self.ddim_steps = ddim_steps
        self.eta = eta
        # 서브셋 타임스텝 생성 (예: 1000 -> 50)
        self.timesteps = np.linspace(
            0, scheduler.num_timesteps - 1, ddim_steps, dtype=int
        )[::-1]

    @torch.no_grad()
    def sample(self, model, shape, device):
        """DDIM 가속 샘플링 - 50 스텝으로 고품질 생성"""
        x = torch.randn(shape, device=device)

        for i in range(len(self.timesteps)):
            t = self.timesteps[i]
            t_prev = self.timesteps[i + 1] if i + 1 < len(self.timesteps) else 0

            t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
            predicted_noise = model(x, t_batch)

            alpha_bar_t = self.scheduler.alphas_cumprod[t]
            alpha_bar_prev = self.scheduler.alphas_cumprod[t_prev]

            # x_0 예측
            x_0_pred = (x - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t)
            x_0_pred = torch.clamp(x_0_pred, -1, 1)

            # 방향 계산
            sigma = self.eta * torch.sqrt(
                (1 - alpha_bar_prev) / (1 - alpha_bar_t) * (1 - alpha_bar_t / alpha_bar_prev)
            )
            direction = torch.sqrt(1 - alpha_bar_prev - sigma**2) * predicted_noise

            # x_{t-1} 계산
            x = torch.sqrt(alpha_bar_prev) * x_0_pred + direction

            if self.eta > 0 and t > 0:
                x = x + sigma * torch.randn_like(x)

        return x

Score-based 모델과의 관계

Song과 Ermon(2019)은 Score Matching 관점에서 확산 모델을 해석했다. Score function은 데이터 분포의 로그 밀도의 기울기이다.

s_\theta(x) \approx \nabla_x \log p(x)

DDPM의 노이즈 예측 epsilon_theta와 Score function은 다음 관계를 갖는다.

s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1 - \bar{\alpha}_t}}

이 관계는 Score SDE(Stochastic Differential Equation) 프레임워크로 통합되어, 연속 시간에서의 확산 과정을 다음과 같이 기술한다.

dx = f(x, t)dt + g(t)dw

Latent Diffusion Model (Stable Diffusion)

아키텍처 개요

Rombach 등(2022)의 Latent Diffusion Model(LDM) 은 확산 과정을 픽셀 공간이 아닌 잠재 공간(latent space) 에서 수행하여 계산 비용을 획기적으로 줄였다. 이것이 바로 Stable Diffusion의 핵심 아키텍처이다.

LDM은 세 가지 핵심 구성 요소로 이루어진다.

구성 요소	역할	상세
VAE Encoder	이미지를 잠재 공간으로 인코딩	512x512 이미지를 64x64x4 잠재 표현으로 압축
U-Net (Denoiser)	잠재 공간에서 노이즈 예측	Cross-attention으로 텍스트 조건 반영
VAE Decoder	잠재 표현을 이미지로 디코딩	64x64x4 잠재 표현을 512x512 이미지로 복원
Text Encoder	텍스트 프롬프트 인코딩	CLIP ViT-L/14로 77 토큰 임베딩 생성

핵심 코드 구조

import torch
from diffusers import StableDiffusionPipeline, DDIMScheduler

class LatentDiffusionInference:
    """Stable Diffusion 추론 파이프라인 (간소화)"""

    def __init__(self, model_id="stable-diffusion-v1-5/stable-diffusion-v1-5"):
        self.pipe = StableDiffusionPipeline.from_pretrained(
            model_id,
            torch_dtype=torch.float16,
            safety_checker=None
        ).to("cuda")

        # DDIM 스케줄러로 교체 (50 스텝으로 가속)
        self.pipe.scheduler = DDIMScheduler.from_config(
            self.pipe.scheduler.config
        )

    def generate(self, prompt, negative_prompt="", num_steps=50, guidance_scale=7.5):
        """텍스트-이미지 생성"""
        image = self.pipe(
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=num_steps,
            guidance_scale=guidance_scale,
        ).images[0]
        return image

    def generate_with_latent_control(self, prompt, seed=42):
        """잠재 공간 직접 제어"""
        generator = torch.Generator(device="cuda").manual_seed(seed)

        # 잠재 벡터 직접 생성
        latents = torch.randn(
            (1, 4, 64, 64),
            generator=generator,
            device="cuda",
            dtype=torch.float16
        )

        image = self.pipe(
            prompt=prompt,
            latents=latents,
            num_inference_steps=50,
            guidance_scale=7.5,
        ).images[0]
        return image

Cross-Attention 메커니즘

Stable Diffusion의 U-Net에서는 Cross-Attention을 통해 텍스트 조건을 이미지 생성에 반영한다. Query는 이미지 잠재 표현에서, Key와 Value는 텍스트 임베딩에서 생성된다.

class CrossAttention(nn.Module):
    """Stable Diffusion U-Net의 Cross-Attention 레이어"""
    def __init__(self, d_model=320, d_context=768, n_heads=8):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_model // n_heads

        self.to_q = nn.Linear(d_model, d_model, bias=False)
        self.to_k = nn.Linear(d_context, d_model, bias=False)
        self.to_v = nn.Linear(d_context, d_model, bias=False)
        self.to_out = nn.Linear(d_model, d_model)

    def forward(self, x, context):
        """
        x: 이미지 잠재 표현 (B, H*W, d_model)
        context: 텍스트 임베딩 (B, seq_len, d_context)
        """
        B, N, C = x.shape

        q = self.to_q(x).view(B, N, self.n_heads, self.d_head).transpose(1, 2)
        k = self.to_k(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
        v = self.to_v(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)

        # Scaled Dot-Product Attention
        scale = self.d_head ** -0.5
        attn = torch.matmul(q, k.transpose(-2, -1)) * scale
        attn = torch.softmax(attn, dim=-1)
        out = torch.matmul(attn, v)

        out = out.transpose(1, 2).contiguous().view(B, N, C)
        return self.to_out(out)

Classifier-free Guidance (CFG)

Ho와 Salimans(2022)가 제안한 Classifier-free Guidance는 별도의 분류기 없이 생성 품질을 제어하는 핵심 기법이다.

학습 시에는 조건부 모델과 비조건부 모델을 동시에 학습한다(일정 확률로 텍스트 조건을 빈 문자열로 대체). 추론 시에는 두 예측의 가중 평균을 사용한다.

\tilde{\epsilon}_\theta(x_t, c) = \epsilon_\theta(x_t, \varnothing) + w \cdot (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \varnothing))

여기서 w는 guidance scale이다. w=1이면 순수 조건부 생성, w가 클수록 텍스트 조건에 더 강하게 따른다(일반적으로 7.5~15).

def classifier_free_guidance_step(model, x_t, t, text_embedding, null_embedding, guidance_scale=7.5):
    """Classifier-free Guidance 단일 스텝"""

    # 조건부/비조건부 예측을 배치로 한번에 처리
    x_in = torch.cat([x_t, x_t], dim=0)
    t_in = torch.cat([t, t], dim=0)
    c_in = torch.cat([null_embedding, text_embedding], dim=0)

    # 한 번의 forward pass로 두 예측 동시 생성
    noise_pred = model(x_in, t_in, encoder_hidden_states=c_in)
    noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)

    # CFG 적용
    noise_pred_guided = noise_pred_uncond + guidance_scale * (
        noise_pred_cond - noise_pred_uncond
    )
    return noise_pred_guided

DiT: Diffusion Transformer

U-Net에서 Transformer로

Peebles와 Xie(2023)의 DiT(Diffusion Transformer) 는 확산 모델의 백본을 U-Net에서 Transformer로 교체했다. 핵심 발견은 Transformer의 크기(GFLOPs)를 늘리면 생성 품질(FID)이 일관되게 향상된다는 것이다.

모델	백본	파라미터 수	FID (ImageNet 256)	GFLOPs
ADM	U-Net	554M	10.94	1120
LDM-4	U-Net	400M	10.56	103
DiT-S/2	Transformer	33M	68.40	6
DiT-B/2	Transformer	130M	43.47	23
DiT-L/2	Transformer	458M	9.62	80
DiT-XL/2	Transformer	675M	2.27	119

adaLN-Zero 블록

DiT의 핵심 혁신은 adaLN-Zero 조건화 방식이다. 타임스텝과 클래스 임베딩을 Adaptive Layer Normalization의 scale/shift 파라미터로 주입하되, 초기화 시 게이팅 파라미터를 0으로 설정하여 학습 초기에는 잔차 연결(identity function)으로 동작하게 한다.

class DiTBlock(nn.Module):
    """DiT의 adaLN-Zero Transformer Block"""
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model, elementwise_affine=False)
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.norm2 = nn.LayerNorm(d_model, elementwise_affine=False)
        self.mlp = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.GELU(),
            nn.Linear(d_model * 4, d_model),
        )
        # adaLN modulation: 6개의 파라미터 (gamma1, beta1, alpha1, gamma2, beta2, alpha2)
        self.adaLN_modulation = nn.Sequential(
            nn.SiLU(),
            nn.Linear(d_model, 6 * d_model),
        )
        # Zero 초기화 - 학습 초기에 identity로 동작
        nn.init.zeros_(self.adaLN_modulation[-1].weight)
        nn.init.zeros_(self.adaLN_modulation[-1].bias)

    def forward(self, x, c):
        """
        x: 패치 토큰 (B, N, D)
        c: 조건 임베딩 - 타임스텝 + 클래스 (B, D)
        """
        # adaLN 파라미터 생성
        shift1, scale1, gate1, shift2, scale2, gate2 = (
            self.adaLN_modulation(c).chunk(6, dim=-1)
        )

        # Self-Attention with adaLN
        h = self.norm1(x)
        h = h * (1 + scale1.unsqueeze(1)) + shift1.unsqueeze(1)
        h, _ = self.attn(h, h, h)
        x = x + gate1.unsqueeze(1) * h

        # FFN with adaLN
        h = self.norm2(x)
        h = h * (1 + scale2.unsqueeze(1)) + shift2.unsqueeze(1)
        h = self.mlp(h)
        x = x + gate2.unsqueeze(1) * h

        return x

Patchify 전략

DiT는 잠재 표현을 p x p 패치로 분할하여 Transformer의 입력 토큰으로 사용한다. 패치 크기가 작을수록 토큰 수가 많아져 성능이 향상되지만 계산 비용도 증가한다.

class PatchEmbed(nn.Module):
    """DiT의 Patchify 레이어"""
    def __init__(self, patch_size=2, in_channels=4, embed_dim=1152):
        super().__init__()
        self.patch_size = patch_size
        self.proj = nn.Conv2d(
            in_channels, embed_dim,
            kernel_size=patch_size, stride=patch_size
        )

    def forward(self, x):
        """(B, C, H, W) -> (B, N, D) 패치 토큰 시퀀스"""
        x = self.proj(x)  # (B, D, H/p, W/p)
        x = x.flatten(2).transpose(1, 2)  # (B, N, D)
        return x

SDXL: Stable Diffusion의 진화

주요 개선 사항

Podell 등(2023)의 SDXL은 Stable Diffusion v1.5 대비 다음의 핵심 개선을 도입했다.

특징	SD v1.5	SDXL Base
U-Net 파라미터	860M	2.6B (3배 증가)
텍스트 인코더	CLIP ViT-L/14	OpenCLIP ViT-bigG + CLIP ViT-L
텍스트 임베딩 차원	768	2048
기본 해상도	512x512	1024x1024
Attention 블록 수	16	70
Refiner 모델	없음	전용 Refiner 포함

이중 텍스트 인코더

SDXL의 가장 큰 혁신 중 하나는 두 개의 텍스트 인코더를 사용하는 것이다. OpenCLIP ViT-bigG의 풍부한 의미 표현과 CLIP ViT-L의 보완적 특징을 결합하여 텍스트 이해력을 크게 향상시켰다.

from diffusers import StableDiffusionXLPipeline
import torch

class SDXLInference:
    """SDXL 추론 파이프라인"""

    def __init__(self):
        self.pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            variant="fp16",
            use_safetensors=True,
        ).to("cuda")

        # 메모리 최적화
        self.pipe.enable_model_cpu_offload()
        self.pipe.enable_vae_tiling()

    def generate(self, prompt, negative_prompt="", steps=30):
        """SDXL 기본 생성"""
        image = self.pipe(
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=steps,
            guidance_scale=7.5,
            height=1024,
            width=1024,
        ).images[0]
        return image

    def generate_with_refiner(self, prompt, base_pipe, refiner_pipe):
        """Base + Refiner 2단계 파이프라인"""
        # Base 모델: 전체 스텝의 80%
        high_noise_frac = 0.8
        image = base_pipe(
            prompt=prompt,
            num_inference_steps=40,
            denoising_end=high_noise_frac,
            output_type="latent",
        ).images

        # Refiner: 나머지 20% (세부 디테일 향상)
        image = refiner_pipe(
            prompt=prompt,
            num_inference_steps=40,
            denoising_start=high_noise_frac,
            image=image,
        ).images[0]
        return image

크기/크롭 조건화

SDXL은 학습 시 이미지의 원본 크기와 크롭 좌표를 조건으로 제공하여, 다양한 종횡비의 이미지를 효과적으로 학습할 수 있게 했다. 이는 Fourier Feature Encoding을 사용하여 구현된다.

def get_sdxl_conditioning(original_size, crop_coords, target_size):
    """SDXL의 크기/크롭 조건 생성"""
    # 원본 크기 (height, width)
    original_size = torch.tensor(original_size, dtype=torch.float32)
    # 크롭 좌표 (top, left)
    crop_coords = torch.tensor(crop_coords, dtype=torch.float32)
    # 목표 크기 (height, width)
    target_size = torch.tensor(target_size, dtype=torch.float32)

    # Fourier Feature Encoding
    conditioning = torch.cat([original_size, crop_coords, target_size])

    # Sinusoidal embedding
    freqs = torch.exp(
        -torch.arange(0, 128) * np.log(10000) / 128
    )
    emb = conditioning.unsqueeze(-1) * freqs.unsqueeze(0)
    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)

    return emb.flatten()

ControlNet: 조건부 생성 제어

Zhang 등(2023)의 ControlNet은 사전 학습된 확산 모델에 엣지, 깊이, 포즈 등의 공간 조건을 추가한다. Zero Convolution 기법으로 학습 초기에 모델의 기존 능력을 보존하면서 새로운 조건을 점진적으로 학습한다.

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from controlnet_aux import CannyDetector
from PIL import Image
import torch

def controlnet_canny_generation(input_image_path, prompt):
    """ControlNet Canny Edge 기반 이미지 생성"""
    # ControlNet 모델 로드
    controlnet = ControlNetModel.from_pretrained(
        "lllyasviel/control_v11p_sd15_canny",
        torch_dtype=torch.float16,
    )

    pipe = StableDiffusionControlNetPipeline.from_pretrained(
        "stable-diffusion-v1-5/stable-diffusion-v1-5",
        controlnet=controlnet,
        torch_dtype=torch.float16,
    ).to("cuda")

    # Canny Edge 추출
    canny_detector = CannyDetector()
    input_image = Image.open(input_image_path)
    canny_image = canny_detector(input_image, low_threshold=100, high_threshold=200)

    # ControlNet 기반 생성
    output = pipe(
        prompt=prompt,
        image=canny_image,
        num_inference_steps=30,
        guidance_scale=7.5,
        controlnet_conditioning_scale=1.0,
    ).images[0]

    return output

학습 파이프라인과 데이터 준비

데이터셋 구성

대규모 확산 모델의 학습에 사용되는 주요 데이터셋 비교이다.

데이터셋	규모	해상도	용도
LAION-5B	58억 이미지-텍스트 쌍	다양	Stable Diffusion 학습
LAION-Aesthetics	1.2억 (필터링)	다양	고품질 파인튜닝
ImageNet	130만	256/512	DiT 학습 (클래스 조건부)
COYO-700M	7억	다양	한국어 포함 다국어 학습

파인튜닝 전략

# LoRA 파인튜닝 (Stable Diffusion)
accelerate launch train_text_to_image_lora.py \
    --pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
    --dataset_name="custom_dataset" \
    --resolution=512 \
    --train_batch_size=4 \
    --gradient_accumulation_steps=4 \
    --learning_rate=1e-4 \
    --lr_scheduler="cosine" \
    --lr_warmup_steps=500 \
    --max_train_steps=10000 \
    --rank=64 \
    --output_dir="./lora_output" \
    --mixed_precision="fp16" \
    --enable_xformers_memory_efficient_attention

# DreamBooth 파인튜닝 (특정 객체/스타일 학습)
accelerate launch train_dreambooth.py \
    --pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
    --instance_data_dir="./my_images" \
    --instance_prompt="a photo of sks dog" \
    --class_data_dir="./class_images" \
    --class_prompt="a photo of dog" \
    --with_prior_preservation \
    --prior_loss_weight=1.0 \
    --num_class_images=200 \
    --resolution=512 \
    --train_batch_size=1 \
    --learning_rate=5e-6 \
    --max_train_steps=800

추론 최적화 기법

주요 최적화 기법 비교

기법	속도 향상	품질 영향	메모리 절감
DDIM (50 steps)	20x	미미	-
DPM-Solver++ (20 steps)	50x	미미	-
xFormers Memory Efficient Attention	1.5x	없음	30-40%
torch.compile	1.2-1.5x	없음	-
VAE Tiling	-	미미	70%+
FP16/BF16	1.5-2x	미미	50%
TensorRT	2-4x	없음	-

실전 최적화 코드

import torch
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler

def optimized_sdxl_pipeline():
    """프로덕션 최적화된 SDXL 파이프라인"""
    pipe = StableDiffusionXLPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0",
        torch_dtype=torch.float16,
        variant="fp16",
        use_safetensors=True,
    ).to("cuda")

    # 1. 고속 스케줄러 적용
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(
        pipe.scheduler.config,
        algorithm_type="dpmsolver++",
        use_karras_sigmas=True,
    )

    # 2. VAE Tiling (고해상도 생성 시 메모리 절감)
    pipe.enable_vae_tiling()

    # 3. Attention Slicing (VRAM 부족 시)
    pipe.enable_attention_slicing()

    # 4. torch.compile (PyTorch 2.0+)
    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

    return pipe

# GPU 메모리 모니터링
def monitor_gpu_memory():
    """GPU 메모리 사용량 모니터링"""
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    max_allocated = torch.cuda.max_memory_allocated() / 1024**3
    print(f"Allocated: {allocated:.2f} GB")
    print(f"Reserved:  {reserved:.2f} GB")
    print(f"Peak:      {max_allocated:.2f} GB")

모델 비교 종합

모델	연도	핵심 기여	백본	조건화 방식	해상도
DDPM	2020	확산 모델 실용화	U-Net	없음 (비조건부)	256
DDIM	2020	가속 샘플링	U-Net	없음	256
LDM (SD)	2022	잠재 공간 확산	U-Net + VAE	Cross-Attention	512
DiT	2023	Transformer 백본	Transformer	adaLN-Zero	256/512
SDXL	2023	대규모 U-Net + 이중 인코더	U-Net + VAE	Cross-Attention + CFG	1024
ControlNet	2023	공간 조건 제어	Zero Conv + U-Net	엣지/깊이/포즈	512
SD3	2024	MMDiT (다중 모달 DiT)	Transformer	Flow Matching	1024

운영 시 주의사항

GPU 메모리 관리

Stable Diffusion 기반 서비스를 운영할 때 가장 빈번한 문제는 GPU OOM(Out of Memory)이다. 다음 사항을 체크해야 한다.

배치 크기 제한: 1024x1024 SDXL 생성 시 단일 이미지 기준 A100 80GB에서 약 12GB, V100 16GB에서는 OOM 발생
동시 요청 제한: Rate limiter를 반드시 적용하여 GPU 메모리 초과 방지
VAE Tiling 활성화: 고해상도(2048x2048+) 생성 시 필수
메모리 프로파일링: 주기적인 GPU 메모리 모니터링으로 메모리 누수 감지

장애 사례: GPU OOM 복구

# GPU 메모리 상태 확인
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

# Python 프로세스의 GPU 메모리 누수 확인
fuser -v /dev/nvidia*

# 강제 GPU 메모리 해제 (프로세스 재시작 없이)
python -c "
import torch
import gc
gc.collect()
torch.cuda.empty_cache()
print('GPU memory cleared')
print(f'Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB')
"

# OOM 발생 시 서비스 복구 절차
# 1. 해당 워커 프로세스 graceful shutdown
# 2. GPU 메모리 해제 확인
# 3. 배치 크기/동시 요청 수 조정
# 4. 워커 프로세스 재시작
# 5. 헬스체크 통과 확인 후 트래픽 복구

NSFW 필터링

상용 서비스에서는 반드시 Safety Checker를 활성화해야 한다. Safety Checker를 비활성화하면 NSFW 콘텐츠가 생성될 수 있어 법적 문제가 발생할 수 있다.

# Safety Checker 설정 (프로덕션 필수)
pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    safety_checker=None,  # 개발 환경에서만 비활성화
)

# 프로덕션에서는 반드시 활성화
from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker
from transformers import CLIPImageProcessor

safety_checker = StableDiffusionSafetyChecker.from_pretrained(
    "CompVis/stable-diffusion-safety-checker"
)
feature_extractor = CLIPImageProcessor.from_pretrained(
    "openai/clip-vit-base-patch32"
)

장애 사례와 복구 절차

사례 1: 모델 로딩 실패

대규모 모델 로딩 시 디스크 I/O 타임아웃이나 체크포인트 손상이 발생할 수 있다.

import os
from diffusers import StableDiffusionXLPipeline

def robust_model_loading(model_id, max_retries=3):
    """안정적인 모델 로딩 (재시도 포함)"""
    for attempt in range(max_retries):
        try:
            pipe = StableDiffusionXLPipeline.from_pretrained(
                model_id,
                torch_dtype=torch.float16,
                use_safetensors=True,
                local_files_only=os.path.exists(
                    os.path.join(model_id, "model_index.json")
                ),
            )
            pipe = pipe.to("cuda")
            # 워밍업 실행
            _ = pipe("test", num_inference_steps=1)
            return pipe
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                import time
                time.sleep(10)
                # 캐시 클리어 후 재시도
                torch.cuda.empty_cache()
            else:
                raise RuntimeError(f"Model loading failed after {max_retries} attempts")

사례 2: 이미지 품질 저하 (CFG Scale 부적절)

# CFG Scale 가이드라인
guidance_scale_guidelines:
  1.0: '조건 거의 무시 - 랜덤에 가까운 생성'
  3.0-5.0: '창의적이고 다양한 생성'
  7.0-8.5: '일반적 권장 범위 - 품질/다양성 균형'
  10.0-15.0: '텍스트 충실도 높음 - 과포화 위험'
  20.0+: '과도한 가이던스 - 아티팩트 발생'

# 문제 진단 체크리스트
troubleshooting:
  blurry_output:
    - 'num_inference_steps 증가 (최소 30 이상)'
    - '스케줄러를 DPM-Solver++로 변경'
  oversaturated:
    - 'guidance_scale을 7.0 이하로 낮춤'
    - "negative_prompt에 'oversaturated, vivid' 추가"
  wrong_composition:
    - '프롬프트 구조 개선 (주어-동사-목적어 명확히)'
    - 'ControlNet으로 구도 제어'

마치며

Diffusion Model은 DDPM의 이론적 기초 위에 DDIM의 가속 샘플링, Latent Diffusion의 효율적 아키텍처, Classifier-free Guidance의 품질 제어, DiT의 확장성, SDXL의 대규모화, ControlNet의 세밀한 제어가 더해지며 급속히 발전했다.

현재 SD3의 MMDiT(Multi-Modal Diffusion Transformer)와 Flow Matching, Consistency Models 등의 새로운 패러다임이 등장하며 더 빠르고 고품질의 이미지 생성이 가능해지고 있다. 특히 DiT 아키텍처는 Sora(OpenAI)와 같은 비디오 생성 모델의 기반이 되며, Diffusion Model의 응용 범위가 이미지를 넘어 비디오, 3D, 오디오까지 확장되고 있다.

엔지니어 관점에서는 모델의 이론적 배경을 이해하는 것이 최적화와 디버깅의 핵심이다. 노이즈 스케줄, CFG Scale, 스케줄러 선택, 메모리 관리 등 각 구성 요소의 역할을 정확히 파악해야 프로덕션 환경에서 안정적인 서비스를 운영할 수 있다.

참고자료

Diffusion Model Paper Survey: Evolution of Image Generation from DDPM to Stable Diffusion, DiT, and SDXL

Introduction
DDPM: Foundations of Diffusion Models
DDIM: Accelerated Sampling
Relationship with Score-based Models
Latent Diffusion Model (Stable Diffusion)
Classifier-free Guidance (CFG)
DiT: Diffusion Transformer
SDXL: Evolution of Stable Diffusion
ControlNet: Conditional Generation Control
Training Pipeline and Data Preparation
- Dataset Composition
- Fine-tuning Strategies
Inference Optimization Techniques
- Key Optimization Techniques Comparison
- Production Optimization Code
Comprehensive Model Comparison
Operational Considerations
Failure Cases and Recovery Procedures
- Case 1: Model Loading Failure
- Case 2: Image Quality Degradation (Inappropriate CFG Scale)
Conclusion
References

Introduction

In the field of image generation, Diffusion Models have established themselves as a new paradigm replacing GANs (Generative Adversarial Networks). Since Ho et al. published DDPM (Denoising Diffusion Probabilistic Models) in 2020, commercial services like Stable Diffusion, DALL-E 2, and Midjourney emerged within just three years, driving the democratization of image generation.

The core idea behind Diffusion Models is remarkably simple. It involves learning a Forward Process that gradually adds noise to data and a Reverse Process that removes this noise in reverse to reconstruct the data. Through this process, the model learns "which direction to remove noise" at each noise level.

In this article, we survey the evolution of major models chronologically: from the mathematical foundations of DDPM to DDIM's accelerated sampling, the relationship with score-based models, Latent Diffusion (Stable Diffusion) architecture, Classifier-free Guidance, DiT (Diffusion Transformer), SDXL, and ControlNet. We comprehensively cover each model's key contributions, implementation code, performance comparisons, and operational considerations.

DDPM: Foundations of Diffusion Models

Forward Process (Adding Noise)

DDPM's Forward Process gradually adds Gaussian noise to the original data x_0 over T steps. The noise schedule at each step t is controlled by beta_t.

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

Using the reparameterization trick, we can directly compute the noised image at any arbitrary timestep t.

x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Here, alpha_t = 1 - beta_t, and alpha_bar_t is the cumulative product from alpha_1 to alpha_t.

import torch
import torch.nn as nn
import numpy as np

class DDPMScheduler:
    """DDPM Forward Process Scheduler"""
    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
        self.num_timesteps = num_timesteps
        # Linear noise schedule
        self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        self.alphas = 1.0 - self.betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)

    def add_noise(self, x_0, t, noise=None):
        """Generate noised image at arbitrary timestep t"""
        if noise is None:
            noise = torch.randn_like(x_0)

        sqrt_alpha_bar = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
        sqrt_one_minus_alpha_bar = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)

        # x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
        x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise
        return x_t

    def sample_timesteps(self, batch_size):
        """Sample random timesteps for training"""
        return torch.randint(0, self.num_timesteps, (batch_size,))

Reverse Process (Denoising)

In the Reverse Process, starting from x_T ~ N(0, I), the trained model epsilon_theta is used to progressively remove noise step by step.

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)

class DDPMSampler:
    """DDPM Reverse Process Sampler"""
    def __init__(self, scheduler):
        self.scheduler = scheduler

    @torch.no_grad()
    def sample(self, model, shape, device):
        """DDPM reverse diffusion sampling"""
        # Start from pure noise
        x = torch.randn(shape, device=device)

        for t in reversed(range(self.scheduler.num_timesteps)):
            t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)

            # Predict noise
            predicted_noise = model(x, t_batch)

            # Compute mean
            alpha = self.scheduler.alphas[t]
            alpha_bar = self.scheduler.alphas_cumprod[t]
            beta = self.scheduler.betas[t]

            mean = (1 / torch.sqrt(alpha)) * (
                x - (beta / torch.sqrt(1 - alpha_bar)) * predicted_noise
            )

            # Add noise only when t > 0
            if t > 0:
                noise = torch.randn_like(x)
                sigma = torch.sqrt(beta)
                x = mean + sigma * noise
            else:
                x = mean

        return x

Training Objective: Simple Loss

DDPM training minimizes the MSE between the model-predicted noise and the actual noise.

L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

def ddpm_training_step(model, x_0, scheduler, optimizer):
    """DDPM training single step"""
    batch_size = x_0.shape[0]
    device = x_0.device

    # 1. Sample random timesteps
    t = scheduler.sample_timesteps(batch_size).to(device)

    # 2. Generate noise and noised image
    noise = torch.randn_like(x_0)
    x_t = scheduler.add_noise(x_0, t, noise)

    # 3. Model predicts noise
    predicted_noise = model(x_t, t)

    # 4. Compute Simple Loss
    loss = nn.functional.mse_loss(predicted_noise, noise)

    # 5. Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss.item()

DDIM: Accelerated Sampling

DDPM requires 1000 steps of reverse diffusion, making generation extremely slow. DDIM (Denoising Diffusion Implicit Models) proposed by Song et al. (2020) defines a non-Markovian diffusion process that enables 10-50x faster sampling with the same trained model.

The key to DDIM is the eta parameter that controls stochastic/deterministic sampling. When eta=0, the sampling is fully deterministic; when eta=1, it becomes identical to DDPM.

class DDIMSampler:
    """DDIM Accelerated Sampler"""
    def __init__(self, scheduler, ddim_steps=50, eta=0.0):
        self.scheduler = scheduler
        self.ddim_steps = ddim_steps
        self.eta = eta
        # Generate subset timesteps (e.g., 1000 -> 50)
        self.timesteps = np.linspace(
            0, scheduler.num_timesteps - 1, ddim_steps, dtype=int
        )[::-1]

    @torch.no_grad()
    def sample(self, model, shape, device):
        """DDIM accelerated sampling - high quality in 50 steps"""
        x = torch.randn(shape, device=device)

        for i in range(len(self.timesteps)):
            t = self.timesteps[i]
            t_prev = self.timesteps[i + 1] if i + 1 < len(self.timesteps) else 0

            t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
            predicted_noise = model(x, t_batch)

            alpha_bar_t = self.scheduler.alphas_cumprod[t]
            alpha_bar_prev = self.scheduler.alphas_cumprod[t_prev]

            # Predict x_0
            x_0_pred = (x - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t)
            x_0_pred = torch.clamp(x_0_pred, -1, 1)

            # Compute direction
            sigma = self.eta * torch.sqrt(
                (1 - alpha_bar_prev) / (1 - alpha_bar_t) * (1 - alpha_bar_t / alpha_bar_prev)
            )
            direction = torch.sqrt(1 - alpha_bar_prev - sigma**2) * predicted_noise

            # Compute x_{t-1}
            x = torch.sqrt(alpha_bar_prev) * x_0_pred + direction

            if self.eta > 0 and t > 0:
                x = x + sigma * torch.randn_like(x)

        return x

Relationship with Score-based Models

Song and Ermon (2019) interpreted diffusion models from the Score Matching perspective. The score function is the gradient of the log density of the data distribution.

s_\theta(x) \approx \nabla_x \log p(x)

DDPM's noise prediction epsilon_theta and the score function have the following relationship:

s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1 - \bar{\alpha}_t}}

This relationship was unified through the Score SDE (Stochastic Differential Equation) framework, which describes the diffusion process in continuous time as:

dx = f(x, t)dt + g(t)dw

Latent Diffusion Model (Stable Diffusion)

Architecture Overview

Latent Diffusion Model (LDM) by Rombach et al. (2022) dramatically reduced computational costs by performing the diffusion process in latent space rather than pixel space. This is the core architecture behind Stable Diffusion.

LDM consists of three key components:

Component	Role	Details
VAE Encoder	Encode images to latent space	Compress 512x512 images to 64x64x4 latent representations
U-Net (Denoiser)	Predict noise in latent space	Incorporates text conditions via Cross-Attention
VAE Decoder	Decode latents to images	Reconstruct 64x64x4 latent representations to 512x512 images
Text Encoder	Encode text prompts	Generate 77-token embeddings using CLIP ViT-L/14

Core Code Structure

import torch
from diffusers import StableDiffusionPipeline, DDIMScheduler

class LatentDiffusionInference:
    """Stable Diffusion Inference Pipeline (Simplified)"""

    def __init__(self, model_id="stable-diffusion-v1-5/stable-diffusion-v1-5"):
        self.pipe = StableDiffusionPipeline.from_pretrained(
            model_id,
            torch_dtype=torch.float16,
            safety_checker=None
        ).to("cuda")

        # Switch to DDIM scheduler (accelerate with 50 steps)
        self.pipe.scheduler = DDIMScheduler.from_config(
            self.pipe.scheduler.config
        )

    def generate(self, prompt, negative_prompt="", num_steps=50, guidance_scale=7.5):
        """Text-to-image generation"""
        image = self.pipe(
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=num_steps,
            guidance_scale=guidance_scale,
        ).images[0]
        return image

    def generate_with_latent_control(self, prompt, seed=42):
        """Direct latent space control"""
        generator = torch.Generator(device="cuda").manual_seed(seed)

        # Generate latent vector directly
        latents = torch.randn(
            (1, 4, 64, 64),
            generator=generator,
            device="cuda",
            dtype=torch.float16
        )

        image = self.pipe(
            prompt=prompt,
            latents=latents,
            num_inference_steps=50,
            guidance_scale=7.5,
        ).images[0]
        return image

Cross-Attention Mechanism

In Stable Diffusion's U-Net, Cross-Attention incorporates text conditions into image generation. Query is generated from the image latent representation, while Key and Value come from the text embeddings.

class CrossAttention(nn.Module):
    """Cross-Attention Layer in Stable Diffusion U-Net"""
    def __init__(self, d_model=320, d_context=768, n_heads=8):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_model // n_heads

        self.to_q = nn.Linear(d_model, d_model, bias=False)
        self.to_k = nn.Linear(d_context, d_model, bias=False)
        self.to_v = nn.Linear(d_context, d_model, bias=False)
        self.to_out = nn.Linear(d_model, d_model)

    def forward(self, x, context):
        """
        x: Image latent representation (B, H*W, d_model)
        context: Text embeddings (B, seq_len, d_context)
        """
        B, N, C = x.shape

        q = self.to_q(x).view(B, N, self.n_heads, self.d_head).transpose(1, 2)
        k = self.to_k(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
        v = self.to_v(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)

        # Scaled Dot-Product Attention
        scale = self.d_head ** -0.5
        attn = torch.matmul(q, k.transpose(-2, -1)) * scale
        attn = torch.softmax(attn, dim=-1)
        out = torch.matmul(attn, v)

        out = out.transpose(1, 2).contiguous().view(B, N, C)
        return self.to_out(out)

Classifier-free Guidance (CFG)

Classifier-free Guidance proposed by Ho and Salimans (2022) is a key technique for controlling generation quality without a separate classifier.

During training, the conditional and unconditional models are trained simultaneously (by replacing the text condition with an empty string at a certain probability). During inference, a weighted average of both predictions is used.

\tilde{\epsilon}_\theta(x_t, c) = \epsilon_\theta(x_t, \varnothing) + w \cdot (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \varnothing))

Here, w is the guidance scale. w=1 means pure conditional generation, and higher w values make the generation follow the text condition more strongly (typically 7.5-15).

def classifier_free_guidance_step(model, x_t, t, text_embedding, null_embedding, guidance_scale=7.5):
    """Classifier-free Guidance single step"""

    # Process conditional/unconditional predictions as a single batch
    x_in = torch.cat([x_t, x_t], dim=0)
    t_in = torch.cat([t, t], dim=0)
    c_in = torch.cat([null_embedding, text_embedding], dim=0)

    # Generate both predictions in a single forward pass
    noise_pred = model(x_in, t_in, encoder_hidden_states=c_in)
    noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)

    # Apply CFG
    noise_pred_guided = noise_pred_uncond + guidance_scale * (
        noise_pred_cond - noise_pred_uncond
    )
    return noise_pred_guided

DiT: Diffusion Transformer

From U-Net to Transformer

DiT (Diffusion Transformer) by Peebles and Xie (2023) replaced the diffusion model backbone from U-Net to Transformer. The key finding is that increasing the Transformer size (GFLOPs) consistently improves generation quality (FID).

Model	Backbone	Parameters	FID (ImageNet 256)	GFLOPs
ADM	U-Net	554M	10.94	1120
LDM-4	U-Net	400M	10.56	103
DiT-S/2	Transformer	33M	68.40	6
DiT-B/2	Transformer	130M	43.47	23
DiT-L/2	Transformer	458M	9.62	80
DiT-XL/2	Transformer	675M	2.27	119

adaLN-Zero Block

The key innovation of DiT is the adaLN-Zero conditioning approach. Timestep and class embeddings are injected as scale/shift parameters of Adaptive Layer Normalization, with gating parameters initialized to zero so that the block acts as an identity function (residual connection) at the start of training.

class DiTBlock(nn.Module):
    """DiT adaLN-Zero Transformer Block"""
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model, elementwise_affine=False)
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.norm2 = nn.LayerNorm(d_model, elementwise_affine=False)
        self.mlp = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.GELU(),
            nn.Linear(d_model * 4, d_model),
        )
        # adaLN modulation: 6 parameters (gamma1, beta1, alpha1, gamma2, beta2, alpha2)
        self.adaLN_modulation = nn.Sequential(
            nn.SiLU(),
            nn.Linear(d_model, 6 * d_model),
        )
        # Zero initialization - acts as identity at start of training
        nn.init.zeros_(self.adaLN_modulation[-1].weight)
        nn.init.zeros_(self.adaLN_modulation[-1].bias)

    def forward(self, x, c):
        """
        x: Patch tokens (B, N, D)
        c: Condition embedding - timestep + class (B, D)
        """
        # Generate adaLN parameters
        shift1, scale1, gate1, shift2, scale2, gate2 = (
            self.adaLN_modulation(c).chunk(6, dim=-1)
        )

        # Self-Attention with adaLN
        h = self.norm1(x)
        h = h * (1 + scale1.unsqueeze(1)) + shift1.unsqueeze(1)
        h, _ = self.attn(h, h, h)
        x = x + gate1.unsqueeze(1) * h

        # FFN with adaLN
        h = self.norm2(x)
        h = h * (1 + scale2.unsqueeze(1)) + shift2.unsqueeze(1)
        h = self.mlp(h)
        x = x + gate2.unsqueeze(1) * h

        return x

Patchify Strategy

DiT splits the latent representation into p x p patches to use as Transformer input tokens. Smaller patch sizes result in more tokens, improving performance but increasing computational cost.

class PatchEmbed(nn.Module):
    """DiT Patchify Layer"""
    def __init__(self, patch_size=2, in_channels=4, embed_dim=1152):
        super().__init__()
        self.patch_size = patch_size
        self.proj = nn.Conv2d(
            in_channels, embed_dim,
            kernel_size=patch_size, stride=patch_size
        )

    def forward(self, x):
        """(B, C, H, W) -> (B, N, D) patch token sequence"""
        x = self.proj(x)  # (B, D, H/p, W/p)
        x = x.flatten(2).transpose(1, 2)  # (B, N, D)
        return x

SDXL: Evolution of Stable Diffusion

Key Improvements

SDXL by Podell et al. (2023) introduced the following core improvements over Stable Diffusion v1.5:

Feature	SD v1.5	SDXL Base
U-Net Parameters	860M	2.6B (3x increase)
Text Encoder	CLIP ViT-L/14	OpenCLIP ViT-bigG + CLIP ViT-L
Text Embedding Dimension	768	2048
Default Resolution	512x512	1024x1024
Attention Blocks	16	70
Refiner Model	None	Dedicated Refiner included

Dual Text Encoders

One of SDXL's greatest innovations is the use of two text encoders. It combines the rich semantic representations from OpenCLIP ViT-bigG with complementary features from CLIP ViT-L, significantly improving text understanding.

from diffusers import StableDiffusionXLPipeline
import torch

class SDXLInference:
    """SDXL Inference Pipeline"""

    def __init__(self):
        self.pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            variant="fp16",
            use_safetensors=True,
        ).to("cuda")

        # Memory optimization
        self.pipe.enable_model_cpu_offload()
        self.pipe.enable_vae_tiling()

    def generate(self, prompt, negative_prompt="", steps=30):
        """SDXL basic generation"""
        image = self.pipe(
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=steps,
            guidance_scale=7.5,
            height=1024,
            width=1024,
        ).images[0]
        return image

    def generate_with_refiner(self, prompt, base_pipe, refiner_pipe):
        """Base + Refiner two-stage pipeline"""
        # Base model: 80% of total steps
        high_noise_frac = 0.8
        image = base_pipe(
            prompt=prompt,
            num_inference_steps=40,
            denoising_end=high_noise_frac,
            output_type="latent",
        ).images

        # Refiner: remaining 20% (enhance fine details)
        image = refiner_pipe(
            prompt=prompt,
            num_inference_steps=40,
            denoising_start=high_noise_frac,
            image=image,
        ).images[0]
        return image

Size/Crop Conditioning

SDXL provides the original image size and crop coordinates as conditions during training, enabling effective learning of images with diverse aspect ratios. This is implemented using Fourier Feature Encoding.

def get_sdxl_conditioning(original_size, crop_coords, target_size):
    """Generate SDXL size/crop conditioning"""
    # Original size (height, width)
    original_size = torch.tensor(original_size, dtype=torch.float32)
    # Crop coordinates (top, left)
    crop_coords = torch.tensor(crop_coords, dtype=torch.float32)
    # Target size (height, width)
    target_size = torch.tensor(target_size, dtype=torch.float32)

    # Fourier Feature Encoding
    conditioning = torch.cat([original_size, crop_coords, target_size])

    # Sinusoidal embedding
    freqs = torch.exp(
        -torch.arange(0, 128) * np.log(10000) / 128
    )
    emb = conditioning.unsqueeze(-1) * freqs.unsqueeze(0)
    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)

    return emb.flatten()

ControlNet: Conditional Generation Control

ControlNet by Zhang et al. (2023) adds spatial conditions such as edges, depth, and pose to pretrained diffusion models. The Zero Convolution technique preserves the existing capabilities of the model at the beginning of training while gradually learning new conditions.

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from controlnet_aux import CannyDetector
from PIL import Image
import torch

def controlnet_canny_generation(input_image_path, prompt):
    """ControlNet Canny Edge based image generation"""
    # Load ControlNet model
    controlnet = ControlNetModel.from_pretrained(
        "lllyasviel/control_v11p_sd15_canny",
        torch_dtype=torch.float16,
    )

    pipe = StableDiffusionControlNetPipeline.from_pretrained(
        "stable-diffusion-v1-5/stable-diffusion-v1-5",
        controlnet=controlnet,
        torch_dtype=torch.float16,
    ).to("cuda")

    # Extract Canny Edge
    canny_detector = CannyDetector()
    input_image = Image.open(input_image_path)
    canny_image = canny_detector(input_image, low_threshold=100, high_threshold=200)

    # ControlNet-based generation
    output = pipe(
        prompt=prompt,
        image=canny_image,
        num_inference_steps=30,
        guidance_scale=7.5,
        controlnet_conditioning_scale=1.0,
    ).images[0]

    return output

Training Pipeline and Data Preparation

Dataset Composition

Here is a comparison of major datasets used for training large-scale diffusion models.

Dataset	Scale	Resolution	Usage
LAION-5B	5.8B image-text pairs	Various	Stable Diffusion training
LAION-Aesthetics	120M (filtered)	Various	High-quality fine-tuning
ImageNet	1.3M	256/512	DiT training (class-conditional)
COYO-700M	700M	Various	Multilingual training (incl. Korean)

Fine-tuning Strategies

# LoRA Fine-tuning (Stable Diffusion)
accelerate launch train_text_to_image_lora.py \
    --pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
    --dataset_name="custom_dataset" \
    --resolution=512 \
    --train_batch_size=4 \
    --gradient_accumulation_steps=4 \
    --learning_rate=1e-4 \
    --lr_scheduler="cosine" \
    --lr_warmup_steps=500 \
    --max_train_steps=10000 \
    --rank=64 \
    --output_dir="./lora_output" \
    --mixed_precision="fp16" \
    --enable_xformers_memory_efficient_attention

# DreamBooth Fine-tuning (specific object/style learning)
accelerate launch train_dreambooth.py \
    --pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
    --instance_data_dir="./my_images" \
    --instance_prompt="a photo of sks dog" \
    --class_data_dir="./class_images" \
    --class_prompt="a photo of dog" \
    --with_prior_preservation \
    --prior_loss_weight=1.0 \
    --num_class_images=200 \
    --resolution=512 \
    --train_batch_size=1 \
    --learning_rate=5e-6 \
    --max_train_steps=800

Inference Optimization Techniques

Key Optimization Techniques Comparison

Technique	Speed Improvement	Quality Impact	Memory Savings
DDIM (50 steps)	20x	Minimal	-
DPM-Solver++ (20 steps)	50x	Minimal	-
xFormers Memory Efficient Attention	1.5x	None	30-40%
torch.compile	1.2-1.5x	None	-
VAE Tiling	-	Minimal	70%+
FP16/BF16	1.5-2x	Minimal	50%
TensorRT	2-4x	None	-

Production Optimization Code

import torch
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler

def optimized_sdxl_pipeline():
    """Production-optimized SDXL Pipeline"""
    pipe = StableDiffusionXLPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0",
        torch_dtype=torch.float16,
        variant="fp16",
        use_safetensors=True,
    ).to("cuda")

    # 1. Apply fast scheduler
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(
        pipe.scheduler.config,
        algorithm_type="dpmsolver++",
        use_karras_sigmas=True,
    )

    # 2. VAE Tiling (memory savings for high-resolution generation)
    pipe.enable_vae_tiling()

    # 3. Attention Slicing (when VRAM is limited)
    pipe.enable_attention_slicing()

    # 4. torch.compile (PyTorch 2.0+)
    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

    return pipe

# GPU memory monitoring
def monitor_gpu_memory():
    """Monitor GPU memory usage"""
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    max_allocated = torch.cuda.max_memory_allocated() / 1024**3
    print(f"Allocated: {allocated:.2f} GB")
    print(f"Reserved:  {reserved:.2f} GB")
    print(f"Peak:      {max_allocated:.2f} GB")

Comprehensive Model Comparison

Model	Year	Key Contribution	Backbone	Conditioning	Resolution
DDPM	2020	Practical diffusion models	U-Net	None (unconditional)	256
DDIM	2020	Accelerated sampling	U-Net	None	256
LDM (SD)	2022	Latent space diffusion	U-Net + VAE	Cross-Attention	512
DiT	2023	Transformer backbone	Transformer	adaLN-Zero	256/512
SDXL	2023	Large-scale U-Net + dual encoders	U-Net + VAE	Cross-Attention + CFG	1024
ControlNet	2023	Spatial condition control	Zero Conv + U-Net	Edge/Depth/Pose	512
SD3	2024	MMDiT (Multi-Modal DiT)	Transformer	Flow Matching	1024

Operational Considerations

GPU Memory Management

The most common issue when operating Stable Diffusion-based services is GPU OOM (Out of Memory). The following items should be checked:

Batch size limits: For 1024x1024 SDXL generation, approximately 12GB is needed per image on A100 80GB, while V100 16GB will encounter OOM
Concurrent request limits: Rate limiters must be applied to prevent GPU memory overflow
Enable VAE Tiling: Essential for high-resolution (2048x2048+) generation
Memory profiling: Regular GPU memory monitoring to detect memory leaks

Failure Case: GPU OOM Recovery

# Check GPU memory status
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

# Check GPU memory leaks from Python processes
fuser -v /dev/nvidia*

# Force GPU memory release (without process restart)
python -c "
import torch
import gc
gc.collect()
torch.cuda.empty_cache()
print('GPU memory cleared')
print(f'Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB')
"

# Service recovery procedure after OOM
# 1. Graceful shutdown of the affected worker process
# 2. Verify GPU memory release
# 3. Adjust batch size/concurrent request count
# 4. Restart worker process
# 5. Resume traffic after health check passes

NSFW Filtering

For commercial services, the Safety Checker must always be enabled. Disabling it can result in NSFW content being generated, which may cause legal issues.

# Safety Checker configuration (required for production)
pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    safety_checker=None,  # Only disable in development
)

# Must be enabled in production
from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker
from transformers import CLIPImageProcessor

safety_checker = StableDiffusionSafetyChecker.from_pretrained(
    "CompVis/stable-diffusion-safety-checker"
)
feature_extractor = CLIPImageProcessor.from_pretrained(
    "openai/clip-vit-base-patch32"
)

Failure Cases and Recovery Procedures

Case 1: Model Loading Failure

Disk I/O timeouts or checkpoint corruption can occur when loading large-scale models.

import os
from diffusers import StableDiffusionXLPipeline

def robust_model_loading(model_id, max_retries=3):
    """Robust model loading (with retries)"""
    for attempt in range(max_retries):
        try:
            pipe = StableDiffusionXLPipeline.from_pretrained(
                model_id,
                torch_dtype=torch.float16,
                use_safetensors=True,
                local_files_only=os.path.exists(
                    os.path.join(model_id, "model_index.json")
                ),
            )
            pipe = pipe.to("cuda")
            # Warmup run
            _ = pipe("test", num_inference_steps=1)
            return pipe
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                import time
                time.sleep(10)
                # Clear cache and retry
                torch.cuda.empty_cache()
            else:
                raise RuntimeError(f"Model loading failed after {max_retries} attempts")

Case 2: Image Quality Degradation (Inappropriate CFG Scale)

# CFG Scale Guidelines
guidance_scale_guidelines:
  1.0: 'Condition nearly ignored - generation close to random'
  3.0-5.0: 'Creative and diverse generation'
  7.0-8.5: 'Generally recommended range - quality/diversity balance'
  10.0-15.0: 'High text fidelity - risk of oversaturation'
  20.0+: 'Excessive guidance - artifacts may appear'

# Troubleshooting checklist
troubleshooting:
  blurry_output:
    - 'Increase num_inference_steps (minimum 30+)'
    - 'Switch scheduler to DPM-Solver++'
  oversaturated:
    - 'Lower guidance_scale to 7.0 or below'
    - "Add 'oversaturated, vivid' to negative_prompt"
  wrong_composition:
    - 'Improve prompt structure (clear subject-verb-object)'
    - 'Use ControlNet for composition control'

Conclusion

Diffusion Models have evolved rapidly, building on DDPM's theoretical foundations with DDIM's accelerated sampling, Latent Diffusion's efficient architecture, Classifier-free Guidance's quality control, DiT's scalability, SDXL's large-scale design, and ControlNet's fine-grained control.

Currently, new paradigms like SD3's MMDiT (Multi-Modal Diffusion Transformer) and Flow Matching, as well as Consistency Models, are emerging to enable even faster and higher-quality image generation. In particular, the DiT architecture serves as the foundation for video generation models like Sora (OpenAI), and the applications of Diffusion Models are expanding beyond images to video, 3D, and audio.

From an engineering perspective, understanding the theoretical background of models is the key to optimization and debugging. Accurately grasping the role of each component, including noise schedules, CFG Scale, scheduler selection, and memory management, is essential for operating stable services in production environments.

Diffusion Model 논문 서베이: DDPM에서 Stable Diffusion·DiT·SDXL까지 이미지 생성 모델의 진화

들어가며

DDPM: 확산 모델의 기초

Forward Process (노이즈 추가)

Reverse Process (노이즈 제거)

학습 목표: Simple Loss

DDIM: 가속 샘플링

Score-based 모델과의 관계

Latent Diffusion Model (Stable Diffusion)

아키텍처 개요

핵심 코드 구조

Cross-Attention 메커니즘

Classifier-free Guidance (CFG)

DiT: Diffusion Transformer

U-Net에서 Transformer로

adaLN-Zero 블록

Patchify 전략

SDXL: Stable Diffusion의 진화

주요 개선 사항

이중 텍스트 인코더

크기/크롭 조건화

ControlNet: 조건부 생성 제어

학습 파이프라인과 데이터 준비

데이터셋 구성

파인튜닝 전략

추론 최적화 기법

주요 최적화 기법 비교

실전 최적화 코드

모델 비교 종합

운영 시 주의사항

GPU 메모리 관리

장애 사례: GPU OOM 복구

NSFW 필터링

장애 사례와 복구 절차

사례 1: 모델 로딩 실패

사례 2: 이미지 품질 저하 (CFG Scale 부적절)

마치며

참고자료

Diffusion Model Paper Survey: Evolution of Image Generation from DDPM to Stable Diffusion, DiT, and SDXL

Introduction

DDPM: Foundations of Diffusion Models

Forward Process (Adding Noise)

Reverse Process (Denoising)

Training Objective: Simple Loss

DDIM: Accelerated Sampling

Relationship with Score-based Models

Latent Diffusion Model (Stable Diffusion)

Architecture Overview

Core Code Structure

Cross-Attention Mechanism

Classifier-free Guidance (CFG)

DiT: Diffusion Transformer

From U-Net to Transformer

adaLN-Zero Block

Patchify Strategy

SDXL: Evolution of Stable Diffusion

Key Improvements

Dual Text Encoders

Size/Crop Conditioning

ControlNet: Conditional Generation Control

Training Pipeline and Data Preparation

Dataset Composition

Fine-tuning Strategies

Inference Optimization Techniques

Key Optimization Techniques Comparison

Production Optimization Code

Comprehensive Model Comparison

Operational Considerations

GPU Memory Management

Failure Case: GPU OOM Recovery

NSFW Filtering

Failure Cases and Recovery Procedures

Case 1: Model Loading Failure

Case 2: Image Quality Degradation (Inappropriate CFG Scale)

Conclusion

References