Diffusion Model 논문 서베이: DDPM에서 Stable Diffusion·DiT·SDXL까지 이미지 생성 모델의 진화

들어가며
DDPM: 확산 모델의 기초
DDIM: 가속 샘플링
Score-based 모델과의 관계
Latent Diffusion Model (Stable Diffusion)
Classifier-free Guidance (CFG)
DiT: Diffusion Transformer
SDXL: Stable Diffusion의 진화
ControlNet: 조건부 생성 제어
학습 파이프라인과 데이터 준비
- 데이터셋 구성
- 파인튜닝 전략
추론 최적화 기법
- 주요 최적화 기법 비교
- 실전 최적화 코드
모델 비교 종합
운영 시 주의사항
장애 사례와 복구 절차
- 사례 1: 모델 로딩 실패
- 사례 2: 이미지 품질 저하 (CFG Scale 부적절)
마치며
참고자료

Diffusion Model Survey: DDPM to Stable Diffusion, DiT, SDXL

들어가며

이미지 생성 분야에서 Diffusion Model은 GAN(Generative Adversarial Network)을 대체하는 새로운 패러다임으로 자리 잡았다. 2020년 Ho 등이 발표한 DDPM(Denoising Diffusion Probabilistic Models) 이후, 불과 3년 만에 Stable Diffusion, DALL-E 2, Midjourney 등의 상용 서비스가 등장하며 이미지 생성의 대중화를 이끌었다.

Diffusion Model의 핵심 아이디어는 놀랍도록 단순하다. 데이터에 점진적으로 노이즈를 추가하는 Forward Process와 이 노이즈를 역으로 제거하여 데이터를 복원하는 Reverse Process를 학습하는 것이다. 이 과정에서 모델은 각 노이즈 수준에서 "어떤 방향으로 노이즈를 제거해야 하는지"를 학습하게 된다.

이 글에서는 DDPM의 수학적 기초부터 DDIM의 가속 샘플링, Score-based 모델과의 관계, Latent Diffusion(Stable Diffusion)의 아키텍처, Classifier-free Guidance, DiT(Diffusion Transformer), SDXL, ControlNet까지 주요 모델의 진화를 시간순으로 서베이한다. 각 모델의 핵심 기여, 구현 코드, 성능 비교, 운영 시 주의사항을 포괄적으로 다룬다.

DDPM: 확산 모델의 기초

Forward Process (노이즈 추가)

DDPM의 Forward Process는 원본 데이터 x_0에 T 단계에 걸쳐 점진적으로 가우시안 노이즈를 추가한다. 각 단계 t에서의 노이즈 스케줄은 beta_t로 제어된다.

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

Reparameterization trick을 활용하면 임의의 타임스텝 t에서의 노이즈 이미지를 직접 계산할 수 있다.

x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

여기서 alpha_t = 1 - beta_t 이고, alpha_bar_t는 alpha_1부터 alpha_t까지의 누적 곱이다.

import torch
import torch.nn as nn
import numpy as np

class DDPMScheduler:
    """DDPM Forward Process 스케줄러"""
    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
        self.num_timesteps = num_timesteps
        # 선형 노이즈 스케줄
        self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        self.alphas = 1.0 - self.betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)

    def add_noise(self, x_0, t, noise=None):
        """임의의 타임스텝 t에서의 노이즈 이미지 생성"""
        if noise is None:
            noise = torch.randn_like(x_0)

        sqrt_alpha_bar = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
        sqrt_one_minus_alpha_bar = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)

        # x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
        x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise
        return x_t

    def sample_timesteps(self, batch_size):
        """학습용 랜덤 타임스텝 샘플링"""
        return torch.randint(0, self.num_timesteps, (batch_size,))

Reverse Process (노이즈 제거)

Reverse Process에서는 x_T ~ N(0, I) 로부터 시작하여 학습된 모델 epsilon_theta를 사용하여 단계적으로 노이즈를 제거한다.

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)

class DDPMSampler:
    """DDPM Reverse Process 샘플러"""
    def __init__(self, scheduler):
        self.scheduler = scheduler

    @torch.no_grad()
    def sample(self, model, shape, device):
        """DDPM 역확산 샘플링"""
        # 순수 노이즈에서 시작
        x = torch.randn(shape, device=device)

        for t in reversed(range(self.scheduler.num_timesteps)):
            t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)

            # 노이즈 예측
            predicted_noise = model(x, t_batch)

            # 평균 계산
            alpha = self.scheduler.alphas[t]
            alpha_bar = self.scheduler.alphas_cumprod[t]
            beta = self.scheduler.betas[t]

            mean = (1 / torch.sqrt(alpha)) * (
                x - (beta / torch.sqrt(1 - alpha_bar)) * predicted_noise
            )

            # t > 0일 때만 노이즈 추가
            if t > 0:
                noise = torch.randn_like(x)
                sigma = torch.sqrt(beta)
                x = mean + sigma * noise
            else:
                x = mean

        return x

학습 목표: Simple Loss

DDPM의 학습은 모델이 예측한 노이즈와 실제 노이즈 사이의 MSE를 최소화하는 것이다.

L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

def ddpm_training_step(model, x_0, scheduler, optimizer):
    """DDPM 학습 단일 스텝"""
    batch_size = x_0.shape[0]
    device = x_0.device

    # 1. 랜덤 타임스텝 샘플링
    t = scheduler.sample_timesteps(batch_size).to(device)

    # 2. 노이즈 생성 및 노이즈 이미지 생성
    noise = torch.randn_like(x_0)
    x_t = scheduler.add_noise(x_0, t, noise)

    # 3. 모델이 노이즈 예측
    predicted_noise = model(x_t, t)

    # 4. Simple Loss 계산
    loss = nn.functional.mse_loss(predicted_noise, noise)

    # 5. 역전파
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss.item()

DDIM: 가속 샘플링

DDPM은 1000 스텝의 역확산 과정이 필요하여 생성 속도가 매우 느리다. Song 등(2020)이 제안한 DDIM(Denoising Diffusion Implicit Models) 은 비마르코프(non-Markovian) 확산 과정을 정의하여 동일한 학습된 모델로 10~50배 빠른 샘플링을 가능하게 했다.

DDIM의 핵심은 eta 파라미터로 확률적/결정적 샘플링을 제어하는 것이다. eta=0이면 완전 결정적(deterministic)이며, eta=1이면 DDPM과 동일해진다.

class DDIMSampler:
    """DDIM 가속 샘플러"""
    def __init__(self, scheduler, ddim_steps=50, eta=0.0):
        self.scheduler = scheduler
        self.ddim_steps = ddim_steps
        self.eta = eta
        # 서브셋 타임스텝 생성 (예: 1000 -> 50)
        self.timesteps = np.linspace(
            0, scheduler.num_timesteps - 1, ddim_steps, dtype=int
        )[::-1]

    @torch.no_grad()
    def sample(self, model, shape, device):
        """DDIM 가속 샘플링 - 50 스텝으로 고품질 생성"""
        x = torch.randn(shape, device=device)

        for i in range(len(self.timesteps)):
            t = self.timesteps[i]
            t_prev = self.timesteps[i + 1] if i + 1 < len(self.timesteps) else 0

            t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
            predicted_noise = model(x, t_batch)

            alpha_bar_t = self.scheduler.alphas_cumprod[t]
            alpha_bar_prev = self.scheduler.alphas_cumprod[t_prev]

            # x_0 예측
            x_0_pred = (x - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t)
            x_0_pred = torch.clamp(x_0_pred, -1, 1)

            # 방향 계산
            sigma = self.eta * torch.sqrt(
                (1 - alpha_bar_prev) / (1 - alpha_bar_t) * (1 - alpha_bar_t / alpha_bar_prev)
            )
            direction = torch.sqrt(1 - alpha_bar_prev - sigma**2) * predicted_noise

            # x_{t-1} 계산
            x = torch.sqrt(alpha_bar_prev) * x_0_pred + direction

            if self.eta > 0 and t > 0:
                x = x + sigma * torch.randn_like(x)

        return x

Score-based 모델과의 관계

Song과 Ermon(2019)은 Score Matching 관점에서 확산 모델을 해석했다. Score function은 데이터 분포의 로그 밀도의 기울기이다.

s_\theta(x) \approx \nabla_x \log p(x)

DDPM의 노이즈 예측 epsilon_theta와 Score function은 다음 관계를 갖는다.

s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1 - \bar{\alpha}_t}}

이 관계는 Score SDE(Stochastic Differential Equation) 프레임워크로 통합되어, 연속 시간에서의 확산 과정을 다음과 같이 기술한다.

dx = f(x, t)dt + g(t)dw

Latent Diffusion Model (Stable Diffusion)

아키텍처 개요

Rombach 등(2022)의 Latent Diffusion Model(LDM) 은 확산 과정을 픽셀 공간이 아닌 잠재 공간(latent space) 에서 수행하여 계산 비용을 획기적으로 줄였다. 이것이 바로 Stable Diffusion의 핵심 아키텍처이다.

LDM은 세 가지 핵심 구성 요소로 이루어진다.

구성 요소	역할	상세
VAE Encoder	이미지를 잠재 공간으로 인코딩	512x512 이미지를 64x64x4 잠재 표현으로 압축
U-Net (Denoiser)	잠재 공간에서 노이즈 예측	Cross-attention으로 텍스트 조건 반영
VAE Decoder	잠재 표현을 이미지로 디코딩	64x64x4 잠재 표현을 512x512 이미지로 복원
Text Encoder	텍스트 프롬프트 인코딩	CLIP ViT-L/14로 77 토큰 임베딩 생성

핵심 코드 구조

import torch
from diffusers import StableDiffusionPipeline, DDIMScheduler

class LatentDiffusionInference:
    """Stable Diffusion 추론 파이프라인 (간소화)"""

    def __init__(self, model_id="stable-diffusion-v1-5/stable-diffusion-v1-5"):
        self.pipe = StableDiffusionPipeline.from_pretrained(
            model_id,
            torch_dtype=torch.float16,
            safety_checker=None
        ).to("cuda")

        # DDIM 스케줄러로 교체 (50 스텝으로 가속)
        self.pipe.scheduler = DDIMScheduler.from_config(
            self.pipe.scheduler.config
        )

    def generate(self, prompt, negative_prompt="", num_steps=50, guidance_scale=7.5):
        """텍스트-이미지 생성"""
        image = self.pipe(
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=num_steps,
            guidance_scale=guidance_scale,
        ).images[0]
        return image

    def generate_with_latent_control(self, prompt, seed=42):
        """잠재 공간 직접 제어"""
        generator = torch.Generator(device="cuda").manual_seed(seed)

        # 잠재 벡터 직접 생성
        latents = torch.randn(
            (1, 4, 64, 64),
            generator=generator,
            device="cuda",
            dtype=torch.float16
        )

        image = self.pipe(
            prompt=prompt,
            latents=latents,
            num_inference_steps=50,
            guidance_scale=7.5,
        ).images[0]
        return image

Cross-Attention 메커니즘

Stable Diffusion의 U-Net에서는 Cross-Attention을 통해 텍스트 조건을 이미지 생성에 반영한다. Query는 이미지 잠재 표현에서, Key와 Value는 텍스트 임베딩에서 생성된다.

class CrossAttention(nn.Module):
    """Stable Diffusion U-Net의 Cross-Attention 레이어"""
    def __init__(self, d_model=320, d_context=768, n_heads=8):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_model // n_heads

        self.to_q = nn.Linear(d_model, d_model, bias=False)
        self.to_k = nn.Linear(d_context, d_model, bias=False)
        self.to_v = nn.Linear(d_context, d_model, bias=False)
        self.to_out = nn.Linear(d_model, d_model)

    def forward(self, x, context):
        """
        x: 이미지 잠재 표현 (B, H*W, d_model)
        context: 텍스트 임베딩 (B, seq_len, d_context)
        """
        B, N, C = x.shape

        q = self.to_q(x).view(B, N, self.n_heads, self.d_head).transpose(1, 2)
        k = self.to_k(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
        v = self.to_v(context).view(B, -1, self.n_heads, self.d_head).transpose(1, 2)

        # Scaled Dot-Product Attention
        scale = self.d_head ** -0.5
        attn = torch.matmul(q, k.transpose(-2, -1)) * scale
        attn = torch.softmax(attn, dim=-1)
        out = torch.matmul(attn, v)

        out = out.transpose(1, 2).contiguous().view(B, N, C)
        return self.to_out(out)

Classifier-free Guidance (CFG)

Ho와 Salimans(2022)가 제안한 Classifier-free Guidance는 별도의 분류기 없이 생성 품질을 제어하는 핵심 기법이다.

학습 시에는 조건부 모델과 비조건부 모델을 동시에 학습한다(일정 확률로 텍스트 조건을 빈 문자열로 대체). 추론 시에는 두 예측의 가중 평균을 사용한다.

\tilde{\epsilon}_\theta(x_t, c) = \epsilon_\theta(x_t, \varnothing) + w \cdot (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \varnothing))

여기서 w는 guidance scale이다. w=1이면 순수 조건부 생성, w가 클수록 텍스트 조건에 더 강하게 따른다(일반적으로 7.5~15).

def classifier_free_guidance_step(model, x_t, t, text_embedding, null_embedding, guidance_scale=7.5):
    """Classifier-free Guidance 단일 스텝"""

    # 조건부/비조건부 예측을 배치로 한번에 처리
    x_in = torch.cat([x_t, x_t], dim=0)
    t_in = torch.cat([t, t], dim=0)
    c_in = torch.cat([null_embedding, text_embedding], dim=0)

    # 한 번의 forward pass로 두 예측 동시 생성
    noise_pred = model(x_in, t_in, encoder_hidden_states=c_in)
    noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)

    # CFG 적용
    noise_pred_guided = noise_pred_uncond + guidance_scale * (
        noise_pred_cond - noise_pred_uncond
    )
    return noise_pred_guided

DiT: Diffusion Transformer

U-Net에서 Transformer로

Peebles와 Xie(2023)의 DiT(Diffusion Transformer) 는 확산 모델의 백본을 U-Net에서 Transformer로 교체했다. 핵심 발견은 Transformer의 크기(GFLOPs)를 늘리면 생성 품질(FID)이 일관되게 향상된다는 것이다.

모델	백본	파라미터 수	FID (ImageNet 256)	GFLOPs
ADM	U-Net	554M	10.94	1120
LDM-4	U-Net	400M	10.56	103
DiT-S/2	Transformer	33M	68.40	6
DiT-B/2	Transformer	130M	43.47	23
DiT-L/2	Transformer	458M	9.62	80
DiT-XL/2	Transformer	675M	2.27	119

adaLN-Zero 블록

DiT의 핵심 혁신은 adaLN-Zero 조건화 방식이다. 타임스텝과 클래스 임베딩을 Adaptive Layer Normalization의 scale/shift 파라미터로 주입하되, 초기화 시 게이팅 파라미터를 0으로 설정하여 학습 초기에는 잔차 연결(identity function)으로 동작하게 한다.

class DiTBlock(nn.Module):
    """DiT의 adaLN-Zero Transformer Block"""
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model, elementwise_affine=False)
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.norm2 = nn.LayerNorm(d_model, elementwise_affine=False)
        self.mlp = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.GELU(),
            nn.Linear(d_model * 4, d_model),
        )
        # adaLN modulation: 6개의 파라미터 (gamma1, beta1, alpha1, gamma2, beta2, alpha2)
        self.adaLN_modulation = nn.Sequential(
            nn.SiLU(),
            nn.Linear(d_model, 6 * d_model),
        )
        # Zero 초기화 - 학습 초기에 identity로 동작
        nn.init.zeros_(self.adaLN_modulation[-1].weight)
        nn.init.zeros_(self.adaLN_modulation[-1].bias)

    def forward(self, x, c):
        """
        x: 패치 토큰 (B, N, D)
        c: 조건 임베딩 - 타임스텝 + 클래스 (B, D)
        """
        # adaLN 파라미터 생성
        shift1, scale1, gate1, shift2, scale2, gate2 = (
            self.adaLN_modulation(c).chunk(6, dim=-1)
        )

        # Self-Attention with adaLN
        h = self.norm1(x)
        h = h * (1 + scale1.unsqueeze(1)) + shift1.unsqueeze(1)
        h, _ = self.attn(h, h, h)
        x = x + gate1.unsqueeze(1) * h

        # FFN with adaLN
        h = self.norm2(x)
        h = h * (1 + scale2.unsqueeze(1)) + shift2.unsqueeze(1)
        h = self.mlp(h)
        x = x + gate2.unsqueeze(1) * h

        return x

Patchify 전략

DiT는 잠재 표현을 p x p 패치로 분할하여 Transformer의 입력 토큰으로 사용한다. 패치 크기가 작을수록 토큰 수가 많아져 성능이 향상되지만 계산 비용도 증가한다.

class PatchEmbed(nn.Module):
    """DiT의 Patchify 레이어"""
    def __init__(self, patch_size=2, in_channels=4, embed_dim=1152):
        super().__init__()
        self.patch_size = patch_size
        self.proj = nn.Conv2d(
            in_channels, embed_dim,
            kernel_size=patch_size, stride=patch_size
        )

    def forward(self, x):
        """(B, C, H, W) -> (B, N, D) 패치 토큰 시퀀스"""
        x = self.proj(x)  # (B, D, H/p, W/p)
        x = x.flatten(2).transpose(1, 2)  # (B, N, D)
        return x

SDXL: Stable Diffusion의 진화

주요 개선 사항

Podell 등(2023)의 SDXL은 Stable Diffusion v1.5 대비 다음의 핵심 개선을 도입했다.

특징	SD v1.5	SDXL Base
U-Net 파라미터	860M	2.6B (3배 증가)
텍스트 인코더	CLIP ViT-L/14	OpenCLIP ViT-bigG + CLIP ViT-L
텍스트 임베딩 차원	768	2048
기본 해상도	512x512	1024x1024
Attention 블록 수	16	70
Refiner 모델	없음	전용 Refiner 포함

이중 텍스트 인코더

SDXL의 가장 큰 혁신 중 하나는 두 개의 텍스트 인코더를 사용하는 것이다. OpenCLIP ViT-bigG의 풍부한 의미 표현과 CLIP ViT-L의 보완적 특징을 결합하여 텍스트 이해력을 크게 향상시켰다.

from diffusers import StableDiffusionXLPipeline
import torch

class SDXLInference:
    """SDXL 추론 파이프라인"""

    def __init__(self):
        self.pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            variant="fp16",
            use_safetensors=True,
        ).to("cuda")

        # 메모리 최적화
        self.pipe.enable_model_cpu_offload()
        self.pipe.enable_vae_tiling()

    def generate(self, prompt, negative_prompt="", steps=30):
        """SDXL 기본 생성"""
        image = self.pipe(
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=steps,
            guidance_scale=7.5,
            height=1024,
            width=1024,
        ).images[0]
        return image

    def generate_with_refiner(self, prompt, base_pipe, refiner_pipe):
        """Base + Refiner 2단계 파이프라인"""
        # Base 모델: 전체 스텝의 80%
        high_noise_frac = 0.8
        image = base_pipe(
            prompt=prompt,
            num_inference_steps=40,
            denoising_end=high_noise_frac,
            output_type="latent",
        ).images

        # Refiner: 나머지 20% (세부 디테일 향상)
        image = refiner_pipe(
            prompt=prompt,
            num_inference_steps=40,
            denoising_start=high_noise_frac,
            image=image,
        ).images[0]
        return image

크기/크롭 조건화

SDXL은 학습 시 이미지의 원본 크기와 크롭 좌표를 조건으로 제공하여, 다양한 종횡비의 이미지를 효과적으로 학습할 수 있게 했다. 이는 Fourier Feature Encoding을 사용하여 구현된다.

def get_sdxl_conditioning(original_size, crop_coords, target_size):
    """SDXL의 크기/크롭 조건 생성"""
    # 원본 크기 (height, width)
    original_size = torch.tensor(original_size, dtype=torch.float32)
    # 크롭 좌표 (top, left)
    crop_coords = torch.tensor(crop_coords, dtype=torch.float32)
    # 목표 크기 (height, width)
    target_size = torch.tensor(target_size, dtype=torch.float32)

    # Fourier Feature Encoding
    conditioning = torch.cat([original_size, crop_coords, target_size])

    # Sinusoidal embedding
    freqs = torch.exp(
        -torch.arange(0, 128) * np.log(10000) / 128
    )
    emb = conditioning.unsqueeze(-1) * freqs.unsqueeze(0)
    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)

    return emb.flatten()

ControlNet: 조건부 생성 제어

Zhang 등(2023)의 ControlNet은 사전 학습된 확산 모델에 엣지, 깊이, 포즈 등의 공간 조건을 추가한다. Zero Convolution 기법으로 학습 초기에 모델의 기존 능력을 보존하면서 새로운 조건을 점진적으로 학습한다.

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from controlnet_aux import CannyDetector
from PIL import Image
import torch

def controlnet_canny_generation(input_image_path, prompt):
    """ControlNet Canny Edge 기반 이미지 생성"""
    # ControlNet 모델 로드
    controlnet = ControlNetModel.from_pretrained(
        "lllyasviel/control_v11p_sd15_canny",
        torch_dtype=torch.float16,
    )

    pipe = StableDiffusionControlNetPipeline.from_pretrained(
        "stable-diffusion-v1-5/stable-diffusion-v1-5",
        controlnet=controlnet,
        torch_dtype=torch.float16,
    ).to("cuda")

    # Canny Edge 추출
    canny_detector = CannyDetector()
    input_image = Image.open(input_image_path)
    canny_image = canny_detector(input_image, low_threshold=100, high_threshold=200)

    # ControlNet 기반 생성
    output = pipe(
        prompt=prompt,
        image=canny_image,
        num_inference_steps=30,
        guidance_scale=7.5,
        controlnet_conditioning_scale=1.0,
    ).images[0]

    return output

학습 파이프라인과 데이터 준비

데이터셋 구성

대규모 확산 모델의 학습에 사용되는 주요 데이터셋 비교이다.

데이터셋	규모	해상도	용도
LAION-5B	58억 이미지-텍스트 쌍	다양	Stable Diffusion 학습
LAION-Aesthetics	1.2억 (필터링)	다양	고품질 파인튜닝
ImageNet	130만	256/512	DiT 학습 (클래스 조건부)
COYO-700M	7억	다양	한국어 포함 다국어 학습

파인튜닝 전략

# LoRA 파인튜닝 (Stable Diffusion)
accelerate launch train_text_to_image_lora.py \
    --pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
    --dataset_name="custom_dataset" \
    --resolution=512 \
    --train_batch_size=4 \
    --gradient_accumulation_steps=4 \
    --learning_rate=1e-4 \
    --lr_scheduler="cosine" \
    --lr_warmup_steps=500 \
    --max_train_steps=10000 \
    --rank=64 \
    --output_dir="./lora_output" \
    --mixed_precision="fp16" \
    --enable_xformers_memory_efficient_attention

# DreamBooth 파인튜닝 (특정 객체/스타일 학습)
accelerate launch train_dreambooth.py \
    --pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
    --instance_data_dir="./my_images" \
    --instance_prompt="a photo of sks dog" \
    --class_data_dir="./class_images" \
    --class_prompt="a photo of dog" \
    --with_prior_preservation \
    --prior_loss_weight=1.0 \
    --num_class_images=200 \
    --resolution=512 \
    --train_batch_size=1 \
    --learning_rate=5e-6 \
    --max_train_steps=800

추론 최적화 기법

주요 최적화 기법 비교

기법	속도 향상	품질 영향	메모리 절감
DDIM (50 steps)	20x	미미	-
DPM-Solver++ (20 steps)	50x	미미	-
xFormers Memory Efficient Attention	1.5x	없음	30-40%
torch.compile	1.2-1.5x	없음	-
VAE Tiling	-	미미	70%+
FP16/BF16	1.5-2x	미미	50%
TensorRT	2-4x	없음	-

실전 최적화 코드

import torch
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler

def optimized_sdxl_pipeline():
    """프로덕션 최적화된 SDXL 파이프라인"""
    pipe = StableDiffusionXLPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0",
        torch_dtype=torch.float16,
        variant="fp16",
        use_safetensors=True,
    ).to("cuda")

    # 1. 고속 스케줄러 적용
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(
        pipe.scheduler.config,
        algorithm_type="dpmsolver++",
        use_karras_sigmas=True,
    )

    # 2. VAE Tiling (고해상도 생성 시 메모리 절감)
    pipe.enable_vae_tiling()

    # 3. Attention Slicing (VRAM 부족 시)
    pipe.enable_attention_slicing()

    # 4. torch.compile (PyTorch 2.0+)
    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

    return pipe

# GPU 메모리 모니터링
def monitor_gpu_memory():
    """GPU 메모리 사용량 모니터링"""
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    max_allocated = torch.cuda.max_memory_allocated() / 1024**3
    print(f"Allocated: {allocated:.2f} GB")
    print(f"Reserved:  {reserved:.2f} GB")
    print(f"Peak:      {max_allocated:.2f} GB")

모델 비교 종합

모델	연도	핵심 기여	백본	조건화 방식	해상도
DDPM	2020	확산 모델 실용화	U-Net	없음 (비조건부)	256
DDIM	2020	가속 샘플링	U-Net	없음	256
LDM (SD)	2022	잠재 공간 확산	U-Net + VAE	Cross-Attention	512
DiT	2023	Transformer 백본	Transformer	adaLN-Zero	256/512
SDXL	2023	대규모 U-Net + 이중 인코더	U-Net + VAE	Cross-Attention + CFG	1024
ControlNet	2023	공간 조건 제어	Zero Conv + U-Net	엣지/깊이/포즈	512
SD3	2024	MMDiT (다중 모달 DiT)	Transformer	Flow Matching	1024

운영 시 주의사항

GPU 메모리 관리

Stable Diffusion 기반 서비스를 운영할 때 가장 빈번한 문제는 GPU OOM(Out of Memory)이다. 다음 사항을 체크해야 한다.

배치 크기 제한: 1024x1024 SDXL 생성 시 단일 이미지 기준 A100 80GB에서 약 12GB, V100 16GB에서는 OOM 발생
동시 요청 제한: Rate limiter를 반드시 적용하여 GPU 메모리 초과 방지
VAE Tiling 활성화: 고해상도(2048x2048+) 생성 시 필수
메모리 프로파일링: 주기적인 GPU 메모리 모니터링으로 메모리 누수 감지

장애 사례: GPU OOM 복구

# GPU 메모리 상태 확인
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

# Python 프로세스의 GPU 메모리 누수 확인
fuser -v /dev/nvidia*

# 강제 GPU 메모리 해제 (프로세스 재시작 없이)
python -c "
import torch
import gc
gc.collect()
torch.cuda.empty_cache()
print('GPU memory cleared')
print(f'Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB')
"

# OOM 발생 시 서비스 복구 절차
# 1. 해당 워커 프로세스 graceful shutdown
# 2. GPU 메모리 해제 확인
# 3. 배치 크기/동시 요청 수 조정
# 4. 워커 프로세스 재시작
# 5. 헬스체크 통과 확인 후 트래픽 복구

NSFW 필터링

상용 서비스에서는 반드시 Safety Checker를 활성화해야 한다. Safety Checker를 비활성화하면 NSFW 콘텐츠가 생성될 수 있어 법적 문제가 발생할 수 있다.

# Safety Checker 설정 (프로덕션 필수)
pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    safety_checker=None,  # 개발 환경에서만 비활성화
)

# 프로덕션에서는 반드시 활성화
from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker
from transformers import CLIPImageProcessor

safety_checker = StableDiffusionSafetyChecker.from_pretrained(
    "CompVis/stable-diffusion-safety-checker"
)
feature_extractor = CLIPImageProcessor.from_pretrained(
    "openai/clip-vit-base-patch32"
)

장애 사례와 복구 절차

사례 1: 모델 로딩 실패

대규모 모델 로딩 시 디스크 I/O 타임아웃이나 체크포인트 손상이 발생할 수 있다.

import os
from diffusers import StableDiffusionXLPipeline

def robust_model_loading(model_id, max_retries=3):
    """안정적인 모델 로딩 (재시도 포함)"""
    for attempt in range(max_retries):
        try:
            pipe = StableDiffusionXLPipeline.from_pretrained(
                model_id,
                torch_dtype=torch.float16,
                use_safetensors=True,
                local_files_only=os.path.exists(
                    os.path.join(model_id, "model_index.json")
                ),
            )
            pipe = pipe.to("cuda")
            # 워밍업 실행
            _ = pipe("test", num_inference_steps=1)
            return pipe
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                import time
                time.sleep(10)
                # 캐시 클리어 후 재시도
                torch.cuda.empty_cache()
            else:
                raise RuntimeError(f"Model loading failed after {max_retries} attempts")

사례 2: 이미지 품질 저하 (CFG Scale 부적절)

# CFG Scale 가이드라인
guidance_scale_guidelines:
  1.0: '조건 거의 무시 - 랜덤에 가까운 생성'
  3.0-5.0: '창의적이고 다양한 생성'
  7.0-8.5: '일반적 권장 범위 - 품질/다양성 균형'
  10.0-15.0: '텍스트 충실도 높음 - 과포화 위험'
  20.0+: '과도한 가이던스 - 아티팩트 발생'

# 문제 진단 체크리스트
troubleshooting:
  blurry_output:
    - 'num_inference_steps 증가 (최소 30 이상)'
    - '스케줄러를 DPM-Solver++로 변경'
  oversaturated:
    - 'guidance_scale을 7.0 이하로 낮춤'
    - "negative_prompt에 'oversaturated, vivid' 추가"
  wrong_composition:
    - '프롬프트 구조 개선 (주어-동사-목적어 명확히)'
    - 'ControlNet으로 구도 제어'

마치며

Diffusion Model은 DDPM의 이론적 기초 위에 DDIM의 가속 샘플링, Latent Diffusion의 효율적 아키텍처, Classifier-free Guidance의 품질 제어, DiT의 확장성, SDXL의 대규모화, ControlNet의 세밀한 제어가 더해지며 급속히 발전했다.

현재 SD3의 MMDiT(Multi-Modal Diffusion Transformer)와 Flow Matching, Consistency Models 등의 새로운 패러다임이 등장하며 더 빠르고 고품질의 이미지 생성이 가능해지고 있다. 특히 DiT 아키텍처는 Sora(OpenAI)와 같은 비디오 생성 모델의 기반이 되며, Diffusion Model의 응용 범위가 이미지를 넘어 비디오, 3D, 오디오까지 확장되고 있다.

엔지니어 관점에서는 모델의 이론적 배경을 이해하는 것이 최적화와 디버깅의 핵심이다. 노이즈 스케줄, CFG Scale, 스케줄러 선택, 메모리 관리 등 각 구성 요소의 역할을 정확히 파악해야 프로덕션 환경에서 안정적인 서비스를 운영할 수 있다.