Split View: Text-to-Image 모델 학습 방법론 완벽 가이드: GAN에서 Flow Matching까지

Text-to-Image 모델 학습 방법론 완벽 가이드: GAN에서 Flow Matching까지

1. 서론: Text-to-Image 생성 모델의 발전사
- 1.1 왜 학습 방법론이 중요한가
2. 핵심 아키텍처별 학습 방법론
3. Text Conditioning 방법론
4. 학습 데이터셋
5. Fine-tuning & Customization 기법
6. 최신 트렌드 (2024-2026)
7. 학습 파이프라인 실전 가이드
8. 주요 논문 레퍼런스 정리
- 8.1 핵심 논문 테이블
- 8.2 추가 참고 논문
9. 결론 및 전망
- 핵심 트렌드 요약
- 향후 전망
참고 자료

1. 서론: Text-to-Image 생성 모델의 발전사

Text-to-Image(T2I) 생성 모델은 자연어 텍스트 프롬프트로부터 고해상도 이미지를 생성하는 기술로, 지난 수년간 급격한 발전을 이루었다. 이 분야의 발전 궤적은 크게 네 가지 패러다임으로 구분할 수 있다.

[Text-to-Image 모델 발전 타임라인]

2014-2019          2017-2020          2020-2023              2023-현재
    |                  |                  |                      |
   GAN              VAE/VQ-VAE        Diffusion Models      Flow Matching
    |                  |                  |                  + DiT
    v                  v                  v                      v
 ┌──────────┐    ┌──────────┐    ┌────────────────┐    ┌──────────────┐
 │StackGAN  │    │ VQ-VAE   │    │ DDPM (2020)    │    │ SD3 (2024)   │
 │AttnGAN   │    │ VQ-VAE-2 │    │ DALL-E 2(2022) │    │ Flux (2024)  │
 │StyleGAN  │    │ dVAE     │    │ Imagen (2022)  │    │ Pixart-Sigma │
 │BigGAN    │    │          │    │ SD 1.x (2022)  │    │              │
 │GigaGAN   │    │          │    │ SDXL (2023)    │    │              │
 └──────────┘    └──────────┘    └────────────────┘    └──────────────┘

특징:                특징:              특징:                   특징:
- Adversarial       - Discrete         - Iterative             - Straight paths
  Training            Latent Space       Denoising             - ODE-based
- Mode Collapse     - Codebook         - Classifier-Free      - Fewer steps
  문제                Learning           Guidance              - DiT backbone
- 빠른 생성         - Two-stage        - Latent Space          - Scalable
                      Training         - U-Net backbone

1.1 왜 학습 방법론이 중요한가

T2I 모델의 품질은 아키텍처 설계뿐 아니라 학습 방법론에 의해 결정적으로 좌우된다. 동일한 아키텍처라 하더라도 noise scheduling, conditioning 방식, 데이터 품질, 학습 전략에 따라 생성 품질이 극적으로 달라진다. 대표적인 예로 DALL-E 3는 아키텍처 변경 없이 캡션 품질 개선만으로 이전 모델 대비 극적인 성능 향상을 달성했다.

이 글에서는 각 패러다임별 핵심 학습 방법론을 논문 기반으로 심층 분석하고, 실전 학습 파이프라인 구성까지 다룬다.

2. 핵심 아키텍처별 학습 방법론

2.1 GAN 기반: Adversarial Training

**Generative Adversarial Network(GAN)**은 Generator와 Discriminator 두 네트워크가 경쟁적으로 학습하는 프레임워크다.

2.1.1 기본 학습 원리

GAN의 학습 목적함수(objective function)는 minimax game으로 정의된다:

min_G max_D  V(D, G) = E_{x~p_data}[log D(x)] + E_{z~p_z}[log(1 - D(G(z)))]

- G (Generator): 랜덤 노이즈 z로부터 이미지 생성
- D (Discriminator): 실제 이미지와 생성 이미지 구분
- 학습 목표: G는 D를 속이고, D는 정확히 판별

2.1.2 StyleGAN 학습 전략

StyleGAN(Karras et al., 2019)은 Progressive Growing과 Style-based Generator를 도입하여 고품질 이미지 생성을 가능하게 했다.

핵심 학습 기법:

기법	설명	효과
Progressive Growing	저해상도(4x4)에서 시작하여 점진적으로 해상도 증가	학습 안정성 향상
Style Mixing	서로 다른 latent code를 다른 레이어에 주입	다양성 증가
Path Length Regularization	Generator의 Jacobian 정규화	생성 품질 향상
R1 Regularization	Discriminator gradient penalty	학습 안정화
Lazy Regularization	정규화를 매 스텝이 아닌 16스텝마다 적용	학습 효율 향상

# StyleGAN2 학습 루프 핵심 (simplified)
for real_images, _ in dataloader:
    # 1. Discriminator 학습
    z = torch.randn(batch_size, latent_dim)
    fake_images = generator(z)

    d_real = discriminator(real_images)
    d_fake = discriminator(fake_images.detach())

    d_loss = F.softplus(-d_real).mean() + F.softplus(d_fake).mean()

    # R1 Regularization (lazy: 매 16스텝마다)
    if step % 16 == 0:
        real_images.requires_grad = True
        d_real = discriminator(real_images)
        r1_grads = torch.autograd.grad(d_real.sum(), real_images)[0]
        r1_penalty = r1_grads.square().sum(dim=[1,2,3]).mean()
        d_loss += 10.0 * r1_penalty

    d_optimizer.zero_grad()
    d_loss.backward()
    d_optimizer.step()

    # 2. Generator 학습
    z = torch.randn(batch_size, latent_dim)
    fake_images = generator(z)
    d_fake = discriminator(fake_images)
    g_loss = F.softplus(-d_fake).mean()

    g_optimizer.zero_grad()
    g_loss.backward()
    g_optimizer.step()

2.1.3 BigGAN의 대규모 학습

BigGAN(Brock et al., 2019)은 GAN을 대규모로 스케일업한 모델로, 다음과 같은 학습 전략을 사용했다:

대규모 배치: 배치 크기를 2048까지 증가시켜 학습 안정성과 품질 향상
Class-conditional Batch Normalization: 클래스 정보를 Batch Normalization 파라미터에 주입
Truncation Trick: 추론 시 latent 분포를 truncate하여 품질-다양성 트레이드오프 조절
Orthogonal Regularization: 가중치 행렬의 직교성을 유지하여 mode collapse 방지

2.1.4 GAN 기반 T2I의 한계

GAN 기반 접근법은 다음과 같은 근본적 한계로 인해 Diffusion 기반 모델에 주도권을 내어주게 되었다:

Mode Collapse: 생성 다양성이 제한됨
Training Instability: 학습이 불안정하여 하이퍼파라미터에 민감
Text Conditioning 어려움: 복잡한 텍스트 프롬프트를 정확히 반영하기 어려움
Scaling 한계: 대규모 스케일업 시 학습 불안정성 증가

2.2 VAE 기반: Codebook Learning과 Discrete Latent Space

2.2.1 VQ-VAE: Vector Quantized Variational Autoencoder

VQ-VAE(van den Oord et al., 2017)는 연속적인 latent space 대신 이산적(discrete) latent space를 학습하는 방식이다.

[VQ-VAE 아키텍처]

Input Image     Encoder      Quantization      Decoder      Reconstructed
  (256x256) --> [E(x)] --> z_e --> [Codebook] --> z_q --> [D(z_q)] --> Image
                             |        |
                             |   ┌────┴────┐
                             |   │ e_1     │
                             |   │ e_2     │  K개의 코드벡터
                             └──>│ ...     │  (Codebook)
                                 │ e_K     │
                                 └─────────┘

  z_q = e_k  where k = argmin_j ||z_e - e_j||
  (가장 가까운 코드벡터로 양자화)

VQ-VAE 학습 손실함수:

L = ||x - D(z_q)||²                    # Reconstruction Loss
  + ||sg[z_e] - e||²                   # Codebook Loss (EMA 업데이트로 대체 가능)
  + β * ||z_e - sg[e]||²              # Commitment Loss

- sg[·]: Stop-gradient 연산자
- β: Commitment loss 가중치 (보통 0.25)
- z_e: Encoder 출력
- e: 선택된 codebook 벡터

양자화(quantization) 연산은 미분 불가능하므로, **Straight-Through Estimator(STE)**를 사용하여 gradient를 encoder로 전달한다. Codebook 자체는 Exponential Moving Average(EMA)를 통해 업데이트된다.

# VQ-VAE Codebook 학습 핵심 코드
class VectorQuantizer(nn.Module):
    def __init__(self, num_embeddings, embedding_dim, commitment_cost=0.25):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings, embedding_dim)
        self.commitment_cost = commitment_cost

    def forward(self, z_e):
        # z_e: (B, D, H, W) -> (B*H*W, D)
        flat_z = z_e.permute(0, 2, 3, 1).reshape(-1, z_e.shape[1])

        # 가장 가까운 codebook 벡터 찾기
        distances = (flat_z ** 2).sum(dim=1, keepdim=True) \
                  + (self.embedding.weight ** 2).sum(dim=1) \
                  - 2 * flat_z @ self.embedding.weight.t()
        indices = distances.argmin(dim=1)
        z_q = self.embedding(indices).view_as(z_e.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)

        # 손실 계산
        codebook_loss = F.mse_loss(z_q.detach(), z_e)      # commitment
        commitment_loss = F.mse_loss(z_q, z_e.detach())     # codebook
        loss = commitment_loss + self.commitment_cost * codebook_loss

        # Straight-Through Estimator
        z_q_st = z_e + (z_q - z_e).detach()
        return z_q_st, loss, indices

2.2.2 VQ-VAE-2: 계층적 코드북 학습

VQ-VAE-2(Razavi et al., 2019)는 다층 계층적 양자화를 도입하여 이미지 품질을 크게 향상시켰다.

[VQ-VAE-2 계층 구조]

                    Top Level (작은 해상도)
                    ┌─────────────┐
                    │  32x32 grid │  전역 구조 정보
                    │  Codebook   │  (구도, 전체 형태)
                    └──────┬──────┘
                           │
                    Bottom Level (큰 해상도)
                    ┌──────┴──────┐
                    │  64x64 grid │  세부 디테일 정보
                    │  Codebook   │  (텍스처, 엣지)
                    └─────────────┘

VQ-VAE-2의 이미지 생성 파이프라인은 다음 두 단계로 이루어진다:

Stage 1: VQ-VAE-2를 학습하여 이미지를 계층적 discrete code로 인코딩
Stage 2: PixelCNN 등 autoregressive 모델로 discrete code의 prior를 학습

이 접근법은 이후 DALL-E의 dVAE(discrete VAE)에 직접적인 영향을 주었다.

2.3 Diffusion 기반: 현재 T2I의 핵심

Diffusion Model은 현재 T2I 생성의 주류 패러다임이다. 데이터에 점진적으로 노이즈를 추가하는 forward process와, 노이즈로부터 데이터를 복원하는 reverse process를 학습한다.

2.3.1 DDPM: Denoising Diffusion Probabilistic Models

Ho et al.(2020)의 DDPM은 Diffusion Model을 실용적 수준으로 끌어올린 핵심 논문이다.

Forward Process (Diffusion):

q(x_t | x_{t-1}) = N(x_t; √(1-β_t) * x_{t-1}, β_t * I)

- 매 timestep t에서 소량의 Gaussian 노이즈를 추가
- β_t: noise schedule (보통 linear 또는 cosine)
- T 스텝 후 x_T ≈ N(0, I) (순수 Gaussian 노이즈)

Closed-form으로 임의의 timestep t에서 직접 노이즈 추가 가능:

q(x_t | x_0) = N(x_t; √(ᾱ_t) * x_0, (1-ᾱ_t) * I)

where ᾱ_t = ∏_{s=1}^{t} α_s,  α_t = 1 - β_t

=> x_t = √(ᾱ_t) * x_0 + √(1-ᾱ_t) * ε,  ε ~ N(0, I)

Reverse Process (Denoising):

p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), σ_t² * I)

- 신경망 ε_θ가 x_t에 추가된 노이즈 ε를 예측
- 예측된 노이즈를 제거하여 x_{t-1}을 복원

학습 목적함수 (Simple Loss):

L_simple = E_{t, x_0, ε} [ ||ε - ε_θ(x_t, t)||² ]

- t ~ Uniform(1, T)
- ε ~ N(0, I)
- x_t = √(ᾱ_t) * x_0 + √(1-ᾱ_t) * ε

# DDPM 학습 루프 핵심
def train_step(model, x_0, noise_scheduler):
    batch_size = x_0.shape[0]

    # 1. 랜덤 timestep 샘플링
    t = torch.randint(0, num_timesteps, (batch_size,))

    # 2. 노이즈 샘플링
    noise = torch.randn_like(x_0)

    # 3. Forward process: x_t 생성
    alpha_bar_t = noise_scheduler.alpha_bar[t]
    x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * noise

    # 4. 노이즈 예측
    noise_pred = model(x_t, t)

    # 5. 손실 계산 (MSE)
    loss = F.mse_loss(noise_pred, noise)

    return loss

2.3.2 Noise Scheduling

Noise schedule은 forward process에서 각 timestep에 추가되는 노이즈의 양을 결정하며, 생성 품질에 결정적인 영향을 미친다.

Schedule	수식	특징	사용 모델
Linear	β_t = β_min + (β_max - β_min) * t/T	간단하지만 끝부분에서 급격히 노이즈 증가	DDPM
Cosine	ᾱ_t = cos²((t/T + s)/(1+s) * π/2)	부드러운 전이, 정보 보존 우수	Improved DDPM
Scaled Linear	β_t = (β_min^0.5 + t/T * (β_max^0.5 - β_min^0.5))²	SD 1.x에서 사용	Stable Diffusion
Sigmoid	β_t = σ(-6 + 12*t/T)	양 끝단에서 완만한 변화	일부 연구
EDM	σ(t) = t, log-normal 샘플링	이론적으로 최적에 가까움	Playground v2.5, EDM
Zero Terminal SNR	SNR(T) = 0으로 보장	순수 노이즈에서 시작 보장	SD3, Flux

Playground v2.5(Li et al., 2024)는 EDM(Karras et al., 2022)의 noise schedule을 채택하여 색상과 대비를 크게 개선했다. 핵심은 Zero Terminal SNR을 보장하는 것으로, 학습 시 timestep T에서의 Signal-to-Noise Ratio(SNR)가 정확히 0이 되어야 한다.

# Cosine Schedule 구현
def cosine_beta_schedule(timesteps, s=0.008):
    steps = timesteps + 1
    x = torch.linspace(0, timesteps, steps)
    alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * math.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clip(betas, 0.0001, 0.9999)

# EDM Noise Schedule (Karras et al., 2022)
def edm_sigma_schedule(num_steps, sigma_min=0.002, sigma_max=80.0, rho=7.0):
    step_indices = torch.arange(num_steps)
    t_steps = (sigma_max ** (1/rho) + step_indices / (num_steps - 1)
               * (sigma_min ** (1/rho) - sigma_max ** (1/rho))) ** rho
    return t_steps

2.3.3 Latent Diffusion Model (LDM) - Stable Diffusion의 핵심

Rombach et al.(2022)의 **Latent Diffusion Model(LDM)**은 pixel space 대신 latent space에서 diffusion을 수행하여 계산 효율을 극적으로 개선했다. 이것이 바로 Stable Diffusion의 핵심 아이디어다.

[Latent Diffusion Model 아키텍처]

                          Text Prompt
                              │
                         ┌────┴────┐
                         │  CLIP   │
                         │ Encoder │
                         └────┬────┘
                              │ text embeddings
                              │
┌──────┐    ┌──────┐    ┌─────┴──────┐    ┌──────┐    ┌──────┐
│Image │    │ VAE  │    │   U-Net    │    │ VAE  │    │Output│
│(512  │--->│Encode│--->│ (Denoising │--->│Decode│--->│Image │
│x512) │    │  r   │    │  in Latent │    │  r   │    │(512  │
│      │    │      │    │   Space)   │    │      │    │x512) │
└──────┘    └──────┘    └────────────┘    └──────┘    └──────┘
              │                                 │
              │  64x64x4                        │
              │  (8x downsampling)              │
              └─────────────────────────────────┘
                    Latent Space (z)

학습: Latent Space에서 Diffusion
추론: 랜덤 노이즈 z_T → U-Net Denoising → VAE Decode → 이미지

LDM의 학습 파이프라인:

Stage 1 - Autoencoder 학습: VAE(KL-regularized)를 이미지 데이터셋으로 사전학습
- Encoder: 이미지 x (H x W x 3) → latent z (H/f x W/f x c), f=8이 일반적
- Decoder: latent z → 복원 이미지
- 손실: Reconstruction + KL Divergence + Perceptual Loss + GAN Loss
Stage 2 - Diffusion Model 학습: 고정된 Autoencoder의 latent space에서 diffusion
- latent z_0 = Encoder(x)에 노이즈를 추가하여 z_t 생성
- U-Net이 z_t에서 노이즈를 예측
- Text conditioning은 cross-attention으로 주입

# Latent Diffusion 학습 핵심
class LatentDiffusionTrainer:
    def __init__(self, vae, unet, text_encoder, noise_scheduler):
        self.vae = vae              # 고정 (frozen)
        self.unet = unet            # 학습 대상
        self.text_encoder = text_encoder  # 고정 (frozen)
        self.noise_scheduler = noise_scheduler

    def train_step(self, images, captions):
        # 1. VAE로 latent encoding (gradient 불필요)
        with torch.no_grad():
            latents = self.vae.encode(images).latent_dist.sample()
            latents = latents * self.vae.config.scaling_factor  # 0.18215

        # 2. 텍스트 임베딩 (gradient 불필요)
        with torch.no_grad():
            text_embeddings = self.text_encoder(captions)

        # 3. 노이즈 추가
        noise = torch.randn_like(latents)
        timesteps = torch.randint(0, 1000, (latents.shape[0],))
        noisy_latents = self.noise_scheduler.add_noise(latents, noise, timesteps)

        # 4. 노이즈 예측
        noise_pred = self.unet(noisy_latents, timesteps, text_embeddings)

        # 5. MSE 손실
        loss = F.mse_loss(noise_pred, noise)
        return loss

2.3.4 U-Net Backbone의 구조

Stable Diffusion 1.x/2.x와 SDXL에서 사용되는 U-Net은 다음과 같은 구조를 갖는다:

[U-Net with Cross-Attention 구조]

Input z_t ─────────────────────────────────────────── Output ε_θ
    │                                                     ▲
    ▼                                                     │
┌────────┐  ┌────────┐  ┌────────┐      ┌────────┐  ┌────────┐
│ Down   │  │ Down   │  │ Down   │      │  Up    │  │  Up    │
│ Block  │──│ Block  │──│ Block  │──┐   │ Block  │──│ Block  │
│ 64x64  │  │ 32x32  │  │ 16x16  │  │   │ 32x32  │  │ 64x64  │
└────────┘  └────────┘  └────────┘  │   └────────┘  └────────┘
    │            │            │     │        ▲           ▲
    │            │            │     ▼        │           │
    │            │            │  ┌────────┐  │           │
    │            │            └──│ Middle │──┘           │
    │            │               │ Block  │              │
    │            │               │ 16x16  │              │
    │            └───────────────└────────┘──────────────┘
    │                    (skip connections)
    └────────────────────────────────────────────────────┘

각 Block 내부:
┌──────────────────────────────────────┐
│  ResNet Block                         │
│  ├── GroupNorm → SiLU → Conv         │
│  ├── Timestep Embedding 주입          │
│  └── GroupNorm → SiLU → Conv         │
│                                       │
│  Self-Attention Block                 │
│  ├── LayerNorm → Self-Attention      │
│  └── Skip Connection                  │
│                                       │
│  Cross-Attention Block                │
│  ├── LayerNorm                        │
│  ├── Q = Linear(latent features)     │
│  ├── K = Linear(text embeddings)     │  ← Text Conditioning
│  ├── V = Linear(text embeddings)     │
│  └── Attention(Q, K, V)             │
│                                       │
│  Feed-Forward Block                   │
│  ├── LayerNorm → Linear → GEGLU     │
│  └── Linear → Skip Connection        │
└──────────────────────────────────────┘

SDXL(Podell et al., 2023)은 U-Net을 약 3배 확대(~2.6B 파라미터)하고, OpenCLIP ViT-bigG와 CLIP ViT-L 두 개의 텍스트 인코더를 사용하며, 다양한 aspect ratio에서 학습하는 개선을 적용했다.

모델	U-Net 파라미터	Text Encoder	해상도	VAE 다운샘플링
SD 1.5	~860M	CLIP ViT-L/14 (1개)	512x512	8x
SD 2.1	~865M	OpenCLIP ViT-H/14 (1개)	768x768	8x
SDXL	~2.6B	OpenCLIP ViT-bigG + CLIP ViT-L (2개)	1024x1024	8x
SDXL Refiner	~2.3B	OpenCLIP ViT-bigG (1개)	1024x1024	8x

2.3.5 Classifier-Free Guidance (CFG)

Ho & Salimans(2022)의 **Classifier-Free Guidance(CFG)**는 현대 T2I 모델의 핵심 학습 기법이다.

기존 Classifier Guidance의 문제:

별도의 분류기(classifier)를 학습해야 함
noisy image에서 작동하는 분류기 필요
추론 시 분류기의 gradient 계산 필요

Classifier-Free Guidance 핵심 아이디어:

학습 시 일정 확률(보통 10-20%)로 텍스트 conditioning을 빈 문자열("")로 대체하여, 하나의 모델이 conditional과 unconditional 생성을 동시에 학습한다.

학습 시:
  - 확률 p_uncond (예: 10%): ε_θ(x_t, t, ∅)  (unconditional)
  - 확률 1-p_uncond:          ε_θ(x_t, t, c)  (conditional)

추론 시:
  ε_guided = ε_θ(x_t, t, ∅) + w * (ε_θ(x_t, t, c) - ε_θ(x_t, t, ∅))

  - w: guidance scale (보통 7.5 ~ 15)
  - w=1: conditional 예측 그대로
  - w>1: 텍스트 조건 방향으로 더 강하게 이동

# Classifier-Free Guidance 학습 구현
def train_step_cfg(model, x_0, text_cond, p_uncond=0.1):
    noise = torch.randn_like(x_0)
    t = torch.randint(0, T, (x_0.shape[0],))
    x_t = add_noise(x_0, noise, t)

    # 랜덤하게 conditioning drop
    mask = torch.rand(x_0.shape[0]) < p_uncond
    cond = text_cond.clone()
    cond[mask] = empty_text_embedding  # null conditioning

    noise_pred = model(x_t, t, cond)
    loss = F.mse_loss(noise_pred, noise)
    return loss

# Classifier-Free Guidance 추론
def sample_cfg(model, x_T, text_cond, guidance_scale=7.5):
    x_t = x_T
    for t in reversed(range(T)):
        # Unconditional 예측
        eps_uncond = model(x_t, t, empty_text_embedding)
        # Conditional 예측
        eps_cond = model(x_t, t, text_cond)
        # Guided 예측
        eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
        x_t = denoise_step(x_t, eps, t)
    return x_t

CFG는 생성 품질과 텍스트 일치도를 극적으로 향상시키지만, guidance scale이 너무 높으면 이미지가 과포화(oversaturated)되거나 아티팩트가 발생한다.

2.3.6 DALL-E 2: CLIP 기반 Diffusion

DALL-E 2(Ramesh et al., 2022)는 CLIP 임베딩 공간을 활용한 two-stage diffusion 아키텍처를 도입했다.

[DALL-E 2 학습 파이프라인]

Text ──→ CLIP Text Encoder ──→ text embedding
                                    │
                              ┌─────┴─────┐
                              │   Prior    │  text emb → CLIP image emb
                              │ (Diffusion)│
                              └─────┬─────┘
                                    │ CLIP image embedding
                              ┌─────┴─────┐
                              │  Decoder   │  CLIP image emb → 64x64 image
                              │ (Diffusion)│
                              └─────┬─────┘
                                    │ 64x64
                              ┌─────┴─────┐
                              │ Super-Res  │  64x64 → 256x256 → 1024x1024
                              │ (Diffusion)│
                              └─────────── ┘

2.3.7 Imagen: T5 Text Encoder의 위력

Google의 Imagen(Saharia et al., 2022)은 T5-XXL(4.6B 파라미터) 텍스트 인코더를 사용하여 텍스트 이해력을 극대화했다.

핵심 발견:

텍스트 인코더 크기를 키우는 것이 U-Net 크기를 키우는 것보다 더 효과적
T5-XXL > CLIP ViT-L (텍스트 이해 품질에서)
Dynamic Thresholding: 높은 CFG scale에서도 안정적인 생성

[Imagen 아키텍처]

Text ──→ T5-XXL (frozen) ──→ text embeddings
                                    │
                              ┌─────┴─────┐
                              │ Base Model │  64x64 생성
                              │  (U-Net)   │  cross-attention
                              └─────┬─────┘
                                    │
                              ┌─────┴─────┐
                              │ SR Model 1 │  64x64 → 256x256
                              │  (U-Net)   │
                              └─────┬─────┘
                                    │
                              ┌─────┴─────┐
                              │ SR Model 2 │  256x256 → 1024x1024
                              │  (U-Net)   │
                              └─────────── ┘

2.3.8 DiT: Diffusion Transformer

Peebles & Xie(2023)의 **DiT(Diffusion Transformer)**는 U-Net을 Transformer로 대체한 아키텍처로, 최근 T2I 모델의 주류가 되고 있다.

[DiT Block 구조]

     Input Tokens (patchified latent)
           │
    ┌──────┴──────┐
    │  LayerNorm  │ ← adaLN-Zero (adaptive)
    │  (adaptive) │   γ, β = MLP(timestep + class)
    └──────┬──────┘
           │
    ┌──────┴──────┐
    │    Self-     │
    │  Attention   │
    └──────┬──────┘
           │ (+ residual)
    ┌──────┴──────┐
    │  LayerNorm  │ ← adaLN-Zero
    │  (adaptive) │
    └──────┬──────┘
           │
    ┌──────┴──────┐
    │ Pointwise   │
    │    FFN      │
    └──────┬──────┘
           │ (+ residual, scaled by α)
           ▼
     Output Tokens

DiT의 핵심 설계 결정:

Patchify: latent을 p x p 패치로 분할 후 linear projection → token sequence
adaLN-Zero: Adaptive Layer Normalization, timestep과 class 정보를 LN 파라미터로 주입
Scaling: 모델 크기(depth, width)에 따른 체계적 스케일링 법칙 확인

DiT 변형	Depth	Width	Parameters	GFLOPs
DiT-S/2	12	384	33M	6
DiT-B/2	12	768	130M	23
DiT-L/2	24	1024	458M	80
DiT-XL/2	28	1152	675M	119

2.4 Autoregressive 기반 T2I

2.4.1 DALL-E (원본): 토큰 기반 자기회귀 생성

DALL-E(Ramesh et al., 2021)는 이미지를 discrete token으로 변환한 후, 텍스트 토큰과 이미지 토큰을 하나의 시퀀스로 연결하여 autoregressive Transformer로 joint distribution을 학습한다.

[DALL-E 학습 파이프라인]

Stage 1: dVAE 학습
  Image (256x256) ──→ dVAE Encoder ──→ 32x32 grid of tokens (8192 vocabulary)
                                           ──→ dVAE Decoder ──→ Reconstructed Image

  Loss: Reconstruction + KL Divergence (Gumbel-Softmax relaxation)

Stage 2: Autoregressive Transformer 학습
  [BPE text tokens (256)] + [Image tokens (1024)] = 1280 tokens

  Transformer (12B params):
  - 64 layers, 62 attention heads
  - 학습 목적: next-token prediction (cross-entropy)
  - 텍스트 토큰은 causal attention (좌→우)
  - 이미지 토큰은 row-major order로 자기회귀 생성
  - 텍스트→이미지 cross-attention 포함

2.4.2 Parti: Encoder-Decoder 기반

Google의 Parti(Yu et al., 2022)는 T2I를 sequence-to-sequence 문제로 정의하고, ViT-VQGAN 토크나이저와 Encoder-Decoder Transformer를 결합했다.

핵심 특징:

ViT-VQGAN: Vision Transformer 기반 이미지 토크나이저
Encoder-Decoder: 텍스트 인코딩에 Encoder, 이미지 토큰 생성에 Decoder 사용
스케일링: 350M → 3B → 20B 파라미터까지 체계적 스케일업
20B 모델에서 Imagen과 대등한 품질 달성

2.4.3 CM3Leon: 효율적 멀티모달 자기회귀

Meta의 CM3Leon(Yu et al., 2023)은 autoregressive 방식의 효율성을 크게 개선했다:

Retrieval-Augmented Training: 학습 시 관련 이미지-텍스트 쌍을 검색하여 context에 추가
Decoder-Only: Parti와 달리 순수 decoder-only 아키텍처
Instruction Tuning: 다양한 태스크에 대한 supervised fine-tuning
5x 적은 학습 비용: 유사 성능 대비 학습 컴퓨트를 1/5로 절감
MS-COCO zero-shot FID 4.88 달성

2.5 Flow Matching: 차세대 학습 패러다임

2.5.1 Flow Matching의 기본 원리

Flow Matching(Lipman et al., 2023)은 Diffusion의 stochastic process 대신 **deterministic ODE(Ordinary Differential Equation)**를 통해 노이즈 분포에서 데이터 분포로의 **직선 경로(straight path)**를 학습한다.

[Diffusion vs Flow Matching 비교]

Diffusion (Stochastic):              Flow Matching (Deterministic):
  x_0 ~~~> x_T                         x_0 ──────> x_1
  (곡선 경로, 많은 스텝 필요)              (직선 경로, 적은 스텝 가능)

  x₀ •                                x₀ •
     \  ← 곡선                            \  ← 직선
      \    경로                            \    경로
       \                                    \
        \                                    \
     x_T •                              x₁ •  (= noise)

dx = f(x,t)dt + g(t)dW              dx/dt = v_θ(x_t, t)
(SDE 기반)                           (ODE 기반, velocity field 학습)

Flow Matching 학습 목적함수:

L_FM = E_{t, x_0, x_1} [ ||v_θ(x_t, t) - u_t(x_t | x_0, x_1)||² ]

where:
  x_t = (1 - t) * x_0 + t * x_1      (linear interpolation)
  u_t = x_1 - x_0                      (target velocity: 직선 경로)
  t ~ Uniform(0, 1)                    (또는 logit-normal)
  x_0 ~ p_data (실제 데이터)
  x_1 ~ N(0, I) (가우시안 노이즈)

2.5.2 Rectified Flow

Rectified Flow(Liu et al., 2023, ICLR 2023 Spotlight)는 Flow Matching의 핵심 변형으로, Optimal Transport 관점에서 노이즈-데이터 쌍을 직선으로 연결하는 방법이다.

핵심 아이디어:

1-Rectified Flow: 데이터 x_0와 노이즈 x_1을 랜덤 페어링하여 직선 경로 학습
2-Rectified Flow (Reflow): 1-Rectified Flow로 생성한 쌍을 다시 직선화하여 경로를 더 직선에 가깝게 만듦
Distillation: 직선화된 모델을 1-step 모델로 증류

# Rectified Flow 학습 핵심
def rectified_flow_train_step(model, x_0, x_1=None):
    """
    x_0: 실제 데이터 (latent)
    x_1: 노이즈 (None이면 랜덤 샘플링)
    """
    if x_1 is None:
        x_1 = torch.randn_like(x_0)

    # 시간 샘플링 (logit-normal for SD3)
    t = torch.sigmoid(torch.randn(x_0.shape[0]))  # logit-normal
    t = t.view(-1, 1, 1, 1)

    # Linear interpolation
    x_t = (1 - t) * x_0 + t * x_1

    # Target velocity (직선 방향)
    target_v = x_1 - x_0

    # Velocity 예측
    v_pred = model(x_t, t)

    # 손실
    loss = F.mse_loss(v_pred, target_v)
    return loss

2.5.3 Stable Diffusion 3의 Flow Matching 적용

SD3(Esser et al., 2024)는 Rectified Flow를 대규모 T2I 모델에 최초로 적용한 모델이다. 논문 "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis"의 핵심 기여:

1. Logit-Normal Timestep Sampling:

균일 분포 대신 logit-normal 분포로 timestep을 샘플링하여, trajectory의 중간 부분(가장 어려운 예측 구간)에 더 많은 가중치를 부여한다.

# SD3의 Logit-Normal Timestep Sampling
def logit_normal_sampling(batch_size, m=0.0, s=1.0):
    """중간 timestep에 더 많은 가중치를 부여"""
    u = torch.randn(batch_size) * s + m
    t = torch.sigmoid(u)  # (0, 1) 범위
    return t

2. MM-DiT (Multi-Modal Diffusion Transformer):

SD3는 텍스트와 이미지를 위한 별도의 가중치를 사용하면서, 양방향 정보 흐름을 가능하게 하는 새로운 Transformer 아키텍처를 도입했다.

[MM-DiT Block]

  Image Tokens          Text Tokens
       │                     │
  ┌────┴────┐          ┌────┴────┐
  │adaLN(t) │          │adaLN(t) │
  └────┬────┘          └────┬────┘
       │                     │
       └──────┬──────────────┘
              │ (concatenate)
       ┌──────┴──────┐
       │ Joint Self- │  ← 이미지-텍스트 토큰이
       │  Attention  │     서로 attend
       └──────┬──────┘
              │ (split)
       ┌──────┴──────────────┐
       │                     │
  ┌────┴────┐          ┌────┴────┐
  │   FFN   │          │   FFN   │
  │ (image) │          │ (text)  │
  └────┬────┘          └────┬────┘
       │                     │
  Image Out             Text Out

3. 스케일링 법칙:

모델	Blocks	Parameters	성능 (validation loss)
SD3-S	15	450M	높음
SD3-M	24	2B	중간
SD3-L	38	8B	낮음 (최고 성능)

모델 크기와 학습 스텝 수가 증가할수록 validation loss가 안정적으로 감소하는 smooth scaling을 확인했다.

2.5.4 Flux: Black Forest Labs의 Flow Matching 모델

Flux(Black Forest Labs, 2024)는 SD3의 Rectified Flow + Transformer 아키텍처를 기반으로 한 모델이다.

변형	학습 방법	추론 스텝	특징
FLUX.1 [pro]	Full training	25-50	최고 품질, API만 제공
FLUX.1 [dev]	Guidance Distillation	25-50	효율적 추론, 오픈 가중치
FLUX.1 [schnell]	Latent Adversarial Diffusion Distillation	1-4	초고속 생성

Guidance Distillation: Teacher 모델(CFG 사용)의 출력을 Student 모델이 CFG 없이 재현하도록 학습하여, 추론 시 CFG 계산(2x forward pass)을 제거한다.

Latent Adversarial Diffusion Distillation (LADD): GAN의 adversarial loss와 diffusion distillation을 결합하여 1-4 스텝 생성을 가능하게 한다.

3. Text Conditioning 방법론

Text Conditioning은 T2I 모델에서 텍스트 프롬프트의 의미를 이미지 생성 과정에 주입하는 메커니즘이다. 텍스트 인코더의 선택과 conditioning 방식은 생성 품질에 결정적 영향을 미친다.

3.1 CLIP Text Encoder

OpenAI의 CLIP(Contrastive Language-Image Pre-training, Radford et al., 2021)은 4억 개의 이미지-텍스트 쌍으로 대조 학습된 모델이다.

[CLIP 학습 과정]

  Image ──→ Image Encoder ──→ image embedding ─┐
                                                 ├─ cosine similarity
  Text  ──→ Text Encoder  ──→ text embedding  ─┘

  학습 목적: 매칭 쌍의 similarity ↑, 비매칭 쌍의 similarity ↓
  (InfoNCE Loss)

CLIP Text Encoder의 특징:

토큰 시퀀스 임베딩(sequence embeddings)과 [EOS] 토큰의 pooled embedding 모두 활용 가능
최대 77 토큰 길이 제한
이미지-텍스트 정렬(alignment)에 강함
시각적 개념에 특화된 텍스트 이해

CLIP 변형	파라미터	사용 모델
CLIP ViT-L/14	~124M (text)	SD 1.x
OpenCLIP ViT-H/14	~354M (text)	SD 2.x
OpenCLIP ViT-bigG/14	~694M (text)	SDXL (primary)
CLIP ViT-L/14	~124M (text)	SDXL (secondary)

3.2 T5 Text Encoder

Google의 T5(Text-to-Text Transfer Transformer, Raffel et al., 2020)는 순수 텍스트 코퍼스에서 학습된 대규모 언어 모델이다.

T5의 장점 (Imagen 논문에서 입증):

CLIP보다 훨씬 큰 텍스트 코퍼스에서 학습 (C4 dataset)
복잡한 문장 구조와 관계 이해에 우수
공간적 관계, 수량, 속성 조합 등 복잡한 프롬프트 처리 능력
텍스트 인코더 스케일링이 U-Net 스케일링보다 효과적 (Imagen 핵심 발견)

T5 변형	파라미터	사용 모델
T5-Small	60M	실험용
T5-Base	220M	실험용
T5-Large	770M	실험용
T5-XL	3B	PixArt-alpha
T5-XXL	4.6B	Imagen, SD3, Flux
Flan-T5-XL	3B	PixArt-sigma

3.3 Cross-Attention Mechanism

Cross-attention은 U-Net 또는 DiT 내부에서 텍스트 정보를 이미지 feature에 주입하는 핵심 메커니즘이다.

# Cross-Attention 구현
class CrossAttention(nn.Module):
    def __init__(self, d_model, d_context, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_model // n_heads

        self.to_q = nn.Linear(d_model, d_model, bias=False)      # latent → Q
        self.to_k = nn.Linear(d_context, d_model, bias=False)     # text → K
        self.to_v = nn.Linear(d_context, d_model, bias=False)     # text → V
        self.to_out = nn.Linear(d_model, d_model)

    def forward(self, x, context):
        """
        x: (B, H*W, d_model) - 이미지 latent features
        context: (B, seq_len, d_context) - 텍스트 임베딩
        """
        q = self.to_q(x)          # 이미지가 Query
        k = self.to_k(context)    # 텍스트가 Key
        v = self.to_v(context)    # 텍스트가 Value

        # Multi-head reshape
        q = q.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
        k = k.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
        v = v.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)

        # Attention
        attn = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_head)
        attn = F.softmax(attn, dim=-1)
        out = attn @ v

        out = out.transpose(1, 2).reshape(B, -1, d_model)
        return self.to_out(out)

3.4 Pooled Text Embeddings vs Sequence Embeddings

현대 T2I 모델은 두 가지 유형의 텍스트 임베딩을 동시에 활용한다:

[텍스트 임베딩 유형]

Text: "a photo of a cat"
         │
    ┌────┴────┐
    │  Text   │
    │ Encoder │
    └────┬────┘
         │
    ┌────┴──────────────────────┐
    │                           │
    ▼                           ▼
 Sequence Embeddings         Pooled Embedding
 (token-level)               (sentence-level)
 [h_1, h_2, ..., h_n]       h_pool = h_[EOS]
 Shape: (seq_len, d)         Shape: (d,)
    │                           │
    │                           │
    ▼                           ▼
 Cross-Attention에 사용       Global conditioning에 사용
 (세밀한 토큰별 정보)          (전체 문장 의미)
                              - Timestep embedding에 더하기
                              - adaLN 파라미터 조절
                              - Vector conditioning

SDXL에서의 dual text encoder 활용:

# SDXL Text Conditioning
def get_sdxl_text_embeddings(text, clip_l, clip_g):
    # CLIP ViT-L: sequence embeddings (77, 768)
    clip_l_output = clip_l(text)
    clip_l_seq = clip_l_output.last_hidden_state      # (77, 768)
    clip_l_pooled = clip_l_output.pooler_output        # (768,)

    # OpenCLIP ViT-bigG: sequence embeddings (77, 1280)
    clip_g_output = clip_g(text)
    clip_g_seq = clip_g_output.last_hidden_state       # (77, 1280)
    clip_g_pooled = clip_g_output.pooler_output        # (1280,)

    # Sequence embeddings 연결 → Cross-Attention에 사용
    text_embeddings = torch.cat([clip_l_seq, clip_g_seq], dim=-1)  # (77, 2048)

    # Pooled embeddings 연결 → Vector conditioning에 사용
    pooled_embeddings = torch.cat([clip_l_pooled, clip_g_pooled], dim=-1)  # (2048,)

    return text_embeddings, pooled_embeddings

SD3와 Flux는 여기에 T5-XXL sequence embeddings를 추가로 결합하여 3중 텍스트 인코더 구성을 사용한다:

인코더	역할	출력 형태	용도
CLIP ViT-L	시각적 정렬	pooled (768) + seq (77, 768)	pooled → vector cond
OpenCLIP ViT-bigG	시각적 정렬	pooled (1280) + seq (77, 1280)	pooled → vector cond
T5-XXL	텍스트 이해	seq (max 256/512, 4096)	cross-attn / joint-attn

4. 학습 데이터셋

T2I 모델의 품질은 학습 데이터의 규모와 품질에 직접적으로 의존한다. 주요 대규모 데이터셋을 정리한다.

4.1 주요 데이터셋 비교

데이터셋	규모	소스	필터링 방법	주요 사용 모델
LAION-5B	58.5억 쌍	Common Crawl	CLIP similarity > 0.28 (영어)	SD 1.x, SD 2.x
LAION-400M	4억 쌍	Common Crawl	CLIP similarity 필터	초기 연구
LAION-Aesthetics	~1.2억 쌍	LAION-5B subset	Aesthetic score > 4.5/5.0	SD fine-tuning
CC3M	330만 쌍	Google 검색	자동 필터링 파이프라인	연구용
CC12M	1,200만 쌍	Google 검색	완화된 필터링	연구용
COYO-700M	7.47억 쌍	Common Crawl	이미지+텍스트 필터링	연구용
WebLi	100억 이미지	웹 크롤링	상위 10% CLIP 유사도	PaLI, Imagen
JourneyDB	~460만 쌍	Midjourney	고품질 프롬프트-이미지	연구용
SAM	11M 이미지	다양한 소스	수동 + 모델 기반	세그멘테이션 + T2I
Internal (비공개)	수십억 쌍	비공개	비공개	DALL-E 3, Midjourney

4.2 LAION-5B 필터링 파이프라인

LAION-5B(Schuhmann et al., 2022)는 가장 널리 사용된 오픈 T2I 학습 데이터셋이다:

[LAION-5B 데이터 수집 및 필터링 파이프라인]

Common Crawl (웹 아카이브)
        │
        ▼
1. HTML 파싱: <img> 태그에서 src URL + alt-text 추출
        │
        ▼
2. 이미지 다운로드 (img2dataset)
   - 최소 해상도 필터: width, height ≥ 64
   - 최대 종횡비: 3:1
        │
        ▼
3. CLIP 유사도 필터링
   - OpenAI CLIP ViT-B/32로 image-text similarity 계산
   - 영어: cosine similarity ≥ 0.28
   - 기타 언어: cosine similarity ≥ 0.26
        │
        ▼
4. 안전성 필터링
   - NSFW 탐지 점수 (CLIP 기반)
   - Watermark 탐지 점수
   - Toxic content 탐지
        │
        ▼
5. 중복 제거 (deduplication)
   - 해시 기반 exact duplicate 제거
   - CLIP embedding 기반 near-duplicate 제거
        │
        ▼
최종: 58.5억 이미지-텍스트 쌍
 - 23.2억 영어
 - 22.6억 기타 100+ 언어
 - 12.7억 언어 미확인

4.3 데이터 품질 평가 (Data Quality Assessment)

최신 모델들은 데이터 양보다 데이터 품질에 집중하는 추세다:

1. CLIP Score 기반 필터링:

# CLIP Score 계산
from transformers import CLIPModel, CLIPProcessor
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

inputs = processor(text=[caption], images=[image], return_tensors="pt")
outputs = model(**inputs)
clip_score = outputs.logits_per_image.item() / 100.0  # normalized

2. Aesthetic Score 필터링:

LAION-Aesthetics는 별도의 aesthetic predictor(CLIP embedding → MLP → score)를 학습하여, aesthetic score 4.5 이상인 이미지만 추출한 서브셋이다.

3. Caption Quality 개선 (DALL-E 3의 핵심 혁신):

DALL-E 3(Betker et al., 2023)는 아키텍처 변경 없이 캡션 품질 개선만으로 극적인 성능 향상을 달성했다:

전용 이미지 캡셔닝 모델을 학습하여 상세한 synthetic caption 생성
95% synthetic 캡션 + 5% 원본 캡션으로 학습
Short synthetic, detailed synthetic, human annotation 3가지 비교 실험
Detailed synthetic caption이 압도적으로 우수

[DALL-E 3 캡션 개선 효과]

기존: "cat on table"
      → 모호하고 세부 정보 부족

개선: "A fluffy orange tabby cat sitting on a round wooden
       dining table, natural sunlight streaming through a
       window behind, casting soft shadows. The cat has
       bright green eyes and is looking directly at the camera."
      → 상세한 속성, 공간 관계, 조명 정보 포함

4.4 데이터 전처리 기법

전처리 기법	설명	효과
Center Crop	이미지 중앙을 정사각형으로 크롭	해상도 표준화
Random Crop	랜덤 위치 크롭	데이터 증강
Bucket Sampling	유사 종횡비 이미지를 그룹화	다양한 종횡비 학습 (SDXL)
Caption Dropout	일정 확률로 캡션을 빈 문자열로	CFG 학습 지원
Multi-resolution	저해상도 → 고해상도 단계적 학습	학습 효율 + 품질
Tag Shuffling	태그 순서 랜덤 셔플	텍스트 순서 편향 감소

5. Fine-tuning & Customization 기법

사전학습된 T2I 모델을 특정 스타일, 주체, 제어 조건에 맞게 조정하는 fine-tuning 기법은 실전 활용의 핵심이다.

5.1 LoRA (Low-Rank Adaptation)

Hu et al.(2022)의 LoRA는 대규모 모델의 가중치를 효율적으로 fine-tuning하는 방법으로, T2I 모델에서도 핵심적으로 활용된다.

[LoRA 원리]

원본 가중치:  W_0 ∈ R^{d×k}  (고정, frozen)
LoRA 업데이트: ΔW = B × A      where A ∈ R^{r×k}, B ∈ R^{d×r}

최종 출력: h = W_0 x + ΔW x = W_0 x + B(Ax)

- r << min(d, k): low-rank (보통 4, 8, 16, 32, 64)
- 학습 파라미터: A와 B만 (전체 대비 매우 적음)
- 원본 가중치는 고정 → 메모리 효율적

# LoRA 적용 예시 (Stable Diffusion U-Net의 attention layer)
class LoRALinear(nn.Module):
    def __init__(self, original_layer, rank=4, alpha=1.0):
        super().__init__()
        self.original = original_layer  # frozen
        in_features = original_layer.in_features
        out_features = original_layer.out_features

        # LoRA layers
        self.lora_A = nn.Linear(in_features, rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=False)
        self.scale = alpha / rank

        # 초기화
        nn.init.kaiming_uniform_(self.lora_A.weight)
        nn.init.zeros_(self.lora_B.weight)  # B를 0으로 → 초기에는 원본과 동일

    def forward(self, x):
        original_out = self.original(x)        # 고정된 원본 출력
        lora_out = self.lora_B(self.lora_A(x)) # LoRA 업데이트
        return original_out + self.scale * lora_out

LoRA 학습 설정 (Diffusers 기반):

# Diffusers LoRA 학습 실행 예시
accelerate launch train_text_to_image_lora.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --dataset_name="lambdalabs/naruto-blip-captions" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=15000 \
  --learning_rate=1e-04 \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=0 \
  --rank=4 \
  --mixed_precision="fp16" \
  --output_dir="./sdxl-naruto-lora"

LoRA 파라미터	일반적 범위	영향
Rank (r)	4-128	높을수록 표현력 증가, 메모리 증가
Alpha (α)	rank와 동일 ~ 2x	학습률 스케일링
Target Modules	attn Q,K,V,O + FFN	적용 범위
Learning Rate	1e-4 ~ 1e-5	수렴 속도
학습 시간	5-30분 (단일 GPU)	빠른 반복 가능
파일 크기	1-200 MB	공유 및 배포 용이

5.2 DreamBooth

Ruiz et al.(2023)의 DreamBooth는 3-5장의 이미지로 특정 주체(subject)의 개념을 모델에 주입하는 기법이다.

[DreamBooth 학습 과정]

입력: 특정 주체의 이미지 3-5장 + 고유 식별자 "[V]"
      예: "a [V] dog" (특정 강아지)

학습 전략:
1. 주체 이미지로 모델 fine-tuning
   - "a [V] dog" → 해당 강아지 이미지

2. Prior Preservation Loss (핵심!)
   - 원본 모델로 "a dog" 이미지를 미리 생성해두고
   - fine-tuning 중 "a dog" → 일반적인 강아지 생성 능력을 보존
   - language drift 방지

L = L_recon([V] images) + λ * L_prior(class images)

# DreamBooth + LoRA 학습 (권장 조합)
# diffusers 라이브러리 기반
accelerate launch train_dreambooth_lora.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --instance_data_dir="./my_dog_images" \
  --instance_prompt="a photo of sks dog" \
  --class_data_dir="./class_dog_images" \
  --class_prompt="a photo of dog" \
  --with_prior_preservation \
  --prior_loss_weight=1.0 \
  --num_class_images=200 \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --max_train_steps=500 \
  --rank=4 \
  --mixed_precision="fp16"

5.3 Textual Inversion

Gal et al.(2023)의 Textual Inversion은 모델 가중치를 전혀 수정하지 않고, 새로운 토큰 임베딩만 학습하는 방법이다.

[Textual Inversion]

기존 토큰 공간:  [cat] [dog] [car] [tree] ...
                                  │
새로운 토큰 추가:               [S*] ← 학습할 새로운 개념
                                  │
학습: 이미지 3-5장으로 [S*]의 embedding vector만 최적화
나머지 모델 전체는 frozen

장점: 파라미터 극소 (토큰 1개 = 768 또는 1024 float)
단점: 표현력이 LoRA/DreamBooth보다 제한적

5.4 ControlNet

Zhang & Agrawala(2023)의 ControlNet은 사전학습된 diffusion model에 **구조적 조건(edge, depth, pose 등)**을 추가하는 방법이다.

[ControlNet 아키텍처]

                   Control Input (예: Canny edge)
                          │
                    ┌─────┴─────┐
                    │  Zero     │
                    │  Conv     │
                    └─────┬─────┘
                          │
                    ┌─────┴─────┐
Locked U-Net        │ Trainable │  ← U-Net Encoder의 복사본
(원본 고정)          │  Copy of  │     (학습 가능)
    │               │ U-Net Enc │
    │               └─────┬─────┘
    │                     │
    │               ┌─────┴─────┐
    │               │  Zero     │  ← 학습 초기에 출력이 0
    │               │  Conv     │     (원본 모델에 영향 없이 시작)
    │               └─────┬─────┘
    │                     │
    └─────── + ───────────┘  ← 원본 U-Net feature에 더함
                  │
              Final Output

ControlNet의 핵심 학습 기법 - Zero Convolution:

# Zero Convolution: 가중치와 바이어스를 0으로 초기화
class ZeroConv(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, 1)
        nn.init.zeros_(self.conv.weight)
        nn.init.zeros_(self.conv.bias)

    def forward(self, x):
        return self.conv(x)

# 학습 초기: zero conv 출력 = 0
# → ControlNet 추가해도 원본 모델 출력에 영향 없음
# → 학습이 진행되면서 점차 control signal 반영

조건 유형	입력	용도
Canny Edge	엣지 맵	윤곽선 기반 생성
Depth	깊이 맵	3D 구조 보존
OpenPose	관절 위치	인체 포즈 제어
Semantic Segmentation	세그멘테이션 맵	레이아웃 제어
Scribble	낙서	대략적 구도
Normal Map	표면 법선 맵	3D 형태 제어
Tile	저해상도/타일	Super-resolution

5.5 IP-Adapter

Ye et al.(2023)의 IP-Adapter(Image Prompt Adapter)는 이미지를 프롬프트로 사용하여 스타일이나 주체를 전달하는 어댑터다.

[IP-Adapter 아키텍처]

Reference Image ──→ CLIP Image Encoder ──→ image features
                                               │
                                         ┌─────┴─────┐
                                         │ Projection │  ← 학습 대상
                                         │   Layer    │
                                         └─────┬─────┘
                                               │
                                         ┌─────┴─────┐
                                         │ Decoupled │  ← 별도의 cross-attention
                                         │ Cross-Attn │    (텍스트 cross-attn과 분리)
                                         └─────┬─────┘
                                               │
Original U-Net Cross-Attention ────── + ───────┘
(텍스트 conditioning)

출력 = Text_CrossAttn(Q, K_text, V_text) + λ * Image_CrossAttn(Q, K_img, V_img)

5.6 Fine-tuning 기법 비교

기법	수정 대상	학습 이미지 수	학습 시간	파일 크기	주요 용도
LoRA	Attention 가중치 (low-rank)	수십~수천	5-30분	1-200MB	스타일, 개념
DreamBooth	전체 모델 or + LoRA	3-10	5-60분	2-7GB (전체) 또는 1-200MB (LoRA)	특정 주체
Textual Inversion	토큰 임베딩만	3-10	30분-수시간	수 KB	단순 개념
ControlNet	U-Net Encoder 복사본	수만~수십만	수일	~1.5GB	구조적 제어
IP-Adapter	Projection + Cross-Attn	대규모	수일	~100MB	이미지 프롬프팅

6. 최신 트렌드 (2024-2026)

6.1 Consistency Models

Yang Song et al.(2023)의 Consistency Models는 diffusion model의 다단계 생성을 1-step 또는 few-step으로 단축하는 방법이다.

[Consistency Models 핵심 아이디어]

Diffusion: x_T → x_{T-1} → ... → x_1 → x_0  (수백 스텝)

Consistency:
  PF-ODE trajectory 위의 모든 점 x_t가
  동일한 x_0로 매핑되도록 학습

  f_θ(x_t, t) = x_0  ∀t ∈ [0, T]

  핵심 제약: f_θ(x_0, 0) = x_0 (self-consistency)

     x_T ────→ f_θ ────→ x_0
      │                    ↑
     x_t ────→ f_θ ───────┘  (같은 x_0로 매핑!)
      │                    ↑
    x_t' ────→ f_θ ───────┘

두 가지 학습 방법:

방법	설명	장점	단점
Consistency Distillation (CD)	사전학습된 diffusion model 필요, PF-ODE 시뮬레이션	높은 품질	teacher 모델 필요
Consistency Training (CT)	실제 데이터에서 직접 학습	teacher 불필요	CD보다 품질 다소 낮음

성능:

CIFAR-10: FID 3.55 (1-step), 2.93 (2-step)
ImageNet 64x64: FID 6.20 (1-step)

후속 연구인 **Improved Consistency Training (iCT)**와 **Latent Consistency Models (LCM)**은 이를 대규모 T2I 모델에 적용하여, SDXL 수준의 모델에서 2-4 step 생성을 가능하게 했다.

6.2 DiT (Diffusion Transformer) 아키텍처의 확산

2024년 이후 DiT는 U-Net을 대체하여 T2I의 주류 backbone이 되고 있다:

모델	연도	Backbone	파라미터	핵심 특징
DiT (원본)	2023	Transformer	675M	클래스 조건부, adaLN-Zero
PixArt-alpha	2023	DiT + Cross-Attn	600M	T2I, 저비용 학습
PixArt-sigma	2024	DiT + KV Compression	600M	4K 해상도, weak-to-strong
SD3	2024	MM-DiT	2B-8B	Flow Matching, 3중 text encoder
Flux	2024	MM-DiT variant	~12B	Distillation 변형
Playground v2.5	2024	SDXL U-Net	~2.6B	EDM noise schedule
Hunyuan-DiT	2024	DiT	~1.5B	중국어+영어 bilingual
Lumina-T2X	2024	DiT	다양	Multi-modal generation

6.3 PixArt-alpha 와 PixArt-sigma

**PixArt-alpha(Chen et al., 2023)**는 효율적 DiT 학습의 선구자적 모델이다:

핵심 혁신 - Training Decomposition (학습 분해):

[PixArt-alpha 3단계 학습]

Stage 1: Pixel Dependency 학습 (저비용)
  - ImageNet 사전학습된 DiT에서 시작
  - 클래스 조건부 → T2I 전환의 기초

Stage 2: Text-Image Alignment 학습
  - Cross-attention으로 텍스트 조건 주입
  - LLaVA로 생성한 고품질 synthetic caption 사용

Stage 3: High-quality Aesthetic 학습
  - 고품질 미적 데이터셋으로 fine-tuning
  - JourneyDB 등 활용

총 학습 비용: ~675 A100 GPU days
(SD 1.5의 ~6,250 A100 GPU days 대비 10.8%)

**PixArt-sigma(Chen et al., 2024)**의 개선점:

Weak-to-Strong Training: PixArt-alpha를 기반으로 더 고품질 데이터로 강화 학습
KV Compression: Attention에서 Key와 Value를 압축하여 효율성 향상 → 4K 해상도 가능
0.6B 파라미터로 SDXL(2.6B)과 대등한 성능

6.4 SDXL, SD3, Flux 비교

[세대별 Stable Diffusion 계보]

SD 1.x (2022)     SDXL (2023)       SD3 (2024)         Flux (2024)
    │                  │                │                   │
  U-Net 860M       U-Net 2.6B      MM-DiT 2-8B        MM-DiT ~12B
    │                  │                │                   │
  CLIP ViT-L       CLIP-L +          CLIP-L +            CLIP-L +
                   OpenCLIP-G        OpenCLIP-G +        OpenCLIP-G +
                                     T5-XXL               T5-XXL
    │                  │                │                   │
  Diffusion        Diffusion        Rectified            Rectified
  (DDPM)           (DDPM)           Flow                 Flow
    │                  │                │                   │
  512x512          1024x1024        1024x1024            1024x1024+
    │                  │                │                   │
  CFG 7.5          CFG 5-9          CFG 3.5-7            Guidance
                                                         Distillation

6.5 DALL-E 3의 학습 혁신

DALL-E 3(Betker et al., 2023)의 핵심 혁신은 학습 데이터 캡션 품질 개선에 있다:

Image Captioner 학습: CoCa 기반 이미지 캡셔닝 모델을 별도 학습
Synthetic Caption 생성: 학습 데이터 전체를 상세한 synthetic caption으로 재라벨링
Caption Mixing: 95% synthetic + 5% original caption으로 학습
Descriptive vs Short: 상세한 설명형 캡션이 짧은 태그형보다 우수

6.6 Playground v2.5의 3대 통찰

Playground v2.5(Li et al., 2024)는 SDXL 아키텍처 기반에서 학습 전략 개선으로 DALL-E 3, Midjourney 5.2를 능가했다:

1. EDM Noise Schedule 채택:

# EDM Framework (Karras et al., 2022)
# σ(t) 기반 noise schedule - Zero Terminal SNR 보장
# 기존 SD의 linear schedule 대비 색상/대비 크게 개선

def edm_precondition(sigma, x_noisy, F_theta):
    """EDM Preconditioning"""
    c_skip = 1.0 / (sigma ** 2 + 1)
    c_out = sigma / (sigma ** 2 + 1).sqrt()
    c_in = 1.0 / (sigma ** 2 + 1).sqrt()
    c_noise = sigma.log() / 4

    D_x = c_skip * x_noisy + c_out * F_theta(c_in * x_noisy, c_noise)
    return D_x

2. Multi-Aspect Ratio Training:

Bucketed dataset: 유사 종횡비 이미지를 그룹화하여 배치 구성
학습 시 다양한 종횡비 지원 (1:1, 4:3, 16:9 등)

3. Human Preference Alignment:

인간 선호도 데이터를 활용한 학습 전략
Quality-tuning으로 미적 품질 극대화

7. 학습 파이프라인 실전 가이드

7.1 학습 인프라

GPU 요구사항

학습 규모	권장 GPU	VRAM	학습 기간	비용 (추정)
LoRA Fine-tuning	RTX 3090/4090 1대	24GB	5-30분	< $1
DreamBooth	A100 40GB 1대	40GB	30-60분	$2-5
ControlNet 학습	A100 80GB 4-8대	320-640GB	2-5일	$500-2,000
SD 1.5 수준 학습	A100 80GB 256대	~20TB	24일	~$150K
SDXL 수준 학습	A100 80GB 512-1024대	~40-80TB	수주	~$500K-1M
SD3/Flux 수준 학습	H100 80GB 1024+대	~80TB+	수주-수개월	> $1M

분산 학습 전략

[대규모 분산 학습 구성]

┌─────────────────────────────────────────────────────┐
│                  Data Parallel (DP/DDP)               │
│                                                       │
│  GPU 0        GPU 1        GPU 2        GPU 3        │
│  ┌──────┐    ┌──────┐    ┌──────┐    ┌──────┐       │
│  │Full  │    │Full  │    │Full  │    │Full  │       │
│  │Model │    │Model │    │Model │    │Model │       │
│  │Copy  │    │Copy  │    │Copy  │    │Copy  │       │
│  └──────┘    └──────┘    └──────┘    └──────┘       │
│  Batch 1     Batch 2     Batch 3     Batch 4        │
│                                                       │
│  → All-Reduce로 gradient 동기화                       │
│  → 각 GPU에 서로 다른 데이터 배치                       │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│              FSDP (Fully Sharded Data Parallel)       │
│                                                       │
│  GPU 0        GPU 1        GPU 2        GPU 3        │
│  ┌──────┐    ┌──────┐    ┌──────┐    ┌──────┐       │
│  │Shard │    │Shard │    │Shard │    │Shard │       │
│  │ 1/4  │    │ 2/4  │    │ 3/4  │    │ 4/4  │       │
│  └──────┘    └──────┘    └──────┘    └──────┘       │
│                                                       │
│  → 모델 파라미터를 GPU별로 분할(shard)                   │
│  → Forward/Backward 시 필요한 shard만 All-Gather       │
│  → 메모리 효율 극대화 (8B+ 모델 학습 가능)               │
└─────────────────────────────────────────────────────┘

7.2 대표적인 학습 프레임워크: Diffusers

HuggingFace의 Diffusers 라이브러리는 T2I 모델 학습의 사실상 표준이다.

# Diffusers 기반 Text-to-Image 학습 전체 파이프라인
from diffusers import AutoencoderKL, UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTextModel, CLIPTokenizer
from accelerate import Accelerator
import torch

# 1. 모델 로드
vae = AutoencoderKL.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="vae"
)
unet = UNet2DConditionModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet"
)
text_encoder = CLIPTextModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="text_encoder"
)
tokenizer = CLIPTokenizer.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="tokenizer"
)
noise_scheduler = DDPMScheduler.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler"
)

# 2. VAE, Text Encoder 고정
vae.requires_grad_(False)
text_encoder.requires_grad_(False)

# 3. Accelerator 설정 (분산 학습 + Mixed Precision)
accelerator = Accelerator(
    mixed_precision="fp16",          # or "bf16"
    gradient_accumulation_steps=4,
)

# 4. Optimizer
optimizer = torch.optim.AdamW(
    unet.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
    weight_decay=1e-2,
    eps=1e-8,
)

# 5. EMA 설정
from diffusers.training_utils import EMAModel
ema_unet = EMAModel(
    unet.parameters(),
    decay=0.9999,
    use_ema_warmup=True,
)

# 6. 분산 학습 준비
unet, optimizer, dataloader = accelerator.prepare(unet, optimizer, dataloader)

# 7. 학습 루프
for epoch in range(num_epochs):
    for batch in dataloader:
        with accelerator.accumulate(unet):
            images = batch["images"]
            captions = batch["captions"]

            # Latent encoding
            with torch.no_grad():
                latents = vae.encode(images).latent_dist.sample()
                latents = latents * vae.config.scaling_factor

            # Text encoding
            with torch.no_grad():
                text_inputs = tokenizer(captions, padding="max_length",
                                       max_length=77, return_tensors="pt")
                text_embeds = text_encoder(text_inputs.input_ids)[0]

            # Noise 추가
            noise = torch.randn_like(latents)
            timesteps = torch.randint(0, 1000, (latents.shape[0],))
            noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)

            # Classifier-Free Guidance: 랜덤 caption dropout
            if torch.rand(1) < 0.1:  # 10% 확률로 unconditional
                text_embeds = torch.zeros_like(text_embeds)

            # Noise 예측
            noise_pred = unet(noisy_latents, timesteps, text_embeds).sample

            # 손실 계산
            loss = F.mse_loss(noise_pred, noise)

            # Backward
            accelerator.backward(loss)
            accelerator.clip_grad_norm_(unet.parameters(), 1.0)
            optimizer.step()
            optimizer.zero_grad()

            # EMA 업데이트
            ema_unet.step(unet.parameters())

7.3 Mixed Precision Training

Mixed Precision은 FP32와 FP16/BF16을 혼합 사용하여 메모리와 계산 효율을 높이는 기법이다.

[Mixed Precision Training]

Forward Pass:
  - 모델 가중치: FP16/BF16 (메모리 절반)
  - Activation: FP16/BF16

Loss Scaling:
  - Loss를 큰 스케일(예: 2^16)로 곱하여 gradient underflow 방지
  - Backward 후 gradient를 다시 스케일 다운

Backward Pass:
  - Gradient: FP16/BF16

Optimizer Step:
  - Master Weights: FP32 (정밀도 유지!)
  - FP32 master weights 업데이트 후 FP16 사본 생성

Precision	메모리	연산 속도	수치 안정성	권장
FP32	4 bytes	기준	최고	Optimizer State
FP16	2 bytes	~2x	낮음 (overflow 위험)	Forward/Backward
BF16	2 bytes	~2x	높음 (넓은 범위)	H100/A100에서 권장
TF32	4 bytes (저장)	~1.5x	높음	A100 default

# BF16 Mixed Precision 설정 (accelerate 기반)
# accelerate config (YAML)
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
mixed_precision: bf16
num_machines: 1
num_processes: 8

7.4 EMA (Exponential Moving Average)

EMA는 학습 중 모델 가중치의 이동 평균을 유지하여 추론 시 더 안정적인 결과를 얻는 기법이다. 거의 모든 T2I 모델 학습에서 사용된다.

[EMA 업데이트]

θ_ema ← λ * θ_ema + (1 - λ) * θ_model

- λ: decay rate (보통 0.9999 ~ 0.99999)
- θ_model: 현재 학습 중인 모델 가중치
- θ_ema: EMA 가중치 (추론 시 사용)
- 효과: gradient noise를 평활화하여 더 안정적인 가중치

# Diffusers의 EMA 구현
from diffusers.training_utils import EMAModel

# EMA 모델 생성
ema_model = EMAModel(
    unet.parameters(),
    decay=0.9999,              # decay rate
    use_ema_warmup=True,       # warmup 사용
    inv_gamma=1.0,             # warmup 파라미터
    power=3/4,                 # warmup 파라미터
)

# 매 학습 스텝마다 업데이트
ema_model.step(unet.parameters())

# 추론 시 EMA 가중치 적용
ema_model.copy_to(unet.parameters())

# 또는 context manager 사용
with ema_model.average_parameters():
    # 이 블록 안에서는 EMA 가중치 사용
    output = unet(noisy_latents, timesteps, text_embeds)

7.5 학습 하이퍼파라미터 가이드

하이퍼파라미터	SD 1.5	SDXL	SD3/Flux	LoRA
Learning Rate	1e-4	1e-4	1e-4	1e-4 ~ 5e-5
Batch Size (총)	2048	2048	2048+	1-8
Optimizer	AdamW	AdamW	AdamW	AdamW / Prodigy
Weight Decay	0.01	0.01	0.01	0.01
Grad Clip	1.0	1.0	1.0	1.0
EMA Decay	0.9999	0.9999	0.9999	N/A
Warmup Steps	10,000	10,000	10,000	0-500
Precision	FP32/FP16	BF16	BF16	FP16/BF16
CFG Dropout	10%	10%	10%	10%
Resolution	512	1024	1024	원본 해상도
Total Steps	~500K	~500K+	~1M+	500-15,000

8. 주요 논문 레퍼런스 정리

8.1 핵심 논문 테이블

#	논문명	저자	연도	핵심 기여	링크
1	Generative Adversarial Networks	Goodfellow et al.	2014	GAN 프레임워크 제안	arXiv:1406.2661
2	Neural Discrete Representation Learning (VQ-VAE)	van den Oord et al.	2017	Vector Quantized 이산 잠재 공간	arXiv:1711.00937
3	A Style-Based Generator Architecture for GANs (StyleGAN)	Karras et al.	2019	Style-based 생성기, Progressive Growing	arXiv:1812.04948
4	Large Scale GAN Training (BigGAN)	Brock et al.	2019	대규모 GAN 학습, Truncation Trick	arXiv:1809.11096
5	Generating Diverse High-Fidelity Images with VQ-VAE-2	Razavi et al.	2019	계층적 VQ-VAE, 고해상도 생성	arXiv:1906.00446
6	Denoising Diffusion Probabilistic Models (DDPM)	Ho et al.	2020	Diffusion 모델의 실용적 학습	arXiv:2006.11239
7	Learning Transferable Visual Models From Natural Language Supervision (CLIP)	Radford et al.	2021	CLIP 대조 학습, 이미지-텍스트 정렬	arXiv:2103.00020
8	Zero-Shot Text-to-Image Generation (DALL-E)	Ramesh et al.	2021	dVAE + Autoregressive Transformer T2I	arXiv:2102.12092
9	High-Resolution Image Synthesis with Latent Diffusion Models (LDM)	Rombach et al.	2022	Latent Diffusion, Cross-Attention 조건화	arXiv:2112.10752
10	Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)	Ramesh et al.	2022	CLIP 기반 2단계 Diffusion, Prior + Decoder	arXiv:2204.06125
11	Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)	Saharia et al.	2022	T5-XXL 텍스트 인코더의 효과, Dynamic Thresholding	arXiv:2205.11487
12	Classifier-Free Diffusion Guidance	Ho & Salimans	2022	CFG 학습 기법, unconditional-conditional 동시 학습	arXiv:2207.12598
13	Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Parti)	Yu et al.	2022	Autoregressive T2I, 20B 스케일링	arXiv:2206.10789
14	LoRA: Low-Rank Adaptation of Large Language Models	Hu et al.	2022	Low-rank fine-tuning 기법	arXiv:2106.09685
15	Elucidating the Design Space of Diffusion-Based Generative Models (EDM)	Karras et al.	2022	체계적 Diffusion 설계 공간 분석, Preconditioning	arXiv:2206.00364
16	An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion	Gal et al.	2023	새로운 토큰 임베딩 학습으로 개인화	arXiv:2208.01618
17	DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation	Ruiz et al.	2023	소수 이미지로 주체 개인화, Prior Preservation	arXiv:2208.12242
18	Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)	Zhang & Agrawala	2023	구조적 제어(edge, depth, pose) 추가	arXiv:2302.05543
19	Consistency Models	Song et al.	2023	1-step 생성, PF-ODE 일관성 학습	arXiv:2303.01469
20	SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis	Podell et al.	2023	대형 U-Net, Dual Text Encoder, Multi-AR 학습	arXiv:2307.01952
21	Scalable Diffusion Models with Transformers (DiT)	Peebles & Xie	2023	Diffusion + Transformer, adaLN-Zero	arXiv:2212.09748
22	Flow Matching for Generative Modeling	Lipman et al.	2023	ODE 기반 Flow Matching 프레임워크	arXiv:2210.02747
23	Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow	Liu et al.	2023	Rectified Flow, Optimal Transport	arXiv:2209.03003
24	IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models	Ye et al.	2023	이미지 프롬프트 어댑터, Decoupled Cross-Attn	arXiv:2308.06721
25	Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3Leon)	Yu et al.	2023	효율적 자기회귀 T2I, Retrieval Augmented	arXiv:2309.02591
26	PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis	Chen et al.	2023	저비용 DiT 학습, 학습 분해 전략	arXiv:2310.00426
27	Improving Image Generation with Better Captions (DALL-E 3)	Betker et al.	2023	Synthetic 캡션으로 극적 품질 향상	cdn.openai.com/papers/dall-e-3.pdf
28	PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation	Chen et al.	2024	Weak-to-Strong 학습, KV Compression, 4K	arXiv:2403.04692
29	Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3)	Esser et al.	2024	MM-DiT, Rectified Flow 대규모 적용, Logit-Normal	arXiv:2403.03206
30	Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation	Li et al.	2024	EDM Noise Schedule, Multi-AR, Human Preference	arXiv:2402.17245

8.2 추가 참고 논문

논문명	연도	핵심
LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models	2022	58.5억 오픈 이미지-텍스트 데이터셋
Improved Denoising Diffusion Probabilistic Models	2021	Cosine schedule, learned variance
Denoising Diffusion Implicit Models (DDIM)	2021	Deterministic sampling, 속도 향상
Progressive Distillation for Fast Sampling of Diffusion Models	2022	단계적 증류로 추론 가속
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation	2024	Rectified Flow 1-step 생성
Latent Consistency Models	2024	LCM, SDXL 기반 few-step 생성
SDXL-Turbo: Adversarial Diffusion Distillation	2024	1-4 step SDXL 생성
Stable Cascade	2024	Wuerstchen 기반 3단계 계층적 생성

9. 결론 및 전망

Text-to-Image 모델 학습 방법론은 GAN의 adversarial training에서 출발하여, Diffusion의 iterative denoising을 거쳐, 현재 Flow Matching + DiT라는 새로운 패러다임으로 수렴하고 있다.

핵심 트렌드 요약

[T2I 학습 방법론 발전 방향]

효율성:  Full Training ──→ LoRA/Adapter ──→ Prompt Tuning
         (수개월, $1M+)     (수분, <$1)      (수초)

아키텍처: U-Net ────────→ DiT ─────────→ MM-DiT + Flow Matching
          (SD 1.x-SDXL)    (DiT, PixArt)   (SD3, Flux)

생성 속도: 50-1000 steps ──→ 20-50 steps ──→ 1-4 steps
           (DDPM)            (DDIM, DPM++)   (LCM, LADD, CM)

데이터 품질: 웹 크롤링 ──→ 필터링 ──→ Synthetic Caption
            (LAION raw)    (aesthetic)  (DALL-E 3 방식)

텍스트 이해: CLIP only ──→ CLIP + T5 ──→ 3중 Encoder
             (SD 1.x)     (Imagen)      (SD3, Flux)

향후 전망

학습 효율성 극대화: PixArt-alpha가 보여준 것처럼, 학습 비용을 1/10 이하로 줄이면서 품질을 유지하는 방향이 지속될 것이다.
데이터 중심 접근법(Data-Centric AI): DALL-E 3가 입증한 것처럼, 아키텍처보다 데이터 품질과 캡셔닝이 더 중요해지고 있다.
Few-Step / One-Step 생성: Consistency Models, LCM, LADD 등의 증류 기법이 발전하여 실시간 생성이 표준이 될 것이다.
Unified Multi-Modal Generation: 텍스트-이미지뿐 아니라 비디오, 3D, 오디오를 통합하는 모델로 확장되고 있다.
개인화(Personalization) 고도화: LoRA, DreamBooth, IP-Adapter를 넘어 더 적은 데이터로, 더 정확한 주체 재현이 가능해질 것이다.

T2I 모델의 학습 방법론은 단순히 "더 큰 모델을 더 많은 데이터로 학습"하는 것을 넘어, 어떤 데이터를, 어떤 스케줄로, 어떤 conditioning과 함께 학습하느냐가 핵심이 되는 시대로 접어들었다. 이 글에서 다룬 방법론들을 기반으로 자신만의 T2I 모델을 학습하거나, 기존 모델을 효과적으로 커스터마이징하는 데 활용할 수 있기를 바란다.

참고 자료

Complete Guide to Text-to-Image Model Training Methodologies: From GAN to Flow Matching

1. Introduction: The Evolution of Text-to-Image Generative Models
- 1.1 Why Training Methodology Matters
2. Training Methodologies by Core Architecture
3. Text Conditioning Methodologies
4. Training Datasets
5. Fine-tuning & Customization Techniques
6. Latest Trends (2024-2026)
7. Practical Training Pipeline Guide
8. Key Paper References
- 8.1 Core Paper Table
- 8.2 Additional Reference Papers
9. Conclusion and Outlook
- Key Trend Summary
- Future Outlook
References
Quiz

1. Introduction: The Evolution of Text-to-Image Generative Models

Text-to-Image (T2I) generative models are technologies that produce high-resolution images from natural language text prompts, and have undergone rapid development over the past several years. The trajectory of this field can be broadly divided into four paradigms.

[Text-to-Image Model Evolution Timeline]

2014-2019          2017-2020          2020-2023              2023-Present
    |                  |                  |                      |
   GAN              VAE/VQ-VAE        Diffusion Models      Flow Matching
    |                  |                  |                  + DiT
    v                  v                  v                      v
 ┌──────────┐    ┌──────────┐    ┌────────────────┐    ┌──────────────┐
 │StackGAN  │    │ VQ-VAE   │    │ DDPM (2020)    │    │ SD3 (2024)   │
 │AttnGAN   │    │ VQ-VAE-2 │    │ DALL-E 2(2022) │    │ Flux (2024)  │
 │StyleGAN  │    │ dVAE     │    │ Imagen (2022)  │    │ Pixart-Sigma │
 │BigGAN    │    │          │    │ SD 1.x (2022)  │    │              │
 │GigaGAN   │    │          │    │ SDXL (2023)    │    │              │
 └──────────┘    └──────────┘    └────────────────┘    └──────────────┘

Features:                Features:              Features:                   Features:
- Adversarial       - Discrete         - Iterative             - Straight paths
  Training            Latent Space       Denoising             - ODE-based
- Mode Collapse     - Codebook         - Classifier-Free      - Fewer steps
  issues                Learning           Guidance              - DiT backbone
- Fast generation         - Two-stage        - Latent Space          - Scalable
                      Training         - U-Net backbone

1.1 Why Training Methodology Matters

The quality of T2I models is critically determined not only by architecture design but also by training methodology. Even with identical architectures, generation quality varies dramatically depending on noise scheduling, conditioning approaches, data quality, and training strategies. A prime example is DALL-E 3, which achieved dramatic performance improvements over its predecessor through caption quality improvement alone without any architecture changes.

This article provides an in-depth, paper-based analysis of core training methodologies for each paradigm, covering practical training pipeline configuration as well.

2. Training Methodologies by Core Architecture

2.1 GAN-Based: Adversarial Training

Generative Adversarial Network (GAN) is a framework where two networks, the Generator and the Discriminator, are trained competitively.

2.1.1 Basic Training Principles

The training objective function of GAN is defined as a minimax game:

min_G max_D  V(D, G) = E_{x~p_data}[log D(x)] + E_{z~p_z}[log(1 - D(G(z)))]

- G (Generator): 랜덤 노이즈 z로부터 이미지 생성
- D (Discriminator): 실제 이미지와 생성 이미지 구분
- 학습 목표: G는 D를 속이고, D는 정확히 판별

2.1.2 StyleGAN Training Strategy

StyleGAN (Karras et al., 2019) introduced Progressive Growing and Style-based Generator to enable high-quality image generation.

Core Training Techniques:

Technique	Description	Effect
Progressive Growing	Start from low resolution (4x4) and progressively increase	Improved training stability
Style Mixing	Inject different latent codes into different layers	Increased diversity
Path Length Regularization	Generator Jacobian regularization	Improved generation quality
R1 Regularization	Discriminator gradient penalty	Training stabilization
Lazy Regularization	Apply regularization every 16 steps instead of every step	Improved training efficiency

# StyleGAN2 core training loop (simplified)
for real_images, _ in dataloader:
    # 1. Discriminator training
    z = torch.randn(batch_size, latent_dim)
    fake_images = generator(z)

    d_real = discriminator(real_images)
    d_fake = discriminator(fake_images.detach())

    d_loss = F.softplus(-d_real).mean() + F.softplus(d_fake).mean()

    # R1 Regularization (lazy: every 16 steps)
    if step % 16 == 0:
        real_images.requires_grad = True
        d_real = discriminator(real_images)
        r1_grads = torch.autograd.grad(d_real.sum(), real_images)[0]
        r1_penalty = r1_grads.square().sum(dim=[1,2,3]).mean()
        d_loss += 10.0 * r1_penalty

    d_optimizer.zero_grad()
    d_loss.backward()
    d_optimizer.step()

    # 2. Generator training
    z = torch.randn(batch_size, latent_dim)
    fake_images = generator(z)
    d_fake = discriminator(fake_images)
    g_loss = F.softplus(-d_fake).mean()

    g_optimizer.zero_grad()
    g_loss.backward()
    g_optimizer.step()

2.1.3 Large-Scale Training with BigGAN

BigGAN (Brock et al., 2019) is a model that scaled up GAN to large scale, employing the following training strategies:

Large-scale batches: Increase batch size up to 2048 for improved training stability and quality
Class-conditional Batch Normalization: Inject class information into Batch Normalization parameters
Truncation Trick: Truncate latent distribution at inference to control quality-diversity trade-off
Orthogonal Regularization: Maintain orthogonality of weight matrices to prevent mode collapse

2.1.4 Limitations of GAN-Based T2I

GAN-based approaches ceded dominance to Diffusion-based models due to the following fundamental limitations:

Mode Collapse: Limited generation diversity
Training Instability: Unstable training sensitive to hyperparameters
Text Conditioning difficulty: Difficult to accurately reflect complex text prompts
Scaling limitations: Increased training instability at large scale

2.2 VAE-Based: Codebook Learning and Discrete Latent Space

2.2.1 VQ-VAE: Vector Quantized Variational Autoencoder

VQ-VAE (van den Oord et al., 2017) is an approach that learns a discrete latent space instead of a continuous one.

[VQ-VAE Architecture]

Input Image     Encoder      Quantization      Decoder      Reconstructed
  (256x256) --> [E(x)] --> z_e --> [Codebook] --> z_q --> [D(z_q)] --> Image
                             |        |
                             |   ┌────┴────┐
                             |   │ e_1     │
                             |   │ e_2     │  K code vectors
                             └──>│ ...     │  (Codebook)
                                 │ e_K     │
                                 └─────────┘

  z_q = e_k  where k = argmin_j ||z_e - e_j||
  (quantize to nearest code vector)

VQ-VAE Training Loss Function:

L = ||x - D(z_q)||²                    # Reconstruction Loss
  + ||sg[z_e] - e||²                   # Codebook Loss (EMA 업데이트로 대체 가능)
  + β * ||z_e - sg[e]||²              # Commitment Loss

- sg[·]: Stop-gradient 연산자
- β: Commitment loss 가중치 (보통 0.25)
- z_e: Encoder 출력
- e: 선택된 codebook 벡터

Since the quantization operation is non-differentiable, the Straight-Through Estimator (STE) is used to pass gradients to the encoder. The codebook itself is updated via Exponential Moving Average (EMA).

# VQ-VAE Codebook core training code
class VectorQuantizer(nn.Module):
    def __init__(self, num_embeddings, embedding_dim, commitment_cost=0.25):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings, embedding_dim)
        self.commitment_cost = commitment_cost

    def forward(self, z_e):
        # z_e: (B, D, H, W) -> (B*H*W, D)
        flat_z = z_e.permute(0, 2, 3, 1).reshape(-1, z_e.shape[1])

        # Find nearest codebook vector
        distances = (flat_z ** 2).sum(dim=1, keepdim=True) \
                  + (self.embedding.weight ** 2).sum(dim=1) \
                  - 2 * flat_z @ self.embedding.weight.t()
        indices = distances.argmin(dim=1)
        z_q = self.embedding(indices).view_as(z_e.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)

        # Loss computation
        codebook_loss = F.mse_loss(z_q.detach(), z_e)      # commitment
        commitment_loss = F.mse_loss(z_q, z_e.detach())     # codebook
        loss = commitment_loss + self.commitment_cost * codebook_loss

        # Straight-Through Estimator
        z_q_st = z_e + (z_q - z_e).detach()
        return z_q_st, loss, indices

2.2.2 VQ-VAE-2: Hierarchical Codebook Learning

VQ-VAE-2 (Razavi et al., 2019) introduced multi-level hierarchical quantization to significantly improve image quality.

[VQ-VAE-2 Hierarchical Structure]

                    Top Level (작은 해상도)
                    ┌─────────────┐
                    │  32x32 grid │  Global structure info
                    │  Codebook   │  (composition, overall shape)
                    └──────┬──────┘
                           │
                    Bottom Level (큰 해상도)
                    ┌──────┴──────┐
                    │  64x64 grid │  Fine detail info
                    │  Codebook   │  (textures, edges)
                    └─────────────┘

The image generation pipeline of VQ-VAE-2 consists of the following two stages:

Stage 1: Train VQ-VAE-2 to encode images into hierarchical discrete codes
Stage 2: Learn the prior of discrete codes with autoregressive models such as PixelCNN

This approach directly influenced the dVAE (discrete VAE) used in DALL-E.

2.3 Diffusion-Based: The Core of Current T2I

Diffusion Model is the mainstream paradigm for current T2I generation. It learns a forward process that gradually adds noise to data, and a reverse process that recovers data from noise.

2.3.1 DDPM: Denoising Diffusion Probabilistic Models

DDPM by Ho et al. (2020) is the key paper that elevated Diffusion Models to a practical level.

Forward Process (Diffusion):

q(x_t | x_{t-1}) = N(x_t; √(1-β_t) * x_{t-1}, β_t * I)

- Add a small amount of Gaussian noise at each timestep t
- β_t: noise schedule (보통 linear 또는 cosine)
- After T steps, x_T approximately equals N(0, I) (pure Gaussian noise)

Noise can be added directly at any timestep t in closed form:

q(x_t | x_0) = N(x_t; √(ᾱ_t) * x_0, (1-ᾱ_t) * I)

where ᾱ_t = ∏_{s=1}^{t} α_s,  α_t = 1 - β_t

=> x_t = √(ᾱ_t) * x_0 + √(1-ᾱ_t) * ε,  ε ~ N(0, I)

Reverse Process (Denoising):

p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), σ_t² * I)

- Neural network epsilon_theta predicts noise epsilon added to x_t
- Remove predicted noise to recover x_{t-1}

Training Objective (Simple Loss):

L_simple = E_{t, x_0, ε} [ ||ε - ε_θ(x_t, t)||² ]

- t ~ Uniform(1, T)
- ε ~ N(0, I)
- x_t = √(ᾱ_t) * x_0 + √(1-ᾱ_t) * ε

# DDPM core training loop
def train_step(model, x_0, noise_scheduler):
    batch_size = x_0.shape[0]

    # 1. Random timestep sampling
    t = torch.randint(0, num_timesteps, (batch_size,))

    # 2. Noise sampling
    noise = torch.randn_like(x_0)

    # 3. Forward process: generate x_t
    alpha_bar_t = noise_scheduler.alpha_bar[t]
    x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * noise

    # 4. Predict noise
    noise_pred = model(x_t, t)

    # 5. Loss computation (MSE)
    loss = F.mse_loss(noise_pred, noise)

    return loss

2.3.2 Noise Scheduling

The noise schedule determines the amount of noise added at each timestep in the forward process and has a decisive impact on generation quality.

Schedule	Formula	Features	Models Used
Linear	β_t = β_min + (β_max - β_min) * t/T	Simple but noise increases sharply at the end	DDPM
Cosine	ᾱ_t = cos²((t/T + s)/(1+s) * π/2)	Smooth transition, excellent information preservation	Improved DDPM
Scaled Linear	β_t = (β_min^0.5 + t/T * (β_max^0.5 - β_min^0.5))²	Used in SD 1.x	Stable Diffusion
Sigmoid	β_t = σ(-6 + 12*t/T)	Gradual change at both ends	Some research
EDM	σ(t) = t, log-normal sampling	Theoretically near optimal	Playground v2.5, EDM
Zero Terminal SNR	Ensures SNR(T) = 0	Guarantees starting from pure noise	SD3, Flux

Playground v2.5 (Li et al., 2024) adopted EDM's (Karras et al., 2022) noise schedule, greatly improving color and contrast. The key is ensuring Zero Terminal SNR, where the Signal-to-Noise Ratio (SNR) at timestep T must be exactly 0 during training.

# Cosine Schedule implementation
def cosine_beta_schedule(timesteps, s=0.008):
    steps = timesteps + 1
    x = torch.linspace(0, timesteps, steps)
    alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * math.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clip(betas, 0.0001, 0.9999)

# EDM Noise Schedule (Karras et al., 2022)
def edm_sigma_schedule(num_steps, sigma_min=0.002, sigma_max=80.0, rho=7.0):
    step_indices = torch.arange(num_steps)
    t_steps = (sigma_max ** (1/rho) + step_indices / (num_steps - 1)
               * (sigma_min ** (1/rho) - sigma_max ** (1/rho))) ** rho
    return t_steps

2.3.3 Latent Diffusion Model (LDM) - The Core of Stable Diffusion

Latent Diffusion Model (LDM) by Rombach et al. (2022) dramatically improved computational efficiency by performing diffusion in latent space instead of pixel space. This is the core idea behind Stable Diffusion.

[Latent Diffusion Model Architecture]

                          Text Prompt
                              │
                         ┌────┴────┐
                         │  CLIP   │
                         │ Encoder │
                         └────┬────┘
                              │ text embeddings
                              │
┌──────┐    ┌──────┐    ┌─────┴──────┐    ┌──────┐    ┌──────┐
│Image │    │ VAE  │    │   U-Net    │    │ VAE  │    │Output│
│(512  │--->│Encode│--->│ (Denoising │--->│Decode│--->│Image │
│x512) │    │  r   │    │  in Latent │    │  r   │    │(512  │
│      │    │      │    │   Space)   │    │      │    │x512) │
└──────┘    └──────┘    └────────────┘    └──────┘    └──────┘
              │                                 │
              │  64x64x4                        │
              │  (8x downsampling)              │
              └─────────────────────────────────┘
                    Latent Space (z)

Training: Diffusion in Latent Space
Inference: Random noise z_T -> U-Net Denoising -> VAE Decode -> Image

LDM Training Pipeline:

Stage 1 - Autoencoder Training: Pretrain VAE (KL-regularized) on image datasets
- Encoder: Image x (H x W x 3) -> latent z (H/f x W/f x c), f=8 is typical
- Decoder: latent z -> Reconstructed image
- Loss: Reconstruction + KL Divergence + Perceptual Loss + GAN Loss
Stage 2 - Diffusion Model Training: Diffusion in the latent space of the frozen Autoencoder
- Add noise to latent z_0 = Encoder(x) to generate z_t
- U-Net predicts noise from z_t
- Text conditioning is injected via cross-attention

# Latent Diffusion core training
class LatentDiffusionTrainer:
    def __init__(self, vae, unet, text_encoder, noise_scheduler):
        self.vae = vae              # Frozen
        self.unet = unet            # Trainable
        self.text_encoder = text_encoder  # Frozen
        self.noise_scheduler = noise_scheduler

    def train_step(self, images, captions):
        # 1. Latent encoding with VAE (no gradient needed)
        with torch.no_grad():
            latents = self.vae.encode(images).latent_dist.sample()
            latents = latents * self.vae.config.scaling_factor  # 0.18215

        # 2. Text embedding (no gradient needed)
        with torch.no_grad():
            text_embeddings = self.text_encoder(captions)

        # 3. Add noise
        noise = torch.randn_like(latents)
        timesteps = torch.randint(0, 1000, (latents.shape[0],))
        noisy_latents = self.noise_scheduler.add_noise(latents, noise, timesteps)

        # 4. Predict noise
        noise_pred = self.unet(noisy_latents, timesteps, text_embeddings)

        # 5. MSE loss
        loss = F.mse_loss(noise_pred, noise)
        return loss

2.3.4 Structure of the U-Net Backbone

The U-Net used in Stable Diffusion 1.x/2.x and SDXL has the following structure:

[U-Net with Cross-Attention Structure]

Input z_t ─────────────────────────────────────────── Output ε_θ
    │                                                     ▲
    ▼                                                     │
┌────────┐  ┌────────┐  ┌────────┐      ┌────────┐  ┌────────┐
│ Down   │  │ Down   │  │ Down   │      │  Up    │  │  Up    │
│ Block  │──│ Block  │──│ Block  │──┐   │ Block  │──│ Block  │
│ 64x64  │  │ 32x32  │  │ 16x16  │  │   │ 32x32  │  │ 64x64  │
└────────┘  └────────┘  └────────┘  │   └────────┘  └────────┘
    │            │            │     │        ▲           ▲
    │            │            │     ▼        │           │
    │            │            │  ┌────────┐  │           │
    │            │            └──│ Middle │──┘           │
    │            │               │ Block  │              │
    │            │               │ 16x16  │              │
    │            └───────────────└────────┘──────────────┘
    │                    (skip connections)
    └────────────────────────────────────────────────────┘

Inside each Block:
┌──────────────────────────────────────┐
│  ResNet Block                         │
│  ├── GroupNorm → SiLU → Conv         │
│  ├── Timestep Embedding injection          │
│  └── GroupNorm → SiLU → Conv         │
│                                       │
│  Self-Attention Block                 │
│  ├── LayerNorm → Self-Attention      │
│  └── Skip Connection                  │
│                                       │
│  Cross-Attention Block                │
│  ├── LayerNorm                        │
│  ├── Q = Linear(latent features)     │
│  ├── K = Linear(text embeddings)     │  ← Text Conditioning
│  ├── V = Linear(text embeddings)     │
│  └── Attention(Q, K, V)             │
│                                       │
│  Feed-Forward Block                   │
│  ├── LayerNorm → Linear → GEGLU     │
│  └── Linear → Skip Connection        │
└──────────────────────────────────────┘

SDXL (Podell et al., 2023) expanded the U-Net by approximately 3x (~2.6B parameters), uses two text encoders (OpenCLIP ViT-bigG and CLIP ViT-L), and applies improvements including training at various aspect ratios.

Model	U-Net Params	Text Encoder	Resolution	VAE Downsampling
SD 1.5	~860M	CLIP ViT-L/14 (1)	512x512	8x
SD 2.1	~865M	OpenCLIP ViT-H/14 (1)	768x768	8x
SDXL	~2.6B	OpenCLIP ViT-bigG + CLIP ViT-L (2)	1024x1024	8x
SDXL Refiner	~2.3B	OpenCLIP ViT-bigG (1)	1024x1024	8x

2.3.5 Classifier-Free Guidance (CFG)

Classifier-Free Guidance (CFG) by Ho & Salimans (2022) is a core training technique for modern T2I models.

Problems with Traditional Classifier Guidance:

Requires training a separate classifier
Needs a classifier that works on noisy images
Requires computing classifier gradients during inference

Classifier-Free Guidance Key Idea:

During training, text conditioning is replaced with an empty string ("") with a certain probability (typically 10-20%), so that a single model simultaneously learns both conditional and unconditional generation.

학습 시:
  - probability p_uncond (예: 10%): ε_θ(x_t, t, ∅)  (unconditional)
  - probability 1-p_uncond:          ε_θ(x_t, t, c)  (conditional)

추론 시:
  ε_guided = ε_θ(x_t, t, ∅) + w * (ε_θ(x_t, t, c) - ε_θ(x_t, t, ∅))

  - w: guidance scale (보통 7.5 ~ 15)
  - w=1: conditional 예측 그대로
  - w>1: 텍스트 조건 방향으로 더 강하게 이동

# Classifier-Free Guidance training implementation
def train_step_cfg(model, x_0, text_cond, p_uncond=0.1):
    noise = torch.randn_like(x_0)
    t = torch.randint(0, T, (x_0.shape[0],))
    x_t = add_noise(x_0, noise, t)

    # Randomly drop conditioning
    mask = torch.rand(x_0.shape[0]) < p_uncond
    cond = text_cond.clone()
    cond[mask] = empty_text_embedding  # null conditioning

    noise_pred = model(x_t, t, cond)
    loss = F.mse_loss(noise_pred, noise)
    return loss

# Classifier-Free Guidance inference
def sample_cfg(model, x_T, text_cond, guidance_scale=7.5):
    x_t = x_T
    for t in reversed(range(T)):
        # Unconditional prediction
        eps_uncond = model(x_t, t, empty_text_embedding)
        # Conditional prediction
        eps_cond = model(x_t, t, text_cond)
        # Guided prediction
        eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
        x_t = denoise_step(x_t, eps, t)
    return x_t

CFG dramatically improves generation quality and text fidelity, but if the guidance scale is too high, images become oversaturated or artifacts appear.

2.3.6 DALL-E 2: CLIP-Based Diffusion

DALL-E 2 (Ramesh et al., 2022) introduced a two-stage diffusion architecture leveraging the CLIP embedding space.

[DALL-E 2 Training Pipeline]

Text ──→ CLIP Text Encoder ──→ text embedding
                                    │
                              ┌─────┴─────┐
                              │   Prior    │  text emb → CLIP image emb
                              │ (Diffusion)│
                              └─────┬─────┘
                                    │ CLIP image embedding
                              ┌─────┴─────┐
                              │  Decoder   │  CLIP image emb → 64x64 image
                              │ (Diffusion)│
                              └─────┬─────┘
                                    │ 64x64
                              ┌─────┴─────┐
                              │ Super-Res  │  64x64 → 256x256 → 1024x1024
                              │ (Diffusion)│
                              └─────────── ┘

2.3.7 Imagen: The Power of T5 Text Encoder

Google's Imagen (Saharia et al., 2022) maximized text understanding by using the T5-XXL (4.6B parameter) text encoder.

Key findings:

Scaling text encoder size is more effective than scaling U-Net size
T5-XXL > CLIP ViT-L (Text understanding in quality)
Dynamic Thresholding: Stable generation even at high CFG scales

[Imagen Architecture]

Text ──→ T5-XXL (frozen) ──→ text embeddings
                                    │
                              ┌─────┴─────┐
                              │ Base Model │  64x64 생성
                              │  (U-Net)   │  cross-attention
                              └─────┬─────┘
                                    │
                              ┌─────┴─────┐
                              │ SR Model 1 │  64x64 → 256x256
                              │  (U-Net)   │
                              └─────┬─────┘
                                    │
                              ┌─────┴─────┐
                              │ SR Model 2 │  256x256 → 1024x1024
                              │  (U-Net)   │
                              └─────────── ┘

2.3.8 DiT: Diffusion Transformer

DiT (Diffusion Transformer) by Peebles & Xie (2023) is an architecture that replaces U-Net with Transformer, and is becoming the mainstream for recent T2I models.

[DiT Block Structure]

     Input Tokens (patchified latent)
           │
    ┌──────┴──────┐
    │  LayerNorm  │ ← adaLN-Zero (adaptive)
    │  (adaptive) │   γ, β = MLP(timestep + class)
    └──────┬──────┘
           │
    ┌──────┴──────┐
    │    Self-     │
    │  Attention   │
    └──────┬──────┘
           │ (+ residual)
    ┌──────┴──────┐
    │  LayerNorm  │ ← adaLN-Zero
    │  (adaptive) │
    └──────┬──────┘
           │
    ┌──────┴──────┐
    │ Pointwise   │
    │    FFN      │
    └──────┬──────┘
           │ (+ residual, scaled by α)
           ▼
     Output Tokens

Key Design Decisions of DiT:

Patchify: Split latent into p x p patches then linear projection to token sequence
adaLN-Zero: Adaptive Layer Normalization, injecting timestep and class information into LN parameters
Scaling: Systematic scaling law verification by model size (depth, width)

DiT Variant	Depth	Width	Parameters	GFLOPs
DiT-S/2	12	384	33M	6
DiT-B/2	12	768	130M	23
DiT-L/2	24	1024	458M	80
DiT-XL/2	28	1152	675M	119

2.4 Autoregressive-Based T2I

2.4.1 DALL-E (Original): Token-Based Autoregressive Generation

DALL-E (Ramesh et al., 2021) converts images into discrete tokens, then concatenates text tokens and image tokens into a single sequence to learn the joint distribution with an autoregressive Transformer.

[DALL-E Training Pipeline]

Stage 1: dVAE 학습
  Image (256x256) ──→ dVAE Encoder ──→ 32x32 grid of tokens (8192 vocabulary)
                                           ──→ dVAE Decoder ──→ Reconstructed Image

  Loss: Reconstruction + KL Divergence (Gumbel-Softmax relaxation)

Stage 2: Autoregressive Transformer 학습
  [BPE text tokens (256)] + [Image tokens (1024)] = 1280 tokens

  Transformer (12B params):
  - 64 layers, 62 attention heads
  - 학습 목적: next-token prediction (cross-entropy)
  - 텍스트 토큰은 causal attention (좌→우)
  - 이미지 토큰은 row-major order로 자기회귀 생성
  - 텍스트→이미지 cross-attention 포함

2.4.2 Parti: Encoder-Decoder Based

Google's Parti (Yu et al., 2022) formulated T2I as a sequence-to-sequence problem, combining a ViT-VQGAN tokenizer with an Encoder-Decoder Transformer.

Key features:

ViT-VQGAN: Vision Transformer-based image tokenizer
Encoder-Decoder: Uses Encoder for text encoding, Decoder for image token generation
Scaling: Systematic scale-up from 350M to 3B to 20B parameters
Achieves quality comparable to Imagen at the 20B model

2.4.3 CM3Leon: Efficient Multimodal Autoregressive

Meta's CM3Leon (Yu et al., 2023) significantly improved the efficiency of the autoregressive approach:

Retrieval-Augmented Training: Retrieve related image-text pairs during training and add to context
Decoder-Only: Pure decoder-only architecture unlike Parti
Instruction Tuning: Supervised fine-tuning for various tasks
5x less training cost: Reduces training compute by 1/5 for comparable performance
Achieves MS-COCO zero-shot FID of 4.88

2.5 Flow Matching: The Next-Generation Training Paradigm

2.5.1 Basic Principles of Flow Matching

Flow Matching (Lipman et al., 2023) learns a straight path from noise distribution to data distribution through a deterministic ODE (Ordinary Differential Equation) instead of Diffusion's stochastic process.

[Diffusion vs Flow Matching Comparison]

Diffusion (Stochastic):              Flow Matching (Deterministic):
  x_0 ~~~> x_T                         x_0 ──────> x_1
  (curved path, requires many steps)              (straight path, fewer steps possible)

  x₀ •                                x₀ •
     \  Curved                            \  Straight
      \    path                            \    path
       \                                    \
        \                                    \
     x_T •                              x₁ •  (= noise)

dx = f(x,t)dt + g(t)dW              dx/dt = v_θ(x_t, t)
(SDE 기반)                           (ODE 기반, velocity field 학습)

Flow Matching Training Objective:

L_FM = E_{t, x_0, x_1} [ ||v_θ(x_t, t) - u_t(x_t | x_0, x_1)||² ]

where:
  x_t = (1 - t) * x_0 + t * x_1      (linear interpolation)
  u_t = x_1 - x_0                      (target velocity: 직선 path)
  t ~ Uniform(0, 1)                    (또는 logit-normal)
  x_0 ~ p_data (실제 데이터)
  x_1 ~ N(0, I) (가우시안 노이즈)

2.5.2 Rectified Flow

Rectified Flow (Liu et al., 2023, ICLR 2023 Spotlight) is a key variant of Flow Matching that connects noise-data pairs in straight lines from an Optimal Transport perspective.

Key idea:

1-Rectified Flow: Randomly pair data x_0 and noise x_1 to learn straight paths
2-Rectified Flow (Reflow): Re-straighten pairs generated by 1-Rectified Flow to make paths closer to straight lines
Distillation: Distill the straightened model into a 1-step model

# Rectified Flow core training
def rectified_flow_train_step(model, x_0, x_1=None):
    """
    x_0: 실제 데이터 (latent)
    x_1: 노이즈 (None이면 랜덤 샘플링)
    """
    if x_1 is None:
        x_1 = torch.randn_like(x_0)

    # Time sampling (logit-normal for SD3)
    t = torch.sigmoid(torch.randn(x_0.shape[0]))  # logit-normal
    t = t.view(-1, 1, 1, 1)

    # Linear interpolation
    x_t = (1 - t) * x_0 + t * x_1

    # Target velocity (straight direction)
    target_v = x_1 - x_0

    # Velocity prediction
    v_pred = model(x_t, t)

    # Loss
    loss = F.mse_loss(v_pred, target_v)
    return loss

2.5.3 Flow Matching in Stable Diffusion 3

SD3 (Esser et al., 2024) is the first model to apply Rectified Flow to a large-scale T2I model. Key contributions from the paper "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis":

1. Logit-Normal Timestep Sampling:

Instead of a uniform distribution, timesteps are sampled using a logit-normal distribution, giving more weight to the middle portion of the trajectory (the most challenging prediction interval).

# SD3's Logit-Normal Timestep Sampling
def logit_normal_sampling(batch_size, m=0.0, s=1.0):
    """Give more weight to middle timesteps"""
    u = torch.randn(batch_size) * s + m
    t = torch.sigmoid(u)  # (0, 1) 범위
    return t

2. MM-DiT (Multi-Modal Diffusion Transformer):

SD3 introduced a new Transformer architecture that uses separate weights for text and images while enabling bidirectional information flow.

[MM-DiT Block]

  Image Tokens          Text Tokens
       │                     │
  ┌────┴────┐          ┌────┴────┐
  │adaLN(t) │          │adaLN(t) │
  └────┬────┘          └────┬────┘
       │                     │
       └──────┬──────────────┘
              │ (concatenate)
       ┌──────┴──────┐
       │ Joint Self- │  Image-text tokens
       │  Attention  │     attend to each other
       └──────┬──────┘
              │ (split)
       ┌──────┴──────────────┐
       │                     │
  ┌────┴────┐          ┌────┴────┐
  │   FFN   │          │   FFN   │
  │ (image) │          │ (text)  │
  └────┬────┘          └────┬────┘
       │                     │
  Image Out             Text Out

3. Scaling Laws:

Model	Blocks	Parameters	Performance (validation loss)
SD3-S	15	450M	High
SD3-M	24	2B	Medium
SD3-L	38	8B	Low (best performance)

Smooth scaling was confirmed where validation loss steadily decreases as model size and training steps increase.

2.5.4 Flux: Black Forest Labs' Flow Matching Model

Flux (Black Forest Labs, 2024) is a model based on SD3's Rectified Flow + Transformer architecture.

Variant	Training Method	추론 스텝	Features
FLUX.1 [pro]	Full training	25-50	Highest quality, API only
FLUX.1 [dev]	Guidance Distillation	25-50	Efficient inference, open weights
FLUX.1 [schnell]	Latent Adversarial Diffusion Distillation	1-4	Ultra-fast generation

Guidance Distillation: The Student model is trained to reproduce the output of the Teacher model (using CFG) without CFG, eliminating CFG computation (2x forward pass) at inference time.

Latent Adversarial Diffusion Distillation (LADD): Combines GAN's adversarial loss with diffusion distillation to enable 1-4 step generation.

3. Text Conditioning Methodologies

Text Conditioning is the mechanism that injects the meaning of text prompts into the image generation process in T2I models. The choice of text encoder and conditioning method has a decisive impact on generation quality.

3.1 CLIP Text Encoder

OpenAI's CLIP (Contrastive Language-Image Pre-training, Radford et al., 2021) is a model contrastively trained on 400 million image-text pairs.

[CLIP Training Process]

  Image ──→ Image Encoder ──→ image embedding ─┐
                                                 ├─ cosine similarity
  Text  ──→ Text Encoder  ──→ text embedding  ─┘

  Training objective: Increase similarity for matching pairs, decrease for non-matching pairs
  (InfoNCE Loss)

Characteristics of CLIP Text Encoder:

Both token sequence embeddings and [EOS] token pooled embeddings can be utilized
Maximum 77 token length limit
Strong at image-text alignment
시각적 개념에 특화된 Text understanding

CLIP 변형	파라미터	Models Used
CLIP ViT-L/14	~124M (text)	SD 1.x
OpenCLIP ViT-H/14	~354M (text)	SD 2.x
OpenCLIP ViT-bigG/14	~694M (text)	SDXL (primary)
CLIP ViT-L/14	~124M (text)	SDXL (secondary)

3.2 T5 Text Encoder

Google's T5 (Text-to-Text Transfer Transformer, Raffel et al., 2020) is a large-scale language model trained on a pure text corpus.

Advantages of T5 (Demonstrated in the Imagen paper):

Trained on a much larger text corpus than CLIP (C4 dataset)
Excellent at understanding complex sentence structures and relationships
Ability to process complex prompts including spatial relationships, quantities, and attribute combinations
텍스트 인코더 스케일링이 U-Net 스케일링보다 Effect적 (Imagen Key 발견)

T5 변형	파라미터	Models Used
T5-Small	60M	Experimental
T5-Base	220M	Experimental
T5-Large	770M	Experimental
T5-XL	3B	PixArt-alpha
T5-XXL	4.6B	Imagen, SD3, Flux
Flan-T5-XL	3B	PixArt-sigma

3.3 Cross-Attention Mechanism

Cross-attention is the core mechanism that injects text information into image features within the U-Net or DiT.

# Cross-Attention implementation
class CrossAttention(nn.Module):
    def __init__(self, d_model, d_context, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_model // n_heads

        self.to_q = nn.Linear(d_model, d_model, bias=False)      # latent → Q
        self.to_k = nn.Linear(d_context, d_model, bias=False)     # text → K
        self.to_v = nn.Linear(d_context, d_model, bias=False)     # text → V
        self.to_out = nn.Linear(d_model, d_model)

    def forward(self, x, context):
        """
        x: (B, H*W, d_model) - 이미지 latent features
        context: (B, seq_len, d_context) - 텍스트 임베딩
        """
        q = self.to_q(x)          # 이미지가 Query
        k = self.to_k(context)    # 텍스트가 Key
        v = self.to_v(context)    # 텍스트가 Value

        # Multi-head reshape
        q = q.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
        k = k.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
        v = v.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)

        # Attention
        attn = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_head)
        attn = F.softmax(attn, dim=-1)
        out = attn @ v

        out = out.transpose(1, 2).reshape(B, -1, d_model)
        return self.to_out(out)

3.4 Pooled Text Embeddings vs Sequence Embeddings

Modern T2I models simultaneously utilize two types of text embeddings:

[Text Embedding Types]

Text: "a photo of a cat"
         │
    ┌────┴────┐
    │  Text   │
    │ Encoder │
    └────┬────┘
         │
    ┌────┴──────────────────────┐
    │                           │
    ▼                           ▼
 Sequence Embeddings         Pooled Embedding
 (token-level)               (sentence-level)
 [h_1, h_2, ..., h_n]       h_pool = h_[EOS]
 Shape: (seq_len, d)         Shape: (d,)
    │                           │
    │                           │
    ▼                           ▼
 Cross-Attention에 사용       Global conditioning에 사용
 (세밀한 토큰별 정보)          (전체 문장 의미)
                              - Timestep embedding에 더하기
                              - adaLN 파라미터 조절
                              - Vector conditioning

Dual text encoder usage in SDXL:

# SDXL Text Conditioning
def get_sdxl_text_embeddings(text, clip_l, clip_g):
    # CLIP ViT-L: sequence embeddings (77, 768)
    clip_l_output = clip_l(text)
    clip_l_seq = clip_l_output.last_hidden_state      # (77, 768)
    clip_l_pooled = clip_l_output.pooler_output        # (768,)

    # OpenCLIP ViT-bigG: sequence embeddings (77, 1280)
    clip_g_output = clip_g(text)
    clip_g_seq = clip_g_output.last_hidden_state       # (77, 1280)
    clip_g_pooled = clip_g_output.pooler_output        # (1280,)

    # Concatenate sequence embeddings -> used for Cross-Attention
    text_embeddings = torch.cat([clip_l_seq, clip_g_seq], dim=-1)  # (77, 2048)

    # Concatenate pooled embeddings -> used for Vector conditioning
    pooled_embeddings = torch.cat([clip_l_pooled, clip_g_pooled], dim=-1)  # (2048,)

    return text_embeddings, pooled_embeddings

SD3 and Flux additionally combine T5-XXL sequence embeddings, using a triple text encoder configuration:

인코더	Role	Output Shape	Use Case
CLIP ViT-L	Visual alignment	pooled (768) + seq (77, 768)	pooled → vector cond
OpenCLIP ViT-bigG	Visual alignment	pooled (1280) + seq (77, 1280)	pooled → vector cond
T5-XXL	Text understanding	seq (max 256/512, 4096)	cross-attn / joint-attn

4. Training Datasets

The quality of T2I models directly depends on the scale and quality of training data. Here is a summary of major large-scale datasets.

4.1 Comparison of Major Datasets

Dataset	Scale	Source	Filtering Method	주요 Models Used
LAION-5B	58.5억 pairs	Common Crawl	CLIP similarity > 0.28 (영어)	SD 1.x, SD 2.x
LAION-400M	4억 pairs	Common Crawl	CLIP similarity 필터	Early research
LAION-Aesthetics	~1.2억 pairs	LAION-5B subset	Aesthetic score > 4.5/5.0	SD fine-tuning
CC3M	330만 pairs	Google 검색	Automated filtering pipeline	Research
CC12M	1,200만 pairs	Google 검색	Relaxed filtering	Research
COYO-700M	7.47억 pairs	Common Crawl	Image + text filtering	Research
WebLi	10B images	Web crawling	Top 10% CLIP similarity	PaLI, Imagen
JourneyDB	~460만 pairs	Midjourney	High-quality prompt-image	Research
SAM	11M images	다양한 Source	Manual + model-based	Segmentation + T2I
Internal (Proprietary)	수십억 pairs	Proprietary	Proprietary	DALL-E 3, Midjourney

4.2 LAION-5B Filtering Pipeline

LAION-5B (Schuhmann et al., 2022) is the most widely used open T2I training dataset:

[LAION-5B Data Collection and Filtering Pipeline]

Common Crawl (웹 아카이브)
        │
        ▼
1. HTML 파싱: <img> 태그에서 src URL + alt-text 추출
        │
        ▼
2. 이미지 다운로드 (img2dataset)
   - 최소 해상도 필터: width, height ≥ 64
   - 최대 종횡비: 3:1
        │
        ▼
3. CLIP 유사도 필터링
   - OpenAI CLIP ViT-B/32로 image-text similarity 계산
   - 영어: cosine similarity ≥ 0.28
   - 기타 언어: cosine similarity ≥ 0.26
        │
        ▼
4. 안전성 필터링
   - NSFW 탐지 점수 (CLIP 기반)
   - Watermark 탐지 점수
   - Toxic content 탐지
        │
        ▼
5. 중복 제거 (deduplication)
   - 해시 기반 exact duplicate 제거
   - CLIP embedding 기반 near-duplicate 제거
        │
        ▼
최종: 58.5억 이미지-텍스트 pairs
 - 23.2억 영어
 - 22.6억 기타 100+ 언어
 - 12.7억 언어 미확인

4.3 Data Quality Assessment

The latest models tend to focus on data quality over data quantity:

1. CLIP Score-Based Filtering:

# CLIP Score computation
from transformers import CLIPModel, CLIPProcessor
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

inputs = processor(text=[caption], images=[image], return_tensors="pt")
outputs = model(**inputs)
clip_score = outputs.logits_per_image.item() / 100.0  # normalized

2. Aesthetic Score Filtering:

LAION-Aesthetics is a subset that trains a separate aesthetic predictor (CLIP embedding to MLP to score) and extracts only images with an aesthetic score of 4.5 or higher.

3. Caption Quality Improvement (DALL-E 3's Core Innovation):

DALL-E 3 (Betker et al., 2023) achieved dramatic performance improvement through caption quality improvement alone without any architecture changes:

Train a dedicated image captioning model to generate detailed synthetic captions
Train with 95% synthetic captions + 5% original captions
Comparison experiments of three types: short synthetic, detailed synthetic, and human annotation
Detailed synthetic captions are overwhelmingly superior

[DALL-E 3 Caption Improvement Effect]

Before: "cat on table"
      -> Vague and lacks detail

After: "A fluffy orange tabby cat sitting on a round wooden
       dining table, natural sunlight streaming through a
       window behind, casting soft shadows. The cat has
       bright green eyes and is looking directly at the camera."
      -> Includes detailed attributes, spatial relationships, and lighting information

4.4 Data Preprocessing Techniques

전처리 Technique	Description	Effect
Center Crop	Crop center of image to square	Resolution standardization
Random Crop	Random position crop	Data augmentation
Bucket Sampling	Group images with similar aspect ratios	Multi-aspect ratio training (SDXL)
Caption Dropout	Replace caption with empty string at a certain probability	CFG training support
Multi-resolution	Progressive learning from low to high resolution	Training efficiency + quality
Tag Shuffling	Random shuffle of tag order	Reduced text order bias

5. Fine-tuning & Customization Techniques

Fine-tuning techniques that adapt pretrained T2I models to specific styles, subjects, and control conditions are essential for practical applications.

5.1 LoRA (Low-Rank Adaptation)

LoRA by Hu et al. (2022) is an efficient method for fine-tuning large model weights, and is also extensively used in T2I models.

[LoRA Principle]

원본 가중치:  W_0 ∈ R^{d×k}  (고정, frozen)
LoRA 업데이트: ΔW = B × A      where A ∈ R^{r×k}, B ∈ R^{d×r}

최종 출력: h = W_0 x + ΔW x = W_0 x + B(Ax)

- r << min(d, k): low-rank (보통 4, 8, 16, 32, 64)
- 학습 파라미터: A와 B만 (전체 대비 매우 적음)
- 원본 가중치는 고정 → 메모리 효율적

# LoRA application example (Stable Diffusion U-Net attention layer)
class LoRALinear(nn.Module):
    def __init__(self, original_layer, rank=4, alpha=1.0):
        super().__init__()
        self.original = original_layer  # frozen
        in_features = original_layer.in_features
        out_features = original_layer.out_features

        # LoRA layers
        self.lora_A = nn.Linear(in_features, rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=False)
        self.scale = alpha / rank

        # Initialization
        nn.init.kaiming_uniform_(self.lora_A.weight)
        nn.init.zeros_(self.lora_B.weight)  # Initialize B to 0 -> identical to original at start

    def forward(self, x):
        original_out = self.original(x)        # Frozen original output
        lora_out = self.lora_B(self.lora_A(x)) # LoRA update
        return original_out + self.scale * lora_out

LoRA Training Configuration (Diffusers-based):

# Diffusers LoRA training execution example
accelerate launch train_text_to_image_lora.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --dataset_name="lambdalabs/naruto-blip-captions" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=15000 \
  --learning_rate=1e-04 \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=0 \
  --rank=4 \
  --mixed_precision="fp16" \
  --output_dir="./sdxl-naruto-lora"

LoRA 파라미터	Typical Range	Impact
Rank (r)	4-128	Higher values increase expressiveness and memory
Alpha (α)	rank와 동일 ~ 2x	Learning rate scaling
Target Modules	attn Q,K,V,O + FFN	Application scope
Learning Rate	1e-4 ~ 1e-5	Convergence speed
Training Time	5-30분 (단일 GPU)	Enables fast iteration
File Size	1-200 MB	Easy to share and distribute

5.2 DreamBooth

DreamBooth by Ruiz et al. (2023) is a technique for injecting the concept of a specific subject into a model using 3-5 images.

[DreamBooth Training Process]

Input: 3-5 images of a specific subject + unique identifier "[V]"
      Example: "a [V] dog" (specific dog)

Training strategy:
1. Fine-tune model with subject images
   - "a [V] dog" → 해당 강아지 이미지

2. Prior Preservation Loss (Key!)
   - Pre-generate "a dog" images with the original model
   - Preserve general dog generation capability during fine-tuning
   - Prevent language drift

L = L_recon([V] images) + λ * L_prior(class images)

# DreamBooth + LoRA training (recommended combination)
# Based on diffusers library
accelerate launch train_dreambooth_lora.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --instance_data_dir="./my_dog_images" \
  --instance_prompt="a photo of sks dog" \
  --class_data_dir="./class_dog_images" \
  --class_prompt="a photo of dog" \
  --with_prior_preservation \
  --prior_loss_weight=1.0 \
  --num_class_images=200 \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --max_train_steps=500 \
  --rank=4 \
  --mixed_precision="fp16"

5.3 Textual Inversion

Textual Inversion by Gal et al. (2023) is a method that learns only new token embeddings without modifying any model weights.

[Textual Inversion]

Existing token space:  [cat] [dog] [car] [tree] ...
                                  │
Add new token:               [S*] New concept to learn
                                  │
Training: Optimize only the embedding vector of [S*] with 3-5 images
Entire rest of model is frozen

Advantage: Minimal parameters (1 token = 768 or 1024 floats)
Disadvantage: Less expressive than LoRA/DreamBooth

5.4 ControlNet

ControlNet by Zhang & Agrawala (2023) is a method for adding structural conditions (edge, depth, pose, etc.) to pretrained diffusion models.

[ControlNet Architecture]

                   Control Input (예: Canny edge)
                          │
                    ┌─────┴─────┐
                    │  Zero     │
                    │  Conv     │
                    └─────┬─────┘
                          │
                    ┌─────┴─────┐
Locked U-Net        │ Trainable │  Copy of U-Net Encoder
(원본 고정)          │  Copy of  │     (trainable)
    │               │ U-Net Enc │
    │               └─────┬─────┘
    │                     │
    │               ┌─────┴─────┐
    │               │  Zero     │  Output is 0 at training start
    │               │  Conv     │     (starts without affecting original model)
    │               └─────┬─────┘
    │                     │
    └─────── + ───────────┘  Add to original U-Net features
                  │
              Final Output

ControlNet's Core Training Technique - Zero Convolution:

# Zero Convolution: Initialize weights and biases to 0
class ZeroConv(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, 1)
        nn.init.zeros_(self.conv.weight)
        nn.init.zeros_(self.conv.bias)

    def forward(self, x):
        return self.conv(x)

# Training start: zero conv output = 0
# -> Adding ControlNet doesn't affect original model output
# -> Gradually reflects control signal as training progresses

Condition Type	Input	Use Case
Canny Edge	Edge map	Contour-based generation
Depth	Depth map	3D structure preservation
OpenPose	Joint positions	Human pose control
Semantic Segmentation	Segmentation map	Layout control
Scribble	Scribble	Rough composition
Normal Map	Surface normal map	3D shape control
Tile	Low-resolution/tile	Super-resolution

5.5 IP-Adapter

IP-Adapter (Image Prompt Adapter) by Ye et al. (2023) is an adapter that uses images as prompts to transfer style or subjects.

[IP-Adapter Architecture]

Reference Image ──→ CLIP Image Encoder ──→ image features
                                               │
                                         ┌─────┴─────┐
                                         │ Projection │  Trainable
                                         │   Layer    │
                                         └─────┬─────┘
                                               │
                                         ┌─────┴─────┐
                                         │ Decoupled │  Separate cross-attention
                                         │ Cross-Attn │    (separated from text cross-attn)
                                         └─────┬─────┘
                                               │
Original U-Net Cross-Attention ────── + ───────┘
(text conditioning)

출력 = Text_CrossAttn(Q, K_text, V_text) + λ * Image_CrossAttn(Q, K_img, V_img)

5.6 Comparison of Fine-tuning Techniques

Technique	Modified Target	Training Images	Training Time	File Size	주요 Use Case
LoRA	Attention weights (low-rank)	Tens to thousands	5-30분	1-200MB	Style, concepts
DreamBooth	Full model or + LoRA	3-10	5-60분	2-7GB (전체) 또는 1-200MB (LoRA)	Specific subject
Textual Inversion	Token embeddings only	3-10	30분-수시간	Few KB	Simple concepts
ControlNet	U-Net Encoder copy	Tens of thousands to hundreds of thousands	Several days	~1.5GB	Structural control
IP-Adapter	Projection + Cross-Attn	Large-scale	Several days	~100MB	Image prompting

6. Latest Trends (2024-2026)

6.1 Consistency Models

Consistency Models by Yang Song et al. (2023) is a method for reducing multi-step generation in diffusion models to 1-step or few-step.

[Consistency Models Key Idea]

Diffusion: x_T → x_{T-1} → ... → x_1 → x_0  (hundreds of steps)

Consistency:
  PF-ODE trajectory 위의 모든 점 x_t가
  동일한 x_0로 매핑되도록 학습

  f_θ(x_t, t) = x_0  ∀t ∈ [0, T]

  Key 제약: f_θ(x_0, 0) = x_0 (self-consistency)

     x_T ────→ f_θ ────→ x_0
      │                    ↑
     x_t ────→ f_θ ───────┘  (maps to the same x_0!)
      │                    ↑
    x_t' ────→ f_θ ───────┘

Two Training Methods:

방법	Description	장점	단점
Consistency Distillation (CD)	사전학습된 diffusion model 필요, PF-ODE 시뮬레이션	높은 품질	teacher 모델 필요
Consistency Training (CT)	실제 데이터에서 직접 학습	teacher 불필요	CD보다 품질 다소 Low

Performance:

CIFAR-10: FID 3.55 (1-step), 2.93 (2-step)
ImageNet 64x64: FID 6.20 (1-step)

Follow-up research, Improved Consistency Training (iCT) and Latent Consistency Models (LCM), applied this to large-scale T2I models, enabling 2-4 step generation at the SDXL level.

6.2 The Spread of DiT (Diffusion Transformer) Architecture

Since 2024, DiT has been replacing U-Net to become the mainstream backbone for T2I:

모델	Year	Backbone	파라미터	Key Features
DiT (원본)	2023	Transformer	675M	Class-conditional, adaLN-Zero
PixArt-alpha	2023	DiT + Cross-Attn	600M	T2I, low-cost training
PixArt-sigma	2024	DiT + KV Compression	600M	4K resolution, weak-to-strong
SD3	2024	MM-DiT	2B-8B	Flow Matching, triple text encoder
Flux	2024	MM-DiT variant	~12B	Distillation variant
Playground v2.5	2024	SDXL U-Net	~2.6B	EDM noise schedule
Hunyuan-DiT	2024	DiT	~1.5B	Chinese+English bilingual
Lumina-T2X	2024	DiT	다양	Multi-modal generation

6.3 PixArt-alpha and PixArt-sigma

PixArt-alpha (Chen et al., 2023) is a pioneering model for efficient DiT training:

Core innovation - Training Decomposition:

[PixArt-alpha 3-Stage Training]

Stage 1: Pixel Dependency 학습 (저비용)
  - ImageNet 사전학습된 DiT에서 시작
  - 클래스 조건부 → T2I 전환의 기초

Stage 2: Text-Image Alignment 학습
  - Cross-attention으로 텍스트 조건 주입
  - LLaVA로 생성한 고품질 synthetic caption 사용

Stage 3: High-quality Aesthetic 학습
  - 고품질 미적 Dataset으로 fine-tuning
  - JourneyDB 등 활용

총 학습 비용: ~675 A100 GPU days
(SD 1.5의 ~6,250 A100 GPU days 대비 10.8%)

Improvements in PixArt-sigma (Chen et al., 2024):

Weak-to-Strong Training: Enhanced training with higher quality data based on PixArt-alpha
KV Compression: Compress Key and Value in Attention for improved efficiency, enabling 4K resolution
Comparable performance to SDXL (2.6B) with only 0.6B parameters

6.4 Comparison of SDXL, SD3, and Flux

[Stable Diffusion Lineage by Generation]

SD 1.x (2022)     SDXL (2023)       SD3 (2024)         Flux (2024)
    │                  │                │                   │
  U-Net 860M       U-Net 2.6B      MM-DiT 2-8B        MM-DiT ~12B
    │                  │                │                   │
  CLIP ViT-L       CLIP-L +          CLIP-L +            CLIP-L +
                   OpenCLIP-G        OpenCLIP-G +        OpenCLIP-G +
                                     T5-XXL               T5-XXL
    │                  │                │                   │
  Diffusion        Diffusion        Rectified            Rectified
  (DDPM)           (DDPM)           Flow                 Flow
    │                  │                │                   │
  512x512          1024x1024        1024x1024            1024x1024+
    │                  │                │                   │
  CFG 7.5          CFG 5-9          CFG 3.5-7            Guidance
                                                         Distillation

6.5 Training Innovations of DALL-E 3

The core innovation of DALL-E 3 (Betker et al., 2023) lies in improving training data caption quality:

Image Captioner Training: Separately train a CoCa-based image captioning model
Synthetic Caption Generation: Re-label entire training data with detailed synthetic captions
Caption Mixing: Train with 95% synthetic + 5% original captions
Descriptive vs Short: 상세한 Description형 캡션이 짧은 태그형보다 우수

6.6 Three Key Insights of Playground v2.5

Playground v2.5 (Li et al., 2024) surpassed DALL-E 3 and Midjourney 5.2 through training strategy improvements based on the SDXL architecture:

1. EDM Noise Schedule Adoption:

# EDM Framework (Karras et al., 2022)
# σ(t) 기반 noise schedule - Zero Terminal SNR 보장
# 기존 SD의 linear schedule 대비 색상/대비 크게 개선

def edm_precondition(sigma, x_noisy, F_theta):
    """EDM Preconditioning"""
    c_skip = 1.0 / (sigma ** 2 + 1)
    c_out = sigma / (sigma ** 2 + 1).sqrt()
    c_in = 1.0 / (sigma ** 2 + 1).sqrt()
    c_noise = sigma.log() / 4

    D_x = c_skip * x_noisy + c_out * F_theta(c_in * x_noisy, c_noise)
    return D_x

2. Multi-Aspect Ratio Training:

Bucketed dataset: Group images with similar aspect ratios하여 배치 구성
Supports various aspect ratios during training (1:1, 4:3, 16:9, etc.)

3. Human Preference Alignment:

Training strategy utilizing human preference data
Maximize aesthetic quality through quality-tuning

7. Practical Training Pipeline Guide

7.1 Training Infrastructure

GPU Requirements

Training Scale	Recommended GPU	VRAM	Training Duration	Cost (Estimated)
LoRA Fine-tuning	RTX 3090/4090 1대	24GB	5-30분	< $1
DreamBooth	A100 40GB 1대	40GB	30-60분	$2-5
ControlNet 학습	A100 80GB 4-8대	320-640GB	2-5일	$500-2,000
SD 1.5 수준 학습	A100 80GB 256대	~20TB	24일	~$150K
SDXL 수준 학습	A100 80GB 512-1024대	~40-80TB	수주	~$500K-1M
SD3/Flux 수준 학습	H100 80GB 1024+대	~80TB+	수주-수개월	> $1M

Distributed Training Strategy

[Large-Scale Distributed Training Configuration]

┌─────────────────────────────────────────────────────┐
│                  Data Parallel (DP/DDP)               │
│                                                       │
│  GPU 0        GPU 1        GPU 2        GPU 3        │
│  ┌──────┐    ┌──────┐    ┌──────┐    ┌──────┐       │
│  │Full  │    │Full  │    │Full  │    │Full  │       │
│  │Model │    │Model │    │Model │    │Model │       │
│  │Copy  │    │Copy  │    │Copy  │    │Copy  │       │
│  └──────┘    └──────┘    └──────┘    └──────┘       │
│  Batch 1     Batch 2     Batch 3     Batch 4        │
│                                                       │
│  -> Synchronize gradients with All-Reduce                       │
│  -> Different data batches on each GPU                       │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│              FSDP (Fully Sharded Data Parallel)       │
│                                                       │
│  GPU 0        GPU 1        GPU 2        GPU 3        │
│  ┌──────┐    ┌──────┐    ┌──────┐    ┌──────┐       │
│  │Shard │    │Shard │    │Shard │    │Shard │       │
│  │ 1/4  │    │ 2/4  │    │ 3/4  │    │ 4/4  │       │
│  └──────┘    └──────┘    └──────┘    └──────┘       │
│                                                       │
│  -> Shard model parameters across GPUs                   │
│  -> All-Gather only needed shards during Forward/Backward       │
│  -> Maximize memory efficiency (enables 8B+ model training)               │
└─────────────────────────────────────────────────────┘

7.2 Representative Training Framework: Diffusers

HuggingFace's Diffusers library is the de facto standard for T2I model training.

# Diffusers-based Text-to-Image full training pipeline
from diffusers import AutoencoderKL, UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTextModel, CLIPTokenizer
from accelerate import Accelerator
import torch

# 1. Load models
vae = AutoencoderKL.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="vae"
)
unet = UNet2DConditionModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet"
)
text_encoder = CLIPTextModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="text_encoder"
)
tokenizer = CLIPTokenizer.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="tokenizer"
)
noise_scheduler = DDPMScheduler.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler"
)

# 2. Freeze VAE and Text Encoder
vae.requires_grad_(False)
text_encoder.requires_grad_(False)

# 3. Accelerator setup (distributed training + Mixed Precision)
accelerator = Accelerator(
    mixed_precision="fp16",          # or "bf16"
    gradient_accumulation_steps=4,
)

# 4. Optimizer
optimizer = torch.optim.AdamW(
    unet.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
    weight_decay=1e-2,
    eps=1e-8,
)

# 5. EMA setup
from diffusers.training_utils import EMAModel
ema_unet = EMAModel(
    unet.parameters(),
    decay=0.9999,
    use_ema_warmup=True,
)

# 6. Prepare for distributed training
unet, optimizer, dataloader = accelerator.prepare(unet, optimizer, dataloader)

# 7. Training loop
for epoch in range(num_epochs):
    for batch in dataloader:
        with accelerator.accumulate(unet):
            images = batch["images"]
            captions = batch["captions"]

            # Latent encoding
            with torch.no_grad():
                latents = vae.encode(images).latent_dist.sample()
                latents = latents * vae.config.scaling_factor

            # Text encoding
            with torch.no_grad():
                text_inputs = tokenizer(captions, padding="max_length",
                                       max_length=77, return_tensors="pt")
                text_embeds = text_encoder(text_inputs.input_ids)[0]

            # Add noise
            noise = torch.randn_like(latents)
            timesteps = torch.randint(0, 1000, (latents.shape[0],))
            noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)

            # Classifier-Free Guidance: random caption dropout
            if torch.rand(1) < 0.1:  # 10% probability로 unconditional
                text_embeds = torch.zeros_like(text_embeds)

            # Predict noise
            noise_pred = unet(noisy_latents, timesteps, text_embeds).sample

            # Loss computation
            loss = F.mse_loss(noise_pred, noise)

            # Backward
            accelerator.backward(loss)
            accelerator.clip_grad_norm_(unet.parameters(), 1.0)
            optimizer.step()
            optimizer.zero_grad()

            # EMA update
            ema_unet.step(unet.parameters())

7.3 Mixed Precision Training

Mixed Precision is a technique that improves memory and computational efficiency by combining FP32 and FP16/BF16.

[Mixed Precision Training]

Forward Pass:
  - Model weights: FP16/BF16 (half memory)
  - Activation: FP16/BF16

Loss Scaling:
  - Multiply loss by a large scale (e.g., 2^16) to prevent gradient underflow
  - Scale down gradient again after backward

Backward Pass:
  - Gradient: FP16/BF16

Optimizer Step:
  - Master Weights: FP32 (maintain precision!)
  - Update FP32 master weights then create FP16 copy

Precision	메모리	연산 속도	수치 안정성	권장
FP32	4 bytes	기준	최고	Optimizer State
FP16	2 bytes	~2x	Low (overflow 위험)	Forward/Backward
BF16	2 bytes	~2x	High (넓은 범위)	H100/A100에서 권장
TF32	4 bytes (저장)	~1.5x	High	A100 default

# BF16 Mixed Precision config (accelerate-based)
# accelerate config (YAML)
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
mixed_precision: bf16
num_machines: 1
num_processes: 8

7.4 EMA (Exponential Moving Average)

EMA is a technique that maintains a moving average of model weights during training to achieve more stable results during inference. It is used in nearly all T2I model training.

[EMA Update]

θ_ema ← λ * θ_ema + (1 - λ) * θ_model

- λ: decay rate (보통 0.9999 ~ 0.99999)
- θ_model: 현재 학습 중인 모델 가중치
- θ_ema: EMA 가중치 (추론 시 사용)
- Effect: gradient noise를 평활화하여 더 안정적인 가중치

# Diffusers EMA implementation
from diffusers.training_utils import EMAModel

# Create EMA model
ema_model = EMAModel(
    unet.parameters(),
    decay=0.9999,              # decay rate
    use_ema_warmup=True,       # warmup 사용
    inv_gamma=1.0,             # warmup 파라미터
    power=3/4,                 # warmup 파라미터
)

# Update at every training step
ema_model.step(unet.parameters())

# Apply EMA weights at inference
ema_model.copy_to(unet.parameters())

# Or use context manager
with ema_model.average_parameters():
    # EMA weights are used inside this block
    output = unet(noisy_latents, timesteps, text_embeds)

7.5 Training Hyperparameter Guide

Hyperparameter	SD 1.5	SDXL	SD3/Flux	LoRA
Learning Rate	1e-4	1e-4	1e-4	1e-4 ~ 5e-5
Batch Size (총)	2048	2048	2048+	1-8
Optimizer	AdamW	AdamW	AdamW	AdamW / Prodigy
Weight Decay	0.01	0.01	0.01	0.01
Grad Clip	1.0	1.0	1.0	1.0
EMA Decay	0.9999	0.9999	0.9999	N/A
Warmup Steps	10,000	10,000	10,000	0-500
Precision	FP32/FP16	BF16	BF16	FP16/BF16
CFG Dropout	10%	10%	10%	10%
Resolution	512	1024	1024	Original resolution
Total Steps	~500K	~500K+	~1M+	500-15,000

8. Key Paper References

8.1 Core Paper Table

#	Paper Title	Authors	Year	Key Contribution	Link
1	Generative Adversarial Networks	Goodfellow et al.	2014	GAN framework proposal	arXiv:1406.2661
2	Neural Discrete Representation Learning (VQ-VAE)	van den Oord et al.	2017	Vector Quantized discrete latent space	arXiv:1711.00937
3	A Style-Based Generator Architecture for GANs (StyleGAN)	Karras et al.	2019	Style-based generator, Progressive Growing	arXiv:1812.04948
4	Large Scale GAN Training (BigGAN)	Brock et al.	2019	Large-scale GAN 학습, Truncation Trick	arXiv:1809.11096
5	Generating Diverse High-Fidelity Images with VQ-VAE-2	Razavi et al.	2019	Hierarchical VQ-VAE, high-resolution generation	arXiv:1906.00446
6	Denoising Diffusion Probabilistic Models (DDPM)	Ho et al.	2020	Practical training of Diffusion models	arXiv:2006.11239
7	Learning Transferable Visual Models From Natural Language Supervision (CLIP)	Radford et al.	2021	CLIP contrastive learning, image-text alignment	arXiv:2103.00020
8	Zero-Shot Text-to-Image Generation (DALL-E)	Ramesh et al.	2021	dVAE + Autoregressive Transformer T2I	arXiv:2102.12092
9	High-Resolution Image Synthesis with Latent Diffusion Models (LDM)	Rombach et al.	2022	Latent Diffusion, Cross-Attention conditioning	arXiv:2112.10752
10	Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)	Ramesh et al.	2022	CLIP-based 2-stage Diffusion, Prior + Decoder	arXiv:2204.06125
11	Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)	Saharia et al.	2022	T5-XXL 텍스트 인코더의 Effect, Dynamic Thresholding	arXiv:2205.11487
12	Classifier-Free Diffusion Guidance	Ho & Salimans	2022	CFG 학습 Technique, unconditional-conditional 동시 학습	arXiv:2207.12598
13	Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Parti)	Yu et al.	2022	Autoregressive T2I, 20B scaling	arXiv:2206.10789
14	LoRA: Low-Rank Adaptation of Large Language Models	Hu et al.	2022	Low-rank fine-tuning Technique	arXiv:2106.09685
15	Elucidating the Design Space of Diffusion-Based Generative Models (EDM)	Karras et al.	2022	Systematic Diffusion design space analysis, Preconditioning	arXiv:2206.00364
16	An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion	Gal et al.	2023	Personalization via new token embedding learning	arXiv:2208.01618
17	DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation	Ruiz et al.	2023	Subject personalization with few images, Prior Preservation	arXiv:2208.12242
18	Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)	Zhang & Agrawala	2023	Structural control(edge, depth, pose) 추가	arXiv:2302.05543
19	Consistency Models	Song et al.	2023	1-step generation, PF-ODE consistency learning	arXiv:2303.01469
20	SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis	Podell et al.	2023	Large U-Net, Dual Text Encoder, Multi-AR training	arXiv:2307.01952
21	Scalable Diffusion Models with Transformers (DiT)	Peebles & Xie	2023	Diffusion + Transformer, adaLN-Zero	arXiv:2212.09748
22	Flow Matching for Generative Modeling	Lipman et al.	2023	ODE-based Flow Matching framework	arXiv:2210.02747
23	Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow	Liu et al.	2023	Rectified Flow, Optimal Transport	arXiv:2209.03003
24	IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models	Ye et al.	2023	Image prompt adapter, Decoupled Cross-Attn	arXiv:2308.06721
25	Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3Leon)	Yu et al.	2023	Efficient autoregressive T2I, Retrieval Augmented	arXiv:2309.02591
26	PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis	Chen et al.	2023	Low-cost DiT training, training decomposition strategy	arXiv:2310.00426
27	Improving Image Generation with Better Captions (DALL-E 3)	Betker et al.	2023	Dramatic quality improvement via synthetic captions	cdn.openai.com/papers/dall-e-3.pdf
28	PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation	Chen et al.	2024	Weak-to-Strong training, KV Compression, 4K	arXiv:2403.04692
29	Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3)	Esser et al.	2024	MM-DiT, Rectified Flow Large-scale 적용, Logit-Normal	arXiv:2403.03206
30	Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation	Li et al.	2024	EDM Noise Schedule, Multi-AR, Human Preference	arXiv:2402.17245

8.2 Additional Reference Papers

Paper Title	Year	Key
LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models	2022	5.85 billion open image-text dataset
Improved Denoising Diffusion Probabilistic Models	2021	Cosine schedule, learned variance
Denoising Diffusion Implicit Models (DDIM)	2021	Deterministic sampling, speed improvement
Progressive Distillation for Fast Sampling of Diffusion Models	2022	Inference acceleration via progressive distillation
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation	2024	Rectified Flow 1-step generation
Latent Consistency Models	2024	LCM, SDXL-based few-step generation
SDXL-Turbo: Adversarial Diffusion Distillation	2024	1-4 step SDXL generation
Stable Cascade	2024	Wuerstchen-based 3-stage hierarchical generation

9. Conclusion and Outlook

Text-to-Image model training methodologies started from GAN's adversarial training, passed through Diffusion's iterative denoising, and are now converging on a new paradigm of Flow Matching + DiT.

Key Trend Summary

[T2I Training Methodology Evolution]

Efficiency:  Full Training ──→ LoRA/Adapter ──→ Prompt Tuning
         (months, $1M+)     (minutes, less than $1)      (seconds)

Architecture: U-Net ────────→ DiT ─────────→ MM-DiT + Flow Matching
          (SD 1.x-SDXL)    (DiT, PixArt)   (SD3, Flux)

Generation speed: 50-1000 steps ──→ 20-50 steps ──→ 1-4 steps
           (DDPM)            (DDIM, DPM++)   (LCM, LADD, CM)

Data quality: Web crawling ──→ 필터링 ──→ Synthetic Caption
            (LAION raw)    (aesthetic)  (DALL-E 3 방식)

Text understanding: CLIP only ──→ CLIP + T5 ──→ 3중 Encoder
             (SD 1.x)     (Imagen)      (SD3, Flux)

Future Outlook

Maximizing training efficiency: As demonstrated by PixArt-alpha, the trend of reducing training costs to 1/10 or less while maintaining quality will continue.
Data-Centric AI approach: As DALL-E 3 demonstrated, data quality and captioning are becoming more important than architecture.
Few-Step / One-Step 생성: Consistency Models, LCM, LADD 등의 증류 Technique이 발전하여 실시간 생성이 표준이 될 것이다.
Unified Multi-Modal Generation: Expanding to models that integrate not only text-to-image but also video, 3D, and audio.
Advanced Personalization: Beyond LoRA, DreamBooth, and IP-Adapter, more accurate subject reproduction with even less data will become possible.

T2I model training methodology has entered an era where the key is not simply "training a larger model with more data," but rather what data, with what schedule, and with what conditioning to train with. We hope the methodologies covered in this article can be used as a foundation for training your own T2I models or effectively customizing existing ones.

References

Quiz

Q1: What is the main topic covered in "Complete Guide to Text-to-Image Model Training Methodologies: From GAN to Flow Matching"?

An in-depth, paper-based analysis of training methodologies for Text-to-Image generative model architectures spanning GAN, VAE, Diffusion, and Flow Matching.

Q2: Describe the Training Methodologies by Core Architecture.

2.1 GAN-Based: Adversarial Training Generative Adversarial Network (GAN) is a framework where two networks, the Generator and the Discriminator, are trained competitively.

Q3: Explain the core concept of Text Conditioning Methodologies.

Q4: What are the key aspects of Training Datasets?

The quality of T2I models directly depends on the scale and quality of training data. Here is a summary of major large-scale datasets.

Q5: How does Fine-tuning & Customization Techniques work?

Fine-tuning techniques that adapt pretrained T2I models to specific styles, subjects, and control conditions are essential for practical applications. 5.1 LoRA (Low-Rank Adaptation) LoRA by Hu et al.