Split View: Text-to-Image 모델 학습 방법론 완벽 가이드: GAN에서 Flow Matching까지
Text-to-Image 모델 학습 방법론 완벽 가이드: GAN에서 Flow Matching까지
- 1. 서론: Text-to-Image 생성 모델의 발전사
- 2. 핵심 아키텍처별 학습 방법론
- 3. Text Conditioning 방법론
- 4. 학습 데이터셋
- 5. Fine-tuning & Customization 기법
- 6. 최신 트렌드 (2024-2026)
- 7. 학습 파이프라인 실전 가이드
- 8. 주요 논문 레퍼런스 정리
- 9. 결론 및 전망
- 참고 자료
1. 서론: Text-to-Image 생성 모델의 발전사
Text-to-Image(T2I) 생성 모델은 자연어 텍스트 프롬프트로부터 고해상도 이미지를 생성하는 기술로, 지난 수년간 급격한 발전을 이루었다. 이 분야의 발전 궤적은 크게 네 가지 패러다임으로 구분할 수 있다.
[Text-to-Image 모델 발전 타임라인]
2014-2019 2017-2020 2020-2023 2023-현재
| | | |
GAN VAE/VQ-VAE Diffusion Models Flow Matching
| | | + DiT
v v v v
┌──────────┐ ┌──────────┐ ┌────────────────┐ ┌──────────────┐
│StackGAN │ │ VQ-VAE │ │ DDPM (2020) │ │ SD3 (2024) │
│AttnGAN │ │ VQ-VAE-2 │ │ DALL-E 2(2022) │ │ Flux (2024) │
│StyleGAN │ │ dVAE │ │ Imagen (2022) │ │ Pixart-Sigma │
│BigGAN │ │ │ │ SD 1.x (2022) │ │ │
│GigaGAN │ │ │ │ SDXL (2023) │ │ │
└──────────┘ └──────────┘ └────────────────┘ └──────────────┘
특징: 특징: 특징: 특징:
- Adversarial - Discrete - Iterative - Straight paths
Training Latent Space Denoising - ODE-based
- Mode Collapse - Codebook - Classifier-Free - Fewer steps
문제 Learning Guidance - DiT backbone
- 빠른 생성 - Two-stage - Latent Space - Scalable
Training - U-Net backbone
1.1 왜 학습 방법론이 중요한가
T2I 모델의 품질은 아키텍처 설계뿐 아니라 학습 방법론에 의해 결정적으로 좌우된다. 동일한 아키텍처라 하더라도 noise scheduling, conditioning 방식, 데이터 품질, 학습 전략에 따라 생성 품질이 극적으로 달라진다. 대표적인 예로 DALL-E 3는 아키텍처 변경 없이 캡션 품질 개선만으로 이전 모델 대비 극적인 성능 향상을 달성했다.
이 글에서는 각 패러다임별 핵심 학습 방법론을 논문 기반으로 심층 분석하고, 실전 학습 파이프라인 구성까지 다룬다.
2. 핵심 아키텍처별 학습 방법론
2.1 GAN 기반: Adversarial Training
**Generative Adversarial Network(GAN)**은 Generator와 Discriminator 두 네트워크가 경쟁적으로 학습하는 프레임워크다.
2.1.1 기본 학습 원리
GAN의 학습 목적함수(objective function)는 minimax game으로 정의된다:
min_G max_D V(D, G) = E_{x~p_data}[log D(x)] + E_{z~p_z}[log(1 - D(G(z)))]
- G (Generator): 랜덤 노이즈 z로부터 이미지 생성
- D (Discriminator): 실제 이미지와 생성 이미지 구분
- 학습 목표: G는 D를 속이고, D는 정확히 판별
2.1.2 StyleGAN 학습 전략
StyleGAN(Karras et al., 2019)은 Progressive Growing과 Style-based Generator를 도입하여 고품질 이미지 생성을 가능하게 했다.
핵심 학습 기법:
| 기법 | 설명 | 효과 |
|---|---|---|
| Progressive Growing | 저해상도(4x4)에서 시작하여 점진적으로 해상도 증가 | 학습 안정성 향상 |
| Style Mixing | 서로 다른 latent code를 다른 레이어에 주입 | 다양성 증가 |
| Path Length Regularization | Generator의 Jacobian 정규화 | 생성 품질 향상 |
| R1 Regularization | Discriminator gradient penalty | 학습 안정화 |
| Lazy Regularization | 정규화를 매 스텝이 아닌 16스텝마다 적용 | 학습 효율 향상 |
# StyleGAN2 학습 루프 핵심 (simplified)
for real_images, _ in dataloader:
# 1. Discriminator 학습
z = torch.randn(batch_size, latent_dim)
fake_images = generator(z)
d_real = discriminator(real_images)
d_fake = discriminator(fake_images.detach())
d_loss = F.softplus(-d_real).mean() + F.softplus(d_fake).mean()
# R1 Regularization (lazy: 매 16스텝마다)
if step % 16 == 0:
real_images.requires_grad = True
d_real = discriminator(real_images)
r1_grads = torch.autograd.grad(d_real.sum(), real_images)[0]
r1_penalty = r1_grads.square().sum(dim=[1,2,3]).mean()
d_loss += 10.0 * r1_penalty
d_optimizer.zero_grad()
d_loss.backward()
d_optimizer.step()
# 2. Generator 학습
z = torch.randn(batch_size, latent_dim)
fake_images = generator(z)
d_fake = discriminator(fake_images)
g_loss = F.softplus(-d_fake).mean()
g_optimizer.zero_grad()
g_loss.backward()
g_optimizer.step()
2.1.3 BigGAN의 대규모 학습
BigGAN(Brock et al., 2019)은 GAN을 대규모로 스케일업한 모델로, 다음과 같은 학습 전략을 사용했다:
- 대규모 배치: 배치 크기를 2048까지 증가시켜 학습 안정성과 품질 향상
- Class-conditional Batch Normalization: 클래스 정보를 Batch Normalization 파라미터에 주입
- Truncation Trick: 추론 시 latent 분포를 truncate하여 품질-다양성 트레이드오프 조절
- Orthogonal Regularization: 가중치 행렬의 직교성을 유지하여 mode collapse 방지
2.1.4 GAN 기반 T2I의 한계
GAN 기반 접근법은 다음과 같은 근본적 한계로 인해 Diffusion 기반 모델에 주도권을 내어주게 되었다:
- Mode Collapse: 생성 다양성이 제한됨
- Training Instability: 학습이 불안정하여 하이퍼파라미터에 민감
- Text Conditioning 어려움: 복잡한 텍스트 프롬프트를 정확히 반영하기 어려움
- Scaling 한계: 대규모 스케일업 시 학습 불안정성 증가
2.2 VAE 기반: Codebook Learning과 Discrete Latent Space
2.2.1 VQ-VAE: Vector Quantized Variational Autoencoder
VQ-VAE(van den Oord et al., 2017)는 연속적인 latent space 대신 이산적(discrete) latent space를 학습하는 방식이다.
[VQ-VAE 아키텍처]
Input Image Encoder Quantization Decoder Reconstructed
(256x256) --> [E(x)] --> z_e --> [Codebook] --> z_q --> [D(z_q)] --> Image
| |
| ┌────┴────┐
| │ e_1 │
| │ e_2 │ K개의 코드벡터
└──>│ ... │ (Codebook)
│ e_K │
└─────────┘
z_q = e_k where k = argmin_j ||z_e - e_j||
(가장 가까운 코드벡터로 양자화)
VQ-VAE 학습 손실함수:
L = ||x - D(z_q)||² # Reconstruction Loss
+ ||sg[z_e] - e||² # Codebook Loss (EMA 업데이트로 대체 가능)
+ β * ||z_e - sg[e]||² # Commitment Loss
- sg[·]: Stop-gradient 연산자
- β: Commitment loss 가중치 (보통 0.25)
- z_e: Encoder 출력
- e: 선택된 codebook 벡터
양자화(quantization) 연산은 미분 불가능하므로, **Straight-Through Estimator(STE)**를 사용하여 gradient를 encoder로 전달한다. Codebook 자체는 Exponential Moving Average(EMA)를 통해 업데이트된다.
# VQ-VAE Codebook 학습 핵심 코드
class VectorQuantizer(nn.Module):
def __init__(self, num_embeddings, embedding_dim, commitment_cost=0.25):
super().__init__()
self.embedding = nn.Embedding(num_embeddings, embedding_dim)
self.commitment_cost = commitment_cost
def forward(self, z_e):
# z_e: (B, D, H, W) -> (B*H*W, D)
flat_z = z_e.permute(0, 2, 3, 1).reshape(-1, z_e.shape[1])
# 가장 가까운 codebook 벡터 찾기
distances = (flat_z ** 2).sum(dim=1, keepdim=True) \
+ (self.embedding.weight ** 2).sum(dim=1) \
- 2 * flat_z @ self.embedding.weight.t()
indices = distances.argmin(dim=1)
z_q = self.embedding(indices).view_as(z_e.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)
# 손실 계산
codebook_loss = F.mse_loss(z_q.detach(), z_e) # commitment
commitment_loss = F.mse_loss(z_q, z_e.detach()) # codebook
loss = commitment_loss + self.commitment_cost * codebook_loss
# Straight-Through Estimator
z_q_st = z_e + (z_q - z_e).detach()
return z_q_st, loss, indices
2.2.2 VQ-VAE-2: 계층적 코드북 학습
VQ-VAE-2(Razavi et al., 2019)는 다층 계층적 양자화를 도입하여 이미지 품질을 크게 향상시켰다.
[VQ-VAE-2 계층 구조]
Top Level (작은 해상도)
┌─────────────┐
│ 32x32 grid │ 전역 구조 정보
│ Codebook │ (구도, 전체 형태)
└──────┬──────┘
│
Bottom Level (큰 해상도)
┌──────┴──────┐
│ 64x64 grid │ 세부 디테일 정보
│ Codebook │ (텍스처, 엣지)
└─────────────┘
VQ-VAE-2의 이미지 생성 파이프라인은 다음 두 단계로 이루어진다:
- Stage 1: VQ-VAE-2를 학습하여 이미지를 계층적 discrete code로 인코딩
- Stage 2: PixelCNN 등 autoregressive 모델로 discrete code의 prior를 학습
이 접근법은 이후 DALL-E의 dVAE(discrete VAE)에 직접적인 영향을 주었다.
2.3 Diffusion 기반: 현재 T2I의 핵심
Diffusion Model은 현재 T2I 생성의 주류 패러다임이다. 데이터에 점진적으로 노이즈를 추가하는 forward process와, 노이즈로부터 데이터를 복원하는 reverse process를 학습한다.
2.3.1 DDPM: Denoising Diffusion Probabilistic Models
Ho et al.(2020)의 DDPM은 Diffusion Model을 실용적 수준으로 끌어올린 핵심 논문이다.
Forward Process (Diffusion):
q(x_t | x_{t-1}) = N(x_t; √(1-β_t) * x_{t-1}, β_t * I)
- 매 timestep t에서 소량의 Gaussian 노이즈를 추가
- β_t: noise schedule (보통 linear 또는 cosine)
- T 스텝 후 x_T ≈ N(0, I) (순수 Gaussian 노이즈)
Closed-form으로 임의의 timestep t에서 직접 노이즈 추가 가능:
q(x_t | x_0) = N(x_t; √(ᾱ_t) * x_0, (1-ᾱ_t) * I)
where ᾱ_t = ∏_{s=1}^{t} α_s, α_t = 1 - β_t
=> x_t = √(ᾱ_t) * x_0 + √(1-ᾱ_t) * ε, ε ~ N(0, I)
Reverse Process (Denoising):
p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), σ_t² * I)
- 신경망 ε_θ가 x_t에 추가된 노이즈 ε를 예측
- 예측된 노이즈를 제거하여 x_{t-1}을 복원
학습 목적함수 (Simple Loss):
L_simple = E_{t, x_0, ε} [ ||ε - ε_θ(x_t, t)||² ]
- t ~ Uniform(1, T)
- ε ~ N(0, I)
- x_t = √(ᾱ_t) * x_0 + √(1-ᾱ_t) * ε
# DDPM 학습 루프 핵심
def train_step(model, x_0, noise_scheduler):
batch_size = x_0.shape[0]
# 1. 랜덤 timestep 샘플링
t = torch.randint(0, num_timesteps, (batch_size,))
# 2. 노이즈 샘플링
noise = torch.randn_like(x_0)
# 3. Forward process: x_t 생성
alpha_bar_t = noise_scheduler.alpha_bar[t]
x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * noise
# 4. 노이즈 예측
noise_pred = model(x_t, t)
# 5. 손실 계산 (MSE)
loss = F.mse_loss(noise_pred, noise)
return loss
2.3.2 Noise Scheduling
Noise schedule은 forward process에서 각 timestep에 추가되는 노이즈의 양을 결정하며, 생성 품질에 결정적인 영향을 미친다.
| Schedule | 수식 | 특징 | 사용 모델 |
|---|---|---|---|
| Linear | β_t = β_min + (β_max - β_min) * t/T | 간단하지만 끝부분에서 급격히 노이즈 증가 | DDPM |
| Cosine | ᾱ_t = cos²((t/T + s)/(1+s) * π/2) | 부드러운 전이, 정보 보존 우수 | Improved DDPM |
| Scaled Linear | β_t = (β_min^0.5 + t/T * (β_max^0.5 - β_min^0.5))² | SD 1.x에서 사용 | Stable Diffusion |
| Sigmoid | β_t = σ(-6 + 12*t/T) | 양 끝단에서 완만한 변화 | 일부 연구 |
| EDM | σ(t) = t, log-normal 샘플링 | 이론적으로 최적에 가까움 | Playground v2.5, EDM |
| Zero Terminal SNR | SNR(T) = 0으로 보장 | 순수 노이즈에서 시작 보장 | SD3, Flux |
Playground v2.5(Li et al., 2024)는 EDM(Karras et al., 2022)의 noise schedule을 채택하여 색상과 대비를 크게 개선했다. 핵심은 Zero Terminal SNR을 보장하는 것으로, 학습 시 timestep T에서의 Signal-to-Noise Ratio(SNR)가 정확히 0이 되어야 한다.
# Cosine Schedule 구현
def cosine_beta_schedule(timesteps, s=0.008):
steps = timesteps + 1
x = torch.linspace(0, timesteps, steps)
alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * math.pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return torch.clip(betas, 0.0001, 0.9999)
# EDM Noise Schedule (Karras et al., 2022)
def edm_sigma_schedule(num_steps, sigma_min=0.002, sigma_max=80.0, rho=7.0):
step_indices = torch.arange(num_steps)
t_steps = (sigma_max ** (1/rho) + step_indices / (num_steps - 1)
* (sigma_min ** (1/rho) - sigma_max ** (1/rho))) ** rho
return t_steps
2.3.3 Latent Diffusion Model (LDM) - Stable Diffusion의 핵심
Rombach et al.(2022)의 **Latent Diffusion Model(LDM)**은 pixel space 대신 latent space에서 diffusion을 수행하여 계산 효율을 극적으로 개선했다. 이것이 바로 Stable Diffusion의 핵심 아이디어다.
[Latent Diffusion Model 아키텍처]
Text Prompt
│
┌────┴────┐
│ CLIP │
│ Encoder │
└────┬────┘
│ text embeddings
│
┌──────┐ ┌──────┐ ┌─────┴──────┐ ┌──────┐ ┌──────┐
│Image │ │ VAE │ │ U-Net │ │ VAE │ │Output│
│(512 │--->│Encode│--->│ (Denoising │--->│Decode│--->│Image │
│x512) │ │ r │ │ in Latent │ │ r │ │(512 │
│ │ │ │ │ Space) │ │ │ │x512) │
└──────┘ └──────┘ └────────────┘ └──────┘ └──────┘
│ │
│ 64x64x4 │
│ (8x downsampling) │
└─────────────────────────────────┘
Latent Space (z)
학습: Latent Space에서 Diffusion
추론: 랜덤 노이즈 z_T → U-Net Denoising → VAE Decode → 이미지
LDM의 학습 파이프라인:
Stage 1 - Autoencoder 학습: VAE(KL-regularized)를 이미지 데이터셋으로 사전학습
- Encoder: 이미지 x (H x W x 3) → latent z (H/f x W/f x c), f=8이 일반적
- Decoder: latent z → 복원 이미지
- 손실: Reconstruction + KL Divergence + Perceptual Loss + GAN Loss
Stage 2 - Diffusion Model 학습: 고정된 Autoencoder의 latent space에서 diffusion
- latent z_0 = Encoder(x)에 노이즈를 추가하여 z_t 생성
- U-Net이 z_t에서 노이즈를 예측
- Text conditioning은 cross-attention으로 주입
# Latent Diffusion 학습 핵심
class LatentDiffusionTrainer:
def __init__(self, vae, unet, text_encoder, noise_scheduler):
self.vae = vae # 고정 (frozen)
self.unet = unet # 학습 대상
self.text_encoder = text_encoder # 고정 (frozen)
self.noise_scheduler = noise_scheduler
def train_step(self, images, captions):
# 1. VAE로 latent encoding (gradient 불필요)
with torch.no_grad():
latents = self.vae.encode(images).latent_dist.sample()
latents = latents * self.vae.config.scaling_factor # 0.18215
# 2. 텍스트 임베딩 (gradient 불필요)
with torch.no_grad():
text_embeddings = self.text_encoder(captions)
# 3. 노이즈 추가
noise = torch.randn_like(latents)
timesteps = torch.randint(0, 1000, (latents.shape[0],))
noisy_latents = self.noise_scheduler.add_noise(latents, noise, timesteps)
# 4. 노이즈 예측
noise_pred = self.unet(noisy_latents, timesteps, text_embeddings)
# 5. MSE 손실
loss = F.mse_loss(noise_pred, noise)
return loss
2.3.4 U-Net Backbone의 구조
Stable Diffusion 1.x/2.x와 SDXL에서 사용되는 U-Net은 다음과 같은 구조를 갖는다:
[U-Net with Cross-Attention 구조]
Input z_t ─────────────────────────────────────────── Output ε_θ
│ ▲
▼ │
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Down │ │ Down │ │ Down │ │ Up │ │ Up │
│ Block │──│ Block │──│ Block │──┐ │ Block │──│ Block │
│ 64x64 │ │ 32x32 │ │ 16x16 │ │ │ 32x32 │ │ 64x64 │
└────────┘ └────────┘ └────────┘ │ └────────┘ └────────┘
│ │ │ │ ▲ ▲
│ │ │ ▼ │ │
│ │ │ ┌────────┐ │ │
│ │ └──│ Middle │──┘ │
│ │ │ Block │ │
│ │ │ 16x16 │ │
│ └───────────────└────────┘──────────────┘
│ (skip connections)
└────────────────────────────────────────────────────┘
각 Block 내부:
┌──────────────────────────────────────┐
│ ResNet Block │
│ ├── GroupNorm → SiLU → Conv │
│ ├── Timestep Embedding 주입 │
│ └── GroupNorm → SiLU → Conv │
│ │
│ Self-Attention Block │
│ ├── LayerNorm → Self-Attention │
│ └── Skip Connection │
│ │
│ Cross-Attention Block │
│ ├── LayerNorm │
│ ├── Q = Linear(latent features) │
│ ├── K = Linear(text embeddings) │ ← Text Conditioning
│ ├── V = Linear(text embeddings) │
│ └── Attention(Q, K, V) │
│ │
│ Feed-Forward Block │
│ ├── LayerNorm → Linear → GEGLU │
│ └── Linear → Skip Connection │
└──────────────────────────────────────┘
SDXL(Podell et al., 2023)은 U-Net을 약 3배 확대(~2.6B 파라미터)하고, OpenCLIP ViT-bigG와 CLIP ViT-L 두 개의 텍스트 인코더를 사용하며, 다양한 aspect ratio에서 학습하는 개선을 적용했다.
| 모델 | U-Net 파라미터 | Text Encoder | 해상도 | VAE 다운샘플링 |
|---|---|---|---|---|
| SD 1.5 | ~860M | CLIP ViT-L/14 (1개) | 512x512 | 8x |
| SD 2.1 | ~865M | OpenCLIP ViT-H/14 (1개) | 768x768 | 8x |
| SDXL | ~2.6B | OpenCLIP ViT-bigG + CLIP ViT-L (2개) | 1024x1024 | 8x |
| SDXL Refiner | ~2.3B | OpenCLIP ViT-bigG (1개) | 1024x1024 | 8x |
2.3.5 Classifier-Free Guidance (CFG)
Ho & Salimans(2022)의 **Classifier-Free Guidance(CFG)**는 현대 T2I 모델의 핵심 학습 기법이다.
기존 Classifier Guidance의 문제:
- 별도의 분류기(classifier)를 학습해야 함
- noisy image에서 작동하는 분류기 필요
- 추론 시 분류기의 gradient 계산 필요
Classifier-Free Guidance 핵심 아이디어:
학습 시 일정 확률(보통 10-20%)로 텍스트 conditioning을 빈 문자열("")로 대체하여, 하나의 모델이 conditional과 unconditional 생성을 동시에 학습한다.
학습 시:
- 확률 p_uncond (예: 10%): ε_θ(x_t, t, ∅) (unconditional)
- 확률 1-p_uncond: ε_θ(x_t, t, c) (conditional)
추론 시:
ε_guided = ε_θ(x_t, t, ∅) + w * (ε_θ(x_t, t, c) - ε_θ(x_t, t, ∅))
- w: guidance scale (보통 7.5 ~ 15)
- w=1: conditional 예측 그대로
- w>1: 텍스트 조건 방향으로 더 강하게 이동
# Classifier-Free Guidance 학습 구현
def train_step_cfg(model, x_0, text_cond, p_uncond=0.1):
noise = torch.randn_like(x_0)
t = torch.randint(0, T, (x_0.shape[0],))
x_t = add_noise(x_0, noise, t)
# 랜덤하게 conditioning drop
mask = torch.rand(x_0.shape[0]) < p_uncond
cond = text_cond.clone()
cond[mask] = empty_text_embedding # null conditioning
noise_pred = model(x_t, t, cond)
loss = F.mse_loss(noise_pred, noise)
return loss
# Classifier-Free Guidance 추론
def sample_cfg(model, x_T, text_cond, guidance_scale=7.5):
x_t = x_T
for t in reversed(range(T)):
# Unconditional 예측
eps_uncond = model(x_t, t, empty_text_embedding)
# Conditional 예측
eps_cond = model(x_t, t, text_cond)
# Guided 예측
eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
x_t = denoise_step(x_t, eps, t)
return x_t
CFG는 생성 품질과 텍스트 일치도를 극적으로 향상시키지만, guidance scale이 너무 높으면 이미지가 과포화(oversaturated)되거나 아티팩트가 발생한다.
2.3.6 DALL-E 2: CLIP 기반 Diffusion
DALL-E 2(Ramesh et al., 2022)는 CLIP 임베딩 공간을 활용한 two-stage diffusion 아키텍처를 도입했다.
[DALL-E 2 학습 파이프라인]
Text ──→ CLIP Text Encoder ──→ text embedding
│
┌─────┴─────┐
│ Prior │ text emb → CLIP image emb
│ (Diffusion)│
└─────┬─────┘
│ CLIP image embedding
┌─────┴─────┐
│ Decoder │ CLIP image emb → 64x64 image
│ (Diffusion)│
└─────┬─────┘
│ 64x64
┌─────┴─────┐
│ Super-Res │ 64x64 → 256x256 → 1024x1024
│ (Diffusion)│
└─────────── ┘
2.3.7 Imagen: T5 Text Encoder의 위력
Google의 Imagen(Saharia et al., 2022)은 T5-XXL(4.6B 파라미터) 텍스트 인코더를 사용하여 텍스트 이해력을 극대화했다.
핵심 발견:
- 텍스트 인코더 크기를 키우는 것이 U-Net 크기를 키우는 것보다 더 효과적
- T5-XXL > CLIP ViT-L (텍스트 이해 품질에서)
- Dynamic Thresholding: 높은 CFG scale에서도 안정적인 생성
[Imagen 아키텍처]
Text ──→ T5-XXL (frozen) ──→ text embeddings
│
┌─────┴─────┐
│ Base Model │ 64x64 생성
│ (U-Net) │ cross-attention
└─────┬─────┘
│
┌─────┴─────┐
│ SR Model 1 │ 64x64 → 256x256
│ (U-Net) │
└─────┬─────┘
│
┌─────┴─────┐
│ SR Model 2 │ 256x256 → 1024x1024
│ (U-Net) │
└─────────── ┘
2.3.8 DiT: Diffusion Transformer
Peebles & Xie(2023)의 **DiT(Diffusion Transformer)**는 U-Net을 Transformer로 대체한 아키텍처로, 최근 T2I 모델의 주류가 되고 있다.
[DiT Block 구조]
Input Tokens (patchified latent)
│
┌──────┴──────┐
│ LayerNorm │ ← adaLN-Zero (adaptive)
│ (adaptive) │ γ, β = MLP(timestep + class)
└──────┬──────┘
│
┌──────┴──────┐
│ Self- │
│ Attention │
└──────┬──────┘
│ (+ residual)
┌──────┴──────┐
│ LayerNorm │ ← adaLN-Zero
│ (adaptive) │
└──────┬──────┘
│
┌──────┴──────┐
│ Pointwise │
│ FFN │
└──────┬──────┘
│ (+ residual, scaled by α)
▼
Output Tokens
DiT의 핵심 설계 결정:
- Patchify: latent을 p x p 패치로 분할 후 linear projection → token sequence
- adaLN-Zero: Adaptive Layer Normalization, timestep과 class 정보를 LN 파라미터로 주입
- Scaling: 모델 크기(depth, width)에 따른 체계적 스케일링 법칙 확인
| DiT 변형 | Depth | Width | Parameters | GFLOPs |
|---|---|---|---|---|
| DiT-S/2 | 12 | 384 | 33M | 6 |
| DiT-B/2 | 12 | 768 | 130M | 23 |
| DiT-L/2 | 24 | 1024 | 458M | 80 |
| DiT-XL/2 | 28 | 1152 | 675M | 119 |
2.4 Autoregressive 기반 T2I
2.4.1 DALL-E (원본): 토큰 기반 자기회귀 생성
DALL-E(Ramesh et al., 2021)는 이미지를 discrete token으로 변환한 후, 텍스트 토큰과 이미지 토큰을 하나의 시퀀스로 연결하여 autoregressive Transformer로 joint distribution을 학습한다.
[DALL-E 학습 파이프라인]
Stage 1: dVAE 학습
Image (256x256) ──→ dVAE Encoder ──→ 32x32 grid of tokens (8192 vocabulary)
──→ dVAE Decoder ──→ Reconstructed Image
Loss: Reconstruction + KL Divergence (Gumbel-Softmax relaxation)
Stage 2: Autoregressive Transformer 학습
[BPE text tokens (256)] + [Image tokens (1024)] = 1280 tokens
Transformer (12B params):
- 64 layers, 62 attention heads
- 학습 목적: next-token prediction (cross-entropy)
- 텍스트 토큰은 causal attention (좌→우)
- 이미지 토큰은 row-major order로 자기회귀 생성
- 텍스트→이미지 cross-attention 포함
2.4.2 Parti: Encoder-Decoder 기반
Google의 Parti(Yu et al., 2022)는 T2I를 sequence-to-sequence 문제로 정의하고, ViT-VQGAN 토크나이저와 Encoder-Decoder Transformer를 결합했다.
핵심 특징:
- ViT-VQGAN: Vision Transformer 기반 이미지 토크나이저
- Encoder-Decoder: 텍스트 인코딩에 Encoder, 이미지 토큰 생성에 Decoder 사용
- 스케일링: 350M → 3B → 20B 파라미터까지 체계적 스케일업
- 20B 모델에서 Imagen과 대등한 품질 달성
2.4.3 CM3Leon: 효율적 멀티모달 자기회귀
Meta의 CM3Leon(Yu et al., 2023)은 autoregressive 방식의 효율성을 크게 개선했다:
- Retrieval-Augmented Training: 학습 시 관련 이미지-텍스트 쌍을 검색하여 context에 추가
- Decoder-Only: Parti와 달리 순수 decoder-only 아키텍처
- Instruction Tuning: 다양한 태스크에 대한 supervised fine-tuning
- 5x 적은 학습 비용: 유사 성능 대비 학습 컴퓨트를 1/5로 절감
- MS-COCO zero-shot FID 4.88 달성
2.5 Flow Matching: 차세대 학습 패러다임
2.5.1 Flow Matching의 기본 원리
Flow Matching(Lipman et al., 2023)은 Diffusion의 stochastic process 대신 **deterministic ODE(Ordinary Differential Equation)**를 통해 노이즈 분포에서 데이터 분포로의 **직선 경로(straight path)**를 학습한다.
[Diffusion vs Flow Matching 비교]
Diffusion (Stochastic): Flow Matching (Deterministic):
x_0 ~~~> x_T x_0 ──────> x_1
(곡선 경로, 많은 스텝 필요) (직선 경로, 적은 스텝 가능)
x₀ • x₀ •
\ ← 곡선 \ ← 직선
\ 경로 \ 경로
\ \
\ \
x_T • x₁ • (= noise)
dx = f(x,t)dt + g(t)dW dx/dt = v_θ(x_t, t)
(SDE 기반) (ODE 기반, velocity field 학습)
Flow Matching 학습 목적함수:
L_FM = E_{t, x_0, x_1} [ ||v_θ(x_t, t) - u_t(x_t | x_0, x_1)||² ]
where:
x_t = (1 - t) * x_0 + t * x_1 (linear interpolation)
u_t = x_1 - x_0 (target velocity: 직선 경로)
t ~ Uniform(0, 1) (또는 logit-normal)
x_0 ~ p_data (실제 데이터)
x_1 ~ N(0, I) (가우시안 노이즈)
2.5.2 Rectified Flow
Rectified Flow(Liu et al., 2023, ICLR 2023 Spotlight)는 Flow Matching의 핵심 변형으로, Optimal Transport 관점에서 노이즈-데이터 쌍을 직선으로 연결하는 방법이다.
핵심 아이디어:
- 1-Rectified Flow: 데이터 x_0와 노이즈 x_1을 랜덤 페어링하여 직선 경로 학습
- 2-Rectified Flow (Reflow): 1-Rectified Flow로 생성한 쌍을 다시 직선화하여 경로를 더 직선에 가깝게 만듦
- Distillation: 직선화된 모델을 1-step 모델로 증류
# Rectified Flow 학습 핵심
def rectified_flow_train_step(model, x_0, x_1=None):
"""
x_0: 실제 데이터 (latent)
x_1: 노이즈 (None이면 랜덤 샘플링)
"""
if x_1 is None:
x_1 = torch.randn_like(x_0)
# 시간 샘플링 (logit-normal for SD3)
t = torch.sigmoid(torch.randn(x_0.shape[0])) # logit-normal
t = t.view(-1, 1, 1, 1)
# Linear interpolation
x_t = (1 - t) * x_0 + t * x_1
# Target velocity (직선 방향)
target_v = x_1 - x_0
# Velocity 예측
v_pred = model(x_t, t)
# 손실
loss = F.mse_loss(v_pred, target_v)
return loss
2.5.3 Stable Diffusion 3의 Flow Matching 적용
SD3(Esser et al., 2024)는 Rectified Flow를 대규모 T2I 모델에 최초로 적용한 모델이다. 논문 "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis"의 핵심 기여:
1. Logit-Normal Timestep Sampling:
균일 분포 대신 logit-normal 분포로 timestep을 샘플링하여, trajectory의 중간 부분(가장 어려운 예측 구간)에 더 많은 가중치를 부여한다.
# SD3의 Logit-Normal Timestep Sampling
def logit_normal_sampling(batch_size, m=0.0, s=1.0):
"""중간 timestep에 더 많은 가중치를 부여"""
u = torch.randn(batch_size) * s + m
t = torch.sigmoid(u) # (0, 1) 범위
return t
2. MM-DiT (Multi-Modal Diffusion Transformer):
SD3는 텍스트와 이미지를 위한 별도의 가중치를 사용하면서, 양방향 정보 흐름을 가능하게 하는 새로운 Transformer 아키텍처를 도입했다.
[MM-DiT Block]
Image Tokens Text Tokens
│ │
┌────┴────┐ ┌────┴────┐
│adaLN(t) │ │adaLN(t) │
└────┬────┘ └────┬────┘
│ │
└──────┬──────────────┘
│ (concatenate)
┌──────┴──────┐
│ Joint Self- │ ← 이미지-텍스트 토큰이
│ Attention │ 서로 attend
└──────┬──────┘
│ (split)
┌──────┴──────────────┐
│ │
┌────┴────┐ ┌────┴────┐
│ FFN │ │ FFN │
│ (image) │ │ (text) │
└────┬────┘ └────┬────┘
│ │
Image Out Text Out
3. 스케일링 법칙:
| 모델 | Blocks | Parameters | 성능 (validation loss) |
|---|---|---|---|
| SD3-S | 15 | 450M | 높음 |
| SD3-M | 24 | 2B | 중간 |
| SD3-L | 38 | 8B | 낮음 (최고 성능) |
모델 크기와 학습 스텝 수가 증가할수록 validation loss가 안정적으로 감소하는 smooth scaling을 확인했다.
2.5.4 Flux: Black Forest Labs의 Flow Matching 모델
Flux(Black Forest Labs, 2024)는 SD3의 Rectified Flow + Transformer 아키텍처를 기반으로 한 모델이다.
| 변형 | 학습 방법 | 추론 스텝 | 특징 |
|---|---|---|---|
| FLUX.1 [pro] | Full training | 25-50 | 최고 품질, API만 제공 |
| FLUX.1 [dev] | Guidance Distillation | 25-50 | 효율적 추론, 오픈 가중치 |
| FLUX.1 [schnell] | Latent Adversarial Diffusion Distillation | 1-4 | 초고속 생성 |
Guidance Distillation: Teacher 모델(CFG 사용)의 출력을 Student 모델이 CFG 없이 재현하도록 학습하여, 추론 시 CFG 계산(2x forward pass)을 제거한다.
Latent Adversarial Diffusion Distillation (LADD): GAN의 adversarial loss와 diffusion distillation을 결합하여 1-4 스텝 생성을 가능하게 한다.
3. Text Conditioning 방법론
Text Conditioning은 T2I 모델에서 텍스트 프롬프트의 의미를 이미지 생성 과정에 주입하는 메커니즘이다. 텍스트 인코더의 선택과 conditioning 방식은 생성 품질에 결정적 영향을 미친다.
3.1 CLIP Text Encoder
OpenAI의 CLIP(Contrastive Language-Image Pre-training, Radford et al., 2021)은 4억 개의 이미지-텍스트 쌍으로 대조 학습된 모델이다.
[CLIP 학습 과정]
Image ──→ Image Encoder ──→ image embedding ─┐
├─ cosine similarity
Text ──→ Text Encoder ──→ text embedding ─┘
학습 목적: 매칭 쌍의 similarity ↑, 비매칭 쌍의 similarity ↓
(InfoNCE Loss)
CLIP Text Encoder의 특징:
- 토큰 시퀀스 임베딩(sequence embeddings)과 [EOS] 토큰의 pooled embedding 모두 활용 가능
- 최대 77 토큰 길이 제한
- 이미지-텍스트 정렬(alignment)에 강함
- 시각적 개념에 특화된 텍스트 이해
| CLIP 변형 | 파라미터 | 사용 모델 |
|---|---|---|
| CLIP ViT-L/14 | ~124M (text) | SD 1.x |
| OpenCLIP ViT-H/14 | ~354M (text) | SD 2.x |
| OpenCLIP ViT-bigG/14 | ~694M (text) | SDXL (primary) |
| CLIP ViT-L/14 | ~124M (text) | SDXL (secondary) |
3.2 T5 Text Encoder
Google의 T5(Text-to-Text Transfer Transformer, Raffel et al., 2020)는 순수 텍스트 코퍼스에서 학습된 대규모 언어 모델이다.
T5의 장점 (Imagen 논문에서 입증):
- CLIP보다 훨씬 큰 텍스트 코퍼스에서 학습 (C4 dataset)
- 복잡한 문장 구조와 관계 이해에 우수
- 공간적 관계, 수량, 속성 조합 등 복잡한 프롬프트 처리 능력
- 텍스트 인코더 스케일링이 U-Net 스케일링보다 효과적 (Imagen 핵심 발견)
| T5 변형 | 파라미터 | 사용 모델 |
|---|---|---|
| T5-Small | 60M | 실험용 |
| T5-Base | 220M | 실험용 |
| T5-Large | 770M | 실험용 |
| T5-XL | 3B | PixArt-alpha |
| T5-XXL | 4.6B | Imagen, SD3, Flux |
| Flan-T5-XL | 3B | PixArt-sigma |
3.3 Cross-Attention Mechanism
Cross-attention은 U-Net 또는 DiT 내부에서 텍스트 정보를 이미지 feature에 주입하는 핵심 메커니즘이다.
# Cross-Attention 구현
class CrossAttention(nn.Module):
def __init__(self, d_model, d_context, n_heads):
super().__init__()
self.n_heads = n_heads
self.d_head = d_model // n_heads
self.to_q = nn.Linear(d_model, d_model, bias=False) # latent → Q
self.to_k = nn.Linear(d_context, d_model, bias=False) # text → K
self.to_v = nn.Linear(d_context, d_model, bias=False) # text → V
self.to_out = nn.Linear(d_model, d_model)
def forward(self, x, context):
"""
x: (B, H*W, d_model) - 이미지 latent features
context: (B, seq_len, d_context) - 텍스트 임베딩
"""
q = self.to_q(x) # 이미지가 Query
k = self.to_k(context) # 텍스트가 Key
v = self.to_v(context) # 텍스트가 Value
# Multi-head reshape
q = q.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
k = k.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
v = v.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
# Attention
attn = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_head)
attn = F.softmax(attn, dim=-1)
out = attn @ v
out = out.transpose(1, 2).reshape(B, -1, d_model)
return self.to_out(out)
3.4 Pooled Text Embeddings vs Sequence Embeddings
현대 T2I 모델은 두 가지 유형의 텍스트 임베딩을 동시에 활용한다:
[텍스트 임베딩 유형]
Text: "a photo of a cat"
│
┌────┴────┐
│ Text │
│ Encoder │
└────┬────┘
│
┌────┴──────────────────────┐
│ │
▼ ▼
Sequence Embeddings Pooled Embedding
(token-level) (sentence-level)
[h_1, h_2, ..., h_n] h_pool = h_[EOS]
Shape: (seq_len, d) Shape: (d,)
│ │
│ │
▼ ▼
Cross-Attention에 사용 Global conditioning에 사용
(세밀한 토큰별 정보) (전체 문장 의미)
- Timestep embedding에 더하기
- adaLN 파라미터 조절
- Vector conditioning
SDXL에서의 dual text encoder 활용:
# SDXL Text Conditioning
def get_sdxl_text_embeddings(text, clip_l, clip_g):
# CLIP ViT-L: sequence embeddings (77, 768)
clip_l_output = clip_l(text)
clip_l_seq = clip_l_output.last_hidden_state # (77, 768)
clip_l_pooled = clip_l_output.pooler_output # (768,)
# OpenCLIP ViT-bigG: sequence embeddings (77, 1280)
clip_g_output = clip_g(text)
clip_g_seq = clip_g_output.last_hidden_state # (77, 1280)
clip_g_pooled = clip_g_output.pooler_output # (1280,)
# Sequence embeddings 연결 → Cross-Attention에 사용
text_embeddings = torch.cat([clip_l_seq, clip_g_seq], dim=-1) # (77, 2048)
# Pooled embeddings 연결 → Vector conditioning에 사용
pooled_embeddings = torch.cat([clip_l_pooled, clip_g_pooled], dim=-1) # (2048,)
return text_embeddings, pooled_embeddings
SD3와 Flux는 여기에 T5-XXL sequence embeddings를 추가로 결합하여 3중 텍스트 인코더 구성을 사용한다:
| 인코더 | 역할 | 출력 형태 | 용도 |
|---|---|---|---|
| CLIP ViT-L | 시각적 정렬 | pooled (768) + seq (77, 768) | pooled → vector cond |
| OpenCLIP ViT-bigG | 시각적 정렬 | pooled (1280) + seq (77, 1280) | pooled → vector cond |
| T5-XXL | 텍스트 이해 | seq (max 256/512, 4096) | cross-attn / joint-attn |
4. 학습 데이터셋
T2I 모델의 품질은 학습 데이터의 규모와 품질에 직접적으로 의존한다. 주요 대규모 데이터셋을 정리한다.
4.1 주요 데이터셋 비교
| 데이터셋 | 규모 | 소스 | 필터링 방법 | 주요 사용 모델 |
|---|---|---|---|---|
| LAION-5B | 58.5억 쌍 | Common Crawl | CLIP similarity > 0.28 (영어) | SD 1.x, SD 2.x |
| LAION-400M | 4억 쌍 | Common Crawl | CLIP similarity 필터 | 초기 연구 |
| LAION-Aesthetics | ~1.2억 쌍 | LAION-5B subset | Aesthetic score > 4.5/5.0 | SD fine-tuning |
| CC3M | 330만 쌍 | Google 검색 | 자동 필터링 파이프라인 | 연구용 |
| CC12M | 1,200만 쌍 | Google 검색 | 완화된 필터링 | 연구용 |
| COYO-700M | 7.47억 쌍 | Common Crawl | 이미지+텍스트 필터링 | 연구용 |
| WebLi | 100억 이미지 | 웹 크롤링 | 상위 10% CLIP 유사도 | PaLI, Imagen |
| JourneyDB | ~460만 쌍 | Midjourney | 고품질 프롬프트-이미지 | 연구용 |
| SAM | 11M 이미지 | 다양한 소스 | 수동 + 모델 기반 | 세그멘테이션 + T2I |
| Internal (비공개) | 수십억 쌍 | 비공개 | 비공개 | DALL-E 3, Midjourney |
4.2 LAION-5B 필터링 파이프라인
LAION-5B(Schuhmann et al., 2022)는 가장 널리 사용된 오픈 T2I 학습 데이터셋이다:
[LAION-5B 데이터 수집 및 필터링 파이프라인]
Common Crawl (웹 아카이브)
│
▼
1. HTML 파싱: <img> 태그에서 src URL + alt-text 추출
│
▼
2. 이미지 다운로드 (img2dataset)
- 최소 해상도 필터: width, height ≥ 64
- 최대 종횡비: 3:1
│
▼
3. CLIP 유사도 필터링
- OpenAI CLIP ViT-B/32로 image-text similarity 계산
- 영어: cosine similarity ≥ 0.28
- 기타 언어: cosine similarity ≥ 0.26
│
▼
4. 안전성 필터링
- NSFW 탐지 점수 (CLIP 기반)
- Watermark 탐지 점수
- Toxic content 탐지
│
▼
5. 중복 제거 (deduplication)
- 해시 기반 exact duplicate 제거
- CLIP embedding 기반 near-duplicate 제거
│
▼
최종: 58.5억 이미지-텍스트 쌍
- 23.2억 영어
- 22.6억 기타 100+ 언어
- 12.7억 언어 미확인
4.3 데이터 품질 평가 (Data Quality Assessment)
최신 모델들은 데이터 양보다 데이터 품질에 집중하는 추세다:
1. CLIP Score 기반 필터링:
# CLIP Score 계산
from transformers import CLIPModel, CLIPProcessor
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
inputs = processor(text=[caption], images=[image], return_tensors="pt")
outputs = model(**inputs)
clip_score = outputs.logits_per_image.item() / 100.0 # normalized
2. Aesthetic Score 필터링:
LAION-Aesthetics는 별도의 aesthetic predictor(CLIP embedding → MLP → score)를 학습하여, aesthetic score 4.5 이상인 이미지만 추출한 서브셋이다.
3. Caption Quality 개선 (DALL-E 3의 핵심 혁신):
DALL-E 3(Betker et al., 2023)는 아키텍처 변경 없이 캡션 품질 개선만으로 극적인 성능 향상을 달성했다:
- 전용 이미지 캡셔닝 모델을 학습하여 상세한 synthetic caption 생성
- 95% synthetic 캡션 + 5% 원본 캡션으로 학습
- Short synthetic, detailed synthetic, human annotation 3가지 비교 실험
- Detailed synthetic caption이 압도적으로 우수
[DALL-E 3 캡션 개선 효과]
기존: "cat on table"
→ 모호하고 세부 정보 부족
개선: "A fluffy orange tabby cat sitting on a round wooden
dining table, natural sunlight streaming through a
window behind, casting soft shadows. The cat has
bright green eyes and is looking directly at the camera."
→ 상세한 속성, 공간 관계, 조명 정보 포함
4.4 데이터 전처리 기법
| 전처리 기법 | 설명 | 효과 |
|---|---|---|
| Center Crop | 이미지 중앙을 정사각형으로 크롭 | 해상도 표준화 |
| Random Crop | 랜덤 위치 크롭 | 데이터 증강 |
| Bucket Sampling | 유사 종횡비 이미지를 그룹화 | 다양한 종횡비 학습 (SDXL) |
| Caption Dropout | 일정 확률로 캡션을 빈 문자열로 | CFG 학습 지원 |
| Multi-resolution | 저해상도 → 고해상도 단계적 학습 | 학습 효율 + 품질 |
| Tag Shuffling | 태그 순서 랜덤 셔플 | 텍스트 순서 편향 감소 |
5. Fine-tuning & Customization 기법
사전학습된 T2I 모델을 특정 스타일, 주체, 제어 조건에 맞게 조정하는 fine-tuning 기법은 실전 활용의 핵심이다.
5.1 LoRA (Low-Rank Adaptation)
Hu et al.(2022)의 LoRA는 대규모 모델의 가중치를 효율적으로 fine-tuning하는 방법으로, T2I 모델에서도 핵심적으로 활용된다.
[LoRA 원리]
원본 가중치: W_0 ∈ R^{d×k} (고정, frozen)
LoRA 업데이트: ΔW = B × A where A ∈ R^{r×k}, B ∈ R^{d×r}
최종 출력: h = W_0 x + ΔW x = W_0 x + B(Ax)
- r << min(d, k): low-rank (보통 4, 8, 16, 32, 64)
- 학습 파라미터: A와 B만 (전체 대비 매우 적음)
- 원본 가중치는 고정 → 메모리 효율적
# LoRA 적용 예시 (Stable Diffusion U-Net의 attention layer)
class LoRALinear(nn.Module):
def __init__(self, original_layer, rank=4, alpha=1.0):
super().__init__()
self.original = original_layer # frozen
in_features = original_layer.in_features
out_features = original_layer.out_features
# LoRA layers
self.lora_A = nn.Linear(in_features, rank, bias=False)
self.lora_B = nn.Linear(rank, out_features, bias=False)
self.scale = alpha / rank
# 초기화
nn.init.kaiming_uniform_(self.lora_A.weight)
nn.init.zeros_(self.lora_B.weight) # B를 0으로 → 초기에는 원본과 동일
def forward(self, x):
original_out = self.original(x) # 고정된 원본 출력
lora_out = self.lora_B(self.lora_A(x)) # LoRA 업데이트
return original_out + self.scale * lora_out
LoRA 학습 설정 (Diffusers 기반):
# Diffusers LoRA 학습 실행 예시
accelerate launch train_text_to_image_lora.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
--dataset_name="lambdalabs/naruto-blip-captions" \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--max_train_steps=15000 \
--learning_rate=1e-04 \
--lr_scheduler="cosine" \
--lr_warmup_steps=0 \
--rank=4 \
--mixed_precision="fp16" \
--output_dir="./sdxl-naruto-lora"
| LoRA 파라미터 | 일반적 범위 | 영향 |
|---|---|---|
| Rank (r) | 4-128 | 높을수록 표현력 증가, 메모리 증가 |
| Alpha (α) | rank와 동일 ~ 2x | 학습률 스케일링 |
| Target Modules | attn Q,K,V,O + FFN | 적용 범위 |
| Learning Rate | 1e-4 ~ 1e-5 | 수렴 속도 |
| 학습 시간 | 5-30분 (단일 GPU) | 빠른 반복 가능 |
| 파일 크기 | 1-200 MB | 공유 및 배포 용이 |
5.2 DreamBooth
Ruiz et al.(2023)의 DreamBooth는 3-5장의 이미지로 특정 주체(subject)의 개념을 모델에 주입하는 기법이다.
[DreamBooth 학습 과정]
입력: 특정 주체의 이미지 3-5장 + 고유 식별자 "[V]"
예: "a [V] dog" (특정 강아지)
학습 전략:
1. 주체 이미지로 모델 fine-tuning
- "a [V] dog" → 해당 강아지 이미지
2. Prior Preservation Loss (핵심!)
- 원본 모델로 "a dog" 이미지를 미리 생성해두고
- fine-tuning 중 "a dog" → 일반적인 강아지 생성 능력을 보존
- language drift 방지
L = L_recon([V] images) + λ * L_prior(class images)
# DreamBooth + LoRA 학습 (권장 조합)
# diffusers 라이브러리 기반
accelerate launch train_dreambooth_lora.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
--instance_data_dir="./my_dog_images" \
--instance_prompt="a photo of sks dog" \
--class_data_dir="./class_dog_images" \
--class_prompt="a photo of dog" \
--with_prior_preservation \
--prior_loss_weight=1.0 \
--num_class_images=200 \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=1e-4 \
--lr_scheduler="constant" \
--max_train_steps=500 \
--rank=4 \
--mixed_precision="fp16"
5.3 Textual Inversion
Gal et al.(2023)의 Textual Inversion은 모델 가중치를 전혀 수정하지 않고, 새로운 토큰 임베딩만 학습하는 방법이다.
[Textual Inversion]
기존 토큰 공간: [cat] [dog] [car] [tree] ...
│
새로운 토큰 추가: [S*] ← 학습할 새로운 개념
│
학습: 이미지 3-5장으로 [S*]의 embedding vector만 최적화
나머지 모델 전체는 frozen
장점: 파라미터 극소 (토큰 1개 = 768 또는 1024 float)
단점: 표현력이 LoRA/DreamBooth보다 제한적
5.4 ControlNet
Zhang & Agrawala(2023)의 ControlNet은 사전학습된 diffusion model에 **구조적 조건(edge, depth, pose 등)**을 추가하는 방법이다.
[ControlNet 아키텍처]
Control Input (예: Canny edge)
│
┌─────┴─────┐
│ Zero │
│ Conv │
└─────┬─────┘
│
┌─────┴─────┐
Locked U-Net │ Trainable │ ← U-Net Encoder의 복사본
(원본 고정) │ Copy of │ (학습 가능)
│ │ U-Net Enc │
│ └─────┬─────┘
│ │
│ ┌─────┴─────┐
│ │ Zero │ ← 학습 초기에 출력이 0
│ │ Conv │ (원본 모델에 영향 없이 시작)
│ └─────┬─────┘
│ │
└─────── + ───────────┘ ← 원본 U-Net feature에 더함
│
Final Output
ControlNet의 핵심 학습 기법 - Zero Convolution:
# Zero Convolution: 가중치와 바이어스를 0으로 초기화
class ZeroConv(nn.Module):
def __init__(self, in_channels, out_channels):
super().__init__()
self.conv = nn.Conv2d(in_channels, out_channels, 1)
nn.init.zeros_(self.conv.weight)
nn.init.zeros_(self.conv.bias)
def forward(self, x):
return self.conv(x)
# 학습 초기: zero conv 출력 = 0
# → ControlNet 추가해도 원본 모델 출력에 영향 없음
# → 학습이 진행되면서 점차 control signal 반영
| 조건 유형 | 입력 | 용도 |
|---|---|---|
| Canny Edge | 엣지 맵 | 윤곽선 기반 생성 |
| Depth | 깊이 맵 | 3D 구조 보존 |
| OpenPose | 관절 위치 | 인체 포즈 제어 |
| Semantic Segmentation | 세그멘테이션 맵 | 레이아웃 제어 |
| Scribble | 낙서 | 대략적 구도 |
| Normal Map | 표면 법선 맵 | 3D 형태 제어 |
| Tile | 저해상도/타일 | Super-resolution |
5.5 IP-Adapter
Ye et al.(2023)의 IP-Adapter(Image Prompt Adapter)는 이미지를 프롬프트로 사용하여 스타일이나 주체를 전달하는 어댑터다.
[IP-Adapter 아키텍처]
Reference Image ──→ CLIP Image Encoder ──→ image features
│
┌─────┴─────┐
│ Projection │ ← 학습 대상
│ Layer │
└─────┬─────┘
│
┌─────┴─────┐
│ Decoupled │ ← 별도의 cross-attention
│ Cross-Attn │ (텍스트 cross-attn과 분리)
└─────┬─────┘
│
Original U-Net Cross-Attention ────── + ───────┘
(텍스트 conditioning)
출력 = Text_CrossAttn(Q, K_text, V_text) + λ * Image_CrossAttn(Q, K_img, V_img)
5.6 Fine-tuning 기법 비교
| 기법 | 수정 대상 | 학습 이미지 수 | 학습 시간 | 파일 크기 | 주요 용도 |
|---|---|---|---|---|---|
| LoRA | Attention 가중치 (low-rank) | 수십~수천 | 5-30분 | 1-200MB | 스타일, 개념 |
| DreamBooth | 전체 모델 or + LoRA | 3-10 | 5-60분 | 2-7GB (전체) 또는 1-200MB (LoRA) | 특정 주체 |
| Textual Inversion | 토큰 임베딩만 | 3-10 | 30분-수시간 | 수 KB | 단순 개념 |
| ControlNet | U-Net Encoder 복사본 | 수만~수십만 | 수일 | ~1.5GB | 구조적 제어 |
| IP-Adapter | Projection + Cross-Attn | 대규모 | 수일 | ~100MB | 이미지 프롬프팅 |
6. 최신 트렌드 (2024-2026)
6.1 Consistency Models
Yang Song et al.(2023)의 Consistency Models는 diffusion model의 다단계 생성을 1-step 또는 few-step으로 단축하는 방법이다.
[Consistency Models 핵심 아이디어]
Diffusion: x_T → x_{T-1} → ... → x_1 → x_0 (수백 스텝)
Consistency:
PF-ODE trajectory 위의 모든 점 x_t가
동일한 x_0로 매핑되도록 학습
f_θ(x_t, t) = x_0 ∀t ∈ [0, T]
핵심 제약: f_θ(x_0, 0) = x_0 (self-consistency)
x_T ────→ f_θ ────→ x_0
│ ↑
x_t ────→ f_θ ───────┘ (같은 x_0로 매핑!)
│ ↑
x_t' ────→ f_θ ───────┘
두 가지 학습 방법:
| 방법 | 설명 | 장점 | 단점 |
|---|---|---|---|
| Consistency Distillation (CD) | 사전학습된 diffusion model 필요, PF-ODE 시뮬레이션 | 높은 품질 | teacher 모델 필요 |
| Consistency Training (CT) | 실제 데이터에서 직접 학습 | teacher 불필요 | CD보다 품질 다소 낮음 |
성능:
- CIFAR-10: FID 3.55 (1-step), 2.93 (2-step)
- ImageNet 64x64: FID 6.20 (1-step)
후속 연구인 **Improved Consistency Training (iCT)**와 **Latent Consistency Models (LCM)**은 이를 대규모 T2I 모델에 적용하여, SDXL 수준의 모델에서 2-4 step 생성을 가능하게 했다.
6.2 DiT (Diffusion Transformer) 아키텍처의 확산
2024년 이후 DiT는 U-Net을 대체하여 T2I의 주류 backbone이 되고 있다:
| 모델 | 연도 | Backbone | 파라미터 | 핵심 특징 |
|---|---|---|---|---|
| DiT (원본) | 2023 | Transformer | 675M | 클래스 조건부, adaLN-Zero |
| PixArt-alpha | 2023 | DiT + Cross-Attn | 600M | T2I, 저비용 학습 |
| PixArt-sigma | 2024 | DiT + KV Compression | 600M | 4K 해상도, weak-to-strong |
| SD3 | 2024 | MM-DiT | 2B-8B | Flow Matching, 3중 text encoder |
| Flux | 2024 | MM-DiT variant | ~12B | Distillation 변형 |
| Playground v2.5 | 2024 | SDXL U-Net | ~2.6B | EDM noise schedule |
| Hunyuan-DiT | 2024 | DiT | ~1.5B | 중국어+영어 bilingual |
| Lumina-T2X | 2024 | DiT | 다양 | Multi-modal generation |
6.3 PixArt-alpha 와 PixArt-sigma
**PixArt-alpha(Chen et al., 2023)**는 효율적 DiT 학습의 선구자적 모델이다:
핵심 혁신 - Training Decomposition (학습 분해):
[PixArt-alpha 3단계 학습]
Stage 1: Pixel Dependency 학습 (저비용)
- ImageNet 사전학습된 DiT에서 시작
- 클래스 조건부 → T2I 전환의 기초
Stage 2: Text-Image Alignment 학습
- Cross-attention으로 텍스트 조건 주입
- LLaVA로 생성한 고품질 synthetic caption 사용
Stage 3: High-quality Aesthetic 학습
- 고품질 미적 데이터셋으로 fine-tuning
- JourneyDB 등 활용
총 학습 비용: ~675 A100 GPU days
(SD 1.5의 ~6,250 A100 GPU days 대비 10.8%)
**PixArt-sigma(Chen et al., 2024)**의 개선점:
- Weak-to-Strong Training: PixArt-alpha를 기반으로 더 고품질 데이터로 강화 학습
- KV Compression: Attention에서 Key와 Value를 압축하여 효율성 향상 → 4K 해상도 가능
- 0.6B 파라미터로 SDXL(2.6B)과 대등한 성능
6.4 SDXL, SD3, Flux 비교
[세대별 Stable Diffusion 계보]
SD 1.x (2022) SDXL (2023) SD3 (2024) Flux (2024)
│ │ │ │
U-Net 860M U-Net 2.6B MM-DiT 2-8B MM-DiT ~12B
│ │ │ │
CLIP ViT-L CLIP-L + CLIP-L + CLIP-L +
OpenCLIP-G OpenCLIP-G + OpenCLIP-G +
T5-XXL T5-XXL
│ │ │ │
Diffusion Diffusion Rectified Rectified
(DDPM) (DDPM) Flow Flow
│ │ │ │
512x512 1024x1024 1024x1024 1024x1024+
│ │ │ │
CFG 7.5 CFG 5-9 CFG 3.5-7 Guidance
Distillation
6.5 DALL-E 3의 학습 혁신
DALL-E 3(Betker et al., 2023)의 핵심 혁신은 학습 데이터 캡션 품질 개선에 있다:
- Image Captioner 학습: CoCa 기반 이미지 캡셔닝 모델을 별도 학습
- Synthetic Caption 생성: 학습 데이터 전체를 상세한 synthetic caption으로 재라벨링
- Caption Mixing: 95% synthetic + 5% original caption으로 학습
- Descriptive vs Short: 상세한 설명형 캡션이 짧은 태그형보다 우수
6.6 Playground v2.5의 3대 통찰
Playground v2.5(Li et al., 2024)는 SDXL 아키텍처 기반에서 학습 전략 개선으로 DALL-E 3, Midjourney 5.2를 능가했다:
1. EDM Noise Schedule 채택:
# EDM Framework (Karras et al., 2022)
# σ(t) 기반 noise schedule - Zero Terminal SNR 보장
# 기존 SD의 linear schedule 대비 색상/대비 크게 개선
def edm_precondition(sigma, x_noisy, F_theta):
"""EDM Preconditioning"""
c_skip = 1.0 / (sigma ** 2 + 1)
c_out = sigma / (sigma ** 2 + 1).sqrt()
c_in = 1.0 / (sigma ** 2 + 1).sqrt()
c_noise = sigma.log() / 4
D_x = c_skip * x_noisy + c_out * F_theta(c_in * x_noisy, c_noise)
return D_x
2. Multi-Aspect Ratio Training:
- Bucketed dataset: 유사 종횡비 이미지를 그룹화하여 배치 구성
- 학습 시 다양한 종횡비 지원 (1:1, 4:3, 16:9 등)
3. Human Preference Alignment:
- 인간 선호도 데이터를 활용한 학습 전략
- Quality-tuning으로 미적 품질 극대화
7. 학습 파이프라인 실전 가이드
7.1 학습 인프라
GPU 요구사항
| 학습 규모 | 권장 GPU | VRAM | 학습 기간 | 비용 (추정) |
|---|---|---|---|---|
| LoRA Fine-tuning | RTX 3090/4090 1대 | 24GB | 5-30분 | < $1 |
| DreamBooth | A100 40GB 1대 | 40GB | 30-60분 | $2-5 |
| ControlNet 학습 | A100 80GB 4-8대 | 320-640GB | 2-5일 | $500-2,000 |
| SD 1.5 수준 학습 | A100 80GB 256대 | ~20TB | 24일 | ~$150K |
| SDXL 수준 학습 | A100 80GB 512-1024대 | ~40-80TB | 수주 | ~$500K-1M |
| SD3/Flux 수준 학습 | H100 80GB 1024+대 | ~80TB+ | 수주-수개월 | > $1M |
분산 학습 전략
[대규모 분산 학습 구성]
┌─────────────────────────────────────────────────────┐
│ Data Parallel (DP/DDP) │
│ │
│ GPU 0 GPU 1 GPU 2 GPU 3 │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Full │ │Full │ │Full │ │Full │ │
│ │Model │ │Model │ │Model │ │Model │ │
│ │Copy │ │Copy │ │Copy │ │Copy │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ │
│ Batch 1 Batch 2 Batch 3 Batch 4 │
│ │
│ → All-Reduce로 gradient 동기화 │
│ → 각 GPU에 서로 다른 데이터 배치 │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ FSDP (Fully Sharded Data Parallel) │
│ │
│ GPU 0 GPU 1 GPU 2 GPU 3 │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Shard │ │Shard │ │Shard │ │Shard │ │
│ │ 1/4 │ │ 2/4 │ │ 3/4 │ │ 4/4 │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ │
│ │
│ → 모델 파라미터를 GPU별로 분할(shard) │
│ → Forward/Backward 시 필요한 shard만 All-Gather │
│ → 메모리 효율 극대화 (8B+ 모델 학습 가능) │
└─────────────────────────────────────────────────────┘
7.2 대표적인 학습 프레임워크: Diffusers
HuggingFace의 Diffusers 라이브러리는 T2I 모델 학습의 사실상 표준이다.
# Diffusers 기반 Text-to-Image 학습 전체 파이프라인
from diffusers import AutoencoderKL, UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTextModel, CLIPTokenizer
from accelerate import Accelerator
import torch
# 1. 모델 로드
vae = AutoencoderKL.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="vae"
)
unet = UNet2DConditionModel.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet"
)
text_encoder = CLIPTextModel.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="text_encoder"
)
tokenizer = CLIPTokenizer.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="tokenizer"
)
noise_scheduler = DDPMScheduler.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler"
)
# 2. VAE, Text Encoder 고정
vae.requires_grad_(False)
text_encoder.requires_grad_(False)
# 3. Accelerator 설정 (분산 학습 + Mixed Precision)
accelerator = Accelerator(
mixed_precision="fp16", # or "bf16"
gradient_accumulation_steps=4,
)
# 4. Optimizer
optimizer = torch.optim.AdamW(
unet.parameters(),
lr=1e-4,
betas=(0.9, 0.999),
weight_decay=1e-2,
eps=1e-8,
)
# 5. EMA 설정
from diffusers.training_utils import EMAModel
ema_unet = EMAModel(
unet.parameters(),
decay=0.9999,
use_ema_warmup=True,
)
# 6. 분산 학습 준비
unet, optimizer, dataloader = accelerator.prepare(unet, optimizer, dataloader)
# 7. 학습 루프
for epoch in range(num_epochs):
for batch in dataloader:
with accelerator.accumulate(unet):
images = batch["images"]
captions = batch["captions"]
# Latent encoding
with torch.no_grad():
latents = vae.encode(images).latent_dist.sample()
latents = latents * vae.config.scaling_factor
# Text encoding
with torch.no_grad():
text_inputs = tokenizer(captions, padding="max_length",
max_length=77, return_tensors="pt")
text_embeds = text_encoder(text_inputs.input_ids)[0]
# Noise 추가
noise = torch.randn_like(latents)
timesteps = torch.randint(0, 1000, (latents.shape[0],))
noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
# Classifier-Free Guidance: 랜덤 caption dropout
if torch.rand(1) < 0.1: # 10% 확률로 unconditional
text_embeds = torch.zeros_like(text_embeds)
# Noise 예측
noise_pred = unet(noisy_latents, timesteps, text_embeds).sample
# 손실 계산
loss = F.mse_loss(noise_pred, noise)
# Backward
accelerator.backward(loss)
accelerator.clip_grad_norm_(unet.parameters(), 1.0)
optimizer.step()
optimizer.zero_grad()
# EMA 업데이트
ema_unet.step(unet.parameters())
7.3 Mixed Precision Training
Mixed Precision은 FP32와 FP16/BF16을 혼합 사용하여 메모리와 계산 효율을 높이는 기법이다.
[Mixed Precision Training]
Forward Pass:
- 모델 가중치: FP16/BF16 (메모리 절반)
- Activation: FP16/BF16
Loss Scaling:
- Loss를 큰 스케일(예: 2^16)로 곱하여 gradient underflow 방지
- Backward 후 gradient를 다시 스케일 다운
Backward Pass:
- Gradient: FP16/BF16
Optimizer Step:
- Master Weights: FP32 (정밀도 유지!)
- FP32 master weights 업데이트 후 FP16 사본 생성
| Precision | 메모리 | 연산 속도 | 수치 안정성 | 권장 |
|---|---|---|---|---|
| FP32 | 4 bytes | 기준 | 최고 | Optimizer State |
| FP16 | 2 bytes | ~2x | 낮음 (overflow 위험) | Forward/Backward |
| BF16 | 2 bytes | ~2x | 높음 (넓은 범위) | H100/A100에서 권장 |
| TF32 | 4 bytes (저장) | ~1.5x | 높음 | A100 default |
# BF16 Mixed Precision 설정 (accelerate 기반)
# accelerate config (YAML)
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
mixed_precision: bf16
num_machines: 1
num_processes: 8
7.4 EMA (Exponential Moving Average)
EMA는 학습 중 모델 가중치의 이동 평균을 유지하여 추론 시 더 안정적인 결과를 얻는 기법이다. 거의 모든 T2I 모델 학습에서 사용된다.
[EMA 업데이트]
θ_ema ← λ * θ_ema + (1 - λ) * θ_model
- λ: decay rate (보통 0.9999 ~ 0.99999)
- θ_model: 현재 학습 중인 모델 가중치
- θ_ema: EMA 가중치 (추론 시 사용)
- 효과: gradient noise를 평활화하여 더 안정적인 가중치
# Diffusers의 EMA 구현
from diffusers.training_utils import EMAModel
# EMA 모델 생성
ema_model = EMAModel(
unet.parameters(),
decay=0.9999, # decay rate
use_ema_warmup=True, # warmup 사용
inv_gamma=1.0, # warmup 파라미터
power=3/4, # warmup 파라미터
)
# 매 학습 스텝마다 업데이트
ema_model.step(unet.parameters())
# 추론 시 EMA 가중치 적용
ema_model.copy_to(unet.parameters())
# 또는 context manager 사용
with ema_model.average_parameters():
# 이 블록 안에서는 EMA 가중치 사용
output = unet(noisy_latents, timesteps, text_embeds)
7.5 학습 하이퍼파라미터 가이드
| 하이퍼파라미터 | SD 1.5 | SDXL | SD3/Flux | LoRA |
|---|---|---|---|---|
| Learning Rate | 1e-4 | 1e-4 | 1e-4 | 1e-4 ~ 5e-5 |
| Batch Size (총) | 2048 | 2048 | 2048+ | 1-8 |
| Optimizer | AdamW | AdamW | AdamW | AdamW / Prodigy |
| Weight Decay | 0.01 | 0.01 | 0.01 | 0.01 |
| Grad Clip | 1.0 | 1.0 | 1.0 | 1.0 |
| EMA Decay | 0.9999 | 0.9999 | 0.9999 | N/A |
| Warmup Steps | 10,000 | 10,000 | 10,000 | 0-500 |
| Precision | FP32/FP16 | BF16 | BF16 | FP16/BF16 |
| CFG Dropout | 10% | 10% | 10% | 10% |
| Resolution | 512 | 1024 | 1024 | 원본 해상도 |
| Total Steps | ~500K | ~500K+ | ~1M+ | 500-15,000 |
8. 주요 논문 레퍼런스 정리
8.1 핵심 논문 테이블
| # | 논문명 | 저자 | 연도 | 핵심 기여 | 링크 |
|---|---|---|---|---|---|
| 1 | Generative Adversarial Networks | Goodfellow et al. | 2014 | GAN 프레임워크 제안 | arXiv:1406.2661 |
| 2 | Neural Discrete Representation Learning (VQ-VAE) | van den Oord et al. | 2017 | Vector Quantized 이산 잠재 공간 | arXiv:1711.00937 |
| 3 | A Style-Based Generator Architecture for GANs (StyleGAN) | Karras et al. | 2019 | Style-based 생성기, Progressive Growing | arXiv:1812.04948 |
| 4 | Large Scale GAN Training (BigGAN) | Brock et al. | 2019 | 대규모 GAN 학습, Truncation Trick | arXiv:1809.11096 |
| 5 | Generating Diverse High-Fidelity Images with VQ-VAE-2 | Razavi et al. | 2019 | 계층적 VQ-VAE, 고해상도 생성 | arXiv:1906.00446 |
| 6 | Denoising Diffusion Probabilistic Models (DDPM) | Ho et al. | 2020 | Diffusion 모델의 실용적 학습 | arXiv:2006.11239 |
| 7 | Learning Transferable Visual Models From Natural Language Supervision (CLIP) | Radford et al. | 2021 | CLIP 대조 학습, 이미지-텍스트 정렬 | arXiv:2103.00020 |
| 8 | Zero-Shot Text-to-Image Generation (DALL-E) | Ramesh et al. | 2021 | dVAE + Autoregressive Transformer T2I | arXiv:2102.12092 |
| 9 | High-Resolution Image Synthesis with Latent Diffusion Models (LDM) | Rombach et al. | 2022 | Latent Diffusion, Cross-Attention 조건화 | arXiv:2112.10752 |
| 10 | Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2) | Ramesh et al. | 2022 | CLIP 기반 2단계 Diffusion, Prior + Decoder | arXiv:2204.06125 |
| 11 | Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen) | Saharia et al. | 2022 | T5-XXL 텍스트 인코더의 효과, Dynamic Thresholding | arXiv:2205.11487 |
| 12 | Classifier-Free Diffusion Guidance | Ho & Salimans | 2022 | CFG 학습 기법, unconditional-conditional 동시 학습 | arXiv:2207.12598 |
| 13 | Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Parti) | Yu et al. | 2022 | Autoregressive T2I, 20B 스케일링 | arXiv:2206.10789 |
| 14 | LoRA: Low-Rank Adaptation of Large Language Models | Hu et al. | 2022 | Low-rank fine-tuning 기법 | arXiv:2106.09685 |
| 15 | Elucidating the Design Space of Diffusion-Based Generative Models (EDM) | Karras et al. | 2022 | 체계적 Diffusion 설계 공간 분석, Preconditioning | arXiv:2206.00364 |
| 16 | An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion | Gal et al. | 2023 | 새로운 토큰 임베딩 학습으로 개인화 | arXiv:2208.01618 |
| 17 | DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation | Ruiz et al. | 2023 | 소수 이미지로 주체 개인화, Prior Preservation | arXiv:2208.12242 |
| 18 | Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet) | Zhang & Agrawala | 2023 | 구조적 제어(edge, depth, pose) 추가 | arXiv:2302.05543 |
| 19 | Consistency Models | Song et al. | 2023 | 1-step 생성, PF-ODE 일관성 학습 | arXiv:2303.01469 |
| 20 | SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis | Podell et al. | 2023 | 대형 U-Net, Dual Text Encoder, Multi-AR 학습 | arXiv:2307.01952 |
| 21 | Scalable Diffusion Models with Transformers (DiT) | Peebles & Xie | 2023 | Diffusion + Transformer, adaLN-Zero | arXiv:2212.09748 |
| 22 | Flow Matching for Generative Modeling | Lipman et al. | 2023 | ODE 기반 Flow Matching 프레임워크 | arXiv:2210.02747 |
| 23 | Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow | Liu et al. | 2023 | Rectified Flow, Optimal Transport | arXiv:2209.03003 |
| 24 | IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models | Ye et al. | 2023 | 이미지 프롬프트 어댑터, Decoupled Cross-Attn | arXiv:2308.06721 |
| 25 | Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3Leon) | Yu et al. | 2023 | 효율적 자기회귀 T2I, Retrieval Augmented | arXiv:2309.02591 |
| 26 | PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis | Chen et al. | 2023 | 저비용 DiT 학습, 학습 분해 전략 | arXiv:2310.00426 |
| 27 | Improving Image Generation with Better Captions (DALL-E 3) | Betker et al. | 2023 | Synthetic 캡션으로 극적 품질 향상 | cdn.openai.com/papers/dall-e-3.pdf |
| 28 | PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation | Chen et al. | 2024 | Weak-to-Strong 학습, KV Compression, 4K | arXiv:2403.04692 |
| 29 | Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3) | Esser et al. | 2024 | MM-DiT, Rectified Flow 대규모 적용, Logit-Normal | arXiv:2403.03206 |
| 30 | Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation | Li et al. | 2024 | EDM Noise Schedule, Multi-AR, Human Preference | arXiv:2402.17245 |
8.2 추가 참고 논문
| 논문명 | 연도 | 핵심 |
|---|---|---|
| LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models | 2022 | 58.5억 오픈 이미지-텍스트 데이터셋 |
| Improved Denoising Diffusion Probabilistic Models | 2021 | Cosine schedule, learned variance |
| Denoising Diffusion Implicit Models (DDIM) | 2021 | Deterministic sampling, 속도 향상 |
| Progressive Distillation for Fast Sampling of Diffusion Models | 2022 | 단계적 증류로 추론 가속 |
| InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation | 2024 | Rectified Flow 1-step 생성 |
| Latent Consistency Models | 2024 | LCM, SDXL 기반 few-step 생성 |
| SDXL-Turbo: Adversarial Diffusion Distillation | 2024 | 1-4 step SDXL 생성 |
| Stable Cascade | 2024 | Wuerstchen 기반 3단계 계층적 생성 |
9. 결론 및 전망
Text-to-Image 모델 학습 방법론은 GAN의 adversarial training에서 출발하여, Diffusion의 iterative denoising을 거쳐, 현재 Flow Matching + DiT라는 새로운 패러다임으로 수렴하고 있다.
핵심 트렌드 요약
[T2I 학습 방법론 발전 방향]
효율성: Full Training ──→ LoRA/Adapter ──→ Prompt Tuning
(수개월, $1M+) (수분, <$1) (수초)
아키텍처: U-Net ────────→ DiT ─────────→ MM-DiT + Flow Matching
(SD 1.x-SDXL) (DiT, PixArt) (SD3, Flux)
생성 속도: 50-1000 steps ──→ 20-50 steps ──→ 1-4 steps
(DDPM) (DDIM, DPM++) (LCM, LADD, CM)
데이터 품질: 웹 크롤링 ──→ 필터링 ──→ Synthetic Caption
(LAION raw) (aesthetic) (DALL-E 3 방식)
텍스트 이해: CLIP only ──→ CLIP + T5 ──→ 3중 Encoder
(SD 1.x) (Imagen) (SD3, Flux)
향후 전망
학습 효율성 극대화: PixArt-alpha가 보여준 것처럼, 학습 비용을 1/10 이하로 줄이면서 품질을 유지하는 방향이 지속될 것이다.
데이터 중심 접근법(Data-Centric AI): DALL-E 3가 입증한 것처럼, 아키텍처보다 데이터 품질과 캡셔닝이 더 중요해지고 있다.
Few-Step / One-Step 생성: Consistency Models, LCM, LADD 등의 증류 기법이 발전하여 실시간 생성이 표준이 될 것이다.
Unified Multi-Modal Generation: 텍스트-이미지뿐 아니라 비디오, 3D, 오디오를 통합하는 모델로 확장되고 있다.
개인화(Personalization) 고도화: LoRA, DreamBooth, IP-Adapter를 넘어 더 적은 데이터로, 더 정확한 주체 재현이 가능해질 것이다.
T2I 모델의 학습 방법론은 단순히 "더 큰 모델을 더 많은 데이터로 학습"하는 것을 넘어, 어떤 데이터를, 어떤 스케줄로, 어떤 conditioning과 함께 학습하느냐가 핵심이 되는 시대로 접어들었다. 이 글에서 다룬 방법론들을 기반으로 자신만의 T2I 모델을 학습하거나, 기존 모델을 효과적으로 커스터마이징하는 데 활용할 수 있기를 바란다.
참고 자료
Complete Guide to Text-to-Image Model Training Methodologies: From GAN to Flow Matching
- 1. Introduction: The Evolution of Text-to-Image Generative Models
- 2. Training Methodologies by Core Architecture
- 2.1 GAN-Based: Adversarial Training
- 2.2 VAE-Based: Codebook Learning and Discrete Latent Space
- 2.3 Diffusion-Based: The Core of Current T2I
- 2.3.1 DDPM: Denoising Diffusion Probabilistic Models
- 2.3.2 Noise Scheduling
- 2.3.3 Latent Diffusion Model (LDM) - The Core of Stable Diffusion
- 2.3.4 Structure of the U-Net Backbone
- 2.3.5 Classifier-Free Guidance (CFG)
- 2.3.6 DALL-E 2: CLIP-Based Diffusion
- 2.3.7 Imagen: The Power of T5 Text Encoder
- 2.3.8 DiT: Diffusion Transformer
- 2.4 Autoregressive-Based T2I
- 2.5 Flow Matching: The Next-Generation Training Paradigm
- 3. Text Conditioning Methodologies
- 4. Training Datasets
- 5. Fine-tuning & Customization Techniques
- 6. Latest Trends (2024-2026)
- 7. Practical Training Pipeline Guide
- 8. Key Paper References
- 9. Conclusion and Outlook
- References
- Quiz
1. Introduction: The Evolution of Text-to-Image Generative Models
Text-to-Image (T2I) generative models are technologies that produce high-resolution images from natural language text prompts, and have undergone rapid development over the past several years. The trajectory of this field can be broadly divided into four paradigms.
[Text-to-Image Model Evolution Timeline]
2014-2019 2017-2020 2020-2023 2023-Present
| | | |
GAN VAE/VQ-VAE Diffusion Models Flow Matching
| | | + DiT
v v v v
┌──────────┐ ┌──────────┐ ┌────────────────┐ ┌──────────────┐
│StackGAN │ │ VQ-VAE │ │ DDPM (2020) │ │ SD3 (2024) │
│AttnGAN │ │ VQ-VAE-2 │ │ DALL-E 2(2022) │ │ Flux (2024) │
│StyleGAN │ │ dVAE │ │ Imagen (2022) │ │ Pixart-Sigma │
│BigGAN │ │ │ │ SD 1.x (2022) │ │ │
│GigaGAN │ │ │ │ SDXL (2023) │ │ │
└──────────┘ └──────────┘ └────────────────┘ └──────────────┘
Features: Features: Features: Features:
- Adversarial - Discrete - Iterative - Straight paths
Training Latent Space Denoising - ODE-based
- Mode Collapse - Codebook - Classifier-Free - Fewer steps
issues Learning Guidance - DiT backbone
- Fast generation - Two-stage - Latent Space - Scalable
Training - U-Net backbone
1.1 Why Training Methodology Matters
The quality of T2I models is critically determined not only by architecture design but also by training methodology. Even with identical architectures, generation quality varies dramatically depending on noise scheduling, conditioning approaches, data quality, and training strategies. A prime example is DALL-E 3, which achieved dramatic performance improvements over its predecessor through caption quality improvement alone without any architecture changes.
This article provides an in-depth, paper-based analysis of core training methodologies for each paradigm, covering practical training pipeline configuration as well.
2. Training Methodologies by Core Architecture
2.1 GAN-Based: Adversarial Training
Generative Adversarial Network (GAN) is a framework where two networks, the Generator and the Discriminator, are trained competitively.
2.1.1 Basic Training Principles
The training objective function of GAN is defined as a minimax game:
min_G max_D V(D, G) = E_{x~p_data}[log D(x)] + E_{z~p_z}[log(1 - D(G(z)))]
- G (Generator): 랜덤 노이즈 z로부터 이미지 생성
- D (Discriminator): 실제 이미지와 생성 이미지 구분
- 학습 목표: G는 D를 속이고, D는 정확히 판별
2.1.2 StyleGAN Training Strategy
StyleGAN (Karras et al., 2019) introduced Progressive Growing and Style-based Generator to enable high-quality image generation.
Core Training Techniques:
| Technique | Description | Effect |
|---|---|---|
| Progressive Growing | Start from low resolution (4x4) and progressively increase | Improved training stability |
| Style Mixing | Inject different latent codes into different layers | Increased diversity |
| Path Length Regularization | Generator Jacobian regularization | Improved generation quality |
| R1 Regularization | Discriminator gradient penalty | Training stabilization |
| Lazy Regularization | Apply regularization every 16 steps instead of every step | Improved training efficiency |
# StyleGAN2 core training loop (simplified)
for real_images, _ in dataloader:
# 1. Discriminator training
z = torch.randn(batch_size, latent_dim)
fake_images = generator(z)
d_real = discriminator(real_images)
d_fake = discriminator(fake_images.detach())
d_loss = F.softplus(-d_real).mean() + F.softplus(d_fake).mean()
# R1 Regularization (lazy: every 16 steps)
if step % 16 == 0:
real_images.requires_grad = True
d_real = discriminator(real_images)
r1_grads = torch.autograd.grad(d_real.sum(), real_images)[0]
r1_penalty = r1_grads.square().sum(dim=[1,2,3]).mean()
d_loss += 10.0 * r1_penalty
d_optimizer.zero_grad()
d_loss.backward()
d_optimizer.step()
# 2. Generator training
z = torch.randn(batch_size, latent_dim)
fake_images = generator(z)
d_fake = discriminator(fake_images)
g_loss = F.softplus(-d_fake).mean()
g_optimizer.zero_grad()
g_loss.backward()
g_optimizer.step()
2.1.3 Large-Scale Training with BigGAN
BigGAN (Brock et al., 2019) is a model that scaled up GAN to large scale, employing the following training strategies:
- Large-scale batches: Increase batch size up to 2048 for improved training stability and quality
- Class-conditional Batch Normalization: Inject class information into Batch Normalization parameters
- Truncation Trick: Truncate latent distribution at inference to control quality-diversity trade-off
- Orthogonal Regularization: Maintain orthogonality of weight matrices to prevent mode collapse
2.1.4 Limitations of GAN-Based T2I
GAN-based approaches ceded dominance to Diffusion-based models due to the following fundamental limitations:
- Mode Collapse: Limited generation diversity
- Training Instability: Unstable training sensitive to hyperparameters
- Text Conditioning difficulty: Difficult to accurately reflect complex text prompts
- Scaling limitations: Increased training instability at large scale
2.2 VAE-Based: Codebook Learning and Discrete Latent Space
2.2.1 VQ-VAE: Vector Quantized Variational Autoencoder
VQ-VAE (van den Oord et al., 2017) is an approach that learns a discrete latent space instead of a continuous one.
[VQ-VAE Architecture]
Input Image Encoder Quantization Decoder Reconstructed
(256x256) --> [E(x)] --> z_e --> [Codebook] --> z_q --> [D(z_q)] --> Image
| |
| ┌────┴────┐
| │ e_1 │
| │ e_2 │ K code vectors
└──>│ ... │ (Codebook)
│ e_K │
└─────────┘
z_q = e_k where k = argmin_j ||z_e - e_j||
(quantize to nearest code vector)
VQ-VAE Training Loss Function:
L = ||x - D(z_q)||² # Reconstruction Loss
+ ||sg[z_e] - e||² # Codebook Loss (EMA 업데이트로 대체 가능)
+ β * ||z_e - sg[e]||² # Commitment Loss
- sg[·]: Stop-gradient 연산자
- β: Commitment loss 가중치 (보통 0.25)
- z_e: Encoder 출력
- e: 선택된 codebook 벡터
Since the quantization operation is non-differentiable, the Straight-Through Estimator (STE) is used to pass gradients to the encoder. The codebook itself is updated via Exponential Moving Average (EMA).
# VQ-VAE Codebook core training code
class VectorQuantizer(nn.Module):
def __init__(self, num_embeddings, embedding_dim, commitment_cost=0.25):
super().__init__()
self.embedding = nn.Embedding(num_embeddings, embedding_dim)
self.commitment_cost = commitment_cost
def forward(self, z_e):
# z_e: (B, D, H, W) -> (B*H*W, D)
flat_z = z_e.permute(0, 2, 3, 1).reshape(-1, z_e.shape[1])
# Find nearest codebook vector
distances = (flat_z ** 2).sum(dim=1, keepdim=True) \
+ (self.embedding.weight ** 2).sum(dim=1) \
- 2 * flat_z @ self.embedding.weight.t()
indices = distances.argmin(dim=1)
z_q = self.embedding(indices).view_as(z_e.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)
# Loss computation
codebook_loss = F.mse_loss(z_q.detach(), z_e) # commitment
commitment_loss = F.mse_loss(z_q, z_e.detach()) # codebook
loss = commitment_loss + self.commitment_cost * codebook_loss
# Straight-Through Estimator
z_q_st = z_e + (z_q - z_e).detach()
return z_q_st, loss, indices
2.2.2 VQ-VAE-2: Hierarchical Codebook Learning
VQ-VAE-2 (Razavi et al., 2019) introduced multi-level hierarchical quantization to significantly improve image quality.
[VQ-VAE-2 Hierarchical Structure]
Top Level (작은 해상도)
┌─────────────┐
│ 32x32 grid │ Global structure info
│ Codebook │ (composition, overall shape)
└──────┬──────┘
│
Bottom Level (큰 해상도)
┌──────┴──────┐
│ 64x64 grid │ Fine detail info
│ Codebook │ (textures, edges)
└─────────────┘
The image generation pipeline of VQ-VAE-2 consists of the following two stages:
- Stage 1: Train VQ-VAE-2 to encode images into hierarchical discrete codes
- Stage 2: Learn the prior of discrete codes with autoregressive models such as PixelCNN
This approach directly influenced the dVAE (discrete VAE) used in DALL-E.
2.3 Diffusion-Based: The Core of Current T2I
Diffusion Model is the mainstream paradigm for current T2I generation. It learns a forward process that gradually adds noise to data, and a reverse process that recovers data from noise.
2.3.1 DDPM: Denoising Diffusion Probabilistic Models
DDPM by Ho et al. (2020) is the key paper that elevated Diffusion Models to a practical level.
Forward Process (Diffusion):
q(x_t | x_{t-1}) = N(x_t; √(1-β_t) * x_{t-1}, β_t * I)
- Add a small amount of Gaussian noise at each timestep t
- β_t: noise schedule (보통 linear 또는 cosine)
- After T steps, x_T approximately equals N(0, I) (pure Gaussian noise)
Noise can be added directly at any timestep t in closed form:
q(x_t | x_0) = N(x_t; √(ᾱ_t) * x_0, (1-ᾱ_t) * I)
where ᾱ_t = ∏_{s=1}^{t} α_s, α_t = 1 - β_t
=> x_t = √(ᾱ_t) * x_0 + √(1-ᾱ_t) * ε, ε ~ N(0, I)
Reverse Process (Denoising):
p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), σ_t² * I)
- Neural network epsilon_theta predicts noise epsilon added to x_t
- Remove predicted noise to recover x_{t-1}
Training Objective (Simple Loss):
L_simple = E_{t, x_0, ε} [ ||ε - ε_θ(x_t, t)||² ]
- t ~ Uniform(1, T)
- ε ~ N(0, I)
- x_t = √(ᾱ_t) * x_0 + √(1-ᾱ_t) * ε
# DDPM core training loop
def train_step(model, x_0, noise_scheduler):
batch_size = x_0.shape[0]
# 1. Random timestep sampling
t = torch.randint(0, num_timesteps, (batch_size,))
# 2. Noise sampling
noise = torch.randn_like(x_0)
# 3. Forward process: generate x_t
alpha_bar_t = noise_scheduler.alpha_bar[t]
x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * noise
# 4. Predict noise
noise_pred = model(x_t, t)
# 5. Loss computation (MSE)
loss = F.mse_loss(noise_pred, noise)
return loss
2.3.2 Noise Scheduling
The noise schedule determines the amount of noise added at each timestep in the forward process and has a decisive impact on generation quality.
| Schedule | Formula | Features | Models Used |
|---|---|---|---|
| Linear | β_t = β_min + (β_max - β_min) * t/T | Simple but noise increases sharply at the end | DDPM |
| Cosine | ᾱ_t = cos²((t/T + s)/(1+s) * π/2) | Smooth transition, excellent information preservation | Improved DDPM |
| Scaled Linear | β_t = (β_min^0.5 + t/T * (β_max^0.5 - β_min^0.5))² | Used in SD 1.x | Stable Diffusion |
| Sigmoid | β_t = σ(-6 + 12*t/T) | Gradual change at both ends | Some research |
| EDM | σ(t) = t, log-normal sampling | Theoretically near optimal | Playground v2.5, EDM |
| Zero Terminal SNR | Ensures SNR(T) = 0 | Guarantees starting from pure noise | SD3, Flux |
Playground v2.5 (Li et al., 2024) adopted EDM's (Karras et al., 2022) noise schedule, greatly improving color and contrast. The key is ensuring Zero Terminal SNR, where the Signal-to-Noise Ratio (SNR) at timestep T must be exactly 0 during training.
# Cosine Schedule implementation
def cosine_beta_schedule(timesteps, s=0.008):
steps = timesteps + 1
x = torch.linspace(0, timesteps, steps)
alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * math.pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return torch.clip(betas, 0.0001, 0.9999)
# EDM Noise Schedule (Karras et al., 2022)
def edm_sigma_schedule(num_steps, sigma_min=0.002, sigma_max=80.0, rho=7.0):
step_indices = torch.arange(num_steps)
t_steps = (sigma_max ** (1/rho) + step_indices / (num_steps - 1)
* (sigma_min ** (1/rho) - sigma_max ** (1/rho))) ** rho
return t_steps
2.3.3 Latent Diffusion Model (LDM) - The Core of Stable Diffusion
Latent Diffusion Model (LDM) by Rombach et al. (2022) dramatically improved computational efficiency by performing diffusion in latent space instead of pixel space. This is the core idea behind Stable Diffusion.
[Latent Diffusion Model Architecture]
Text Prompt
│
┌────┴────┐
│ CLIP │
│ Encoder │
└────┬────┘
│ text embeddings
│
┌──────┐ ┌──────┐ ┌─────┴──────┐ ┌──────┐ ┌──────┐
│Image │ │ VAE │ │ U-Net │ │ VAE │ │Output│
│(512 │--->│Encode│--->│ (Denoising │--->│Decode│--->│Image │
│x512) │ │ r │ │ in Latent │ │ r │ │(512 │
│ │ │ │ │ Space) │ │ │ │x512) │
└──────┘ └──────┘ └────────────┘ └──────┘ └──────┘
│ │
│ 64x64x4 │
│ (8x downsampling) │
└─────────────────────────────────┘
Latent Space (z)
Training: Diffusion in Latent Space
Inference: Random noise z_T -> U-Net Denoising -> VAE Decode -> Image
LDM Training Pipeline:
Stage 1 - Autoencoder Training: Pretrain VAE (KL-regularized) on image datasets
- Encoder: Image x (H x W x 3) -> latent z (H/f x W/f x c), f=8 is typical
- Decoder: latent z -> Reconstructed image
- Loss: Reconstruction + KL Divergence + Perceptual Loss + GAN Loss
Stage 2 - Diffusion Model Training: Diffusion in the latent space of the frozen Autoencoder
- Add noise to latent z_0 = Encoder(x) to generate z_t
- U-Net predicts noise from z_t
- Text conditioning is injected via cross-attention
# Latent Diffusion core training
class LatentDiffusionTrainer:
def __init__(self, vae, unet, text_encoder, noise_scheduler):
self.vae = vae # Frozen
self.unet = unet # Trainable
self.text_encoder = text_encoder # Frozen
self.noise_scheduler = noise_scheduler
def train_step(self, images, captions):
# 1. Latent encoding with VAE (no gradient needed)
with torch.no_grad():
latents = self.vae.encode(images).latent_dist.sample()
latents = latents * self.vae.config.scaling_factor # 0.18215
# 2. Text embedding (no gradient needed)
with torch.no_grad():
text_embeddings = self.text_encoder(captions)
# 3. Add noise
noise = torch.randn_like(latents)
timesteps = torch.randint(0, 1000, (latents.shape[0],))
noisy_latents = self.noise_scheduler.add_noise(latents, noise, timesteps)
# 4. Predict noise
noise_pred = self.unet(noisy_latents, timesteps, text_embeddings)
# 5. MSE loss
loss = F.mse_loss(noise_pred, noise)
return loss
2.3.4 Structure of the U-Net Backbone
The U-Net used in Stable Diffusion 1.x/2.x and SDXL has the following structure:
[U-Net with Cross-Attention Structure]
Input z_t ─────────────────────────────────────────── Output ε_θ
│ ▲
▼ │
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Down │ │ Down │ │ Down │ │ Up │ │ Up │
│ Block │──│ Block │──│ Block │──┐ │ Block │──│ Block │
│ 64x64 │ │ 32x32 │ │ 16x16 │ │ │ 32x32 │ │ 64x64 │
└────────┘ └────────┘ └────────┘ │ └────────┘ └────────┘
│ │ │ │ ▲ ▲
│ │ │ ▼ │ │
│ │ │ ┌────────┐ │ │
│ │ └──│ Middle │──┘ │
│ │ │ Block │ │
│ │ │ 16x16 │ │
│ └───────────────└────────┘──────────────┘
│ (skip connections)
└────────────────────────────────────────────────────┘
Inside each Block:
┌──────────────────────────────────────┐
│ ResNet Block │
│ ├── GroupNorm → SiLU → Conv │
│ ├── Timestep Embedding injection │
│ └── GroupNorm → SiLU → Conv │
│ │
│ Self-Attention Block │
│ ├── LayerNorm → Self-Attention │
│ └── Skip Connection │
│ │
│ Cross-Attention Block │
│ ├── LayerNorm │
│ ├── Q = Linear(latent features) │
│ ├── K = Linear(text embeddings) │ ← Text Conditioning
│ ├── V = Linear(text embeddings) │
│ └── Attention(Q, K, V) │
│ │
│ Feed-Forward Block │
│ ├── LayerNorm → Linear → GEGLU │
│ └── Linear → Skip Connection │
└──────────────────────────────────────┘
SDXL (Podell et al., 2023) expanded the U-Net by approximately 3x (~2.6B parameters), uses two text encoders (OpenCLIP ViT-bigG and CLIP ViT-L), and applies improvements including training at various aspect ratios.
| Model | U-Net Params | Text Encoder | Resolution | VAE Downsampling |
|---|---|---|---|---|
| SD 1.5 | ~860M | CLIP ViT-L/14 (1) | 512x512 | 8x |
| SD 2.1 | ~865M | OpenCLIP ViT-H/14 (1) | 768x768 | 8x |
| SDXL | ~2.6B | OpenCLIP ViT-bigG + CLIP ViT-L (2) | 1024x1024 | 8x |
| SDXL Refiner | ~2.3B | OpenCLIP ViT-bigG (1) | 1024x1024 | 8x |
2.3.5 Classifier-Free Guidance (CFG)
Classifier-Free Guidance (CFG) by Ho & Salimans (2022) is a core training technique for modern T2I models.
Problems with Traditional Classifier Guidance:
- Requires training a separate classifier
- Needs a classifier that works on noisy images
- Requires computing classifier gradients during inference
Classifier-Free Guidance Key Idea:
During training, text conditioning is replaced with an empty string ("") with a certain probability (typically 10-20%), so that a single model simultaneously learns both conditional and unconditional generation.
학습 시:
- probability p_uncond (예: 10%): ε_θ(x_t, t, ∅) (unconditional)
- probability 1-p_uncond: ε_θ(x_t, t, c) (conditional)
추론 시:
ε_guided = ε_θ(x_t, t, ∅) + w * (ε_θ(x_t, t, c) - ε_θ(x_t, t, ∅))
- w: guidance scale (보통 7.5 ~ 15)
- w=1: conditional 예측 그대로
- w>1: 텍스트 조건 방향으로 더 강하게 이동
# Classifier-Free Guidance training implementation
def train_step_cfg(model, x_0, text_cond, p_uncond=0.1):
noise = torch.randn_like(x_0)
t = torch.randint(0, T, (x_0.shape[0],))
x_t = add_noise(x_0, noise, t)
# Randomly drop conditioning
mask = torch.rand(x_0.shape[0]) < p_uncond
cond = text_cond.clone()
cond[mask] = empty_text_embedding # null conditioning
noise_pred = model(x_t, t, cond)
loss = F.mse_loss(noise_pred, noise)
return loss
# Classifier-Free Guidance inference
def sample_cfg(model, x_T, text_cond, guidance_scale=7.5):
x_t = x_T
for t in reversed(range(T)):
# Unconditional prediction
eps_uncond = model(x_t, t, empty_text_embedding)
# Conditional prediction
eps_cond = model(x_t, t, text_cond)
# Guided prediction
eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
x_t = denoise_step(x_t, eps, t)
return x_t
CFG dramatically improves generation quality and text fidelity, but if the guidance scale is too high, images become oversaturated or artifacts appear.
2.3.6 DALL-E 2: CLIP-Based Diffusion
DALL-E 2 (Ramesh et al., 2022) introduced a two-stage diffusion architecture leveraging the CLIP embedding space.
[DALL-E 2 Training Pipeline]
Text ──→ CLIP Text Encoder ──→ text embedding
│
┌─────┴─────┐
│ Prior │ text emb → CLIP image emb
│ (Diffusion)│
└─────┬─────┘
│ CLIP image embedding
┌─────┴─────┐
│ Decoder │ CLIP image emb → 64x64 image
│ (Diffusion)│
└─────┬─────┘
│ 64x64
┌─────┴─────┐
│ Super-Res │ 64x64 → 256x256 → 1024x1024
│ (Diffusion)│
└─────────── ┘
2.3.7 Imagen: The Power of T5 Text Encoder
Google's Imagen (Saharia et al., 2022) maximized text understanding by using the T5-XXL (4.6B parameter) text encoder.
Key findings:
- Scaling text encoder size is more effective than scaling U-Net size
- T5-XXL > CLIP ViT-L (Text understanding in quality)
- Dynamic Thresholding: Stable generation even at high CFG scales
[Imagen Architecture]
Text ──→ T5-XXL (frozen) ──→ text embeddings
│
┌─────┴─────┐
│ Base Model │ 64x64 생성
│ (U-Net) │ cross-attention
└─────┬─────┘
│
┌─────┴─────┐
│ SR Model 1 │ 64x64 → 256x256
│ (U-Net) │
└─────┬─────┘
│
┌─────┴─────┐
│ SR Model 2 │ 256x256 → 1024x1024
│ (U-Net) │
└─────────── ┘
2.3.8 DiT: Diffusion Transformer
DiT (Diffusion Transformer) by Peebles & Xie (2023) is an architecture that replaces U-Net with Transformer, and is becoming the mainstream for recent T2I models.
[DiT Block Structure]
Input Tokens (patchified latent)
│
┌──────┴──────┐
│ LayerNorm │ ← adaLN-Zero (adaptive)
│ (adaptive) │ γ, β = MLP(timestep + class)
└──────┬──────┘
│
┌──────┴──────┐
│ Self- │
│ Attention │
└──────┬──────┘
│ (+ residual)
┌──────┴──────┐
│ LayerNorm │ ← adaLN-Zero
│ (adaptive) │
└──────┬──────┘
│
┌──────┴──────┐
│ Pointwise │
│ FFN │
└──────┬──────┘
│ (+ residual, scaled by α)
▼
Output Tokens
Key Design Decisions of DiT:
- Patchify: Split latent into p x p patches then linear projection to token sequence
- adaLN-Zero: Adaptive Layer Normalization, injecting timestep and class information into LN parameters
- Scaling: Systematic scaling law verification by model size (depth, width)
| DiT Variant | Depth | Width | Parameters | GFLOPs |
|---|---|---|---|---|
| DiT-S/2 | 12 | 384 | 33M | 6 |
| DiT-B/2 | 12 | 768 | 130M | 23 |
| DiT-L/2 | 24 | 1024 | 458M | 80 |
| DiT-XL/2 | 28 | 1152 | 675M | 119 |
2.4 Autoregressive-Based T2I
2.4.1 DALL-E (Original): Token-Based Autoregressive Generation
DALL-E (Ramesh et al., 2021) converts images into discrete tokens, then concatenates text tokens and image tokens into a single sequence to learn the joint distribution with an autoregressive Transformer.
[DALL-E Training Pipeline]
Stage 1: dVAE 학습
Image (256x256) ──→ dVAE Encoder ──→ 32x32 grid of tokens (8192 vocabulary)
──→ dVAE Decoder ──→ Reconstructed Image
Loss: Reconstruction + KL Divergence (Gumbel-Softmax relaxation)
Stage 2: Autoregressive Transformer 학습
[BPE text tokens (256)] + [Image tokens (1024)] = 1280 tokens
Transformer (12B params):
- 64 layers, 62 attention heads
- 학습 목적: next-token prediction (cross-entropy)
- 텍스트 토큰은 causal attention (좌→우)
- 이미지 토큰은 row-major order로 자기회귀 생성
- 텍스트→이미지 cross-attention 포함
2.4.2 Parti: Encoder-Decoder Based
Google's Parti (Yu et al., 2022) formulated T2I as a sequence-to-sequence problem, combining a ViT-VQGAN tokenizer with an Encoder-Decoder Transformer.
Key features:
- ViT-VQGAN: Vision Transformer-based image tokenizer
- Encoder-Decoder: Uses Encoder for text encoding, Decoder for image token generation
- Scaling: Systematic scale-up from 350M to 3B to 20B parameters
- Achieves quality comparable to Imagen at the 20B model
2.4.3 CM3Leon: Efficient Multimodal Autoregressive
Meta's CM3Leon (Yu et al., 2023) significantly improved the efficiency of the autoregressive approach:
- Retrieval-Augmented Training: Retrieve related image-text pairs during training and add to context
- Decoder-Only: Pure decoder-only architecture unlike Parti
- Instruction Tuning: Supervised fine-tuning for various tasks
- 5x less training cost: Reduces training compute by 1/5 for comparable performance
- Achieves MS-COCO zero-shot FID of 4.88
2.5 Flow Matching: The Next-Generation Training Paradigm
2.5.1 Basic Principles of Flow Matching
Flow Matching (Lipman et al., 2023) learns a straight path from noise distribution to data distribution through a deterministic ODE (Ordinary Differential Equation) instead of Diffusion's stochastic process.
[Diffusion vs Flow Matching Comparison]
Diffusion (Stochastic): Flow Matching (Deterministic):
x_0 ~~~> x_T x_0 ──────> x_1
(curved path, requires many steps) (straight path, fewer steps possible)
x₀ • x₀ •
\ Curved \ Straight
\ path \ path
\ \
\ \
x_T • x₁ • (= noise)
dx = f(x,t)dt + g(t)dW dx/dt = v_θ(x_t, t)
(SDE 기반) (ODE 기반, velocity field 학습)
Flow Matching Training Objective:
L_FM = E_{t, x_0, x_1} [ ||v_θ(x_t, t) - u_t(x_t | x_0, x_1)||² ]
where:
x_t = (1 - t) * x_0 + t * x_1 (linear interpolation)
u_t = x_1 - x_0 (target velocity: 직선 path)
t ~ Uniform(0, 1) (또는 logit-normal)
x_0 ~ p_data (실제 데이터)
x_1 ~ N(0, I) (가우시안 노이즈)
2.5.2 Rectified Flow
Rectified Flow (Liu et al., 2023, ICLR 2023 Spotlight) is a key variant of Flow Matching that connects noise-data pairs in straight lines from an Optimal Transport perspective.
Key idea:
- 1-Rectified Flow: Randomly pair data x_0 and noise x_1 to learn straight paths
- 2-Rectified Flow (Reflow): Re-straighten pairs generated by 1-Rectified Flow to make paths closer to straight lines
- Distillation: Distill the straightened model into a 1-step model
# Rectified Flow core training
def rectified_flow_train_step(model, x_0, x_1=None):
"""
x_0: 실제 데이터 (latent)
x_1: 노이즈 (None이면 랜덤 샘플링)
"""
if x_1 is None:
x_1 = torch.randn_like(x_0)
# Time sampling (logit-normal for SD3)
t = torch.sigmoid(torch.randn(x_0.shape[0])) # logit-normal
t = t.view(-1, 1, 1, 1)
# Linear interpolation
x_t = (1 - t) * x_0 + t * x_1
# Target velocity (straight direction)
target_v = x_1 - x_0
# Velocity prediction
v_pred = model(x_t, t)
# Loss
loss = F.mse_loss(v_pred, target_v)
return loss
2.5.3 Flow Matching in Stable Diffusion 3
SD3 (Esser et al., 2024) is the first model to apply Rectified Flow to a large-scale T2I model. Key contributions from the paper "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis":
1. Logit-Normal Timestep Sampling:
Instead of a uniform distribution, timesteps are sampled using a logit-normal distribution, giving more weight to the middle portion of the trajectory (the most challenging prediction interval).
# SD3's Logit-Normal Timestep Sampling
def logit_normal_sampling(batch_size, m=0.0, s=1.0):
"""Give more weight to middle timesteps"""
u = torch.randn(batch_size) * s + m
t = torch.sigmoid(u) # (0, 1) 범위
return t
2. MM-DiT (Multi-Modal Diffusion Transformer):
SD3 introduced a new Transformer architecture that uses separate weights for text and images while enabling bidirectional information flow.
[MM-DiT Block]
Image Tokens Text Tokens
│ │
┌────┴────┐ ┌────┴────┐
│adaLN(t) │ │adaLN(t) │
└────┬────┘ └────┬────┘
│ │
└──────┬──────────────┘
│ (concatenate)
┌──────┴──────┐
│ Joint Self- │ Image-text tokens
│ Attention │ attend to each other
└──────┬──────┘
│ (split)
┌──────┴──────────────┐
│ │
┌────┴────┐ ┌────┴────┐
│ FFN │ │ FFN │
│ (image) │ │ (text) │
└────┬────┘ └────┬────┘
│ │
Image Out Text Out
3. Scaling Laws:
| Model | Blocks | Parameters | Performance (validation loss) |
|---|---|---|---|
| SD3-S | 15 | 450M | High |
| SD3-M | 24 | 2B | Medium |
| SD3-L | 38 | 8B | Low (best performance) |
Smooth scaling was confirmed where validation loss steadily decreases as model size and training steps increase.
2.5.4 Flux: Black Forest Labs' Flow Matching Model
Flux (Black Forest Labs, 2024) is a model based on SD3's Rectified Flow + Transformer architecture.
| Variant | Training Method | 추론 스텝 | Features |
|---|---|---|---|
| FLUX.1 [pro] | Full training | 25-50 | Highest quality, API only |
| FLUX.1 [dev] | Guidance Distillation | 25-50 | Efficient inference, open weights |
| FLUX.1 [schnell] | Latent Adversarial Diffusion Distillation | 1-4 | Ultra-fast generation |
Guidance Distillation: The Student model is trained to reproduce the output of the Teacher model (using CFG) without CFG, eliminating CFG computation (2x forward pass) at inference time.
Latent Adversarial Diffusion Distillation (LADD): Combines GAN's adversarial loss with diffusion distillation to enable 1-4 step generation.
3. Text Conditioning Methodologies
Text Conditioning is the mechanism that injects the meaning of text prompts into the image generation process in T2I models. The choice of text encoder and conditioning method has a decisive impact on generation quality.
3.1 CLIP Text Encoder
OpenAI's CLIP (Contrastive Language-Image Pre-training, Radford et al., 2021) is a model contrastively trained on 400 million image-text pairs.
[CLIP Training Process]
Image ──→ Image Encoder ──→ image embedding ─┐
├─ cosine similarity
Text ──→ Text Encoder ──→ text embedding ─┘
Training objective: Increase similarity for matching pairs, decrease for non-matching pairs
(InfoNCE Loss)
Characteristics of CLIP Text Encoder:
- Both token sequence embeddings and [EOS] token pooled embeddings can be utilized
- Maximum 77 token length limit
- Strong at image-text alignment
- 시각적 개념에 특화된 Text understanding
| CLIP 변형 | 파라미터 | Models Used |
|---|---|---|
| CLIP ViT-L/14 | ~124M (text) | SD 1.x |
| OpenCLIP ViT-H/14 | ~354M (text) | SD 2.x |
| OpenCLIP ViT-bigG/14 | ~694M (text) | SDXL (primary) |
| CLIP ViT-L/14 | ~124M (text) | SDXL (secondary) |
3.2 T5 Text Encoder
Google's T5 (Text-to-Text Transfer Transformer, Raffel et al., 2020) is a large-scale language model trained on a pure text corpus.
Advantages of T5 (Demonstrated in the Imagen paper):
- Trained on a much larger text corpus than CLIP (C4 dataset)
- Excellent at understanding complex sentence structures and relationships
- Ability to process complex prompts including spatial relationships, quantities, and attribute combinations
- 텍스트 인코더 스케일링이 U-Net 스케일링보다 Effect적 (Imagen Key 발견)
| T5 변형 | 파라미터 | Models Used |
|---|---|---|
| T5-Small | 60M | Experimental |
| T5-Base | 220M | Experimental |
| T5-Large | 770M | Experimental |
| T5-XL | 3B | PixArt-alpha |
| T5-XXL | 4.6B | Imagen, SD3, Flux |
| Flan-T5-XL | 3B | PixArt-sigma |
3.3 Cross-Attention Mechanism
Cross-attention is the core mechanism that injects text information into image features within the U-Net or DiT.
# Cross-Attention implementation
class CrossAttention(nn.Module):
def __init__(self, d_model, d_context, n_heads):
super().__init__()
self.n_heads = n_heads
self.d_head = d_model // n_heads
self.to_q = nn.Linear(d_model, d_model, bias=False) # latent → Q
self.to_k = nn.Linear(d_context, d_model, bias=False) # text → K
self.to_v = nn.Linear(d_context, d_model, bias=False) # text → V
self.to_out = nn.Linear(d_model, d_model)
def forward(self, x, context):
"""
x: (B, H*W, d_model) - 이미지 latent features
context: (B, seq_len, d_context) - 텍스트 임베딩
"""
q = self.to_q(x) # 이미지가 Query
k = self.to_k(context) # 텍스트가 Key
v = self.to_v(context) # 텍스트가 Value
# Multi-head reshape
q = q.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
k = k.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
v = v.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
# Attention
attn = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_head)
attn = F.softmax(attn, dim=-1)
out = attn @ v
out = out.transpose(1, 2).reshape(B, -1, d_model)
return self.to_out(out)
3.4 Pooled Text Embeddings vs Sequence Embeddings
Modern T2I models simultaneously utilize two types of text embeddings:
[Text Embedding Types]
Text: "a photo of a cat"
│
┌────┴────┐
│ Text │
│ Encoder │
└────┬────┘
│
┌────┴──────────────────────┐
│ │
▼ ▼
Sequence Embeddings Pooled Embedding
(token-level) (sentence-level)
[h_1, h_2, ..., h_n] h_pool = h_[EOS]
Shape: (seq_len, d) Shape: (d,)
│ │
│ │
▼ ▼
Cross-Attention에 사용 Global conditioning에 사용
(세밀한 토큰별 정보) (전체 문장 의미)
- Timestep embedding에 더하기
- adaLN 파라미터 조절
- Vector conditioning
Dual text encoder usage in SDXL:
# SDXL Text Conditioning
def get_sdxl_text_embeddings(text, clip_l, clip_g):
# CLIP ViT-L: sequence embeddings (77, 768)
clip_l_output = clip_l(text)
clip_l_seq = clip_l_output.last_hidden_state # (77, 768)
clip_l_pooled = clip_l_output.pooler_output # (768,)
# OpenCLIP ViT-bigG: sequence embeddings (77, 1280)
clip_g_output = clip_g(text)
clip_g_seq = clip_g_output.last_hidden_state # (77, 1280)
clip_g_pooled = clip_g_output.pooler_output # (1280,)
# Concatenate sequence embeddings -> used for Cross-Attention
text_embeddings = torch.cat([clip_l_seq, clip_g_seq], dim=-1) # (77, 2048)
# Concatenate pooled embeddings -> used for Vector conditioning
pooled_embeddings = torch.cat([clip_l_pooled, clip_g_pooled], dim=-1) # (2048,)
return text_embeddings, pooled_embeddings
SD3 and Flux additionally combine T5-XXL sequence embeddings, using a triple text encoder configuration:
| 인코더 | Role | Output Shape | Use Case |
|---|---|---|---|
| CLIP ViT-L | Visual alignment | pooled (768) + seq (77, 768) | pooled → vector cond |
| OpenCLIP ViT-bigG | Visual alignment | pooled (1280) + seq (77, 1280) | pooled → vector cond |
| T5-XXL | Text understanding | seq (max 256/512, 4096) | cross-attn / joint-attn |
4. Training Datasets
The quality of T2I models directly depends on the scale and quality of training data. Here is a summary of major large-scale datasets.
4.1 Comparison of Major Datasets
| Dataset | Scale | Source | Filtering Method | 주요 Models Used |
|---|---|---|---|---|
| LAION-5B | 58.5억 pairs | Common Crawl | CLIP similarity > 0.28 (영어) | SD 1.x, SD 2.x |
| LAION-400M | 4억 pairs | Common Crawl | CLIP similarity 필터 | Early research |
| LAION-Aesthetics | ~1.2억 pairs | LAION-5B subset | Aesthetic score > 4.5/5.0 | SD fine-tuning |
| CC3M | 330만 pairs | Google 검색 | Automated filtering pipeline | Research |
| CC12M | 1,200만 pairs | Google 검색 | Relaxed filtering | Research |
| COYO-700M | 7.47억 pairs | Common Crawl | Image + text filtering | Research |
| WebLi | 10B images | Web crawling | Top 10% CLIP similarity | PaLI, Imagen |
| JourneyDB | ~460만 pairs | Midjourney | High-quality prompt-image | Research |
| SAM | 11M images | 다양한 Source | Manual + model-based | Segmentation + T2I |
| Internal (Proprietary) | 수십억 pairs | Proprietary | Proprietary | DALL-E 3, Midjourney |
4.2 LAION-5B Filtering Pipeline
LAION-5B (Schuhmann et al., 2022) is the most widely used open T2I training dataset:
[LAION-5B Data Collection and Filtering Pipeline]
Common Crawl (웹 아카이브)
│
▼
1. HTML 파싱: <img> 태그에서 src URL + alt-text 추출
│
▼
2. 이미지 다운로드 (img2dataset)
- 최소 해상도 필터: width, height ≥ 64
- 최대 종횡비: 3:1
│
▼
3. CLIP 유사도 필터링
- OpenAI CLIP ViT-B/32로 image-text similarity 계산
- 영어: cosine similarity ≥ 0.28
- 기타 언어: cosine similarity ≥ 0.26
│
▼
4. 안전성 필터링
- NSFW 탐지 점수 (CLIP 기반)
- Watermark 탐지 점수
- Toxic content 탐지
│
▼
5. 중복 제거 (deduplication)
- 해시 기반 exact duplicate 제거
- CLIP embedding 기반 near-duplicate 제거
│
▼
최종: 58.5억 이미지-텍스트 pairs
- 23.2억 영어
- 22.6억 기타 100+ 언어
- 12.7억 언어 미확인
4.3 Data Quality Assessment
The latest models tend to focus on data quality over data quantity:
1. CLIP Score-Based Filtering:
# CLIP Score computation
from transformers import CLIPModel, CLIPProcessor
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
inputs = processor(text=[caption], images=[image], return_tensors="pt")
outputs = model(**inputs)
clip_score = outputs.logits_per_image.item() / 100.0 # normalized
2. Aesthetic Score Filtering:
LAION-Aesthetics is a subset that trains a separate aesthetic predictor (CLIP embedding to MLP to score) and extracts only images with an aesthetic score of 4.5 or higher.
3. Caption Quality Improvement (DALL-E 3's Core Innovation):
DALL-E 3 (Betker et al., 2023) achieved dramatic performance improvement through caption quality improvement alone without any architecture changes:
- Train a dedicated image captioning model to generate detailed synthetic captions
- Train with 95% synthetic captions + 5% original captions
- Comparison experiments of three types: short synthetic, detailed synthetic, and human annotation
- Detailed synthetic captions are overwhelmingly superior
[DALL-E 3 Caption Improvement Effect]
Before: "cat on table"
-> Vague and lacks detail
After: "A fluffy orange tabby cat sitting on a round wooden
dining table, natural sunlight streaming through a
window behind, casting soft shadows. The cat has
bright green eyes and is looking directly at the camera."
-> Includes detailed attributes, spatial relationships, and lighting information
4.4 Data Preprocessing Techniques
| 전처리 Technique | Description | Effect |
|---|---|---|
| Center Crop | Crop center of image to square | Resolution standardization |
| Random Crop | Random position crop | Data augmentation |
| Bucket Sampling | Group images with similar aspect ratios | Multi-aspect ratio training (SDXL) |
| Caption Dropout | Replace caption with empty string at a certain probability | CFG training support |
| Multi-resolution | Progressive learning from low to high resolution | Training efficiency + quality |
| Tag Shuffling | Random shuffle of tag order | Reduced text order bias |
5. Fine-tuning & Customization Techniques
Fine-tuning techniques that adapt pretrained T2I models to specific styles, subjects, and control conditions are essential for practical applications.
5.1 LoRA (Low-Rank Adaptation)
LoRA by Hu et al. (2022) is an efficient method for fine-tuning large model weights, and is also extensively used in T2I models.
[LoRA Principle]
원본 가중치: W_0 ∈ R^{d×k} (고정, frozen)
LoRA 업데이트: ΔW = B × A where A ∈ R^{r×k}, B ∈ R^{d×r}
최종 출력: h = W_0 x + ΔW x = W_0 x + B(Ax)
- r << min(d, k): low-rank (보통 4, 8, 16, 32, 64)
- 학습 파라미터: A와 B만 (전체 대비 매우 적음)
- 원본 가중치는 고정 → 메모리 효율적
# LoRA application example (Stable Diffusion U-Net attention layer)
class LoRALinear(nn.Module):
def __init__(self, original_layer, rank=4, alpha=1.0):
super().__init__()
self.original = original_layer # frozen
in_features = original_layer.in_features
out_features = original_layer.out_features
# LoRA layers
self.lora_A = nn.Linear(in_features, rank, bias=False)
self.lora_B = nn.Linear(rank, out_features, bias=False)
self.scale = alpha / rank
# Initialization
nn.init.kaiming_uniform_(self.lora_A.weight)
nn.init.zeros_(self.lora_B.weight) # Initialize B to 0 -> identical to original at start
def forward(self, x):
original_out = self.original(x) # Frozen original output
lora_out = self.lora_B(self.lora_A(x)) # LoRA update
return original_out + self.scale * lora_out
LoRA Training Configuration (Diffusers-based):
# Diffusers LoRA training execution example
accelerate launch train_text_to_image_lora.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
--dataset_name="lambdalabs/naruto-blip-captions" \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--max_train_steps=15000 \
--learning_rate=1e-04 \
--lr_scheduler="cosine" \
--lr_warmup_steps=0 \
--rank=4 \
--mixed_precision="fp16" \
--output_dir="./sdxl-naruto-lora"
| LoRA 파라미터 | Typical Range | Impact |
|---|---|---|
| Rank (r) | 4-128 | Higher values increase expressiveness and memory |
| Alpha (α) | rank와 동일 ~ 2x | Learning rate scaling |
| Target Modules | attn Q,K,V,O + FFN | Application scope |
| Learning Rate | 1e-4 ~ 1e-5 | Convergence speed |
| Training Time | 5-30분 (단일 GPU) | Enables fast iteration |
| File Size | 1-200 MB | Easy to share and distribute |
5.2 DreamBooth
DreamBooth by Ruiz et al. (2023) is a technique for injecting the concept of a specific subject into a model using 3-5 images.
[DreamBooth Training Process]
Input: 3-5 images of a specific subject + unique identifier "[V]"
Example: "a [V] dog" (specific dog)
Training strategy:
1. Fine-tune model with subject images
- "a [V] dog" → 해당 강아지 이미지
2. Prior Preservation Loss (Key!)
- Pre-generate "a dog" images with the original model
- Preserve general dog generation capability during fine-tuning
- Prevent language drift
L = L_recon([V] images) + λ * L_prior(class images)
# DreamBooth + LoRA training (recommended combination)
# Based on diffusers library
accelerate launch train_dreambooth_lora.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
--instance_data_dir="./my_dog_images" \
--instance_prompt="a photo of sks dog" \
--class_data_dir="./class_dog_images" \
--class_prompt="a photo of dog" \
--with_prior_preservation \
--prior_loss_weight=1.0 \
--num_class_images=200 \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=1e-4 \
--lr_scheduler="constant" \
--max_train_steps=500 \
--rank=4 \
--mixed_precision="fp16"
5.3 Textual Inversion
Textual Inversion by Gal et al. (2023) is a method that learns only new token embeddings without modifying any model weights.
[Textual Inversion]
Existing token space: [cat] [dog] [car] [tree] ...
│
Add new token: [S*] New concept to learn
│
Training: Optimize only the embedding vector of [S*] with 3-5 images
Entire rest of model is frozen
Advantage: Minimal parameters (1 token = 768 or 1024 floats)
Disadvantage: Less expressive than LoRA/DreamBooth
5.4 ControlNet
ControlNet by Zhang & Agrawala (2023) is a method for adding structural conditions (edge, depth, pose, etc.) to pretrained diffusion models.
[ControlNet Architecture]
Control Input (예: Canny edge)
│
┌─────┴─────┐
│ Zero │
│ Conv │
└─────┬─────┘
│
┌─────┴─────┐
Locked U-Net │ Trainable │ Copy of U-Net Encoder
(원본 고정) │ Copy of │ (trainable)
│ │ U-Net Enc │
│ └─────┬─────┘
│ │
│ ┌─────┴─────┐
│ │ Zero │ Output is 0 at training start
│ │ Conv │ (starts without affecting original model)
│ └─────┬─────┘
│ │
└─────── + ───────────┘ Add to original U-Net features
│
Final Output
ControlNet's Core Training Technique - Zero Convolution:
# Zero Convolution: Initialize weights and biases to 0
class ZeroConv(nn.Module):
def __init__(self, in_channels, out_channels):
super().__init__()
self.conv = nn.Conv2d(in_channels, out_channels, 1)
nn.init.zeros_(self.conv.weight)
nn.init.zeros_(self.conv.bias)
def forward(self, x):
return self.conv(x)
# Training start: zero conv output = 0
# -> Adding ControlNet doesn't affect original model output
# -> Gradually reflects control signal as training progresses
| Condition Type | Input | Use Case |
|---|---|---|
| Canny Edge | Edge map | Contour-based generation |
| Depth | Depth map | 3D structure preservation |
| OpenPose | Joint positions | Human pose control |
| Semantic Segmentation | Segmentation map | Layout control |
| Scribble | Scribble | Rough composition |
| Normal Map | Surface normal map | 3D shape control |
| Tile | Low-resolution/tile | Super-resolution |
5.5 IP-Adapter
IP-Adapter (Image Prompt Adapter) by Ye et al. (2023) is an adapter that uses images as prompts to transfer style or subjects.
[IP-Adapter Architecture]
Reference Image ──→ CLIP Image Encoder ──→ image features
│
┌─────┴─────┐
│ Projection │ Trainable
│ Layer │
└─────┬─────┘
│
┌─────┴─────┐
│ Decoupled │ Separate cross-attention
│ Cross-Attn │ (separated from text cross-attn)
└─────┬─────┘
│
Original U-Net Cross-Attention ────── + ───────┘
(text conditioning)
출력 = Text_CrossAttn(Q, K_text, V_text) + λ * Image_CrossAttn(Q, K_img, V_img)
5.6 Comparison of Fine-tuning Techniques
| Technique | Modified Target | Training Images | Training Time | File Size | 주요 Use Case |
|---|---|---|---|---|---|
| LoRA | Attention weights (low-rank) | Tens to thousands | 5-30분 | 1-200MB | Style, concepts |
| DreamBooth | Full model or + LoRA | 3-10 | 5-60분 | 2-7GB (전체) 또는 1-200MB (LoRA) | Specific subject |
| Textual Inversion | Token embeddings only | 3-10 | 30분-수시간 | Few KB | Simple concepts |
| ControlNet | U-Net Encoder copy | Tens of thousands to hundreds of thousands | Several days | ~1.5GB | Structural control |
| IP-Adapter | Projection + Cross-Attn | Large-scale | Several days | ~100MB | Image prompting |
6. Latest Trends (2024-2026)
6.1 Consistency Models
Consistency Models by Yang Song et al. (2023) is a method for reducing multi-step generation in diffusion models to 1-step or few-step.
[Consistency Models Key Idea]
Diffusion: x_T → x_{T-1} → ... → x_1 → x_0 (hundreds of steps)
Consistency:
PF-ODE trajectory 위의 모든 점 x_t가
동일한 x_0로 매핑되도록 학습
f_θ(x_t, t) = x_0 ∀t ∈ [0, T]
Key 제약: f_θ(x_0, 0) = x_0 (self-consistency)
x_T ────→ f_θ ────→ x_0
│ ↑
x_t ────→ f_θ ───────┘ (maps to the same x_0!)
│ ↑
x_t' ────→ f_θ ───────┘
Two Training Methods:
| 방법 | Description | 장점 | 단점 |
|---|---|---|---|
| Consistency Distillation (CD) | 사전학습된 diffusion model 필요, PF-ODE 시뮬레이션 | 높은 품질 | teacher 모델 필요 |
| Consistency Training (CT) | 실제 데이터에서 직접 학습 | teacher 불필요 | CD보다 품질 다소 Low |
Performance:
- CIFAR-10: FID 3.55 (1-step), 2.93 (2-step)
- ImageNet 64x64: FID 6.20 (1-step)
Follow-up research, Improved Consistency Training (iCT) and Latent Consistency Models (LCM), applied this to large-scale T2I models, enabling 2-4 step generation at the SDXL level.
6.2 The Spread of DiT (Diffusion Transformer) Architecture
Since 2024, DiT has been replacing U-Net to become the mainstream backbone for T2I:
| 모델 | Year | Backbone | 파라미터 | Key Features |
|---|---|---|---|---|
| DiT (원본) | 2023 | Transformer | 675M | Class-conditional, adaLN-Zero |
| PixArt-alpha | 2023 | DiT + Cross-Attn | 600M | T2I, low-cost training |
| PixArt-sigma | 2024 | DiT + KV Compression | 600M | 4K resolution, weak-to-strong |
| SD3 | 2024 | MM-DiT | 2B-8B | Flow Matching, triple text encoder |
| Flux | 2024 | MM-DiT variant | ~12B | Distillation variant |
| Playground v2.5 | 2024 | SDXL U-Net | ~2.6B | EDM noise schedule |
| Hunyuan-DiT | 2024 | DiT | ~1.5B | Chinese+English bilingual |
| Lumina-T2X | 2024 | DiT | 다양 | Multi-modal generation |
6.3 PixArt-alpha and PixArt-sigma
PixArt-alpha (Chen et al., 2023) is a pioneering model for efficient DiT training:
Core innovation - Training Decomposition:
[PixArt-alpha 3-Stage Training]
Stage 1: Pixel Dependency 학습 (저비용)
- ImageNet 사전학습된 DiT에서 시작
- 클래스 조건부 → T2I 전환의 기초
Stage 2: Text-Image Alignment 학습
- Cross-attention으로 텍스트 조건 주입
- LLaVA로 생성한 고품질 synthetic caption 사용
Stage 3: High-quality Aesthetic 학습
- 고품질 미적 Dataset으로 fine-tuning
- JourneyDB 등 활용
총 학습 비용: ~675 A100 GPU days
(SD 1.5의 ~6,250 A100 GPU days 대비 10.8%)
Improvements in PixArt-sigma (Chen et al., 2024):
- Weak-to-Strong Training: Enhanced training with higher quality data based on PixArt-alpha
- KV Compression: Compress Key and Value in Attention for improved efficiency, enabling 4K resolution
- Comparable performance to SDXL (2.6B) with only 0.6B parameters
6.4 Comparison of SDXL, SD3, and Flux
[Stable Diffusion Lineage by Generation]
SD 1.x (2022) SDXL (2023) SD3 (2024) Flux (2024)
│ │ │ │
U-Net 860M U-Net 2.6B MM-DiT 2-8B MM-DiT ~12B
│ │ │ │
CLIP ViT-L CLIP-L + CLIP-L + CLIP-L +
OpenCLIP-G OpenCLIP-G + OpenCLIP-G +
T5-XXL T5-XXL
│ │ │ │
Diffusion Diffusion Rectified Rectified
(DDPM) (DDPM) Flow Flow
│ │ │ │
512x512 1024x1024 1024x1024 1024x1024+
│ │ │ │
CFG 7.5 CFG 5-9 CFG 3.5-7 Guidance
Distillation
6.5 Training Innovations of DALL-E 3
The core innovation of DALL-E 3 (Betker et al., 2023) lies in improving training data caption quality:
- Image Captioner Training: Separately train a CoCa-based image captioning model
- Synthetic Caption Generation: Re-label entire training data with detailed synthetic captions
- Caption Mixing: Train with 95% synthetic + 5% original captions
- Descriptive vs Short: 상세한 Description형 캡션이 짧은 태그형보다 우수
6.6 Three Key Insights of Playground v2.5
Playground v2.5 (Li et al., 2024) surpassed DALL-E 3 and Midjourney 5.2 through training strategy improvements based on the SDXL architecture:
1. EDM Noise Schedule Adoption:
# EDM Framework (Karras et al., 2022)
# σ(t) 기반 noise schedule - Zero Terminal SNR 보장
# 기존 SD의 linear schedule 대비 색상/대비 크게 개선
def edm_precondition(sigma, x_noisy, F_theta):
"""EDM Preconditioning"""
c_skip = 1.0 / (sigma ** 2 + 1)
c_out = sigma / (sigma ** 2 + 1).sqrt()
c_in = 1.0 / (sigma ** 2 + 1).sqrt()
c_noise = sigma.log() / 4
D_x = c_skip * x_noisy + c_out * F_theta(c_in * x_noisy, c_noise)
return D_x
2. Multi-Aspect Ratio Training:
- Bucketed dataset: Group images with similar aspect ratios하여 배치 구성
- Supports various aspect ratios during training (1:1, 4:3, 16:9, etc.)
3. Human Preference Alignment:
- Training strategy utilizing human preference data
- Maximize aesthetic quality through quality-tuning
7. Practical Training Pipeline Guide
7.1 Training Infrastructure
GPU Requirements
| Training Scale | Recommended GPU | VRAM | Training Duration | Cost (Estimated) |
|---|---|---|---|---|
| LoRA Fine-tuning | RTX 3090/4090 1대 | 24GB | 5-30분 | < $1 |
| DreamBooth | A100 40GB 1대 | 40GB | 30-60분 | $2-5 |
| ControlNet 학습 | A100 80GB 4-8대 | 320-640GB | 2-5일 | $500-2,000 |
| SD 1.5 수준 학습 | A100 80GB 256대 | ~20TB | 24일 | ~$150K |
| SDXL 수준 학습 | A100 80GB 512-1024대 | ~40-80TB | 수주 | ~$500K-1M |
| SD3/Flux 수준 학습 | H100 80GB 1024+대 | ~80TB+ | 수주-수개월 | > $1M |
Distributed Training Strategy
[Large-Scale Distributed Training Configuration]
┌─────────────────────────────────────────────────────┐
│ Data Parallel (DP/DDP) │
│ │
│ GPU 0 GPU 1 GPU 2 GPU 3 │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Full │ │Full │ │Full │ │Full │ │
│ │Model │ │Model │ │Model │ │Model │ │
│ │Copy │ │Copy │ │Copy │ │Copy │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ │
│ Batch 1 Batch 2 Batch 3 Batch 4 │
│ │
│ -> Synchronize gradients with All-Reduce │
│ -> Different data batches on each GPU │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ FSDP (Fully Sharded Data Parallel) │
│ │
│ GPU 0 GPU 1 GPU 2 GPU 3 │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Shard │ │Shard │ │Shard │ │Shard │ │
│ │ 1/4 │ │ 2/4 │ │ 3/4 │ │ 4/4 │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ │
│ │
│ -> Shard model parameters across GPUs │
│ -> All-Gather only needed shards during Forward/Backward │
│ -> Maximize memory efficiency (enables 8B+ model training) │
└─────────────────────────────────────────────────────┘
7.2 Representative Training Framework: Diffusers
HuggingFace's Diffusers library is the de facto standard for T2I model training.
# Diffusers-based Text-to-Image full training pipeline
from diffusers import AutoencoderKL, UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTextModel, CLIPTokenizer
from accelerate import Accelerator
import torch
# 1. Load models
vae = AutoencoderKL.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="vae"
)
unet = UNet2DConditionModel.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet"
)
text_encoder = CLIPTextModel.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="text_encoder"
)
tokenizer = CLIPTokenizer.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="tokenizer"
)
noise_scheduler = DDPMScheduler.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler"
)
# 2. Freeze VAE and Text Encoder
vae.requires_grad_(False)
text_encoder.requires_grad_(False)
# 3. Accelerator setup (distributed training + Mixed Precision)
accelerator = Accelerator(
mixed_precision="fp16", # or "bf16"
gradient_accumulation_steps=4,
)
# 4. Optimizer
optimizer = torch.optim.AdamW(
unet.parameters(),
lr=1e-4,
betas=(0.9, 0.999),
weight_decay=1e-2,
eps=1e-8,
)
# 5. EMA setup
from diffusers.training_utils import EMAModel
ema_unet = EMAModel(
unet.parameters(),
decay=0.9999,
use_ema_warmup=True,
)
# 6. Prepare for distributed training
unet, optimizer, dataloader = accelerator.prepare(unet, optimizer, dataloader)
# 7. Training loop
for epoch in range(num_epochs):
for batch in dataloader:
with accelerator.accumulate(unet):
images = batch["images"]
captions = batch["captions"]
# Latent encoding
with torch.no_grad():
latents = vae.encode(images).latent_dist.sample()
latents = latents * vae.config.scaling_factor
# Text encoding
with torch.no_grad():
text_inputs = tokenizer(captions, padding="max_length",
max_length=77, return_tensors="pt")
text_embeds = text_encoder(text_inputs.input_ids)[0]
# Add noise
noise = torch.randn_like(latents)
timesteps = torch.randint(0, 1000, (latents.shape[0],))
noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
# Classifier-Free Guidance: random caption dropout
if torch.rand(1) < 0.1: # 10% probability로 unconditional
text_embeds = torch.zeros_like(text_embeds)
# Predict noise
noise_pred = unet(noisy_latents, timesteps, text_embeds).sample
# Loss computation
loss = F.mse_loss(noise_pred, noise)
# Backward
accelerator.backward(loss)
accelerator.clip_grad_norm_(unet.parameters(), 1.0)
optimizer.step()
optimizer.zero_grad()
# EMA update
ema_unet.step(unet.parameters())
7.3 Mixed Precision Training
Mixed Precision is a technique that improves memory and computational efficiency by combining FP32 and FP16/BF16.
[Mixed Precision Training]
Forward Pass:
- Model weights: FP16/BF16 (half memory)
- Activation: FP16/BF16
Loss Scaling:
- Multiply loss by a large scale (e.g., 2^16) to prevent gradient underflow
- Scale down gradient again after backward
Backward Pass:
- Gradient: FP16/BF16
Optimizer Step:
- Master Weights: FP32 (maintain precision!)
- Update FP32 master weights then create FP16 copy
| Precision | 메모리 | 연산 속도 | 수치 안정성 | 권장 |
|---|---|---|---|---|
| FP32 | 4 bytes | 기준 | 최고 | Optimizer State |
| FP16 | 2 bytes | ~2x | Low (overflow 위험) | Forward/Backward |
| BF16 | 2 bytes | ~2x | High (넓은 범위) | H100/A100에서 권장 |
| TF32 | 4 bytes (저장) | ~1.5x | High | A100 default |
# BF16 Mixed Precision config (accelerate-based)
# accelerate config (YAML)
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
mixed_precision: bf16
num_machines: 1
num_processes: 8
7.4 EMA (Exponential Moving Average)
EMA is a technique that maintains a moving average of model weights during training to achieve more stable results during inference. It is used in nearly all T2I model training.
[EMA Update]
θ_ema ← λ * θ_ema + (1 - λ) * θ_model
- λ: decay rate (보통 0.9999 ~ 0.99999)
- θ_model: 현재 학습 중인 모델 가중치
- θ_ema: EMA 가중치 (추론 시 사용)
- Effect: gradient noise를 평활화하여 더 안정적인 가중치
# Diffusers EMA implementation
from diffusers.training_utils import EMAModel
# Create EMA model
ema_model = EMAModel(
unet.parameters(),
decay=0.9999, # decay rate
use_ema_warmup=True, # warmup 사용
inv_gamma=1.0, # warmup 파라미터
power=3/4, # warmup 파라미터
)
# Update at every training step
ema_model.step(unet.parameters())
# Apply EMA weights at inference
ema_model.copy_to(unet.parameters())
# Or use context manager
with ema_model.average_parameters():
# EMA weights are used inside this block
output = unet(noisy_latents, timesteps, text_embeds)
7.5 Training Hyperparameter Guide
| Hyperparameter | SD 1.5 | SDXL | SD3/Flux | LoRA |
|---|---|---|---|---|
| Learning Rate | 1e-4 | 1e-4 | 1e-4 | 1e-4 ~ 5e-5 |
| Batch Size (총) | 2048 | 2048 | 2048+ | 1-8 |
| Optimizer | AdamW | AdamW | AdamW | AdamW / Prodigy |
| Weight Decay | 0.01 | 0.01 | 0.01 | 0.01 |
| Grad Clip | 1.0 | 1.0 | 1.0 | 1.0 |
| EMA Decay | 0.9999 | 0.9999 | 0.9999 | N/A |
| Warmup Steps | 10,000 | 10,000 | 10,000 | 0-500 |
| Precision | FP32/FP16 | BF16 | BF16 | FP16/BF16 |
| CFG Dropout | 10% | 10% | 10% | 10% |
| Resolution | 512 | 1024 | 1024 | Original resolution |
| Total Steps | ~500K | ~500K+ | ~1M+ | 500-15,000 |
8. Key Paper References
8.1 Core Paper Table
| # | Paper Title | Authors | Year | Key Contribution | Link |
|---|---|---|---|---|---|
| 1 | Generative Adversarial Networks | Goodfellow et al. | 2014 | GAN framework proposal | arXiv:1406.2661 |
| 2 | Neural Discrete Representation Learning (VQ-VAE) | van den Oord et al. | 2017 | Vector Quantized discrete latent space | arXiv:1711.00937 |
| 3 | A Style-Based Generator Architecture for GANs (StyleGAN) | Karras et al. | 2019 | Style-based generator, Progressive Growing | arXiv:1812.04948 |
| 4 | Large Scale GAN Training (BigGAN) | Brock et al. | 2019 | Large-scale GAN 학습, Truncation Trick | arXiv:1809.11096 |
| 5 | Generating Diverse High-Fidelity Images with VQ-VAE-2 | Razavi et al. | 2019 | Hierarchical VQ-VAE, high-resolution generation | arXiv:1906.00446 |
| 6 | Denoising Diffusion Probabilistic Models (DDPM) | Ho et al. | 2020 | Practical training of Diffusion models | arXiv:2006.11239 |
| 7 | Learning Transferable Visual Models From Natural Language Supervision (CLIP) | Radford et al. | 2021 | CLIP contrastive learning, image-text alignment | arXiv:2103.00020 |
| 8 | Zero-Shot Text-to-Image Generation (DALL-E) | Ramesh et al. | 2021 | dVAE + Autoregressive Transformer T2I | arXiv:2102.12092 |
| 9 | High-Resolution Image Synthesis with Latent Diffusion Models (LDM) | Rombach et al. | 2022 | Latent Diffusion, Cross-Attention conditioning | arXiv:2112.10752 |
| 10 | Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2) | Ramesh et al. | 2022 | CLIP-based 2-stage Diffusion, Prior + Decoder | arXiv:2204.06125 |
| 11 | Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen) | Saharia et al. | 2022 | T5-XXL 텍스트 인코더의 Effect, Dynamic Thresholding | arXiv:2205.11487 |
| 12 | Classifier-Free Diffusion Guidance | Ho & Salimans | 2022 | CFG 학습 Technique, unconditional-conditional 동시 학습 | arXiv:2207.12598 |
| 13 | Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Parti) | Yu et al. | 2022 | Autoregressive T2I, 20B scaling | arXiv:2206.10789 |
| 14 | LoRA: Low-Rank Adaptation of Large Language Models | Hu et al. | 2022 | Low-rank fine-tuning Technique | arXiv:2106.09685 |
| 15 | Elucidating the Design Space of Diffusion-Based Generative Models (EDM) | Karras et al. | 2022 | Systematic Diffusion design space analysis, Preconditioning | arXiv:2206.00364 |
| 16 | An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion | Gal et al. | 2023 | Personalization via new token embedding learning | arXiv:2208.01618 |
| 17 | DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation | Ruiz et al. | 2023 | Subject personalization with few images, Prior Preservation | arXiv:2208.12242 |
| 18 | Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet) | Zhang & Agrawala | 2023 | Structural control(edge, depth, pose) 추가 | arXiv:2302.05543 |
| 19 | Consistency Models | Song et al. | 2023 | 1-step generation, PF-ODE consistency learning | arXiv:2303.01469 |
| 20 | SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis | Podell et al. | 2023 | Large U-Net, Dual Text Encoder, Multi-AR training | arXiv:2307.01952 |
| 21 | Scalable Diffusion Models with Transformers (DiT) | Peebles & Xie | 2023 | Diffusion + Transformer, adaLN-Zero | arXiv:2212.09748 |
| 22 | Flow Matching for Generative Modeling | Lipman et al. | 2023 | ODE-based Flow Matching framework | arXiv:2210.02747 |
| 23 | Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow | Liu et al. | 2023 | Rectified Flow, Optimal Transport | arXiv:2209.03003 |
| 24 | IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models | Ye et al. | 2023 | Image prompt adapter, Decoupled Cross-Attn | arXiv:2308.06721 |
| 25 | Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3Leon) | Yu et al. | 2023 | Efficient autoregressive T2I, Retrieval Augmented | arXiv:2309.02591 |
| 26 | PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis | Chen et al. | 2023 | Low-cost DiT training, training decomposition strategy | arXiv:2310.00426 |
| 27 | Improving Image Generation with Better Captions (DALL-E 3) | Betker et al. | 2023 | Dramatic quality improvement via synthetic captions | cdn.openai.com/papers/dall-e-3.pdf |
| 28 | PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation | Chen et al. | 2024 | Weak-to-Strong training, KV Compression, 4K | arXiv:2403.04692 |
| 29 | Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3) | Esser et al. | 2024 | MM-DiT, Rectified Flow Large-scale 적용, Logit-Normal | arXiv:2403.03206 |
| 30 | Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation | Li et al. | 2024 | EDM Noise Schedule, Multi-AR, Human Preference | arXiv:2402.17245 |
8.2 Additional Reference Papers
| Paper Title | Year | Key |
|---|---|---|
| LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models | 2022 | 5.85 billion open image-text dataset |
| Improved Denoising Diffusion Probabilistic Models | 2021 | Cosine schedule, learned variance |
| Denoising Diffusion Implicit Models (DDIM) | 2021 | Deterministic sampling, speed improvement |
| Progressive Distillation for Fast Sampling of Diffusion Models | 2022 | Inference acceleration via progressive distillation |
| InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation | 2024 | Rectified Flow 1-step generation |
| Latent Consistency Models | 2024 | LCM, SDXL-based few-step generation |
| SDXL-Turbo: Adversarial Diffusion Distillation | 2024 | 1-4 step SDXL generation |
| Stable Cascade | 2024 | Wuerstchen-based 3-stage hierarchical generation |
9. Conclusion and Outlook
Text-to-Image model training methodologies started from GAN's adversarial training, passed through Diffusion's iterative denoising, and are now converging on a new paradigm of Flow Matching + DiT.
Key Trend Summary
[T2I Training Methodology Evolution]
Efficiency: Full Training ──→ LoRA/Adapter ──→ Prompt Tuning
(months, $1M+) (minutes, less than $1) (seconds)
Architecture: U-Net ────────→ DiT ─────────→ MM-DiT + Flow Matching
(SD 1.x-SDXL) (DiT, PixArt) (SD3, Flux)
Generation speed: 50-1000 steps ──→ 20-50 steps ──→ 1-4 steps
(DDPM) (DDIM, DPM++) (LCM, LADD, CM)
Data quality: Web crawling ──→ 필터링 ──→ Synthetic Caption
(LAION raw) (aesthetic) (DALL-E 3 방식)
Text understanding: CLIP only ──→ CLIP + T5 ──→ 3중 Encoder
(SD 1.x) (Imagen) (SD3, Flux)
Future Outlook
Maximizing training efficiency: As demonstrated by PixArt-alpha, the trend of reducing training costs to 1/10 or less while maintaining quality will continue.
Data-Centric AI approach: As DALL-E 3 demonstrated, data quality and captioning are becoming more important than architecture.
Few-Step / One-Step 생성: Consistency Models, LCM, LADD 등의 증류 Technique이 발전하여 실시간 생성이 표준이 될 것이다.
Unified Multi-Modal Generation: Expanding to models that integrate not only text-to-image but also video, 3D, and audio.
Advanced Personalization: Beyond LoRA, DreamBooth, and IP-Adapter, more accurate subject reproduction with even less data will become possible.
T2I model training methodology has entered an era where the key is not simply "training a larger model with more data," but rather what data, with what schedule, and with what conditioning to train with. We hope the methodologies covered in this article can be used as a foundation for training your own T2I models or effectively customizing existing ones.
References
- HuggingFace Diffusers Documentation
- HuggingFace Diffusers Training Examples
- Awesome Text-to-Image Studies
- Text-to-Image Diffusion Models in Generative AI: A Survey (arXiv:2303.07909)
- Text to Image Generation and Editing: A Survey (arXiv:2505.02527)
- Stability AI Research
- Black Forest Labs
Quiz
Q1: What is the main topic covered in "Complete Guide to Text-to-Image Model Training Methodologies: From GAN to Flow Matching"?
An in-depth, paper-based analysis of training methodologies for Text-to-Image generative model architectures spanning GAN, VAE, Diffusion, and Flow Matching.
Q2: Describe the Training Methodologies by Core Architecture.
2.1 GAN-Based: Adversarial Training Generative Adversarial Network (GAN) is a framework where two networks, the Generator and the Discriminator, are trained competitively.
Q3: Explain the core concept of Text Conditioning Methodologies.
Text Conditioning is the mechanism that injects the meaning of text prompts into the image generation process in T2I models. The choice of text encoder and conditioning method has a decisive impact on generation quality.
Q4: What are the key aspects of Training Datasets?
The quality of T2I models directly depends on the scale and quality of training data. Here is a summary of major large-scale datasets.
Q5: How does Fine-tuning & Customization Techniques work?
Fine-tuning techniques that adapt pretrained T2I models to specific styles, subjects, and control conditions are essential for practical applications. 5.1 LoRA (Low-Rank Adaptation) LoRA by Hu et al.