- Authors
- Name
- 1. Introduction: The Evolution of Text-to-Image Generative Models
- 2. Training Methodologies by Core Architecture
- 2.1 GAN-Based: Adversarial Training
- 2.2 VAE-Based: Codebook Learning and Discrete Latent Space
- 2.3 Diffusion-Based: The Core of Current T2I
- 2.3.1 DDPM: Denoising Diffusion Probabilistic Models
- 2.3.2 Noise Scheduling
- 2.3.3 Latent Diffusion Model (LDM) - The Core of Stable Diffusion
- 2.3.4 Structure of the U-Net Backbone
- 2.3.5 Classifier-Free Guidance (CFG)
- 2.3.6 DALL-E 2: CLIP-Based Diffusion
- 2.3.7 Imagen: The Power of T5 Text Encoder
- 2.3.8 DiT: Diffusion Transformer
- 2.4 Autoregressive-Based T2I
- 2.5 Flow Matching: The Next-Generation Training Paradigm
- 3. Text Conditioning Methodologies
- 4. Training Datasets
- 5. Fine-tuning & Customization Techniques
- 6. Latest Trends (2024-2026)
- 7. Practical Training Pipeline Guide
- 8. Key Paper References
- 9. Conclusion and Outlook
- References
1. Introduction: The Evolution of Text-to-Image Generative Models
Text-to-Image (T2I) generative models are technologies that produce high-resolution images from natural language text prompts, and have undergone rapid development over the past several years. The trajectory of this field can be broadly divided into four paradigms.
[Text-to-Image Model Evolution Timeline]
2014-2019 2017-2020 2020-2023 2023-Present
| | | |
GAN VAE/VQ-VAE Diffusion Models Flow Matching
| | | + DiT
v v v v
┌──────────┐ ┌──────────┐ ┌────────────────┐ ┌──────────────┐
│StackGAN │ │ VQ-VAE │ │ DDPM (2020) │ │ SD3 (2024) │
│AttnGAN │ │ VQ-VAE-2 │ │ DALL-E 2(2022) │ │ Flux (2024) │
│StyleGAN │ │ dVAE │ │ Imagen (2022) │ │ Pixart-Sigma │
│BigGAN │ │ │ │ SD 1.x (2022) │ │ │
│GigaGAN │ │ │ │ SDXL (2023) │ │ │
└──────────┘ └──────────┘ └────────────────┘ └──────────────┘
Features: Features: Features: Features:
- Adversarial - Discrete - Iterative - Straight paths
Training Latent Space Denoising - ODE-based
- Mode Collapse - Codebook - Classifier-Free - Fewer steps
issues Learning Guidance - DiT backbone
- Fast generation - Two-stage - Latent Space - Scalable
Training - U-Net backbone
1.1 Why Training Methodology Matters
The quality of T2I models is critically determined not only by architecture design but also by training methodology. Even with identical architectures, generation quality varies dramatically depending on noise scheduling, conditioning approaches, data quality, and training strategies. A prime example is DALL-E 3, which achieved dramatic performance improvements over its predecessor through caption quality improvement alone without any architecture changes.
This article provides an in-depth, paper-based analysis of core training methodologies for each paradigm, covering practical training pipeline configuration as well.
2. Training Methodologies by Core Architecture
2.1 GAN-Based: Adversarial Training
Generative Adversarial Network (GAN) is a framework where two networks, the Generator and the Discriminator, are trained competitively.
2.1.1 Basic Training Principles
The training objective function of GAN is defined as a minimax game:
min_G max_D V(D, G) = E_{x~p_data}[log D(x)] + E_{z~p_z}[log(1 - D(G(z)))]
- G (Generator): 랜덤 노이즈 z로부터 이미지 생성
- D (Discriminator): 실제 이미지와 생성 이미지 구분
- 학습 목표: G는 D를 속이고, D는 정확히 판별
2.1.2 StyleGAN Training Strategy
StyleGAN (Karras et al., 2019) introduced Progressive Growing and Style-based Generator to enable high-quality image generation.
Core Training Techniques:
| Technique | Description | Effect |
|---|---|---|
| Progressive Growing | Start from low resolution (4x4) and progressively increase | Improved training stability |
| Style Mixing | Inject different latent codes into different layers | Increased diversity |
| Path Length Regularization | Generator Jacobian regularization | Improved generation quality |
| R1 Regularization | Discriminator gradient penalty | Training stabilization |
| Lazy Regularization | Apply regularization every 16 steps instead of every step | Improved training efficiency |
# StyleGAN2 core training loop (simplified)
for real_images, _ in dataloader:
# 1. Discriminator training
z = torch.randn(batch_size, latent_dim)
fake_images = generator(z)
d_real = discriminator(real_images)
d_fake = discriminator(fake_images.detach())
d_loss = F.softplus(-d_real).mean() + F.softplus(d_fake).mean()
# R1 Regularization (lazy: every 16 steps)
if step % 16 == 0:
real_images.requires_grad = True
d_real = discriminator(real_images)
r1_grads = torch.autograd.grad(d_real.sum(), real_images)[0]
r1_penalty = r1_grads.square().sum(dim=[1,2,3]).mean()
d_loss += 10.0 * r1_penalty
d_optimizer.zero_grad()
d_loss.backward()
d_optimizer.step()
# 2. Generator training
z = torch.randn(batch_size, latent_dim)
fake_images = generator(z)
d_fake = discriminator(fake_images)
g_loss = F.softplus(-d_fake).mean()
g_optimizer.zero_grad()
g_loss.backward()
g_optimizer.step()
2.1.3 Large-Scale Training with BigGAN
BigGAN (Brock et al., 2019) is a model that scaled up GAN to large scale, employing the following training strategies:
- Large-scale batches: Increase batch size up to 2048 for improved training stability and quality
- Class-conditional Batch Normalization: Inject class information into Batch Normalization parameters
- Truncation Trick: Truncate latent distribution at inference to control quality-diversity trade-off
- Orthogonal Regularization: Maintain orthogonality of weight matrices to prevent mode collapse
2.1.4 Limitations of GAN-Based T2I
GAN-based approaches ceded dominance to Diffusion-based models due to the following fundamental limitations:
- Mode Collapse: Limited generation diversity
- Training Instability: Unstable training sensitive to hyperparameters
- Text Conditioning difficulty: Difficult to accurately reflect complex text prompts
- Scaling limitations: Increased training instability at large scale
2.2 VAE-Based: Codebook Learning and Discrete Latent Space
2.2.1 VQ-VAE: Vector Quantized Variational Autoencoder
VQ-VAE (van den Oord et al., 2017) is an approach that learns a discrete latent space instead of a continuous one.
[VQ-VAE Architecture]
Input Image Encoder Quantization Decoder Reconstructed
(256x256) --> [E(x)] --> z_e --> [Codebook] --> z_q --> [D(z_q)] --> Image
| |
| ┌────┴────┐
| │ e_1 │
| │ e_2 │ K code vectors
└──>│ ... │ (Codebook)
│ e_K │
└─────────┘
z_q = e_k where k = argmin_j ||z_e - e_j||
(quantize to nearest code vector)
VQ-VAE Training Loss Function:
L = ||x - D(z_q)||² # Reconstruction Loss
+ ||sg[z_e] - e||² # Codebook Loss (EMA 업데이트로 대체 가능)
+ β * ||z_e - sg[e]||² # Commitment Loss
- sg[·]: Stop-gradient 연산자
- β: Commitment loss 가중치 (보통 0.25)
- z_e: Encoder 출력
- e: 선택된 codebook 벡터
Since the quantization operation is non-differentiable, the Straight-Through Estimator (STE) is used to pass gradients to the encoder. The codebook itself is updated via Exponential Moving Average (EMA).
# VQ-VAE Codebook core training code
class VectorQuantizer(nn.Module):
def __init__(self, num_embeddings, embedding_dim, commitment_cost=0.25):
super().__init__()
self.embedding = nn.Embedding(num_embeddings, embedding_dim)
self.commitment_cost = commitment_cost
def forward(self, z_e):
# z_e: (B, D, H, W) -> (B*H*W, D)
flat_z = z_e.permute(0, 2, 3, 1).reshape(-1, z_e.shape[1])
# Find nearest codebook vector
distances = (flat_z ** 2).sum(dim=1, keepdim=True) \
+ (self.embedding.weight ** 2).sum(dim=1) \
- 2 * flat_z @ self.embedding.weight.t()
indices = distances.argmin(dim=1)
z_q = self.embedding(indices).view_as(z_e.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)
# Loss computation
codebook_loss = F.mse_loss(z_q.detach(), z_e) # commitment
commitment_loss = F.mse_loss(z_q, z_e.detach()) # codebook
loss = commitment_loss + self.commitment_cost * codebook_loss
# Straight-Through Estimator
z_q_st = z_e + (z_q - z_e).detach()
return z_q_st, loss, indices
2.2.2 VQ-VAE-2: Hierarchical Codebook Learning
VQ-VAE-2 (Razavi et al., 2019) introduced multi-level hierarchical quantization to significantly improve image quality.
[VQ-VAE-2 Hierarchical Structure]
Top Level (작은 해상도)
┌─────────────┐
│ 32x32 grid │ Global structure info
│ Codebook │ (composition, overall shape)
└──────┬──────┘
│
Bottom Level (큰 해상도)
┌──────┴──────┐
│ 64x64 grid │ Fine detail info
│ Codebook │ (textures, edges)
└─────────────┘
The image generation pipeline of VQ-VAE-2 consists of the following two stages:
- Stage 1: Train VQ-VAE-2 to encode images into hierarchical discrete codes
- Stage 2: Learn the prior of discrete codes with autoregressive models such as PixelCNN
This approach directly influenced the dVAE (discrete VAE) used in DALL-E.
2.3 Diffusion-Based: The Core of Current T2I
Diffusion Model is the mainstream paradigm for current T2I generation. It learns a forward process that gradually adds noise to data, and a reverse process that recovers data from noise.
2.3.1 DDPM: Denoising Diffusion Probabilistic Models
DDPM by Ho et al. (2020) is the key paper that elevated Diffusion Models to a practical level.
Forward Process (Diffusion):
q(x_t | x_{t-1}) = N(x_t; √(1-β_t) * x_{t-1}, β_t * I)
- Add a small amount of Gaussian noise at each timestep t
- β_t: noise schedule (보통 linear 또는 cosine)
- After T steps, x_T approximately equals N(0, I) (pure Gaussian noise)
Noise can be added directly at any timestep t in closed form:
q(x_t | x_0) = N(x_t; √(ᾱ_t) * x_0, (1-ᾱ_t) * I)
where ᾱ_t = ∏_{s=1}^{t} α_s, α_t = 1 - β_t
=> x_t = √(ᾱ_t) * x_0 + √(1-ᾱ_t) * ε, ε ~ N(0, I)
Reverse Process (Denoising):
p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), σ_t² * I)
- Neural network epsilon_theta predicts noise epsilon added to x_t
- Remove predicted noise to recover x_{t-1}
Training Objective (Simple Loss):
L_simple = E_{t, x_0, ε} [ ||ε - ε_θ(x_t, t)||² ]
- t ~ Uniform(1, T)
- ε ~ N(0, I)
- x_t = √(ᾱ_t) * x_0 + √(1-ᾱ_t) * ε
# DDPM core training loop
def train_step(model, x_0, noise_scheduler):
batch_size = x_0.shape[0]
# 1. Random timestep sampling
t = torch.randint(0, num_timesteps, (batch_size,))
# 2. Noise sampling
noise = torch.randn_like(x_0)
# 3. Forward process: generate x_t
alpha_bar_t = noise_scheduler.alpha_bar[t]
x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * noise
# 4. Predict noise
noise_pred = model(x_t, t)
# 5. Loss computation (MSE)
loss = F.mse_loss(noise_pred, noise)
return loss
2.3.2 Noise Scheduling
The noise schedule determines the amount of noise added at each timestep in the forward process and has a decisive impact on generation quality.
| Schedule | Formula | Features | Models Used |
|---|---|---|---|
| Linear | β_t = β_min + (β_max - β_min) * t/T | Simple but noise increases sharply at the end | DDPM |
| Cosine | ᾱ_t = cos²((t/T + s)/(1+s) * π/2) | Smooth transition, excellent information preservation | Improved DDPM |
| Scaled Linear | β_t = (β_min^0.5 + t/T * (β_max^0.5 - β_min^0.5))² | Used in SD 1.x | Stable Diffusion |
| Sigmoid | β_t = σ(-6 + 12*t/T) | Gradual change at both ends | Some research |
| EDM | σ(t) = t, log-normal sampling | Theoretically near optimal | Playground v2.5, EDM |
| Zero Terminal SNR | Ensures SNR(T) = 0 | Guarantees starting from pure noise | SD3, Flux |
Playground v2.5 (Li et al., 2024) adopted EDM's (Karras et al., 2022) noise schedule, greatly improving color and contrast. The key is ensuring Zero Terminal SNR, where the Signal-to-Noise Ratio (SNR) at timestep T must be exactly 0 during training.
# Cosine Schedule implementation
def cosine_beta_schedule(timesteps, s=0.008):
steps = timesteps + 1
x = torch.linspace(0, timesteps, steps)
alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * math.pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return torch.clip(betas, 0.0001, 0.9999)
# EDM Noise Schedule (Karras et al., 2022)
def edm_sigma_schedule(num_steps, sigma_min=0.002, sigma_max=80.0, rho=7.0):
step_indices = torch.arange(num_steps)
t_steps = (sigma_max ** (1/rho) + step_indices / (num_steps - 1)
* (sigma_min ** (1/rho) - sigma_max ** (1/rho))) ** rho
return t_steps
2.3.3 Latent Diffusion Model (LDM) - The Core of Stable Diffusion
Latent Diffusion Model (LDM) by Rombach et al. (2022) dramatically improved computational efficiency by performing diffusion in latent space instead of pixel space. This is the core idea behind Stable Diffusion.
[Latent Diffusion Model Architecture]
Text Prompt
│
┌────┴────┐
│ CLIP │
│ Encoder │
└────┬────┘
│ text embeddings
│
┌──────┐ ┌──────┐ ┌─────┴──────┐ ┌──────┐ ┌──────┐
│Image │ │ VAE │ │ U-Net │ │ VAE │ │Output│
│(512 │--->│Encode│--->│ (Denoising │--->│Decode│--->│Image │
│x512) │ │ r │ │ in Latent │ │ r │ │(512 │
│ │ │ │ │ Space) │ │ │ │x512) │
└──────┘ └──────┘ └────────────┘ └──────┘ └──────┘
│ │
│ 64x64x4 │
│ (8x downsampling) │
└─────────────────────────────────┘
Latent Space (z)
Training: Diffusion in Latent Space
Inference: Random noise z_T -> U-Net Denoising -> VAE Decode -> Image
LDM Training Pipeline:
Stage 1 - Autoencoder Training: Pretrain VAE (KL-regularized) on image datasets
- Encoder: Image x (H x W x 3) -> latent z (H/f x W/f x c), f=8 is typical
- Decoder: latent z -> Reconstructed image
- Loss: Reconstruction + KL Divergence + Perceptual Loss + GAN Loss
Stage 2 - Diffusion Model Training: Diffusion in the latent space of the frozen Autoencoder
- Add noise to latent z_0 = Encoder(x) to generate z_t
- U-Net predicts noise from z_t
- Text conditioning is injected via cross-attention
# Latent Diffusion core training
class LatentDiffusionTrainer:
def __init__(self, vae, unet, text_encoder, noise_scheduler):
self.vae = vae # Frozen
self.unet = unet # Trainable
self.text_encoder = text_encoder # Frozen
self.noise_scheduler = noise_scheduler
def train_step(self, images, captions):
# 1. Latent encoding with VAE (no gradient needed)
with torch.no_grad():
latents = self.vae.encode(images).latent_dist.sample()
latents = latents * self.vae.config.scaling_factor # 0.18215
# 2. Text embedding (no gradient needed)
with torch.no_grad():
text_embeddings = self.text_encoder(captions)
# 3. Add noise
noise = torch.randn_like(latents)
timesteps = torch.randint(0, 1000, (latents.shape[0],))
noisy_latents = self.noise_scheduler.add_noise(latents, noise, timesteps)
# 4. Predict noise
noise_pred = self.unet(noisy_latents, timesteps, text_embeddings)
# 5. MSE loss
loss = F.mse_loss(noise_pred, noise)
return loss
2.3.4 Structure of the U-Net Backbone
The U-Net used in Stable Diffusion 1.x/2.x and SDXL has the following structure:
[U-Net with Cross-Attention Structure]
Input z_t ─────────────────────────────────────────── Output ε_θ
│ ▲
▼ │
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Down │ │ Down │ │ Down │ │ Up │ │ Up │
│ Block │──│ Block │──│ Block │──┐ │ Block │──│ Block │
│ 64x64 │ │ 32x32 │ │ 16x16 │ │ │ 32x32 │ │ 64x64 │
└────────┘ └────────┘ └────────┘ │ └────────┘ └────────┘
│ │ │ │ ▲ ▲
│ │ │ ▼ │ │
│ │ │ ┌────────┐ │ │
│ │ └──│ Middle │──┘ │
│ │ │ Block │ │
│ │ │ 16x16 │ │
│ └───────────────└────────┘──────────────┘
│ (skip connections)
└────────────────────────────────────────────────────┘
Inside each Block:
┌──────────────────────────────────────┐
│ ResNet Block │
│ ├── GroupNorm → SiLU → Conv │
│ ├── Timestep Embedding injection │
│ └── GroupNorm → SiLU → Conv │
│ │
│ Self-Attention Block │
│ ├── LayerNorm → Self-Attention │
│ └── Skip Connection │
│ │
│ Cross-Attention Block │
│ ├── LayerNorm │
│ ├── Q = Linear(latent features) │
│ ├── K = Linear(text embeddings) │ ← Text Conditioning
│ ├── V = Linear(text embeddings) │
│ └── Attention(Q, K, V) │
│ │
│ Feed-Forward Block │
│ ├── LayerNorm → Linear → GEGLU │
│ └── Linear → Skip Connection │
└──────────────────────────────────────┘
SDXL (Podell et al., 2023) expanded the U-Net by approximately 3x (~2.6B parameters), uses two text encoders (OpenCLIP ViT-bigG and CLIP ViT-L), and applies improvements including training at various aspect ratios.
| Model | U-Net Params | Text Encoder | Resolution | VAE Downsampling |
|---|---|---|---|---|
| SD 1.5 | ~860M | CLIP ViT-L/14 (1) | 512x512 | 8x |
| SD 2.1 | ~865M | OpenCLIP ViT-H/14 (1) | 768x768 | 8x |
| SDXL | ~2.6B | OpenCLIP ViT-bigG + CLIP ViT-L (2) | 1024x1024 | 8x |
| SDXL Refiner | ~2.3B | OpenCLIP ViT-bigG (1) | 1024x1024 | 8x |
2.3.5 Classifier-Free Guidance (CFG)
Classifier-Free Guidance (CFG) by Ho & Salimans (2022) is a core training technique for modern T2I models.
Problems with Traditional Classifier Guidance:
- Requires training a separate classifier
- Needs a classifier that works on noisy images
- Requires computing classifier gradients during inference
Classifier-Free Guidance Key Idea:
During training, text conditioning is replaced with an empty string ("") with a certain probability (typically 10-20%), so that a single model simultaneously learns both conditional and unconditional generation.
학습 시:
- probability p_uncond (예: 10%): ε_θ(x_t, t, ∅) (unconditional)
- probability 1-p_uncond: ε_θ(x_t, t, c) (conditional)
추론 시:
ε_guided = ε_θ(x_t, t, ∅) + w * (ε_θ(x_t, t, c) - ε_θ(x_t, t, ∅))
- w: guidance scale (보통 7.5 ~ 15)
- w=1: conditional 예측 그대로
- w>1: 텍스트 조건 방향으로 더 강하게 이동
# Classifier-Free Guidance training implementation
def train_step_cfg(model, x_0, text_cond, p_uncond=0.1):
noise = torch.randn_like(x_0)
t = torch.randint(0, T, (x_0.shape[0],))
x_t = add_noise(x_0, noise, t)
# Randomly drop conditioning
mask = torch.rand(x_0.shape[0]) < p_uncond
cond = text_cond.clone()
cond[mask] = empty_text_embedding # null conditioning
noise_pred = model(x_t, t, cond)
loss = F.mse_loss(noise_pred, noise)
return loss
# Classifier-Free Guidance inference
def sample_cfg(model, x_T, text_cond, guidance_scale=7.5):
x_t = x_T
for t in reversed(range(T)):
# Unconditional prediction
eps_uncond = model(x_t, t, empty_text_embedding)
# Conditional prediction
eps_cond = model(x_t, t, text_cond)
# Guided prediction
eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
x_t = denoise_step(x_t, eps, t)
return x_t
CFG dramatically improves generation quality and text fidelity, but if the guidance scale is too high, images become oversaturated or artifacts appear.
2.3.6 DALL-E 2: CLIP-Based Diffusion
DALL-E 2 (Ramesh et al., 2022) introduced a two-stage diffusion architecture leveraging the CLIP embedding space.
[DALL-E 2 Training Pipeline]
Text ──→ CLIP Text Encoder ──→ text embedding
│
┌─────┴─────┐
│ Prior │ text emb → CLIP image emb
│ (Diffusion)│
└─────┬─────┘
│ CLIP image embedding
┌─────┴─────┐
│ Decoder │ CLIP image emb → 64x64 image
│ (Diffusion)│
└─────┬─────┘
│ 64x64
┌─────┴─────┐
│ Super-Res │ 64x64 → 256x256 → 1024x1024
│ (Diffusion)│
└─────────── ┘
2.3.7 Imagen: The Power of T5 Text Encoder
Google's Imagen (Saharia et al., 2022) maximized text understanding by using the T5-XXL (4.6B parameter) text encoder.
Key findings:
- Scaling text encoder size is more effective than scaling U-Net size
- T5-XXL > CLIP ViT-L (Text understanding in quality)
- Dynamic Thresholding: Stable generation even at high CFG scales
[Imagen Architecture]
Text ──→ T5-XXL (frozen) ──→ text embeddings
│
┌─────┴─────┐
│ Base Model │ 64x64 생성
│ (U-Net) │ cross-attention
└─────┬─────┘
│
┌─────┴─────┐
│ SR Model 1 │ 64x64 → 256x256
│ (U-Net) │
└─────┬─────┘
│
┌─────┴─────┐
│ SR Model 2 │ 256x256 → 1024x1024
│ (U-Net) │
└─────────── ┘
2.3.8 DiT: Diffusion Transformer
DiT (Diffusion Transformer) by Peebles & Xie (2023) is an architecture that replaces U-Net with Transformer, and is becoming the mainstream for recent T2I models.
[DiT Block Structure]
Input Tokens (patchified latent)
│
┌──────┴──────┐
│ LayerNorm │ ← adaLN-Zero (adaptive)
│ (adaptive) │ γ, β = MLP(timestep + class)
└──────┬──────┘
│
┌──────┴──────┐
│ Self- │
│ Attention │
└──────┬──────┘
│ (+ residual)
┌──────┴──────┐
│ LayerNorm │ ← adaLN-Zero
│ (adaptive) │
└──────┬──────┘
│
┌──────┴──────┐
│ Pointwise │
│ FFN │
└──────┬──────┘
│ (+ residual, scaled by α)
▼
Output Tokens
Key Design Decisions of DiT:
- Patchify: Split latent into p x p patches then linear projection to token sequence
- adaLN-Zero: Adaptive Layer Normalization, injecting timestep and class information into LN parameters
- Scaling: Systematic scaling law verification by model size (depth, width)
| DiT Variant | Depth | Width | Parameters | GFLOPs |
|---|---|---|---|---|
| DiT-S/2 | 12 | 384 | 33M | 6 |
| DiT-B/2 | 12 | 768 | 130M | 23 |
| DiT-L/2 | 24 | 1024 | 458M | 80 |
| DiT-XL/2 | 28 | 1152 | 675M | 119 |
2.4 Autoregressive-Based T2I
2.4.1 DALL-E (Original): Token-Based Autoregressive Generation
DALL-E (Ramesh et al., 2021) converts images into discrete tokens, then concatenates text tokens and image tokens into a single sequence to learn the joint distribution with an autoregressive Transformer.
[DALL-E Training Pipeline]
Stage 1: dVAE 학습
Image (256x256) ──→ dVAE Encoder ──→ 32x32 grid of tokens (8192 vocabulary)
──→ dVAE Decoder ──→ Reconstructed Image
Loss: Reconstruction + KL Divergence (Gumbel-Softmax relaxation)
Stage 2: Autoregressive Transformer 학습
[BPE text tokens (256)] + [Image tokens (1024)] = 1280 tokens
Transformer (12B params):
- 64 layers, 62 attention heads
- 학습 목적: next-token prediction (cross-entropy)
- 텍스트 토큰은 causal attention (좌→우)
- 이미지 토큰은 row-major order로 자기회귀 생성
- 텍스트→이미지 cross-attention 포함
2.4.2 Parti: Encoder-Decoder Based
Google's Parti (Yu et al., 2022) formulated T2I as a sequence-to-sequence problem, combining a ViT-VQGAN tokenizer with an Encoder-Decoder Transformer.
Key features:
- ViT-VQGAN: Vision Transformer-based image tokenizer
- Encoder-Decoder: Uses Encoder for text encoding, Decoder for image token generation
- Scaling: Systematic scale-up from 350M to 3B to 20B parameters
- Achieves quality comparable to Imagen at the 20B model
2.4.3 CM3Leon: Efficient Multimodal Autoregressive
Meta's CM3Leon (Yu et al., 2023) significantly improved the efficiency of the autoregressive approach:
- Retrieval-Augmented Training: Retrieve related image-text pairs during training and add to context
- Decoder-Only: Pure decoder-only architecture unlike Parti
- Instruction Tuning: Supervised fine-tuning for various tasks
- 5x less training cost: Reduces training compute by 1/5 for comparable performance
- Achieves MS-COCO zero-shot FID of 4.88
2.5 Flow Matching: The Next-Generation Training Paradigm
2.5.1 Basic Principles of Flow Matching
Flow Matching (Lipman et al., 2023) learns a straight path from noise distribution to data distribution through a deterministic ODE (Ordinary Differential Equation) instead of Diffusion's stochastic process.
[Diffusion vs Flow Matching Comparison]
Diffusion (Stochastic): Flow Matching (Deterministic):
x_0 ~~~> x_T x_0 ──────> x_1
(curved path, requires many steps) (straight path, fewer steps possible)
x₀ • x₀ •
\ Curved \ Straight
\ path \ path
\ \
\ \
x_T • x₁ • (= noise)
dx = f(x,t)dt + g(t)dW dx/dt = v_θ(x_t, t)
(SDE 기반) (ODE 기반, velocity field 학습)
Flow Matching Training Objective:
L_FM = E_{t, x_0, x_1} [ ||v_θ(x_t, t) - u_t(x_t | x_0, x_1)||² ]
where:
x_t = (1 - t) * x_0 + t * x_1 (linear interpolation)
u_t = x_1 - x_0 (target velocity: 직선 path)
t ~ Uniform(0, 1) (또는 logit-normal)
x_0 ~ p_data (실제 데이터)
x_1 ~ N(0, I) (가우시안 노이즈)
2.5.2 Rectified Flow
Rectified Flow (Liu et al., 2023, ICLR 2023 Spotlight) is a key variant of Flow Matching that connects noise-data pairs in straight lines from an Optimal Transport perspective.
Key idea:
- 1-Rectified Flow: Randomly pair data x_0 and noise x_1 to learn straight paths
- 2-Rectified Flow (Reflow): Re-straighten pairs generated by 1-Rectified Flow to make paths closer to straight lines
- Distillation: Distill the straightened model into a 1-step model
# Rectified Flow core training
def rectified_flow_train_step(model, x_0, x_1=None):
"""
x_0: 실제 데이터 (latent)
x_1: 노이즈 (None이면 랜덤 샘플링)
"""
if x_1 is None:
x_1 = torch.randn_like(x_0)
# Time sampling (logit-normal for SD3)
t = torch.sigmoid(torch.randn(x_0.shape[0])) # logit-normal
t = t.view(-1, 1, 1, 1)
# Linear interpolation
x_t = (1 - t) * x_0 + t * x_1
# Target velocity (straight direction)
target_v = x_1 - x_0
# Velocity prediction
v_pred = model(x_t, t)
# Loss
loss = F.mse_loss(v_pred, target_v)
return loss
2.5.3 Flow Matching in Stable Diffusion 3
SD3 (Esser et al., 2024) is the first model to apply Rectified Flow to a large-scale T2I model. Key contributions from the paper "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis":
1. Logit-Normal Timestep Sampling:
Instead of a uniform distribution, timesteps are sampled using a logit-normal distribution, giving more weight to the middle portion of the trajectory (the most challenging prediction interval).
# SD3's Logit-Normal Timestep Sampling
def logit_normal_sampling(batch_size, m=0.0, s=1.0):
"""Give more weight to middle timesteps"""
u = torch.randn(batch_size) * s + m
t = torch.sigmoid(u) # (0, 1) 범위
return t
2. MM-DiT (Multi-Modal Diffusion Transformer):
SD3 introduced a new Transformer architecture that uses separate weights for text and images while enabling bidirectional information flow.
[MM-DiT Block]
Image Tokens Text Tokens
│ │
┌────┴────┐ ┌────┴────┐
│adaLN(t) │ │adaLN(t) │
└────┬────┘ └────┬────┘
│ │
└──────┬──────────────┘
│ (concatenate)
┌──────┴──────┐
│ Joint Self- │ Image-text tokens
│ Attention │ attend to each other
└──────┬──────┘
│ (split)
┌──────┴──────────────┐
│ │
┌────┴────┐ ┌────┴────┐
│ FFN │ │ FFN │
│ (image) │ │ (text) │
└────┬────┘ └────┬────┘
│ │
Image Out Text Out
3. Scaling Laws:
| Model | Blocks | Parameters | Performance (validation loss) |
|---|---|---|---|
| SD3-S | 15 | 450M | High |
| SD3-M | 24 | 2B | Medium |
| SD3-L | 38 | 8B | Low (best performance) |
Smooth scaling was confirmed where validation loss steadily decreases as model size and training steps increase.
2.5.4 Flux: Black Forest Labs' Flow Matching Model
Flux (Black Forest Labs, 2024) is a model based on SD3's Rectified Flow + Transformer architecture.
| Variant | Training Method | 추론 스텝 | Features |
|---|---|---|---|
| FLUX.1 [pro] | Full training | 25-50 | Highest quality, API only |
| FLUX.1 [dev] | Guidance Distillation | 25-50 | Efficient inference, open weights |
| FLUX.1 [schnell] | Latent Adversarial Diffusion Distillation | 1-4 | Ultra-fast generation |
Guidance Distillation: The Student model is trained to reproduce the output of the Teacher model (using CFG) without CFG, eliminating CFG computation (2x forward pass) at inference time.
Latent Adversarial Diffusion Distillation (LADD): Combines GAN's adversarial loss with diffusion distillation to enable 1-4 step generation.
3. Text Conditioning Methodologies
Text Conditioning is the mechanism that injects the meaning of text prompts into the image generation process in T2I models. The choice of text encoder and conditioning method has a decisive impact on generation quality.
3.1 CLIP Text Encoder
OpenAI's CLIP (Contrastive Language-Image Pre-training, Radford et al., 2021) is a model contrastively trained on 400 million image-text pairs.
[CLIP Training Process]
Image ──→ Image Encoder ──→ image embedding ─┐
├─ cosine similarity
Text ──→ Text Encoder ──→ text embedding ─┘
Training objective: Increase similarity for matching pairs, decrease for non-matching pairs
(InfoNCE Loss)
Characteristics of CLIP Text Encoder:
- Both token sequence embeddings and [EOS] token pooled embeddings can be utilized
- Maximum 77 token length limit
- Strong at image-text alignment
- 시각적 개념에 특화된 Text understanding
| CLIP 변형 | 파라미터 | Models Used |
|---|---|---|
| CLIP ViT-L/14 | ~124M (text) | SD 1.x |
| OpenCLIP ViT-H/14 | ~354M (text) | SD 2.x |
| OpenCLIP ViT-bigG/14 | ~694M (text) | SDXL (primary) |
| CLIP ViT-L/14 | ~124M (text) | SDXL (secondary) |
3.2 T5 Text Encoder
Google's T5 (Text-to-Text Transfer Transformer, Raffel et al., 2020) is a large-scale language model trained on a pure text corpus.
Advantages of T5 (Demonstrated in the Imagen paper):
- Trained on a much larger text corpus than CLIP (C4 dataset)
- Excellent at understanding complex sentence structures and relationships
- Ability to process complex prompts including spatial relationships, quantities, and attribute combinations
- 텍스트 인코더 스케일링이 U-Net 스케일링보다 Effect적 (Imagen Key 발견)
| T5 변형 | 파라미터 | Models Used |
|---|---|---|
| T5-Small | 60M | Experimental |
| T5-Base | 220M | Experimental |
| T5-Large | 770M | Experimental |
| T5-XL | 3B | PixArt-alpha |
| T5-XXL | 4.6B | Imagen, SD3, Flux |
| Flan-T5-XL | 3B | PixArt-sigma |
3.3 Cross-Attention Mechanism
Cross-attention is the core mechanism that injects text information into image features within the U-Net or DiT.
# Cross-Attention implementation
class CrossAttention(nn.Module):
def __init__(self, d_model, d_context, n_heads):
super().__init__()
self.n_heads = n_heads
self.d_head = d_model // n_heads
self.to_q = nn.Linear(d_model, d_model, bias=False) # latent → Q
self.to_k = nn.Linear(d_context, d_model, bias=False) # text → K
self.to_v = nn.Linear(d_context, d_model, bias=False) # text → V
self.to_out = nn.Linear(d_model, d_model)
def forward(self, x, context):
"""
x: (B, H*W, d_model) - 이미지 latent features
context: (B, seq_len, d_context) - 텍스트 임베딩
"""
q = self.to_q(x) # 이미지가 Query
k = self.to_k(context) # 텍스트가 Key
v = self.to_v(context) # 텍스트가 Value
# Multi-head reshape
q = q.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
k = k.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
v = v.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
# Attention
attn = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_head)
attn = F.softmax(attn, dim=-1)
out = attn @ v
out = out.transpose(1, 2).reshape(B, -1, d_model)
return self.to_out(out)
3.4 Pooled Text Embeddings vs Sequence Embeddings
Modern T2I models simultaneously utilize two types of text embeddings:
[Text Embedding Types]
Text: "a photo of a cat"
│
┌────┴────┐
│ Text │
│ Encoder │
└────┬────┘
│
┌────┴──────────────────────┐
│ │
▼ ▼
Sequence Embeddings Pooled Embedding
(token-level) (sentence-level)
[h_1, h_2, ..., h_n] h_pool = h_[EOS]
Shape: (seq_len, d) Shape: (d,)
│ │
│ │
▼ ▼
Cross-Attention에 사용 Global conditioning에 사용
(세밀한 토큰별 정보) (전체 문장 의미)
- Timestep embedding에 더하기
- adaLN 파라미터 조절
- Vector conditioning
Dual text encoder usage in SDXL:
# SDXL Text Conditioning
def get_sdxl_text_embeddings(text, clip_l, clip_g):
# CLIP ViT-L: sequence embeddings (77, 768)
clip_l_output = clip_l(text)
clip_l_seq = clip_l_output.last_hidden_state # (77, 768)
clip_l_pooled = clip_l_output.pooler_output # (768,)
# OpenCLIP ViT-bigG: sequence embeddings (77, 1280)
clip_g_output = clip_g(text)
clip_g_seq = clip_g_output.last_hidden_state # (77, 1280)
clip_g_pooled = clip_g_output.pooler_output # (1280,)
# Concatenate sequence embeddings -> used for Cross-Attention
text_embeddings = torch.cat([clip_l_seq, clip_g_seq], dim=-1) # (77, 2048)
# Concatenate pooled embeddings -> used for Vector conditioning
pooled_embeddings = torch.cat([clip_l_pooled, clip_g_pooled], dim=-1) # (2048,)
return text_embeddings, pooled_embeddings
SD3 and Flux additionally combine T5-XXL sequence embeddings, using a triple text encoder configuration:
| 인코더 | Role | Output Shape | Use Case |
|---|---|---|---|
| CLIP ViT-L | Visual alignment | pooled (768) + seq (77, 768) | pooled → vector cond |
| OpenCLIP ViT-bigG | Visual alignment | pooled (1280) + seq (77, 1280) | pooled → vector cond |
| T5-XXL | Text understanding | seq (max 256/512, 4096) | cross-attn / joint-attn |
4. Training Datasets
The quality of T2I models directly depends on the scale and quality of training data. Here is a summary of major large-scale datasets.
4.1 Comparison of Major Datasets
| Dataset | Scale | Source | Filtering Method | 주요 Models Used |
|---|---|---|---|---|
| LAION-5B | 58.5억 pairs | Common Crawl | CLIP similarity > 0.28 (영어) | SD 1.x, SD 2.x |
| LAION-400M | 4억 pairs | Common Crawl | CLIP similarity 필터 | Early research |
| LAION-Aesthetics | ~1.2억 pairs | LAION-5B subset | Aesthetic score > 4.5/5.0 | SD fine-tuning |
| CC3M | 330만 pairs | Google 검색 | Automated filtering pipeline | Research |
| CC12M | 1,200만 pairs | Google 검색 | Relaxed filtering | Research |
| COYO-700M | 7.47억 pairs | Common Crawl | Image + text filtering | Research |
| WebLi | 10B images | Web crawling | Top 10% CLIP similarity | PaLI, Imagen |
| JourneyDB | ~460만 pairs | Midjourney | High-quality prompt-image | Research |
| SAM | 11M images | 다양한 Source | Manual + model-based | Segmentation + T2I |
| Internal (Proprietary) | 수십억 pairs | Proprietary | Proprietary | DALL-E 3, Midjourney |
4.2 LAION-5B Filtering Pipeline
LAION-5B (Schuhmann et al., 2022) is the most widely used open T2I training dataset:
[LAION-5B Data Collection and Filtering Pipeline]
Common Crawl (웹 아카이브)
│
▼
1. HTML 파싱: <img> 태그에서 src URL + alt-text 추출
│
▼
2. 이미지 다운로드 (img2dataset)
- 최소 해상도 필터: width, height ≥ 64
- 최대 종횡비: 3:1
│
▼
3. CLIP 유사도 필터링
- OpenAI CLIP ViT-B/32로 image-text similarity 계산
- 영어: cosine similarity ≥ 0.28
- 기타 언어: cosine similarity ≥ 0.26
│
▼
4. 안전성 필터링
- NSFW 탐지 점수 (CLIP 기반)
- Watermark 탐지 점수
- Toxic content 탐지
│
▼
5. 중복 제거 (deduplication)
- 해시 기반 exact duplicate 제거
- CLIP embedding 기반 near-duplicate 제거
│
▼
최종: 58.5억 이미지-텍스트 pairs
- 23.2억 영어
- 22.6억 기타 100+ 언어
- 12.7억 언어 미확인
4.3 Data Quality Assessment
The latest models tend to focus on data quality over data quantity:
1. CLIP Score-Based Filtering:
# CLIP Score computation
from transformers import CLIPModel, CLIPProcessor
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
inputs = processor(text=[caption], images=[image], return_tensors="pt")
outputs = model(**inputs)
clip_score = outputs.logits_per_image.item() / 100.0 # normalized
2. Aesthetic Score Filtering:
LAION-Aesthetics is a subset that trains a separate aesthetic predictor (CLIP embedding to MLP to score) and extracts only images with an aesthetic score of 4.5 or higher.
3. Caption Quality Improvement (DALL-E 3's Core Innovation):
DALL-E 3 (Betker et al., 2023) achieved dramatic performance improvement through caption quality improvement alone without any architecture changes:
- Train a dedicated image captioning model to generate detailed synthetic captions
- Train with 95% synthetic captions + 5% original captions
- Comparison experiments of three types: short synthetic, detailed synthetic, and human annotation
- Detailed synthetic captions are overwhelmingly superior
[DALL-E 3 Caption Improvement Effect]
Before: "cat on table"
-> Vague and lacks detail
After: "A fluffy orange tabby cat sitting on a round wooden
dining table, natural sunlight streaming through a
window behind, casting soft shadows. The cat has
bright green eyes and is looking directly at the camera."
-> Includes detailed attributes, spatial relationships, and lighting information
4.4 Data Preprocessing Techniques
| 전처리 Technique | Description | Effect |
|---|---|---|
| Center Crop | Crop center of image to square | Resolution standardization |
| Random Crop | Random position crop | Data augmentation |
| Bucket Sampling | Group images with similar aspect ratios | Multi-aspect ratio training (SDXL) |
| Caption Dropout | Replace caption with empty string at a certain probability | CFG training support |
| Multi-resolution | Progressive learning from low to high resolution | Training efficiency + quality |
| Tag Shuffling | Random shuffle of tag order | Reduced text order bias |
5. Fine-tuning & Customization Techniques
Fine-tuning techniques that adapt pretrained T2I models to specific styles, subjects, and control conditions are essential for practical applications.
5.1 LoRA (Low-Rank Adaptation)
LoRA by Hu et al. (2022) is an efficient method for fine-tuning large model weights, and is also extensively used in T2I models.
[LoRA Principle]
원본 가중치: W_0 ∈ R^{d×k} (고정, frozen)
LoRA 업데이트: ΔW = B × A where A ∈ R^{r×k}, B ∈ R^{d×r}
최종 출력: h = W_0 x + ΔW x = W_0 x + B(Ax)
- r << min(d, k): low-rank (보통 4, 8, 16, 32, 64)
- 학습 파라미터: A와 B만 (전체 대비 매우 적음)
- 원본 가중치는 고정 → 메모리 효율적
# LoRA application example (Stable Diffusion U-Net attention layer)
class LoRALinear(nn.Module):
def __init__(self, original_layer, rank=4, alpha=1.0):
super().__init__()
self.original = original_layer # frozen
in_features = original_layer.in_features
out_features = original_layer.out_features
# LoRA layers
self.lora_A = nn.Linear(in_features, rank, bias=False)
self.lora_B = nn.Linear(rank, out_features, bias=False)
self.scale = alpha / rank
# Initialization
nn.init.kaiming_uniform_(self.lora_A.weight)
nn.init.zeros_(self.lora_B.weight) # Initialize B to 0 -> identical to original at start
def forward(self, x):
original_out = self.original(x) # Frozen original output
lora_out = self.lora_B(self.lora_A(x)) # LoRA update
return original_out + self.scale * lora_out
LoRA Training Configuration (Diffusers-based):
# Diffusers LoRA training execution example
accelerate launch train_text_to_image_lora.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
--dataset_name="lambdalabs/naruto-blip-captions" \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--max_train_steps=15000 \
--learning_rate=1e-04 \
--lr_scheduler="cosine" \
--lr_warmup_steps=0 \
--rank=4 \
--mixed_precision="fp16" \
--output_dir="./sdxl-naruto-lora"
| LoRA 파라미터 | Typical Range | Impact |
|---|---|---|
| Rank (r) | 4-128 | Higher values increase expressiveness and memory |
| Alpha (α) | rank와 동일 ~ 2x | Learning rate scaling |
| Target Modules | attn Q,K,V,O + FFN | Application scope |
| Learning Rate | 1e-4 ~ 1e-5 | Convergence speed |
| Training Time | 5-30분 (단일 GPU) | Enables fast iteration |
| File Size | 1-200 MB | Easy to share and distribute |
5.2 DreamBooth
DreamBooth by Ruiz et al. (2023) is a technique for injecting the concept of a specific subject into a model using 3-5 images.
[DreamBooth Training Process]
Input: 3-5 images of a specific subject + unique identifier "[V]"
Example: "a [V] dog" (specific dog)
Training strategy:
1. Fine-tune model with subject images
- "a [V] dog" → 해당 강아지 이미지
2. Prior Preservation Loss (Key!)
- Pre-generate "a dog" images with the original model
- Preserve general dog generation capability during fine-tuning
- Prevent language drift
L = L_recon([V] images) + λ * L_prior(class images)
# DreamBooth + LoRA training (recommended combination)
# Based on diffusers library
accelerate launch train_dreambooth_lora.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
--instance_data_dir="./my_dog_images" \
--instance_prompt="a photo of sks dog" \
--class_data_dir="./class_dog_images" \
--class_prompt="a photo of dog" \
--with_prior_preservation \
--prior_loss_weight=1.0 \
--num_class_images=200 \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=1e-4 \
--lr_scheduler="constant" \
--max_train_steps=500 \
--rank=4 \
--mixed_precision="fp16"
5.3 Textual Inversion
Textual Inversion by Gal et al. (2023) is a method that learns only new token embeddings without modifying any model weights.
[Textual Inversion]
Existing token space: [cat] [dog] [car] [tree] ...
│
Add new token: [S*] New concept to learn
│
Training: Optimize only the embedding vector of [S*] with 3-5 images
Entire rest of model is frozen
Advantage: Minimal parameters (1 token = 768 or 1024 floats)
Disadvantage: Less expressive than LoRA/DreamBooth
5.4 ControlNet
ControlNet by Zhang & Agrawala (2023) is a method for adding structural conditions (edge, depth, pose, etc.) to pretrained diffusion models.
[ControlNet Architecture]
Control Input (예: Canny edge)
│
┌─────┴─────┐
│ Zero │
│ Conv │
└─────┬─────┘
│
┌─────┴─────┐
Locked U-Net │ Trainable │ Copy of U-Net Encoder
(원본 고정) │ Copy of │ (trainable)
│ │ U-Net Enc │
│ └─────┬─────┘
│ │
│ ┌─────┴─────┐
│ │ Zero │ Output is 0 at training start
│ │ Conv │ (starts without affecting original model)
│ └─────┬─────┘
│ │
└─────── + ───────────┘ Add to original U-Net features
│
Final Output
ControlNet's Core Training Technique - Zero Convolution:
# Zero Convolution: Initialize weights and biases to 0
class ZeroConv(nn.Module):
def __init__(self, in_channels, out_channels):
super().__init__()
self.conv = nn.Conv2d(in_channels, out_channels, 1)
nn.init.zeros_(self.conv.weight)
nn.init.zeros_(self.conv.bias)
def forward(self, x):
return self.conv(x)
# Training start: zero conv output = 0
# -> Adding ControlNet doesn't affect original model output
# -> Gradually reflects control signal as training progresses
| Condition Type | Input | Use Case |
|---|---|---|
| Canny Edge | Edge map | Contour-based generation |
| Depth | Depth map | 3D structure preservation |
| OpenPose | Joint positions | Human pose control |
| Semantic Segmentation | Segmentation map | Layout control |
| Scribble | Scribble | Rough composition |
| Normal Map | Surface normal map | 3D shape control |
| Tile | Low-resolution/tile | Super-resolution |
5.5 IP-Adapter
IP-Adapter (Image Prompt Adapter) by Ye et al. (2023) is an adapter that uses images as prompts to transfer style or subjects.
[IP-Adapter Architecture]
Reference Image ──→ CLIP Image Encoder ──→ image features
│
┌─────┴─────┐
│ Projection │ Trainable
│ Layer │
└─────┬─────┘
│
┌─────┴─────┐
│ Decoupled │ Separate cross-attention
│ Cross-Attn │ (separated from text cross-attn)
└─────┬─────┘
│
Original U-Net Cross-Attention ────── + ───────┘
(text conditioning)
출력 = Text_CrossAttn(Q, K_text, V_text) + λ * Image_CrossAttn(Q, K_img, V_img)
5.6 Comparison of Fine-tuning Techniques
| Technique | Modified Target | Training Images | Training Time | File Size | 주요 Use Case |
|---|---|---|---|---|---|
| LoRA | Attention weights (low-rank) | Tens to thousands | 5-30분 | 1-200MB | Style, concepts |
| DreamBooth | Full model or + LoRA | 3-10 | 5-60분 | 2-7GB (전체) 또는 1-200MB (LoRA) | Specific subject |
| Textual Inversion | Token embeddings only | 3-10 | 30분-수시간 | Few KB | Simple concepts |
| ControlNet | U-Net Encoder copy | Tens of thousands to hundreds of thousands | Several days | ~1.5GB | Structural control |
| IP-Adapter | Projection + Cross-Attn | Large-scale | Several days | ~100MB | Image prompting |
6. Latest Trends (2024-2026)
6.1 Consistency Models
Consistency Models by Yang Song et al. (2023) is a method for reducing multi-step generation in diffusion models to 1-step or few-step.
[Consistency Models Key Idea]
Diffusion: x_T → x_{T-1} → ... → x_1 → x_0 (hundreds of steps)
Consistency:
PF-ODE trajectory 위의 모든 점 x_t가
동일한 x_0로 매핑되도록 학습
f_θ(x_t, t) = x_0 ∀t ∈ [0, T]
Key 제약: f_θ(x_0, 0) = x_0 (self-consistency)
x_T ────→ f_θ ────→ x_0
│ ↑
x_t ────→ f_θ ───────┘ (maps to the same x_0!)
│ ↑
x_t' ────→ f_θ ───────┘
Two Training Methods:
| 방법 | Description | 장점 | 단점 |
|---|---|---|---|
| Consistency Distillation (CD) | 사전학습된 diffusion model 필요, PF-ODE 시뮬레이션 | 높은 품질 | teacher 모델 필요 |
| Consistency Training (CT) | 실제 데이터에서 직접 학습 | teacher 불필요 | CD보다 품질 다소 Low |
Performance:
- CIFAR-10: FID 3.55 (1-step), 2.93 (2-step)
- ImageNet 64x64: FID 6.20 (1-step)
Follow-up research, Improved Consistency Training (iCT) and Latent Consistency Models (LCM), applied this to large-scale T2I models, enabling 2-4 step generation at the SDXL level.
6.2 The Spread of DiT (Diffusion Transformer) Architecture
Since 2024, DiT has been replacing U-Net to become the mainstream backbone for T2I:
| 모델 | Year | Backbone | 파라미터 | Key Features |
|---|---|---|---|---|
| DiT (원본) | 2023 | Transformer | 675M | Class-conditional, adaLN-Zero |
| PixArt-alpha | 2023 | DiT + Cross-Attn | 600M | T2I, low-cost training |
| PixArt-sigma | 2024 | DiT + KV Compression | 600M | 4K resolution, weak-to-strong |
| SD3 | 2024 | MM-DiT | 2B-8B | Flow Matching, triple text encoder |
| Flux | 2024 | MM-DiT variant | ~12B | Distillation variant |
| Playground v2.5 | 2024 | SDXL U-Net | ~2.6B | EDM noise schedule |
| Hunyuan-DiT | 2024 | DiT | ~1.5B | Chinese+English bilingual |
| Lumina-T2X | 2024 | DiT | 다양 | Multi-modal generation |
6.3 PixArt-alpha and PixArt-sigma
PixArt-alpha (Chen et al., 2023) is a pioneering model for efficient DiT training:
Core innovation - Training Decomposition:
[PixArt-alpha 3-Stage Training]
Stage 1: Pixel Dependency 학습 (저비용)
- ImageNet 사전학습된 DiT에서 시작
- 클래스 조건부 → T2I 전환의 기초
Stage 2: Text-Image Alignment 학습
- Cross-attention으로 텍스트 조건 주입
- LLaVA로 생성한 고품질 synthetic caption 사용
Stage 3: High-quality Aesthetic 학습
- 고품질 미적 Dataset으로 fine-tuning
- JourneyDB 등 활용
총 학습 비용: ~675 A100 GPU days
(SD 1.5의 ~6,250 A100 GPU days 대비 10.8%)
Improvements in PixArt-sigma (Chen et al., 2024):
- Weak-to-Strong Training: Enhanced training with higher quality data based on PixArt-alpha
- KV Compression: Compress Key and Value in Attention for improved efficiency, enabling 4K resolution
- Comparable performance to SDXL (2.6B) with only 0.6B parameters
6.4 Comparison of SDXL, SD3, and Flux
[Stable Diffusion Lineage by Generation]
SD 1.x (2022) SDXL (2023) SD3 (2024) Flux (2024)
│ │ │ │
U-Net 860M U-Net 2.6B MM-DiT 2-8B MM-DiT ~12B
│ │ │ │
CLIP ViT-L CLIP-L + CLIP-L + CLIP-L +
OpenCLIP-G OpenCLIP-G + OpenCLIP-G +
T5-XXL T5-XXL
│ │ │ │
Diffusion Diffusion Rectified Rectified
(DDPM) (DDPM) Flow Flow
│ │ │ │
512x512 1024x1024 1024x1024 1024x1024+
│ │ │ │
CFG 7.5 CFG 5-9 CFG 3.5-7 Guidance
Distillation
6.5 Training Innovations of DALL-E 3
The core innovation of DALL-E 3 (Betker et al., 2023) lies in improving training data caption quality:
- Image Captioner Training: Separately train a CoCa-based image captioning model
- Synthetic Caption Generation: Re-label entire training data with detailed synthetic captions
- Caption Mixing: Train with 95% synthetic + 5% original captions
- Descriptive vs Short: 상세한 Description형 캡션이 짧은 태그형보다 우수
6.6 Three Key Insights of Playground v2.5
Playground v2.5 (Li et al., 2024) surpassed DALL-E 3 and Midjourney 5.2 through training strategy improvements based on the SDXL architecture:
1. EDM Noise Schedule Adoption:
# EDM Framework (Karras et al., 2022)
# σ(t) 기반 noise schedule - Zero Terminal SNR 보장
# 기존 SD의 linear schedule 대비 색상/대비 크게 개선
def edm_precondition(sigma, x_noisy, F_theta):
"""EDM Preconditioning"""
c_skip = 1.0 / (sigma ** 2 + 1)
c_out = sigma / (sigma ** 2 + 1).sqrt()
c_in = 1.0 / (sigma ** 2 + 1).sqrt()
c_noise = sigma.log() / 4
D_x = c_skip * x_noisy + c_out * F_theta(c_in * x_noisy, c_noise)
return D_x
2. Multi-Aspect Ratio Training:
- Bucketed dataset: Group images with similar aspect ratios하여 배치 구성
- Supports various aspect ratios during training (1:1, 4:3, 16:9, etc.)
3. Human Preference Alignment:
- Training strategy utilizing human preference data
- Maximize aesthetic quality through quality-tuning
7. Practical Training Pipeline Guide
7.1 Training Infrastructure
GPU Requirements
| Training Scale | Recommended GPU | VRAM | Training Duration | Cost (Estimated) |
|---|---|---|---|---|
| LoRA Fine-tuning | RTX 3090/4090 1대 | 24GB | 5-30분 | < $1 |
| DreamBooth | A100 40GB 1대 | 40GB | 30-60분 | $2-5 |
| ControlNet 학습 | A100 80GB 4-8대 | 320-640GB | 2-5일 | $500-2,000 |
| SD 1.5 수준 학습 | A100 80GB 256대 | ~20TB | 24일 | ~$150K |
| SDXL 수준 학습 | A100 80GB 512-1024대 | ~40-80TB | 수주 | ~$500K-1M |
| SD3/Flux 수준 학습 | H100 80GB 1024+대 | ~80TB+ | 수주-수개월 | > $1M |
Distributed Training Strategy
[Large-Scale Distributed Training Configuration]
┌─────────────────────────────────────────────────────┐
│ Data Parallel (DP/DDP) │
│ │
│ GPU 0 GPU 1 GPU 2 GPU 3 │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Full │ │Full │ │Full │ │Full │ │
│ │Model │ │Model │ │Model │ │Model │ │
│ │Copy │ │Copy │ │Copy │ │Copy │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ │
│ Batch 1 Batch 2 Batch 3 Batch 4 │
│ │
│ -> Synchronize gradients with All-Reduce │
│ -> Different data batches on each GPU │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ FSDP (Fully Sharded Data Parallel) │
│ │
│ GPU 0 GPU 1 GPU 2 GPU 3 │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Shard │ │Shard │ │Shard │ │Shard │ │
│ │ 1/4 │ │ 2/4 │ │ 3/4 │ │ 4/4 │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ │
│ │
│ -> Shard model parameters across GPUs │
│ -> All-Gather only needed shards during Forward/Backward │
│ -> Maximize memory efficiency (enables 8B+ model training) │
└─────────────────────────────────────────────────────┘
7.2 Representative Training Framework: Diffusers
HuggingFace's Diffusers library is the de facto standard for T2I model training.
# Diffusers-based Text-to-Image full training pipeline
from diffusers import AutoencoderKL, UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTextModel, CLIPTokenizer
from accelerate import Accelerator
import torch
# 1. Load models
vae = AutoencoderKL.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="vae"
)
unet = UNet2DConditionModel.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet"
)
text_encoder = CLIPTextModel.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="text_encoder"
)
tokenizer = CLIPTokenizer.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="tokenizer"
)
noise_scheduler = DDPMScheduler.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler"
)
# 2. Freeze VAE and Text Encoder
vae.requires_grad_(False)
text_encoder.requires_grad_(False)
# 3. Accelerator setup (distributed training + Mixed Precision)
accelerator = Accelerator(
mixed_precision="fp16", # or "bf16"
gradient_accumulation_steps=4,
)
# 4. Optimizer
optimizer = torch.optim.AdamW(
unet.parameters(),
lr=1e-4,
betas=(0.9, 0.999),
weight_decay=1e-2,
eps=1e-8,
)
# 5. EMA setup
from diffusers.training_utils import EMAModel
ema_unet = EMAModel(
unet.parameters(),
decay=0.9999,
use_ema_warmup=True,
)
# 6. Prepare for distributed training
unet, optimizer, dataloader = accelerator.prepare(unet, optimizer, dataloader)
# 7. Training loop
for epoch in range(num_epochs):
for batch in dataloader:
with accelerator.accumulate(unet):
images = batch["images"]
captions = batch["captions"]
# Latent encoding
with torch.no_grad():
latents = vae.encode(images).latent_dist.sample()
latents = latents * vae.config.scaling_factor
# Text encoding
with torch.no_grad():
text_inputs = tokenizer(captions, padding="max_length",
max_length=77, return_tensors="pt")
text_embeds = text_encoder(text_inputs.input_ids)[0]
# Add noise
noise = torch.randn_like(latents)
timesteps = torch.randint(0, 1000, (latents.shape[0],))
noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
# Classifier-Free Guidance: random caption dropout
if torch.rand(1) < 0.1: # 10% probability로 unconditional
text_embeds = torch.zeros_like(text_embeds)
# Predict noise
noise_pred = unet(noisy_latents, timesteps, text_embeds).sample
# Loss computation
loss = F.mse_loss(noise_pred, noise)
# Backward
accelerator.backward(loss)
accelerator.clip_grad_norm_(unet.parameters(), 1.0)
optimizer.step()
optimizer.zero_grad()
# EMA update
ema_unet.step(unet.parameters())
7.3 Mixed Precision Training
Mixed Precision is a technique that improves memory and computational efficiency by combining FP32 and FP16/BF16.
[Mixed Precision Training]
Forward Pass:
- Model weights: FP16/BF16 (half memory)
- Activation: FP16/BF16
Loss Scaling:
- Multiply loss by a large scale (e.g., 2^16) to prevent gradient underflow
- Scale down gradient again after backward
Backward Pass:
- Gradient: FP16/BF16
Optimizer Step:
- Master Weights: FP32 (maintain precision!)
- Update FP32 master weights then create FP16 copy
| Precision | 메모리 | 연산 속도 | 수치 안정성 | 권장 |
|---|---|---|---|---|
| FP32 | 4 bytes | 기준 | 최고 | Optimizer State |
| FP16 | 2 bytes | ~2x | Low (overflow 위험) | Forward/Backward |
| BF16 | 2 bytes | ~2x | High (넓은 범위) | H100/A100에서 권장 |
| TF32 | 4 bytes (저장) | ~1.5x | High | A100 default |
# BF16 Mixed Precision config (accelerate-based)
# accelerate config (YAML)
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
mixed_precision: bf16
num_machines: 1
num_processes: 8
7.4 EMA (Exponential Moving Average)
EMA is a technique that maintains a moving average of model weights during training to achieve more stable results during inference. It is used in nearly all T2I model training.
[EMA Update]
θ_ema ← λ * θ_ema + (1 - λ) * θ_model
- λ: decay rate (보통 0.9999 ~ 0.99999)
- θ_model: 현재 학습 중인 모델 가중치
- θ_ema: EMA 가중치 (추론 시 사용)
- Effect: gradient noise를 평활화하여 더 안정적인 가중치
# Diffusers EMA implementation
from diffusers.training_utils import EMAModel
# Create EMA model
ema_model = EMAModel(
unet.parameters(),
decay=0.9999, # decay rate
use_ema_warmup=True, # warmup 사용
inv_gamma=1.0, # warmup 파라미터
power=3/4, # warmup 파라미터
)
# Update at every training step
ema_model.step(unet.parameters())
# Apply EMA weights at inference
ema_model.copy_to(unet.parameters())
# Or use context manager
with ema_model.average_parameters():
# EMA weights are used inside this block
output = unet(noisy_latents, timesteps, text_embeds)
7.5 Training Hyperparameter Guide
| Hyperparameter | SD 1.5 | SDXL | SD3/Flux | LoRA |
|---|---|---|---|---|
| Learning Rate | 1e-4 | 1e-4 | 1e-4 | 1e-4 ~ 5e-5 |
| Batch Size (총) | 2048 | 2048 | 2048+ | 1-8 |
| Optimizer | AdamW | AdamW | AdamW | AdamW / Prodigy |
| Weight Decay | 0.01 | 0.01 | 0.01 | 0.01 |
| Grad Clip | 1.0 | 1.0 | 1.0 | 1.0 |
| EMA Decay | 0.9999 | 0.9999 | 0.9999 | N/A |
| Warmup Steps | 10,000 | 10,000 | 10,000 | 0-500 |
| Precision | FP32/FP16 | BF16 | BF16 | FP16/BF16 |
| CFG Dropout | 10% | 10% | 10% | 10% |
| Resolution | 512 | 1024 | 1024 | Original resolution |
| Total Steps | ~500K | ~500K+ | ~1M+ | 500-15,000 |
8. Key Paper References
8.1 Core Paper Table
| # | Paper Title | Authors | Year | Key Contribution | Link |
|---|---|---|---|---|---|
| 1 | Generative Adversarial Networks | Goodfellow et al. | 2014 | GAN framework proposal | arXiv:1406.2661 |
| 2 | Neural Discrete Representation Learning (VQ-VAE) | van den Oord et al. | 2017 | Vector Quantized discrete latent space | arXiv:1711.00937 |
| 3 | A Style-Based Generator Architecture for GANs (StyleGAN) | Karras et al. | 2019 | Style-based generator, Progressive Growing | arXiv:1812.04948 |
| 4 | Large Scale GAN Training (BigGAN) | Brock et al. | 2019 | Large-scale GAN 학습, Truncation Trick | arXiv:1809.11096 |
| 5 | Generating Diverse High-Fidelity Images with VQ-VAE-2 | Razavi et al. | 2019 | Hierarchical VQ-VAE, high-resolution generation | arXiv:1906.00446 |
| 6 | Denoising Diffusion Probabilistic Models (DDPM) | Ho et al. | 2020 | Practical training of Diffusion models | arXiv:2006.11239 |
| 7 | Learning Transferable Visual Models From Natural Language Supervision (CLIP) | Radford et al. | 2021 | CLIP contrastive learning, image-text alignment | arXiv:2103.00020 |
| 8 | Zero-Shot Text-to-Image Generation (DALL-E) | Ramesh et al. | 2021 | dVAE + Autoregressive Transformer T2I | arXiv:2102.12092 |
| 9 | High-Resolution Image Synthesis with Latent Diffusion Models (LDM) | Rombach et al. | 2022 | Latent Diffusion, Cross-Attention conditioning | arXiv:2112.10752 |
| 10 | Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2) | Ramesh et al. | 2022 | CLIP-based 2-stage Diffusion, Prior + Decoder | arXiv:2204.06125 |
| 11 | Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen) | Saharia et al. | 2022 | T5-XXL 텍스트 인코더의 Effect, Dynamic Thresholding | arXiv:2205.11487 |
| 12 | Classifier-Free Diffusion Guidance | Ho & Salimans | 2022 | CFG 학습 Technique, unconditional-conditional 동시 학습 | arXiv:2207.12598 |
| 13 | Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Parti) | Yu et al. | 2022 | Autoregressive T2I, 20B scaling | arXiv:2206.10789 |
| 14 | LoRA: Low-Rank Adaptation of Large Language Models | Hu et al. | 2022 | Low-rank fine-tuning Technique | arXiv:2106.09685 |
| 15 | Elucidating the Design Space of Diffusion-Based Generative Models (EDM) | Karras et al. | 2022 | Systematic Diffusion design space analysis, Preconditioning | arXiv:2206.00364 |
| 16 | An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion | Gal et al. | 2023 | Personalization via new token embedding learning | arXiv:2208.01618 |
| 17 | DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation | Ruiz et al. | 2023 | Subject personalization with few images, Prior Preservation | arXiv:2208.12242 |
| 18 | Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet) | Zhang & Agrawala | 2023 | Structural control(edge, depth, pose) 추가 | arXiv:2302.05543 |
| 19 | Consistency Models | Song et al. | 2023 | 1-step generation, PF-ODE consistency learning | arXiv:2303.01469 |
| 20 | SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis | Podell et al. | 2023 | Large U-Net, Dual Text Encoder, Multi-AR training | arXiv:2307.01952 |
| 21 | Scalable Diffusion Models with Transformers (DiT) | Peebles & Xie | 2023 | Diffusion + Transformer, adaLN-Zero | arXiv:2212.09748 |
| 22 | Flow Matching for Generative Modeling | Lipman et al. | 2023 | ODE-based Flow Matching framework | arXiv:2210.02747 |
| 23 | Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow | Liu et al. | 2023 | Rectified Flow, Optimal Transport | arXiv:2209.03003 |
| 24 | IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models | Ye et al. | 2023 | Image prompt adapter, Decoupled Cross-Attn | arXiv:2308.06721 |
| 25 | Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3Leon) | Yu et al. | 2023 | Efficient autoregressive T2I, Retrieval Augmented | arXiv:2309.02591 |
| 26 | PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis | Chen et al. | 2023 | Low-cost DiT training, training decomposition strategy | arXiv:2310.00426 |
| 27 | Improving Image Generation with Better Captions (DALL-E 3) | Betker et al. | 2023 | Dramatic quality improvement via synthetic captions | cdn.openai.com/papers/dall-e-3.pdf |
| 28 | PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation | Chen et al. | 2024 | Weak-to-Strong training, KV Compression, 4K | arXiv:2403.04692 |
| 29 | Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3) | Esser et al. | 2024 | MM-DiT, Rectified Flow Large-scale 적용, Logit-Normal | arXiv:2403.03206 |
| 30 | Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation | Li et al. | 2024 | EDM Noise Schedule, Multi-AR, Human Preference | arXiv:2402.17245 |
8.2 Additional Reference Papers
| Paper Title | Year | Key |
|---|---|---|
| LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models | 2022 | 5.85 billion open image-text dataset |
| Improved Denoising Diffusion Probabilistic Models | 2021 | Cosine schedule, learned variance |
| Denoising Diffusion Implicit Models (DDIM) | 2021 | Deterministic sampling, speed improvement |
| Progressive Distillation for Fast Sampling of Diffusion Models | 2022 | Inference acceleration via progressive distillation |
| InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation | 2024 | Rectified Flow 1-step generation |
| Latent Consistency Models | 2024 | LCM, SDXL-based few-step generation |
| SDXL-Turbo: Adversarial Diffusion Distillation | 2024 | 1-4 step SDXL generation |
| Stable Cascade | 2024 | Wuerstchen-based 3-stage hierarchical generation |
9. Conclusion and Outlook
Text-to-Image model training methodologies started from GAN's adversarial training, passed through Diffusion's iterative denoising, and are now converging on a new paradigm of Flow Matching + DiT.
Key Trend Summary
[T2I Training Methodology Evolution]
Efficiency: Full Training ──→ LoRA/Adapter ──→ Prompt Tuning
(months, $1M+) (minutes, less than $1) (seconds)
Architecture: U-Net ────────→ DiT ─────────→ MM-DiT + Flow Matching
(SD 1.x-SDXL) (DiT, PixArt) (SD3, Flux)
Generation speed: 50-1000 steps ──→ 20-50 steps ──→ 1-4 steps
(DDPM) (DDIM, DPM++) (LCM, LADD, CM)
Data quality: Web crawling ──→ 필터링 ──→ Synthetic Caption
(LAION raw) (aesthetic) (DALL-E 3 방식)
Text understanding: CLIP only ──→ CLIP + T5 ──→ 3중 Encoder
(SD 1.x) (Imagen) (SD3, Flux)
Future Outlook
Maximizing training efficiency: As demonstrated by PixArt-alpha, the trend of reducing training costs to 1/10 or less while maintaining quality will continue.
Data-Centric AI approach: As DALL-E 3 demonstrated, data quality and captioning are becoming more important than architecture.
Few-Step / One-Step 생성: Consistency Models, LCM, LADD 등의 증류 Technique이 발전하여 실시간 생성이 표준이 될 것이다.
Unified Multi-Modal Generation: Expanding to models that integrate not only text-to-image but also video, 3D, and audio.
Advanced Personalization: Beyond LoRA, DreamBooth, and IP-Adapter, more accurate subject reproduction with even less data will become possible.
T2I model training methodology has entered an era where the key is not simply "training a larger model with more data," but rather what data, with what schedule, and with what conditioning to train with. We hope the methodologies covered in this article can be used as a foundation for training your own T2I models or effectively customizing existing ones.