Skip to content
Published on

Complete Guide to Text-to-Image Model Training Methodologies: From GAN to Flow Matching

Authors
  • Name
    Twitter

1. Introduction: The Evolution of Text-to-Image Generative Models

Text-to-Image (T2I) generative models are technologies that produce high-resolution images from natural language text prompts, and have undergone rapid development over the past several years. The trajectory of this field can be broadly divided into four paradigms.

[Text-to-Image Model Evolution Timeline]

2014-2019          2017-2020          2020-2023              2023-Present
    |                  |                  |                      |
   GAN              VAE/VQ-VAE        Diffusion Models      Flow Matching
    |                  |                  |                  + DiT
    v                  v                  v                      v
 ┌──────────┐    ┌──────────┐    ┌────────────────┐    ┌──────────────┐
 │StackGAN  │    │ VQ-VAE   │    │ DDPM (2020)    │    │ SD3 (2024) │AttnGAN   │    │ VQ-VAE-2 │    │ DALL-E 2(2022) │    │ Flux (2024) │StyleGAN  │    │ dVAE     │    │ Imagen (2022)  │    │ Pixart-Sigma │BigGAN    │    │          │    │ SD 1.x (2022)  │    │              │
 │GigaGAN   │    │          │    │ SDXL (2023)    │    │              │
 └──────────┘    └──────────┘    └────────────────┘    └──────────────┘

Features:                Features:              Features:                   Features:
- Adversarial       - Discrete         - Iterative             - Straight paths
  Training            Latent Space       Denoising             - ODE-based
- Mode Collapse     - Codebook         - Classifier-Free      - Fewer steps
  issues                Learning           Guidance              - DiT backbone
- Fast generation         - Two-stage        - Latent Space          - Scalable
                      Training         - U-Net backbone

1.1 Why Training Methodology Matters

The quality of T2I models is critically determined not only by architecture design but also by training methodology. Even with identical architectures, generation quality varies dramatically depending on noise scheduling, conditioning approaches, data quality, and training strategies. A prime example is DALL-E 3, which achieved dramatic performance improvements over its predecessor through caption quality improvement alone without any architecture changes.

This article provides an in-depth, paper-based analysis of core training methodologies for each paradigm, covering practical training pipeline configuration as well.


2. Training Methodologies by Core Architecture

2.1 GAN-Based: Adversarial Training

Generative Adversarial Network (GAN) is a framework where two networks, the Generator and the Discriminator, are trained competitively.

2.1.1 Basic Training Principles

The training objective function of GAN is defined as a minimax game:

min_G max_D  V(D, G) = E_{x~p_data}[log D(x)] + E_{z~p_z}[log(1 - D(G(z)))]

- G (Generator): 랜덤 노이즈 z로부터 이미지 생성
- D (Discriminator): 실제 이미지와 생성 이미지 구분
- 학습 목표: GD를 속이고, D는 정확히 판별

2.1.2 StyleGAN Training Strategy

StyleGAN (Karras et al., 2019) introduced Progressive Growing and Style-based Generator to enable high-quality image generation.

Core Training Techniques:

TechniqueDescriptionEffect
Progressive GrowingStart from low resolution (4x4) and progressively increaseImproved training stability
Style MixingInject different latent codes into different layersIncreased diversity
Path Length RegularizationGenerator Jacobian regularizationImproved generation quality
R1 RegularizationDiscriminator gradient penaltyTraining stabilization
Lazy RegularizationApply regularization every 16 steps instead of every stepImproved training efficiency
# StyleGAN2 core training loop (simplified)
for real_images, _ in dataloader:
    # 1. Discriminator training
    z = torch.randn(batch_size, latent_dim)
    fake_images = generator(z)

    d_real = discriminator(real_images)
    d_fake = discriminator(fake_images.detach())

    d_loss = F.softplus(-d_real).mean() + F.softplus(d_fake).mean()

    # R1 Regularization (lazy: every 16 steps)
    if step % 16 == 0:
        real_images.requires_grad = True
        d_real = discriminator(real_images)
        r1_grads = torch.autograd.grad(d_real.sum(), real_images)[0]
        r1_penalty = r1_grads.square().sum(dim=[1,2,3]).mean()
        d_loss += 10.0 * r1_penalty

    d_optimizer.zero_grad()
    d_loss.backward()
    d_optimizer.step()

    # 2. Generator training
    z = torch.randn(batch_size, latent_dim)
    fake_images = generator(z)
    d_fake = discriminator(fake_images)
    g_loss = F.softplus(-d_fake).mean()

    g_optimizer.zero_grad()
    g_loss.backward()
    g_optimizer.step()

2.1.3 Large-Scale Training with BigGAN

BigGAN (Brock et al., 2019) is a model that scaled up GAN to large scale, employing the following training strategies:

  • Large-scale batches: Increase batch size up to 2048 for improved training stability and quality
  • Class-conditional Batch Normalization: Inject class information into Batch Normalization parameters
  • Truncation Trick: Truncate latent distribution at inference to control quality-diversity trade-off
  • Orthogonal Regularization: Maintain orthogonality of weight matrices to prevent mode collapse

2.1.4 Limitations of GAN-Based T2I

GAN-based approaches ceded dominance to Diffusion-based models due to the following fundamental limitations:

  • Mode Collapse: Limited generation diversity
  • Training Instability: Unstable training sensitive to hyperparameters
  • Text Conditioning difficulty: Difficult to accurately reflect complex text prompts
  • Scaling limitations: Increased training instability at large scale

2.2 VAE-Based: Codebook Learning and Discrete Latent Space

2.2.1 VQ-VAE: Vector Quantized Variational Autoencoder

VQ-VAE (van den Oord et al., 2017) is an approach that learns a discrete latent space instead of a continuous one.

[VQ-VAE Architecture]

Input Image     Encoder      Quantization      Decoder      Reconstructed
  (256x256) --> [E(x)] --> z_e --> [Codebook] --> z_q --> [D(z_q)] --> Image
                             |        |
                             |   ┌────┴────┐
                             |   │ e_1     │
                             |   │ e_2     │  K code vectors
                             └──>...       (Codebook)
                                 │ e_K     │
                                 └─────────┘

  z_q = e_k  where k = argmin_j ||z_e - e_j||
  (quantize to nearest code vector)

VQ-VAE Training Loss Function:

L = ||x - D(z_q)||²                    # Reconstruction Loss
  + ||sg[z_e] - e||²                   # Codebook Loss (EMA 업데이트로 대체 가능)
  + β * ||z_e - sg[e]||²              # Commitment Loss

- sg[·]: Stop-gradient 연산자
- β: Commitment loss 가중치 (보통 0.25)
- z_e: Encoder 출력
- e: 선택된 codebook 벡터

Since the quantization operation is non-differentiable, the Straight-Through Estimator (STE) is used to pass gradients to the encoder. The codebook itself is updated via Exponential Moving Average (EMA).

# VQ-VAE Codebook core training code
class VectorQuantizer(nn.Module):
    def __init__(self, num_embeddings, embedding_dim, commitment_cost=0.25):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings, embedding_dim)
        self.commitment_cost = commitment_cost

    def forward(self, z_e):
        # z_e: (B, D, H, W) -> (B*H*W, D)
        flat_z = z_e.permute(0, 2, 3, 1).reshape(-1, z_e.shape[1])

        # Find nearest codebook vector
        distances = (flat_z ** 2).sum(dim=1, keepdim=True) \
                  + (self.embedding.weight ** 2).sum(dim=1) \
                  - 2 * flat_z @ self.embedding.weight.t()
        indices = distances.argmin(dim=1)
        z_q = self.embedding(indices).view_as(z_e.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)

        # Loss computation
        codebook_loss = F.mse_loss(z_q.detach(), z_e)      # commitment
        commitment_loss = F.mse_loss(z_q, z_e.detach())     # codebook
        loss = commitment_loss + self.commitment_cost * codebook_loss

        # Straight-Through Estimator
        z_q_st = z_e + (z_q - z_e).detach()
        return z_q_st, loss, indices

2.2.2 VQ-VAE-2: Hierarchical Codebook Learning

VQ-VAE-2 (Razavi et al., 2019) introduced multi-level hierarchical quantization to significantly improve image quality.

[VQ-VAE-2 Hierarchical Structure]

                    Top Level (작은 해상도)
                    ┌─────────────┐
                    │  32x32 grid │  Global structure info
Codebook     (composition, overall shape)
                    └──────┬──────┘
                    Bottom Level (큰 해상도)
                    ┌──────┴──────┐
                    │  64x64 grid │  Fine detail info
Codebook     (textures, edges)
                    └─────────────┘

The image generation pipeline of VQ-VAE-2 consists of the following two stages:

  1. Stage 1: Train VQ-VAE-2 to encode images into hierarchical discrete codes
  2. Stage 2: Learn the prior of discrete codes with autoregressive models such as PixelCNN

This approach directly influenced the dVAE (discrete VAE) used in DALL-E.


2.3 Diffusion-Based: The Core of Current T2I

Diffusion Model is the mainstream paradigm for current T2I generation. It learns a forward process that gradually adds noise to data, and a reverse process that recovers data from noise.

2.3.1 DDPM: Denoising Diffusion Probabilistic Models

DDPM by Ho et al. (2020) is the key paper that elevated Diffusion Models to a practical level.

Forward Process (Diffusion):

q(x_t | x_{t-1}) = N(x_t; (1-β_t) * x_{t-1}, β_t * I)

- Add a small amount of Gaussian noise at each timestep t
- β_t: noise schedule (보통 linear 또는 cosine)
- After T steps, x_T approximately equals N(0, I) (pure Gaussian noise)

Noise can be added directly at any timestep t in closed form:

q(x_t | x_0) = N(x_t; (ᾱ_t) * x_0, (1-ᾱ_t) * I)

where ᾱ_t = ∏_{s=1}^{t} α_s,  α_t = 1 - β_t

=> x_t = (ᾱ_t) * x_0 + (1-ᾱ_t) * ε,  ε ~ N(0, I)

Reverse Process (Denoising):

p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), σ_t² * I)

- Neural network epsilon_theta predicts noise epsilon added to x_t
- Remove predicted noise to recover x_{t-1}

Training Objective (Simple Loss):

L_simple = E_{t, x_0, ε} [ ||ε - ε_θ(x_t, t)||² ]

- t ~ Uniform(1, T)
- ε ~ N(0, I)
- x_t = (ᾱ_t) * x_0 + (1-ᾱ_t) * ε
# DDPM core training loop
def train_step(model, x_0, noise_scheduler):
    batch_size = x_0.shape[0]

    # 1. Random timestep sampling
    t = torch.randint(0, num_timesteps, (batch_size,))

    # 2. Noise sampling
    noise = torch.randn_like(x_0)

    # 3. Forward process: generate x_t
    alpha_bar_t = noise_scheduler.alpha_bar[t]
    x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * noise

    # 4. Predict noise
    noise_pred = model(x_t, t)

    # 5. Loss computation (MSE)
    loss = F.mse_loss(noise_pred, noise)

    return loss

2.3.2 Noise Scheduling

The noise schedule determines the amount of noise added at each timestep in the forward process and has a decisive impact on generation quality.

ScheduleFormulaFeaturesModels Used
Linearβ_t = β_min + (β_max - β_min) * t/TSimple but noise increases sharply at the endDDPM
Cosineᾱ_t = cos²((t/T + s)/(1+s) * π/2)Smooth transition, excellent information preservationImproved DDPM
Scaled Linearβ_t = (β_min^0.5 + t/T * (β_max^0.5 - β_min^0.5))²Used in SD 1.xStable Diffusion
Sigmoidβ_t = σ(-6 + 12*t/T)Gradual change at both endsSome research
EDMσ(t) = t, log-normal samplingTheoretically near optimalPlayground v2.5, EDM
Zero Terminal SNREnsures SNR(T) = 0Guarantees starting from pure noiseSD3, Flux

Playground v2.5 (Li et al., 2024) adopted EDM's (Karras et al., 2022) noise schedule, greatly improving color and contrast. The key is ensuring Zero Terminal SNR, where the Signal-to-Noise Ratio (SNR) at timestep T must be exactly 0 during training.

# Cosine Schedule implementation
def cosine_beta_schedule(timesteps, s=0.008):
    steps = timesteps + 1
    x = torch.linspace(0, timesteps, steps)
    alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * math.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clip(betas, 0.0001, 0.9999)

# EDM Noise Schedule (Karras et al., 2022)
def edm_sigma_schedule(num_steps, sigma_min=0.002, sigma_max=80.0, rho=7.0):
    step_indices = torch.arange(num_steps)
    t_steps = (sigma_max ** (1/rho) + step_indices / (num_steps - 1)
               * (sigma_min ** (1/rho) - sigma_max ** (1/rho))) ** rho
    return t_steps

2.3.3 Latent Diffusion Model (LDM) - The Core of Stable Diffusion

Latent Diffusion Model (LDM) by Rombach et al. (2022) dramatically improved computational efficiency by performing diffusion in latent space instead of pixel space. This is the core idea behind Stable Diffusion.

[Latent Diffusion Model Architecture]

                          Text Prompt
                         ┌────┴────┐
CLIPEncoder                         └────┬────┘
                              │ text embeddings
┌──────┐    ┌──────┐    ┌─────┴──────┐    ┌──────┐    ┌──────┐
│Image │    │ VAE  │    │   U-Net    │    │ VAE  │    │Output│
(512--->│Encode│---> (Denoising--->│Decode│--->│Image │
│x512) │    │  r   │    │  in Latent │    │  r   │    (512│      │    │      │    │   Space)   │    │      │    │x512)└──────┘    └──────┘    └────────────┘    └──────┘    └──────┘
              │                                 │
              │  64x64x4                        │
                (8x downsampling)              └─────────────────────────────────┘
                    Latent Space (z)

Training: Diffusion in Latent Space
Inference: Random noise z_T -> U-Net Denoising -> VAE Decode -> Image

LDM Training Pipeline:

  1. Stage 1 - Autoencoder Training: Pretrain VAE (KL-regularized) on image datasets

    • Encoder: Image x (H x W x 3) -> latent z (H/f x W/f x c), f=8 is typical
    • Decoder: latent z -> Reconstructed image
    • Loss: Reconstruction + KL Divergence + Perceptual Loss + GAN Loss
  2. Stage 2 - Diffusion Model Training: Diffusion in the latent space of the frozen Autoencoder

    • Add noise to latent z_0 = Encoder(x) to generate z_t
    • U-Net predicts noise from z_t
    • Text conditioning is injected via cross-attention
# Latent Diffusion core training
class LatentDiffusionTrainer:
    def __init__(self, vae, unet, text_encoder, noise_scheduler):
        self.vae = vae              # Frozen
        self.unet = unet            # Trainable
        self.text_encoder = text_encoder  # Frozen
        self.noise_scheduler = noise_scheduler

    def train_step(self, images, captions):
        # 1. Latent encoding with VAE (no gradient needed)
        with torch.no_grad():
            latents = self.vae.encode(images).latent_dist.sample()
            latents = latents * self.vae.config.scaling_factor  # 0.18215

        # 2. Text embedding (no gradient needed)
        with torch.no_grad():
            text_embeddings = self.text_encoder(captions)

        # 3. Add noise
        noise = torch.randn_like(latents)
        timesteps = torch.randint(0, 1000, (latents.shape[0],))
        noisy_latents = self.noise_scheduler.add_noise(latents, noise, timesteps)

        # 4. Predict noise
        noise_pred = self.unet(noisy_latents, timesteps, text_embeddings)

        # 5. MSE loss
        loss = F.mse_loss(noise_pred, noise)
        return loss

2.3.4 Structure of the U-Net Backbone

The U-Net used in Stable Diffusion 1.x/2.x and SDXL has the following structure:

[U-Net with Cross-Attention Structure]

Input z_t ─────────────────────────────────────────── Output ε_θ
    │                                                     ▲
    ▼                                                     │
┌────────┐  ┌────────┐  ┌────────┐      ┌────────┐  ┌────────┐
Down   │  │ Down   │  │ Down   │      │  Up    │  │  UpBlock  │──│ Block  │──│ Block  │──┐   │ Block  │──│ Block│ 64x64  │  │ 32x32  │  │ 16x16  │  │   │ 32x32  │  │ 64x64  │
└────────┘  └────────┘  └────────┘  │   └────────┘  └────────┘
    │            │            │     │        ▲           ▲
    │            │            │     ▼        │           │
    │            │            │  ┌────────┐  │           │
    │            │            └──│ Middle │──┘           │
    │            │               │ Block  │              │
    │            │               │ 16x16  │              │
    │            └───────────────└────────┘──────────────┘
                        (skip connections)
    └────────────────────────────────────────────────────┘

Inside each Block:
┌──────────────────────────────────────┐
ResNet Block│  ├── GroupNormSiLUConv│  ├── Timestep Embedding injection          │
│  └── GroupNormSiLUConv│                                       │
Self-Attention Block│  ├── LayerNormSelf-Attention│  └── Skip Connection│                                       │
Cross-Attention Block│  ├── LayerNorm│  ├── Q = Linear(latent features)│  ├── K = Linear(text embeddings)     │  ← Text Conditioning
│  ├── V = Linear(text embeddings)│  └── Attention(Q, K, V)│                                       │
Feed-Forward Block│  ├── LayerNormLinearGEGLU│  └── LinearSkip Connection└──────────────────────────────────────┘

SDXL (Podell et al., 2023) expanded the U-Net by approximately 3x (~2.6B parameters), uses two text encoders (OpenCLIP ViT-bigG and CLIP ViT-L), and applies improvements including training at various aspect ratios.

ModelU-Net ParamsText EncoderResolutionVAE Downsampling
SD 1.5~860MCLIP ViT-L/14 (1)512x5128x
SD 2.1~865MOpenCLIP ViT-H/14 (1)768x7688x
SDXL~2.6BOpenCLIP ViT-bigG + CLIP ViT-L (2)1024x10248x
SDXL Refiner~2.3BOpenCLIP ViT-bigG (1)1024x10248x

2.3.5 Classifier-Free Guidance (CFG)

Classifier-Free Guidance (CFG) by Ho & Salimans (2022) is a core training technique for modern T2I models.

Problems with Traditional Classifier Guidance:

  • Requires training a separate classifier
  • Needs a classifier that works on noisy images
  • Requires computing classifier gradients during inference

Classifier-Free Guidance Key Idea:

During training, text conditioning is replaced with an empty string ("") with a certain probability (typically 10-20%), so that a single model simultaneously learns both conditional and unconditional generation.

학습 시:
  - probability p_uncond (: 10%): ε_θ(x_t, t,)  (unconditional)
  - probability 1-p_uncond:          ε_θ(x_t, t, c)  (conditional)

추론 시:
  ε_guided = ε_θ(x_t, t,) + w * (ε_θ(x_t, t, c) - ε_θ(x_t, t,))

  - w: guidance scale (보통 7.5 ~ 15)
  - w=1: conditional 예측 그대로
  - w>1: 텍스트 조건 방향으로 더 강하게 이동
# Classifier-Free Guidance training implementation
def train_step_cfg(model, x_0, text_cond, p_uncond=0.1):
    noise = torch.randn_like(x_0)
    t = torch.randint(0, T, (x_0.shape[0],))
    x_t = add_noise(x_0, noise, t)

    # Randomly drop conditioning
    mask = torch.rand(x_0.shape[0]) < p_uncond
    cond = text_cond.clone()
    cond[mask] = empty_text_embedding  # null conditioning

    noise_pred = model(x_t, t, cond)
    loss = F.mse_loss(noise_pred, noise)
    return loss

# Classifier-Free Guidance inference
def sample_cfg(model, x_T, text_cond, guidance_scale=7.5):
    x_t = x_T
    for t in reversed(range(T)):
        # Unconditional prediction
        eps_uncond = model(x_t, t, empty_text_embedding)
        # Conditional prediction
        eps_cond = model(x_t, t, text_cond)
        # Guided prediction
        eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
        x_t = denoise_step(x_t, eps, t)
    return x_t

CFG dramatically improves generation quality and text fidelity, but if the guidance scale is too high, images become oversaturated or artifacts appear.

2.3.6 DALL-E 2: CLIP-Based Diffusion

DALL-E 2 (Ramesh et al., 2022) introduced a two-stage diffusion architecture leveraging the CLIP embedding space.

[DALL-E 2 Training Pipeline]

Text ──→ CLIP Text Encoder ──→ text embedding
                              ┌─────┴─────┐
Prior    │  text emb → CLIP image emb
                               (Diffusion)                              └─────┬─────┘
CLIP image embedding
                              ┌─────┴─────┐
DecoderCLIP image emb → 64x64 image
                               (Diffusion)                              └─────┬─────┘
                                    │ 64x64
                              ┌─────┴─────┐
Super-Res  │  64x64 → 256x256 → 1024x1024
                               (Diffusion)                              └─────────── ┘

2.3.7 Imagen: The Power of T5 Text Encoder

Google's Imagen (Saharia et al., 2022) maximized text understanding by using the T5-XXL (4.6B parameter) text encoder.

Key findings:

  • Scaling text encoder size is more effective than scaling U-Net size
  • T5-XXL > CLIP ViT-L (Text understanding in quality)
  • Dynamic Thresholding: Stable generation even at high CFG scales
[Imagen Architecture]

Text ──→ T5-XXL (frozen) ──→ text embeddings
                              ┌─────┴─────┐
Base Model │  64x64 생성
                                (U-Net)   │  cross-attention
                              └─────┬─────┘
                              ┌─────┴─────┐
SR Model 1 │  64x64 → 256x256
                                (U-Net)                              └─────┬─────┘
                              ┌─────┴─────┐
SR Model 2 │  256x256 → 1024x1024
                                (U-Net)                              └─────────── ┘

2.3.8 DiT: Diffusion Transformer

DiT (Diffusion Transformer) by Peebles & Xie (2023) is an architecture that replaces U-Net with Transformer, and is becoming the mainstream for recent T2I models.

[DiT Block Structure]

     Input Tokens (patchified latent)
    ┌──────┴──────┐
LayerNorm  │ ← adaLN-Zero (adaptive)
      (adaptive) │   γ, β = MLP(timestep + class)
    └──────┬──────┘
    ┌──────┴──────┐
Self-Attention    └──────┬──────┘
            (+ residual)
    ┌──────┴──────┐
LayerNorm  │ ← adaLN-Zero
      (adaptive)    └──────┬──────┘
    ┌──────┴──────┐
PointwiseFFN    └──────┬──────┘
            (+ residual, scaled by α)
     Output Tokens

Key Design Decisions of DiT:

  • Patchify: Split latent into p x p patches then linear projection to token sequence
  • adaLN-Zero: Adaptive Layer Normalization, injecting timestep and class information into LN parameters
  • Scaling: Systematic scaling law verification by model size (depth, width)
DiT VariantDepthWidthParametersGFLOPs
DiT-S/21238433M6
DiT-B/212768130M23
DiT-L/2241024458M80
DiT-XL/2281152675M119

2.4 Autoregressive-Based T2I

2.4.1 DALL-E (Original): Token-Based Autoregressive Generation

DALL-E (Ramesh et al., 2021) converts images into discrete tokens, then concatenates text tokens and image tokens into a single sequence to learn the joint distribution with an autoregressive Transformer.

[DALL-E Training Pipeline]

Stage 1: dVAE 학습
  Image (256x256) ──→ dVAE Encoder ──→ 32x32 grid of tokens (8192 vocabulary)
                                           ──→ dVAE Decoder ──→ Reconstructed Image

  Loss: Reconstruction + KL Divergence (Gumbel-Softmax relaxation)

Stage 2: Autoregressive Transformer 학습
  [BPE text tokens (256)] + [Image tokens (1024)] = 1280 tokens

  Transformer (12B params):
  - 64 layers, 62 attention heads
  - 학습 목적: next-token prediction (cross-entropy)
  - 텍스트 토큰은 causal attention (좌→우)
  - 이미지 토큰은 row-major order로 자기회귀 생성
  - 텍스트→이미지 cross-attention 포함

2.4.2 Parti: Encoder-Decoder Based

Google's Parti (Yu et al., 2022) formulated T2I as a sequence-to-sequence problem, combining a ViT-VQGAN tokenizer with an Encoder-Decoder Transformer.

Key features:

  • ViT-VQGAN: Vision Transformer-based image tokenizer
  • Encoder-Decoder: Uses Encoder for text encoding, Decoder for image token generation
  • Scaling: Systematic scale-up from 350M to 3B to 20B parameters
  • Achieves quality comparable to Imagen at the 20B model

2.4.3 CM3Leon: Efficient Multimodal Autoregressive

Meta's CM3Leon (Yu et al., 2023) significantly improved the efficiency of the autoregressive approach:

  • Retrieval-Augmented Training: Retrieve related image-text pairs during training and add to context
  • Decoder-Only: Pure decoder-only architecture unlike Parti
  • Instruction Tuning: Supervised fine-tuning for various tasks
  • 5x less training cost: Reduces training compute by 1/5 for comparable performance
  • Achieves MS-COCO zero-shot FID of 4.88

2.5 Flow Matching: The Next-Generation Training Paradigm

2.5.1 Basic Principles of Flow Matching

Flow Matching (Lipman et al., 2023) learns a straight path from noise distribution to data distribution through a deterministic ODE (Ordinary Differential Equation) instead of Diffusion's stochastic process.

[Diffusion vs Flow Matching Comparison]

Diffusion (Stochastic):              Flow Matching (Deterministic):
  x_0 ~~~> x_T                         x_0 ──────> x_1
  (curved path, requires many steps)              (straight path, fewer steps possible)

  x₀ •                                x₀ •
     \  Curved                            \  Straight
      \    path                            \    path
       \                                    \
        \                                    \
     x_T •                              x₁   (= noise)

dx = f(x,t)dt + g(t)dW              dx/dt = v_θ(x_t, t)
(SDE 기반)                           (ODE 기반, velocity field 학습)

Flow Matching Training Objective:

L_FM = E_{t, x_0, x_1} [ ||v_θ(x_t, t) - u_t(x_t | x_0, x_1)||² ]

where:
  x_t = (1 - t) * x_0 + t * x_1      (linear interpolation)
  u_t = x_1 - x_0                      (target velocity: 직선 path)
  t ~ Uniform(0, 1)                    (또는 logit-normal)
  x_0 ~ p_data (실제 데이터)
  x_1 ~ N(0, I) (가우시안 노이즈)

2.5.2 Rectified Flow

Rectified Flow (Liu et al., 2023, ICLR 2023 Spotlight) is a key variant of Flow Matching that connects noise-data pairs in straight lines from an Optimal Transport perspective.

Key idea:

  1. 1-Rectified Flow: Randomly pair data x_0 and noise x_1 to learn straight paths
  2. 2-Rectified Flow (Reflow): Re-straighten pairs generated by 1-Rectified Flow to make paths closer to straight lines
  3. Distillation: Distill the straightened model into a 1-step model
# Rectified Flow core training
def rectified_flow_train_step(model, x_0, x_1=None):
    """
    x_0: 실제 데이터 (latent)
    x_1: 노이즈 (None이면 랜덤 샘플링)
    """
    if x_1 is None:
        x_1 = torch.randn_like(x_0)

    # Time sampling (logit-normal for SD3)
    t = torch.sigmoid(torch.randn(x_0.shape[0]))  # logit-normal
    t = t.view(-1, 1, 1, 1)

    # Linear interpolation
    x_t = (1 - t) * x_0 + t * x_1

    # Target velocity (straight direction)
    target_v = x_1 - x_0

    # Velocity prediction
    v_pred = model(x_t, t)

    # Loss
    loss = F.mse_loss(v_pred, target_v)
    return loss

2.5.3 Flow Matching in Stable Diffusion 3

SD3 (Esser et al., 2024) is the first model to apply Rectified Flow to a large-scale T2I model. Key contributions from the paper "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis":

1. Logit-Normal Timestep Sampling:

Instead of a uniform distribution, timesteps are sampled using a logit-normal distribution, giving more weight to the middle portion of the trajectory (the most challenging prediction interval).

# SD3's Logit-Normal Timestep Sampling
def logit_normal_sampling(batch_size, m=0.0, s=1.0):
    """Give more weight to middle timesteps"""
    u = torch.randn(batch_size) * s + m
    t = torch.sigmoid(u)  # (0, 1) 범위
    return t

2. MM-DiT (Multi-Modal Diffusion Transformer):

SD3 introduced a new Transformer architecture that uses separate weights for text and images while enabling bidirectional information flow.

[MM-DiT Block]

  Image Tokens          Text Tokens
       │                     │
  ┌────┴────┐          ┌────┴────┐
  │adaLN(t)│adaLN(t)  └────┬────┘          └────┬────┘
       │                     │
       └──────┬──────────────┘
               (concatenate)
       ┌──────┴──────┐
Joint Self-Image-text tokens
Attention  │     attend to each other
       └──────┬──────┘
               (split)
       ┌──────┴──────────────┐
       │                     │
  ┌────┴────┐          ┌────┴────┐
FFN   │          │   FFN   (image) (text)  └────┬────┘          └────┬────┘
       │                     │
  Image Out             Text Out

3. Scaling Laws:

ModelBlocksParametersPerformance (validation loss)
SD3-S15450MHigh
SD3-M242BMedium
SD3-L388BLow (best performance)

Smooth scaling was confirmed where validation loss steadily decreases as model size and training steps increase.

2.5.4 Flux: Black Forest Labs' Flow Matching Model

Flux (Black Forest Labs, 2024) is a model based on SD3's Rectified Flow + Transformer architecture.

VariantTraining Method추론 스텝Features
FLUX.1 [pro]Full training25-50Highest quality, API only
FLUX.1 [dev]Guidance Distillation25-50Efficient inference, open weights
FLUX.1 [schnell]Latent Adversarial Diffusion Distillation1-4Ultra-fast generation

Guidance Distillation: The Student model is trained to reproduce the output of the Teacher model (using CFG) without CFG, eliminating CFG computation (2x forward pass) at inference time.

Latent Adversarial Diffusion Distillation (LADD): Combines GAN's adversarial loss with diffusion distillation to enable 1-4 step generation.


3. Text Conditioning Methodologies

Text Conditioning is the mechanism that injects the meaning of text prompts into the image generation process in T2I models. The choice of text encoder and conditioning method has a decisive impact on generation quality.

3.1 CLIP Text Encoder

OpenAI's CLIP (Contrastive Language-Image Pre-training, Radford et al., 2021) is a model contrastively trained on 400 million image-text pairs.

[CLIP Training Process]

  Image ──→ Image Encoder ──→ image embedding ─┐
                                                 ├─ cosine similarity
  Text  ──→ Text Encoder  ──→ text embedding  ─┘

  Training objective: Increase similarity for matching pairs, decrease for non-matching pairs
  (InfoNCE Loss)

Characteristics of CLIP Text Encoder:

  • Both token sequence embeddings and [EOS] token pooled embeddings can be utilized
  • Maximum 77 token length limit
  • Strong at image-text alignment
  • 시각적 개념에 특화된 Text understanding
CLIP 변형파라미터Models Used
CLIP ViT-L/14~124M (text)SD 1.x
OpenCLIP ViT-H/14~354M (text)SD 2.x
OpenCLIP ViT-bigG/14~694M (text)SDXL (primary)
CLIP ViT-L/14~124M (text)SDXL (secondary)

3.2 T5 Text Encoder

Google's T5 (Text-to-Text Transfer Transformer, Raffel et al., 2020) is a large-scale language model trained on a pure text corpus.

Advantages of T5 (Demonstrated in the Imagen paper):

  • Trained on a much larger text corpus than CLIP (C4 dataset)
  • Excellent at understanding complex sentence structures and relationships
  • Ability to process complex prompts including spatial relationships, quantities, and attribute combinations
  • 텍스트 인코더 스케일링이 U-Net 스케일링보다 Effect적 (Imagen Key 발견)
T5 변형파라미터Models Used
T5-Small60MExperimental
T5-Base220MExperimental
T5-Large770MExperimental
T5-XL3BPixArt-alpha
T5-XXL4.6BImagen, SD3, Flux
Flan-T5-XL3BPixArt-sigma

3.3 Cross-Attention Mechanism

Cross-attention is the core mechanism that injects text information into image features within the U-Net or DiT.

# Cross-Attention implementation
class CrossAttention(nn.Module):
    def __init__(self, d_model, d_context, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_model // n_heads

        self.to_q = nn.Linear(d_model, d_model, bias=False)      # latent → Q
        self.to_k = nn.Linear(d_context, d_model, bias=False)     # text → K
        self.to_v = nn.Linear(d_context, d_model, bias=False)     # text → V
        self.to_out = nn.Linear(d_model, d_model)

    def forward(self, x, context):
        """
        x: (B, H*W, d_model) - 이미지 latent features
        context: (B, seq_len, d_context) - 텍스트 임베딩
        """
        q = self.to_q(x)          # 이미지가 Query
        k = self.to_k(context)    # 텍스트가 Key
        v = self.to_v(context)    # 텍스트가 Value

        # Multi-head reshape
        q = q.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
        k = k.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
        v = v.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)

        # Attention
        attn = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_head)
        attn = F.softmax(attn, dim=-1)
        out = attn @ v

        out = out.transpose(1, 2).reshape(B, -1, d_model)
        return self.to_out(out)

3.4 Pooled Text Embeddings vs Sequence Embeddings

Modern T2I models simultaneously utilize two types of text embeddings:

[Text Embedding Types]

Text: "a photo of a cat"
    ┌────┴────┐
TextEncoder    └────┬────┘
    ┌────┴──────────────────────┐
    │                           │
    ▼                           ▼
 Sequence Embeddings         Pooled Embedding
 (token-level)               (sentence-level)
 [h_1, h_2, ..., h_n]       h_pool = h_[EOS]
 Shape: (seq_len, d)         Shape: (d,)
    │                           │
    │                           │
    ▼                           ▼
 Cross-Attention에 사용       Global conditioning에 사용
 (세밀한 토큰별 정보)          (전체 문장 의미)
                              - Timestep embedding에 더하기
                              - adaLN 파라미터 조절
                              - Vector conditioning

Dual text encoder usage in SDXL:

# SDXL Text Conditioning
def get_sdxl_text_embeddings(text, clip_l, clip_g):
    # CLIP ViT-L: sequence embeddings (77, 768)
    clip_l_output = clip_l(text)
    clip_l_seq = clip_l_output.last_hidden_state      # (77, 768)
    clip_l_pooled = clip_l_output.pooler_output        # (768,)

    # OpenCLIP ViT-bigG: sequence embeddings (77, 1280)
    clip_g_output = clip_g(text)
    clip_g_seq = clip_g_output.last_hidden_state       # (77, 1280)
    clip_g_pooled = clip_g_output.pooler_output        # (1280,)

    # Concatenate sequence embeddings -> used for Cross-Attention
    text_embeddings = torch.cat([clip_l_seq, clip_g_seq], dim=-1)  # (77, 2048)

    # Concatenate pooled embeddings -> used for Vector conditioning
    pooled_embeddings = torch.cat([clip_l_pooled, clip_g_pooled], dim=-1)  # (2048,)

    return text_embeddings, pooled_embeddings

SD3 and Flux additionally combine T5-XXL sequence embeddings, using a triple text encoder configuration:

인코더RoleOutput ShapeUse Case
CLIP ViT-LVisual alignmentpooled (768) + seq (77, 768)pooled → vector cond
OpenCLIP ViT-bigGVisual alignmentpooled (1280) + seq (77, 1280)pooled → vector cond
T5-XXLText understandingseq (max 256/512, 4096)cross-attn / joint-attn

4. Training Datasets

The quality of T2I models directly depends on the scale and quality of training data. Here is a summary of major large-scale datasets.

4.1 Comparison of Major Datasets

DatasetScaleSourceFiltering Method주요 Models Used
LAION-5B58.5억 pairsCommon CrawlCLIP similarity > 0.28 (영어)SD 1.x, SD 2.x
LAION-400M4억 pairsCommon CrawlCLIP similarity 필터Early research
LAION-Aesthetics~1.2억 pairsLAION-5B subsetAesthetic score > 4.5/5.0SD fine-tuning
CC3M330만 pairsGoogle 검색Automated filtering pipelineResearch
CC12M1,200만 pairsGoogle 검색Relaxed filteringResearch
COYO-700M7.47억 pairsCommon CrawlImage + text filteringResearch
WebLi10B imagesWeb crawlingTop 10% CLIP similarityPaLI, Imagen
JourneyDB~460만 pairsMidjourneyHigh-quality prompt-imageResearch
SAM11M images다양한 SourceManual + model-basedSegmentation + T2I
Internal (Proprietary)수십억 pairsProprietaryProprietaryDALL-E 3, Midjourney

4.2 LAION-5B Filtering Pipeline

LAION-5B (Schuhmann et al., 2022) is the most widely used open T2I training dataset:

[LAION-5B Data Collection and Filtering Pipeline]

Common Crawl (웹 아카이브)
1. HTML 파싱: <img> 태그에서 src URL + alt-text 추출
2. 이미지 다운로드 (img2dataset)
   - 최소 해상도 필터: width, height ≥ 64
   - 최대 종횡비: 3:1
3. CLIP 유사도 필터링
   - OpenAI CLIP ViT-B/32로 image-text similarity 계산
   - 영어: cosine similarity ≥ 0.28
   - 기타 언어: cosine similarity ≥ 0.26
4. 안전성 필터링
   - NSFW 탐지 점수 (CLIP 기반)
   - Watermark 탐지 점수
   - Toxic content 탐지
5. 중복 제거 (deduplication)
   - 해시 기반 exact duplicate 제거
   - CLIP embedding 기반 near-duplicate 제거
최종: 58.5억 이미지-텍스트 pairs
 - 23.2억 영어
 - 22.6억 기타 100+ 언어
 - 12.7억 언어 미확인

4.3 Data Quality Assessment

The latest models tend to focus on data quality over data quantity:

1. CLIP Score-Based Filtering:

# CLIP Score computation
from transformers import CLIPModel, CLIPProcessor
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

inputs = processor(text=[caption], images=[image], return_tensors="pt")
outputs = model(**inputs)
clip_score = outputs.logits_per_image.item() / 100.0  # normalized

2. Aesthetic Score Filtering:

LAION-Aesthetics is a subset that trains a separate aesthetic predictor (CLIP embedding to MLP to score) and extracts only images with an aesthetic score of 4.5 or higher.

3. Caption Quality Improvement (DALL-E 3's Core Innovation):

DALL-E 3 (Betker et al., 2023) achieved dramatic performance improvement through caption quality improvement alone without any architecture changes:

  • Train a dedicated image captioning model to generate detailed synthetic captions
  • Train with 95% synthetic captions + 5% original captions
  • Comparison experiments of three types: short synthetic, detailed synthetic, and human annotation
  • Detailed synthetic captions are overwhelmingly superior
[DALL-E 3 Caption Improvement Effect]

Before: "cat on table"
      -> Vague and lacks detail

After: "A fluffy orange tabby cat sitting on a round wooden
       dining table, natural sunlight streaming through a
       window behind, casting soft shadows. The cat has
       bright green eyes and is looking directly at the camera."
      -> Includes detailed attributes, spatial relationships, and lighting information

4.4 Data Preprocessing Techniques

전처리 TechniqueDescriptionEffect
Center CropCrop center of image to squareResolution standardization
Random CropRandom position cropData augmentation
Bucket SamplingGroup images with similar aspect ratiosMulti-aspect ratio training (SDXL)
Caption DropoutReplace caption with empty string at a certain probabilityCFG training support
Multi-resolutionProgressive learning from low to high resolutionTraining efficiency + quality
Tag ShufflingRandom shuffle of tag orderReduced text order bias

5. Fine-tuning & Customization Techniques

Fine-tuning techniques that adapt pretrained T2I models to specific styles, subjects, and control conditions are essential for practical applications.

5.1 LoRA (Low-Rank Adaptation)

LoRA by Hu et al. (2022) is an efficient method for fine-tuning large model weights, and is also extensively used in T2I models.

[LoRA Principle]

원본 가중치:  W_0R^{d×k}  (고정, frozen)
LoRA 업데이트: ΔW = B × A      where AR^{r×k}, BR^{d×r}

최종 출력: h = W_0 x + ΔW x = W_0 x + B(Ax)

- r << min(d, k): low-rank (보통 4, 8, 16, 32, 64)
- 학습 파라미터: AB (전체 대비 매우 적음)
- 원본 가중치는 고정 → 메모리 효율적
# LoRA application example (Stable Diffusion U-Net attention layer)
class LoRALinear(nn.Module):
    def __init__(self, original_layer, rank=4, alpha=1.0):
        super().__init__()
        self.original = original_layer  # frozen
        in_features = original_layer.in_features
        out_features = original_layer.out_features

        # LoRA layers
        self.lora_A = nn.Linear(in_features, rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=False)
        self.scale = alpha / rank

        # Initialization
        nn.init.kaiming_uniform_(self.lora_A.weight)
        nn.init.zeros_(self.lora_B.weight)  # Initialize B to 0 -> identical to original at start

    def forward(self, x):
        original_out = self.original(x)        # Frozen original output
        lora_out = self.lora_B(self.lora_A(x)) # LoRA update
        return original_out + self.scale * lora_out

LoRA Training Configuration (Diffusers-based):

# Diffusers LoRA training execution example
accelerate launch train_text_to_image_lora.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --dataset_name="lambdalabs/naruto-blip-captions" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=15000 \
  --learning_rate=1e-04 \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=0 \
  --rank=4 \
  --mixed_precision="fp16" \
  --output_dir="./sdxl-naruto-lora"
LoRA 파라미터Typical RangeImpact
Rank (r)4-128Higher values increase expressiveness and memory
Alpha (α)rank와 동일 ~ 2xLearning rate scaling
Target Modulesattn Q,K,V,O + FFNApplication scope
Learning Rate1e-4 ~ 1e-5Convergence speed
Training Time5-30분 (단일 GPU)Enables fast iteration
File Size1-200 MBEasy to share and distribute

5.2 DreamBooth

DreamBooth by Ruiz et al. (2023) is a technique for injecting the concept of a specific subject into a model using 3-5 images.

[DreamBooth Training Process]

Input: 3-5 images of a specific subject + unique identifier "[V]"
      Example: "a [V] dog" (specific dog)

Training strategy:
1. Fine-tune model with subject images
   - "a [V] dog" → 해당 강아지 이미지

2. Prior Preservation Loss (Key!)
   - Pre-generate "a dog" images with the original model
   - Preserve general dog generation capability during fine-tuning
   - Prevent language drift

L = L_recon([V] images) + λ * L_prior(class images)
# DreamBooth + LoRA training (recommended combination)
# Based on diffusers library
accelerate launch train_dreambooth_lora.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --instance_data_dir="./my_dog_images" \
  --instance_prompt="a photo of sks dog" \
  --class_data_dir="./class_dog_images" \
  --class_prompt="a photo of dog" \
  --with_prior_preservation \
  --prior_loss_weight=1.0 \
  --num_class_images=200 \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --max_train_steps=500 \
  --rank=4 \
  --mixed_precision="fp16"

5.3 Textual Inversion

Textual Inversion by Gal et al. (2023) is a method that learns only new token embeddings without modifying any model weights.

[Textual Inversion]

Existing token space:  [cat] [dog] [car] [tree] ...
Add new token:               [S*] New concept to learn
Training: Optimize only the embedding vector of [S*] with 3-5 images
Entire rest of model is frozen

Advantage: Minimal parameters (1 token = 768 or 1024 floats)
Disadvantage: Less expressive than LoRA/DreamBooth

5.4 ControlNet

ControlNet by Zhang & Agrawala (2023) is a method for adding structural conditions (edge, depth, pose, etc.) to pretrained diffusion models.

[ControlNet Architecture]

                   Control Input (: Canny edge)
                    ┌─────┴─────┐
ZeroConv                    └─────┬─────┘
                    ┌─────┴─────┐
Locked U-NetTrainableCopy of U-Net Encoder
(원본 고정)Copy of       (trainable)
    │               │ U-Net Enc    │               └─────┬─────┘
    │                     │
    │               ┌─────┴─────┐
    │               │  ZeroOutput is 0 at training start
    │               │  Conv          (starts without affecting original model)
    │               └─────┬─────┘
    │                     │
    └─────── + ───────────┘  Add to original U-Net features
              Final Output

ControlNet's Core Training Technique - Zero Convolution:

# Zero Convolution: Initialize weights and biases to 0
class ZeroConv(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, 1)
        nn.init.zeros_(self.conv.weight)
        nn.init.zeros_(self.conv.bias)

    def forward(self, x):
        return self.conv(x)

# Training start: zero conv output = 0
# -> Adding ControlNet doesn't affect original model output
# -> Gradually reflects control signal as training progresses
Condition TypeInputUse Case
Canny EdgeEdge mapContour-based generation
DepthDepth map3D structure preservation
OpenPoseJoint positionsHuman pose control
Semantic SegmentationSegmentation mapLayout control
ScribbleScribbleRough composition
Normal MapSurface normal map3D shape control
TileLow-resolution/tileSuper-resolution

5.5 IP-Adapter

IP-Adapter (Image Prompt Adapter) by Ye et al. (2023) is an adapter that uses images as prompts to transfer style or subjects.

[IP-Adapter Architecture]

Reference Image ──→ CLIP Image Encoder ──→ image features
                                         ┌─────┴─────┐
ProjectionTrainable
Layer                                         └─────┬─────┘
                                         ┌─────┴─────┐
DecoupledSeparate cross-attention
Cross-Attn     (separated from text cross-attn)
                                         └─────┬─────┘
Original U-Net Cross-Attention ────── + ───────┘
(text conditioning)

출력 = Text_CrossAttn(Q, K_text, V_text) + λ * Image_CrossAttn(Q, K_img, V_img)

5.6 Comparison of Fine-tuning Techniques

TechniqueModified TargetTraining ImagesTraining TimeFile Size주요 Use Case
LoRAAttention weights (low-rank)Tens to thousands5-30분1-200MBStyle, concepts
DreamBoothFull model or + LoRA3-105-60분2-7GB (전체) 또는 1-200MB (LoRA)Specific subject
Textual InversionToken embeddings only3-1030분-수시간Few KBSimple concepts
ControlNetU-Net Encoder copyTens of thousands to hundreds of thousandsSeveral days~1.5GBStructural control
IP-AdapterProjection + Cross-AttnLarge-scaleSeveral days~100MBImage prompting

6.1 Consistency Models

Consistency Models by Yang Song et al. (2023) is a method for reducing multi-step generation in diffusion models to 1-step or few-step.

[Consistency Models Key Idea]

Diffusion: x_T → x_{T-1}... → x_1 → x_0  (hundreds of steps)

Consistency:
  PF-ODE trajectory 위의 모든 점 x_t가
  동일한 x_0로 매핑되도록 학습

  f_θ(x_t, t) = x_0  ∀t ∈ [0, T]

  Key 제약: f_θ(x_0, 0) = x_0 (self-consistency)

     x_T ────→ f_θ ────→ x_0
      │                    ↑
     x_t ────→ f_θ ───────┘  (maps to the same x_0!)
      │                    ↑
    x_t' ────→ f_θ ───────┘

Two Training Methods:

방법Description장점단점
Consistency Distillation (CD)사전학습된 diffusion model 필요, PF-ODE 시뮬레이션높은 품질teacher 모델 필요
Consistency Training (CT)실제 데이터에서 직접 학습teacher 불필요CD보다 품질 다소 Low

Performance:

  • CIFAR-10: FID 3.55 (1-step), 2.93 (2-step)
  • ImageNet 64x64: FID 6.20 (1-step)

Follow-up research, Improved Consistency Training (iCT) and Latent Consistency Models (LCM), applied this to large-scale T2I models, enabling 2-4 step generation at the SDXL level.

6.2 The Spread of DiT (Diffusion Transformer) Architecture

Since 2024, DiT has been replacing U-Net to become the mainstream backbone for T2I:

모델YearBackbone파라미터Key Features
DiT (원본)2023Transformer675MClass-conditional, adaLN-Zero
PixArt-alpha2023DiT + Cross-Attn600MT2I, low-cost training
PixArt-sigma2024DiT + KV Compression600M4K resolution, weak-to-strong
SD32024MM-DiT2B-8BFlow Matching, triple text encoder
Flux2024MM-DiT variant~12BDistillation variant
Playground v2.52024SDXL U-Net~2.6BEDM noise schedule
Hunyuan-DiT2024DiT~1.5BChinese+English bilingual
Lumina-T2X2024DiT다양Multi-modal generation

6.3 PixArt-alpha and PixArt-sigma

PixArt-alpha (Chen et al., 2023) is a pioneering model for efficient DiT training:

Core innovation - Training Decomposition:

[PixArt-alpha 3-Stage Training]

Stage 1: Pixel Dependency 학습 (저비용)
  - ImageNet 사전학습된 DiT에서 시작
  - 클래스 조건부 → T2I 전환의 기초

Stage 2: Text-Image Alignment 학습
  - Cross-attention으로 텍스트 조건 주입
  - LLaVA로 생성한 고품질 synthetic caption 사용

Stage 3: High-quality Aesthetic 학습
  - 고품질 미적 Dataset으로 fine-tuning
  - JourneyDB 등 활용

총 학습 비용: ~675 A100 GPU days
(SD 1.5~6,250 A100 GPU days 대비 10.8%)

Improvements in PixArt-sigma (Chen et al., 2024):

  • Weak-to-Strong Training: Enhanced training with higher quality data based on PixArt-alpha
  • KV Compression: Compress Key and Value in Attention for improved efficiency, enabling 4K resolution
  • Comparable performance to SDXL (2.6B) with only 0.6B parameters

6.4 Comparison of SDXL, SD3, and Flux

[Stable Diffusion Lineage by Generation]

SD 1.x (2022)     SDXL (2023)       SD3 (2024)         Flux (2024)
    │                  │                │                   │
  U-Net 860M       U-Net 2.6B      MM-DiT 2-8B        MM-DiT ~12B
    │                  │                │                   │
  CLIP ViT-L       CLIP-L +          CLIP-L +            CLIP-L +
                   OpenCLIP-G        OpenCLIP-G +        OpenCLIP-G +
                                     T5-XXL               T5-XXL
    │                  │                │                   │
  Diffusion        Diffusion        Rectified            Rectified
  (DDPM)           (DDPM)           Flow                 Flow
    │                  │                │                   │
  512x512          1024x1024        1024x1024            1024x1024+
    │                  │                │                   │
  CFG 7.5          CFG 5-9          CFG 3.5-7            Guidance
                                                         Distillation

6.5 Training Innovations of DALL-E 3

The core innovation of DALL-E 3 (Betker et al., 2023) lies in improving training data caption quality:

  1. Image Captioner Training: Separately train a CoCa-based image captioning model
  2. Synthetic Caption Generation: Re-label entire training data with detailed synthetic captions
  3. Caption Mixing: Train with 95% synthetic + 5% original captions
  4. Descriptive vs Short: 상세한 Description형 캡션이 짧은 태그형보다 우수

6.6 Three Key Insights of Playground v2.5

Playground v2.5 (Li et al., 2024) surpassed DALL-E 3 and Midjourney 5.2 through training strategy improvements based on the SDXL architecture:

1. EDM Noise Schedule Adoption:

# EDM Framework (Karras et al., 2022)
# σ(t) 기반 noise schedule - Zero Terminal SNR 보장
# 기존 SD의 linear schedule 대비 색상/대비 크게 개선

def edm_precondition(sigma, x_noisy, F_theta):
    """EDM Preconditioning"""
    c_skip = 1.0 / (sigma ** 2 + 1)
    c_out = sigma / (sigma ** 2 + 1).sqrt()
    c_in = 1.0 / (sigma ** 2 + 1).sqrt()
    c_noise = sigma.log() / 4

    D_x = c_skip * x_noisy + c_out * F_theta(c_in * x_noisy, c_noise)
    return D_x

2. Multi-Aspect Ratio Training:

  • Bucketed dataset: Group images with similar aspect ratios하여 배치 구성
  • Supports various aspect ratios during training (1:1, 4:3, 16:9, etc.)

3. Human Preference Alignment:

  • Training strategy utilizing human preference data
  • Maximize aesthetic quality through quality-tuning

7. Practical Training Pipeline Guide

7.1 Training Infrastructure

GPU Requirements

Training ScaleRecommended GPUVRAMTraining DurationCost (Estimated)
LoRA Fine-tuningRTX 3090/4090 1대24GB5-30분< $1
DreamBoothA100 40GB 1대40GB30-60분$2-5
ControlNet 학습A100 80GB 4-8대320-640GB2-5일$500-2,000
SD 1.5 수준 학습A100 80GB 256대~20TB24일~$150K
SDXL 수준 학습A100 80GB 512-1024대~40-80TB수주~$500K-1M
SD3/Flux 수준 학습H100 80GB 1024+대~80TB+수주-수개월> $1M

Distributed Training Strategy

[Large-Scale Distributed Training Configuration]

┌─────────────────────────────────────────────────────┐
Data Parallel (DP/DDP)│                                                       │
GPU 0        GPU 1        GPU 2        GPU 3│  ┌──────┐    ┌──────┐    ┌──────┐    ┌──────┐       │
│  │Full  │    │Full  │    │Full  │    │Full  │       │
│  │Model │    │Model │    │Model │    │Model │       │
│  │Copy  │    │Copy  │    │Copy  │    │Copy  │       │
│  └──────┘    └──────┘    └──────┘    └──────┘       │
Batch 1     Batch 2     Batch 3     Batch 4│                                                       │
-> Synchronize gradients with All-Reduce-> Different data batches on each GPU└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
FSDP (Fully Sharded Data Parallel)│                                                       │
GPU 0        GPU 1        GPU 2        GPU 3│  ┌──────┐    ┌──────┐    ┌──────┐    ┌──────┐       │
│  │Shard │    │Shard │    │Shard │    │Shard │       │
│  │ 1/4  │    │ 2/4  │    │ 3/4  │    │ 4/4  │       │
│  └──────┘    └──────┘    └──────┘    └──────┘       │
│                                                       │
-> Shard model parameters across GPUs-> All-Gather only needed shards during Forward/Backward-> Maximize memory efficiency (enables 8B+ model training)└─────────────────────────────────────────────────────┘

7.2 Representative Training Framework: Diffusers

HuggingFace's Diffusers library is the de facto standard for T2I model training.

# Diffusers-based Text-to-Image full training pipeline
from diffusers import AutoencoderKL, UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTextModel, CLIPTokenizer
from accelerate import Accelerator
import torch

# 1. Load models
vae = AutoencoderKL.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="vae"
)
unet = UNet2DConditionModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet"
)
text_encoder = CLIPTextModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="text_encoder"
)
tokenizer = CLIPTokenizer.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="tokenizer"
)
noise_scheduler = DDPMScheduler.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler"
)

# 2. Freeze VAE and Text Encoder
vae.requires_grad_(False)
text_encoder.requires_grad_(False)

# 3. Accelerator setup (distributed training + Mixed Precision)
accelerator = Accelerator(
    mixed_precision="fp16",          # or "bf16"
    gradient_accumulation_steps=4,
)

# 4. Optimizer
optimizer = torch.optim.AdamW(
    unet.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
    weight_decay=1e-2,
    eps=1e-8,
)

# 5. EMA setup
from diffusers.training_utils import EMAModel
ema_unet = EMAModel(
    unet.parameters(),
    decay=0.9999,
    use_ema_warmup=True,
)

# 6. Prepare for distributed training
unet, optimizer, dataloader = accelerator.prepare(unet, optimizer, dataloader)

# 7. Training loop
for epoch in range(num_epochs):
    for batch in dataloader:
        with accelerator.accumulate(unet):
            images = batch["images"]
            captions = batch["captions"]

            # Latent encoding
            with torch.no_grad():
                latents = vae.encode(images).latent_dist.sample()
                latents = latents * vae.config.scaling_factor

            # Text encoding
            with torch.no_grad():
                text_inputs = tokenizer(captions, padding="max_length",
                                       max_length=77, return_tensors="pt")
                text_embeds = text_encoder(text_inputs.input_ids)[0]

            # Add noise
            noise = torch.randn_like(latents)
            timesteps = torch.randint(0, 1000, (latents.shape[0],))
            noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)

            # Classifier-Free Guidance: random caption dropout
            if torch.rand(1) < 0.1:  # 10% probability로 unconditional
                text_embeds = torch.zeros_like(text_embeds)

            # Predict noise
            noise_pred = unet(noisy_latents, timesteps, text_embeds).sample

            # Loss computation
            loss = F.mse_loss(noise_pred, noise)

            # Backward
            accelerator.backward(loss)
            accelerator.clip_grad_norm_(unet.parameters(), 1.0)
            optimizer.step()
            optimizer.zero_grad()

            # EMA update
            ema_unet.step(unet.parameters())

7.3 Mixed Precision Training

Mixed Precision is a technique that improves memory and computational efficiency by combining FP32 and FP16/BF16.

[Mixed Precision Training]

Forward Pass:
  - Model weights: FP16/BF16 (half memory)
  - Activation: FP16/BF16

Loss Scaling:
  - Multiply loss by a large scale (e.g., 2^16) to prevent gradient underflow
  - Scale down gradient again after backward

Backward Pass:
  - Gradient: FP16/BF16

Optimizer Step:
  - Master Weights: FP32 (maintain precision!)
  - Update FP32 master weights then create FP16 copy
Precision메모리연산 속도수치 안정성권장
FP324 bytes기준최고Optimizer State
FP162 bytes~2xLow (overflow 위험)Forward/Backward
BF162 bytes~2xHigh (넓은 범위)H100/A100에서 권장
TF324 bytes (저장)~1.5xHighA100 default
# BF16 Mixed Precision config (accelerate-based)
# accelerate config (YAML)
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
mixed_precision: bf16
num_machines: 1
num_processes: 8

7.4 EMA (Exponential Moving Average)

EMA is a technique that maintains a moving average of model weights during training to achieve more stable results during inference. It is used in nearly all T2I model training.

[EMA Update]

θ_ema ← λ * θ_ema + (1 - λ) * θ_model

- λ: decay rate (보통 0.9999 ~ 0.99999)
- θ_model: 현재 학습 중인 모델 가중치
- θ_ema: EMA 가중치 (추론 시 사용)
- Effect: gradient noise를 평활화하여 더 안정적인 가중치
# Diffusers EMA implementation
from diffusers.training_utils import EMAModel

# Create EMA model
ema_model = EMAModel(
    unet.parameters(),
    decay=0.9999,              # decay rate
    use_ema_warmup=True,       # warmup 사용
    inv_gamma=1.0,             # warmup 파라미터
    power=3/4,                 # warmup 파라미터
)

# Update at every training step
ema_model.step(unet.parameters())

# Apply EMA weights at inference
ema_model.copy_to(unet.parameters())

# Or use context manager
with ema_model.average_parameters():
    # EMA weights are used inside this block
    output = unet(noisy_latents, timesteps, text_embeds)

7.5 Training Hyperparameter Guide

HyperparameterSD 1.5SDXLSD3/FluxLoRA
Learning Rate1e-41e-41e-41e-4 ~ 5e-5
Batch Size (총)204820482048+1-8
OptimizerAdamWAdamWAdamWAdamW / Prodigy
Weight Decay0.010.010.010.01
Grad Clip1.01.01.01.0
EMA Decay0.99990.99990.9999N/A
Warmup Steps10,00010,00010,0000-500
PrecisionFP32/FP16BF16BF16FP16/BF16
CFG Dropout10%10%10%10%
Resolution51210241024Original resolution
Total Steps~500K~500K+~1M+500-15,000

8. Key Paper References

8.1 Core Paper Table

#Paper TitleAuthorsYearKey ContributionLink
1Generative Adversarial NetworksGoodfellow et al.2014GAN framework proposalarXiv:1406.2661
2Neural Discrete Representation Learning (VQ-VAE)van den Oord et al.2017Vector Quantized discrete latent spacearXiv:1711.00937
3A Style-Based Generator Architecture for GANs (StyleGAN)Karras et al.2019Style-based generator, Progressive GrowingarXiv:1812.04948
4Large Scale GAN Training (BigGAN)Brock et al.2019Large-scale GAN 학습, Truncation TrickarXiv:1809.11096
5Generating Diverse High-Fidelity Images with VQ-VAE-2Razavi et al.2019Hierarchical VQ-VAE, high-resolution generationarXiv:1906.00446
6Denoising Diffusion Probabilistic Models (DDPM)Ho et al.2020Practical training of Diffusion modelsarXiv:2006.11239
7Learning Transferable Visual Models From Natural Language Supervision (CLIP)Radford et al.2021CLIP contrastive learning, image-text alignmentarXiv:2103.00020
8Zero-Shot Text-to-Image Generation (DALL-E)Ramesh et al.2021dVAE + Autoregressive Transformer T2IarXiv:2102.12092
9High-Resolution Image Synthesis with Latent Diffusion Models (LDM)Rombach et al.2022Latent Diffusion, Cross-Attention conditioningarXiv:2112.10752
10Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)Ramesh et al.2022CLIP-based 2-stage Diffusion, Prior + DecoderarXiv:2204.06125
11Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)Saharia et al.2022T5-XXL 텍스트 인코더의 Effect, Dynamic ThresholdingarXiv:2205.11487
12Classifier-Free Diffusion GuidanceHo & Salimans2022CFG 학습 Technique, unconditional-conditional 동시 학습arXiv:2207.12598
13Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Parti)Yu et al.2022Autoregressive T2I, 20B scalingarXiv:2206.10789
14LoRA: Low-Rank Adaptation of Large Language ModelsHu et al.2022Low-rank fine-tuning TechniquearXiv:2106.09685
15Elucidating the Design Space of Diffusion-Based Generative Models (EDM)Karras et al.2022Systematic Diffusion design space analysis, PreconditioningarXiv:2206.00364
16An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual InversionGal et al.2023Personalization via new token embedding learningarXiv:2208.01618
17DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven GenerationRuiz et al.2023Subject personalization with few images, Prior PreservationarXiv:2208.12242
18Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)Zhang & Agrawala2023Structural control(edge, depth, pose) 추가arXiv:2302.05543
19Consistency ModelsSong et al.20231-step generation, PF-ODE consistency learningarXiv:2303.01469
20SDXL: Improving Latent Diffusion Models for High-Resolution Image SynthesisPodell et al.2023Large U-Net, Dual Text Encoder, Multi-AR trainingarXiv:2307.01952
21Scalable Diffusion Models with Transformers (DiT)Peebles & Xie2023Diffusion + Transformer, adaLN-ZeroarXiv:2212.09748
22Flow Matching for Generative ModelingLipman et al.2023ODE-based Flow Matching frameworkarXiv:2210.02747
23Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified FlowLiu et al.2023Rectified Flow, Optimal TransportarXiv:2209.03003
24IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion ModelsYe et al.2023Image prompt adapter, Decoupled Cross-AttnarXiv:2308.06721
25Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3Leon)Yu et al.2023Efficient autoregressive T2I, Retrieval AugmentedarXiv:2309.02591
26PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image SynthesisChen et al.2023Low-cost DiT training, training decomposition strategyarXiv:2310.00426
27Improving Image Generation with Better Captions (DALL-E 3)Betker et al.2023Dramatic quality improvement via synthetic captionscdn.openai.com/papers/dall-e-3.pdf
28PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image GenerationChen et al.2024Weak-to-Strong training, KV Compression, 4KarXiv:2403.04692
29Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3)Esser et al.2024MM-DiT, Rectified Flow Large-scale 적용, Logit-NormalarXiv:2403.03206
30Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image GenerationLi et al.2024EDM Noise Schedule, Multi-AR, Human PreferencearXiv:2402.17245

8.2 Additional Reference Papers

Paper TitleYearKey
LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models20225.85 billion open image-text dataset
Improved Denoising Diffusion Probabilistic Models2021Cosine schedule, learned variance
Denoising Diffusion Implicit Models (DDIM)2021Deterministic sampling, speed improvement
Progressive Distillation for Fast Sampling of Diffusion Models2022Inference acceleration via progressive distillation
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation2024Rectified Flow 1-step generation
Latent Consistency Models2024LCM, SDXL-based few-step generation
SDXL-Turbo: Adversarial Diffusion Distillation20241-4 step SDXL generation
Stable Cascade2024Wuerstchen-based 3-stage hierarchical generation

9. Conclusion and Outlook

Text-to-Image model training methodologies started from GAN's adversarial training, passed through Diffusion's iterative denoising, and are now converging on a new paradigm of Flow Matching + DiT.

Key Trend Summary

[T2I Training Methodology Evolution]

Efficiency:  Full Training ──→ LoRA/Adapter ──→ Prompt Tuning
         (months, $1M+)     (minutes, less than $1)      (seconds)

Architecture: U-Net ────────→ DiT ─────────→ MM-DiT + Flow Matching
          (SD 1.x-SDXL)    (DiT, PixArt)   (SD3, Flux)

Generation speed: 50-1000 steps ──→ 20-50 steps ──→ 1-4 steps
           (DDPM)            (DDIM, DPM++)   (LCM, LADD, CM)

Data quality: Web crawling ──→ 필터링 ──→ Synthetic Caption
            (LAION raw)    (aesthetic)  (DALL-E 3 방식)

Text understanding: CLIP only ──→ CLIP + T5 ──→ 3Encoder
             (SD 1.x)     (Imagen)      (SD3, Flux)

Future Outlook

  1. Maximizing training efficiency: As demonstrated by PixArt-alpha, the trend of reducing training costs to 1/10 or less while maintaining quality will continue.

  2. Data-Centric AI approach: As DALL-E 3 demonstrated, data quality and captioning are becoming more important than architecture.

  3. Few-Step / One-Step 생성: Consistency Models, LCM, LADD 등의 증류 Technique이 발전하여 실시간 생성이 표준이 될 것이다.

  4. Unified Multi-Modal Generation: Expanding to models that integrate not only text-to-image but also video, 3D, and audio.

  5. Advanced Personalization: Beyond LoRA, DreamBooth, and IP-Adapter, more accurate subject reproduction with even less data will become possible.

T2I model training methodology has entered an era where the key is not simply "training a larger model with more data," but rather what data, with what schedule, and with what conditioning to train with. We hope the methodologies covered in this article can be used as a foundation for training your own T2I models or effectively customizing existing ones.


References