Complete Guide to Text-to-Image Model Training Methodologies: From GAN to Flow Matching

1. Introduction: The Evolution of Text-to-Image Generative Models
- 1.1 Why Training Methodology Matters
2. Training Methodologies by Core Architecture
3. Text Conditioning Methodologies
4. Training Datasets
5. Fine-tuning & Customization Techniques
6. Latest Trends (2024-2026)
7. Practical Training Pipeline Guide
8. Key Paper References
- 8.1 Core Paper Table
- 8.2 Additional Reference Papers
9. Conclusion and Outlook
- Key Trend Summary
- Future Outlook
References

1. Introduction: The Evolution of Text-to-Image Generative Models

Text-to-Image (T2I) generative models are technologies that produce high-resolution images from natural language text prompts, and have undergone rapid development over the past several years. The trajectory of this field can be broadly divided into four paradigms.

[Text-to-Image Model Evolution Timeline]

2014-2019          2017-2020          2020-2023              2023-Present
    |                  |                  |                      |
   GAN              VAE/VQ-VAE        Diffusion Models      Flow Matching
    |                  |                  |                  + DiT
    v                  v                  v                      v
 ┌──────────┐    ┌──────────┐    ┌────────────────┐    ┌──────────────┐
 │StackGAN  │    │ VQ-VAE   │    │ DDPM (2020)    │    │ SD3 (2024)   │
 │AttnGAN   │    │ VQ-VAE-2 │    │ DALL-E 2(2022) │    │ Flux (2024)  │
 │StyleGAN  │    │ dVAE     │    │ Imagen (2022)  │    │ Pixart-Sigma │
 │BigGAN    │    │          │    │ SD 1.x (2022)  │    │              │
 │GigaGAN   │    │          │    │ SDXL (2023)    │    │              │
 └──────────┘    └──────────┘    └────────────────┘    └──────────────┘

Features:                Features:              Features:                   Features:
- Adversarial       - Discrete         - Iterative             - Straight paths
  Training            Latent Space       Denoising             - ODE-based
- Mode Collapse     - Codebook         - Classifier-Free      - Fewer steps
  issues                Learning           Guidance              - DiT backbone
- Fast generation         - Two-stage        - Latent Space          - Scalable
                      Training         - U-Net backbone

1.1 Why Training Methodology Matters

The quality of T2I models is critically determined not only by architecture design but also by training methodology. Even with identical architectures, generation quality varies dramatically depending on noise scheduling, conditioning approaches, data quality, and training strategies. A prime example is DALL-E 3, which achieved dramatic performance improvements over its predecessor through caption quality improvement alone without any architecture changes.

This article provides an in-depth, paper-based analysis of core training methodologies for each paradigm, covering practical training pipeline configuration as well.

2. Training Methodologies by Core Architecture

2.1 GAN-Based: Adversarial Training

Generative Adversarial Network (GAN) is a framework where two networks, the Generator and the Discriminator, are trained competitively.

2.1.1 Basic Training Principles

The training objective function of GAN is defined as a minimax game:

min_G max_D  V(D, G) = E_{x~p_data}[log D(x)] + E_{z~p_z}[log(1 - D(G(z)))]

- G (Generator): 랜덤 노이즈 z로부터 이미지 생성
- D (Discriminator): 실제 이미지와 생성 이미지 구분
- 학습 목표: G는 D를 속이고, D는 정확히 판별

2.1.2 StyleGAN Training Strategy

StyleGAN (Karras et al., 2019) introduced Progressive Growing and Style-based Generator to enable high-quality image generation.

Core Training Techniques:

Technique	Description	Effect
Progressive Growing	Start from low resolution (4x4) and progressively increase	Improved training stability
Style Mixing	Inject different latent codes into different layers	Increased diversity
Path Length Regularization	Generator Jacobian regularization	Improved generation quality
R1 Regularization	Discriminator gradient penalty	Training stabilization
Lazy Regularization	Apply regularization every 16 steps instead of every step	Improved training efficiency

# StyleGAN2 core training loop (simplified)
for real_images, _ in dataloader:
    # 1. Discriminator training
    z = torch.randn(batch_size, latent_dim)
    fake_images = generator(z)

    d_real = discriminator(real_images)
    d_fake = discriminator(fake_images.detach())

    d_loss = F.softplus(-d_real).mean() + F.softplus(d_fake).mean()

    # R1 Regularization (lazy: every 16 steps)
    if step % 16 == 0:
        real_images.requires_grad = True
        d_real = discriminator(real_images)
        r1_grads = torch.autograd.grad(d_real.sum(), real_images)[0]
        r1_penalty = r1_grads.square().sum(dim=[1,2,3]).mean()
        d_loss += 10.0 * r1_penalty

    d_optimizer.zero_grad()
    d_loss.backward()
    d_optimizer.step()

    # 2. Generator training
    z = torch.randn(batch_size, latent_dim)
    fake_images = generator(z)
    d_fake = discriminator(fake_images)
    g_loss = F.softplus(-d_fake).mean()

    g_optimizer.zero_grad()
    g_loss.backward()
    g_optimizer.step()

2.1.3 Large-Scale Training with BigGAN

BigGAN (Brock et al., 2019) is a model that scaled up GAN to large scale, employing the following training strategies:

Large-scale batches: Increase batch size up to 2048 for improved training stability and quality
Class-conditional Batch Normalization: Inject class information into Batch Normalization parameters
Truncation Trick: Truncate latent distribution at inference to control quality-diversity trade-off
Orthogonal Regularization: Maintain orthogonality of weight matrices to prevent mode collapse

2.1.4 Limitations of GAN-Based T2I

GAN-based approaches ceded dominance to Diffusion-based models due to the following fundamental limitations:

Mode Collapse: Limited generation diversity
Training Instability: Unstable training sensitive to hyperparameters
Text Conditioning difficulty: Difficult to accurately reflect complex text prompts
Scaling limitations: Increased training instability at large scale

2.2 VAE-Based: Codebook Learning and Discrete Latent Space

2.2.1 VQ-VAE: Vector Quantized Variational Autoencoder

VQ-VAE (van den Oord et al., 2017) is an approach that learns a discrete latent space instead of a continuous one.

[VQ-VAE Architecture]

Input Image     Encoder      Quantization      Decoder      Reconstructed
  (256x256) --> [E(x)] --> z_e --> [Codebook] --> z_q --> [D(z_q)] --> Image
                             |        |
                             |   ┌────┴────┐
                             |   │ e_1     │
                             |   │ e_2     │  K code vectors
                             └──>│ ...     │  (Codebook)
                                 │ e_K     │
                                 └─────────┘

  z_q = e_k  where k = argmin_j ||z_e - e_j||
  (quantize to nearest code vector)

VQ-VAE Training Loss Function:

L = ||x - D(z_q)||²                    # Reconstruction Loss
  + ||sg[z_e] - e||²                   # Codebook Loss (EMA 업데이트로 대체 가능)
  + β * ||z_e - sg[e]||²              # Commitment Loss

- sg[·]: Stop-gradient 연산자
- β: Commitment loss 가중치 (보통 0.25)
- z_e: Encoder 출력
- e: 선택된 codebook 벡터

Since the quantization operation is non-differentiable, the Straight-Through Estimator (STE) is used to pass gradients to the encoder. The codebook itself is updated via Exponential Moving Average (EMA).

# VQ-VAE Codebook core training code
class VectorQuantizer(nn.Module):
    def __init__(self, num_embeddings, embedding_dim, commitment_cost=0.25):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings, embedding_dim)
        self.commitment_cost = commitment_cost

    def forward(self, z_e):
        # z_e: (B, D, H, W) -> (B*H*W, D)
        flat_z = z_e.permute(0, 2, 3, 1).reshape(-1, z_e.shape[1])

        # Find nearest codebook vector
        distances = (flat_z ** 2).sum(dim=1, keepdim=True) \
                  + (self.embedding.weight ** 2).sum(dim=1) \
                  - 2 * flat_z @ self.embedding.weight.t()
        indices = distances.argmin(dim=1)
        z_q = self.embedding(indices).view_as(z_e.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)

        # Loss computation
        codebook_loss = F.mse_loss(z_q.detach(), z_e)      # commitment
        commitment_loss = F.mse_loss(z_q, z_e.detach())     # codebook
        loss = commitment_loss + self.commitment_cost * codebook_loss

        # Straight-Through Estimator
        z_q_st = z_e + (z_q - z_e).detach()
        return z_q_st, loss, indices

2.2.2 VQ-VAE-2: Hierarchical Codebook Learning

VQ-VAE-2 (Razavi et al., 2019) introduced multi-level hierarchical quantization to significantly improve image quality.

[VQ-VAE-2 Hierarchical Structure]

                    Top Level (작은 해상도)
                    ┌─────────────┐
                    │  32x32 grid │  Global structure info
                    │  Codebook   │  (composition, overall shape)
                    └──────┬──────┘
                           │
                    Bottom Level (큰 해상도)
                    ┌──────┴──────┐
                    │  64x64 grid │  Fine detail info
                    │  Codebook   │  (textures, edges)
                    └─────────────┘

The image generation pipeline of VQ-VAE-2 consists of the following two stages:

Stage 1: Train VQ-VAE-2 to encode images into hierarchical discrete codes
Stage 2: Learn the prior of discrete codes with autoregressive models such as PixelCNN

This approach directly influenced the dVAE (discrete VAE) used in DALL-E.

2.3 Diffusion-Based: The Core of Current T2I

Diffusion Model is the mainstream paradigm for current T2I generation. It learns a forward process that gradually adds noise to data, and a reverse process that recovers data from noise.

2.3.1 DDPM: Denoising Diffusion Probabilistic Models

DDPM by Ho et al. (2020) is the key paper that elevated Diffusion Models to a practical level.

Forward Process (Diffusion):

q(x_t | x_{t-1}) = N(x_t; √(1-β_t) * x_{t-1}, β_t * I)

- Add a small amount of Gaussian noise at each timestep t
- β_t: noise schedule (보통 linear 또는 cosine)
- After T steps, x_T approximately equals N(0, I) (pure Gaussian noise)

Noise can be added directly at any timestep t in closed form:

q(x_t | x_0) = N(x_t; √(ᾱ_t) * x_0, (1-ᾱ_t) * I)

where ᾱ_t = ∏_{s=1}^{t} α_s,  α_t = 1 - β_t

=> x_t = √(ᾱ_t) * x_0 + √(1-ᾱ_t) * ε,  ε ~ N(0, I)

Reverse Process (Denoising):

p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), σ_t² * I)

- Neural network epsilon_theta predicts noise epsilon added to x_t
- Remove predicted noise to recover x_{t-1}

Training Objective (Simple Loss):

L_simple = E_{t, x_0, ε} [ ||ε - ε_θ(x_t, t)||² ]

- t ~ Uniform(1, T)
- ε ~ N(0, I)
- x_t = √(ᾱ_t) * x_0 + √(1-ᾱ_t) * ε

# DDPM core training loop
def train_step(model, x_0, noise_scheduler):
    batch_size = x_0.shape[0]

    # 1. Random timestep sampling
    t = torch.randint(0, num_timesteps, (batch_size,))

    # 2. Noise sampling
    noise = torch.randn_like(x_0)

    # 3. Forward process: generate x_t
    alpha_bar_t = noise_scheduler.alpha_bar[t]
    x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * noise

    # 4. Predict noise
    noise_pred = model(x_t, t)

    # 5. Loss computation (MSE)
    loss = F.mse_loss(noise_pred, noise)

    return loss

2.3.2 Noise Scheduling

The noise schedule determines the amount of noise added at each timestep in the forward process and has a decisive impact on generation quality.

Schedule	Formula	Features	Models Used
Linear	β_t = β_min + (β_max - β_min) * t/T	Simple but noise increases sharply at the end	DDPM
Cosine	ᾱ_t = cos²((t/T + s)/(1+s) * π/2)	Smooth transition, excellent information preservation	Improved DDPM
Scaled Linear	β_t = (β_min^0.5 + t/T * (β_max^0.5 - β_min^0.5))²	Used in SD 1.x	Stable Diffusion
Sigmoid	β_t = σ(-6 + 12*t/T)	Gradual change at both ends	Some research
EDM	σ(t) = t, log-normal sampling	Theoretically near optimal	Playground v2.5, EDM
Zero Terminal SNR	Ensures SNR(T) = 0	Guarantees starting from pure noise	SD3, Flux

Playground v2.5 (Li et al., 2024) adopted EDM's (Karras et al., 2022) noise schedule, greatly improving color and contrast. The key is ensuring Zero Terminal SNR, where the Signal-to-Noise Ratio (SNR) at timestep T must be exactly 0 during training.

# Cosine Schedule implementation
def cosine_beta_schedule(timesteps, s=0.008):
    steps = timesteps + 1
    x = torch.linspace(0, timesteps, steps)
    alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * math.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clip(betas, 0.0001, 0.9999)

# EDM Noise Schedule (Karras et al., 2022)
def edm_sigma_schedule(num_steps, sigma_min=0.002, sigma_max=80.0, rho=7.0):
    step_indices = torch.arange(num_steps)
    t_steps = (sigma_max ** (1/rho) + step_indices / (num_steps - 1)
               * (sigma_min ** (1/rho) - sigma_max ** (1/rho))) ** rho
    return t_steps

2.3.3 Latent Diffusion Model (LDM) - The Core of Stable Diffusion

Latent Diffusion Model (LDM) by Rombach et al. (2022) dramatically improved computational efficiency by performing diffusion in latent space instead of pixel space. This is the core idea behind Stable Diffusion.

[Latent Diffusion Model Architecture]

                          Text Prompt
                              │
                         ┌────┴────┐
                         │  CLIP   │
                         │ Encoder │
                         └────┬────┘
                              │ text embeddings
                              │
┌──────┐    ┌──────┐    ┌─────┴──────┐    ┌──────┐    ┌──────┐
│Image │    │ VAE  │    │   U-Net    │    │ VAE  │    │Output│
│(512  │--->│Encode│--->│ (Denoising │--->│Decode│--->│Image │
│x512) │    │  r   │    │  in Latent │    │  r   │    │(512  │
│      │    │      │    │   Space)   │    │      │    │x512) │
└──────┘    └──────┘    └────────────┘    └──────┘    └──────┘
              │                                 │
              │  64x64x4                        │
              │  (8x downsampling)              │
              └─────────────────────────────────┘
                    Latent Space (z)

Training: Diffusion in Latent Space
Inference: Random noise z_T -> U-Net Denoising -> VAE Decode -> Image

LDM Training Pipeline:

Stage 1 - Autoencoder Training: Pretrain VAE (KL-regularized) on image datasets
- Encoder: Image x (H x W x 3) -> latent z (H/f x W/f x c), f=8 is typical
- Decoder: latent z -> Reconstructed image
- Loss: Reconstruction + KL Divergence + Perceptual Loss + GAN Loss
Stage 2 - Diffusion Model Training: Diffusion in the latent space of the frozen Autoencoder
- Add noise to latent z_0 = Encoder(x) to generate z_t
- U-Net predicts noise from z_t
- Text conditioning is injected via cross-attention

# Latent Diffusion core training
class LatentDiffusionTrainer:
    def __init__(self, vae, unet, text_encoder, noise_scheduler):
        self.vae = vae              # Frozen
        self.unet = unet            # Trainable
        self.text_encoder = text_encoder  # Frozen
        self.noise_scheduler = noise_scheduler

    def train_step(self, images, captions):
        # 1. Latent encoding with VAE (no gradient needed)
        with torch.no_grad():
            latents = self.vae.encode(images).latent_dist.sample()
            latents = latents * self.vae.config.scaling_factor  # 0.18215

        # 2. Text embedding (no gradient needed)
        with torch.no_grad():
            text_embeddings = self.text_encoder(captions)

        # 3. Add noise
        noise = torch.randn_like(latents)
        timesteps = torch.randint(0, 1000, (latents.shape[0],))
        noisy_latents = self.noise_scheduler.add_noise(latents, noise, timesteps)

        # 4. Predict noise
        noise_pred = self.unet(noisy_latents, timesteps, text_embeddings)

        # 5. MSE loss
        loss = F.mse_loss(noise_pred, noise)
        return loss

2.3.4 Structure of the U-Net Backbone

The U-Net used in Stable Diffusion 1.x/2.x and SDXL has the following structure:

[U-Net with Cross-Attention Structure]

Input z_t ─────────────────────────────────────────── Output ε_θ
    │                                                     ▲
    ▼                                                     │
┌────────┐  ┌────────┐  ┌────────┐      ┌────────┐  ┌────────┐
│ Down   │  │ Down   │  │ Down   │      │  Up    │  │  Up    │
│ Block  │──│ Block  │──│ Block  │──┐   │ Block  │──│ Block  │
│ 64x64  │  │ 32x32  │  │ 16x16  │  │   │ 32x32  │  │ 64x64  │
└────────┘  └────────┘  └────────┘  │   └────────┘  └────────┘
    │            │            │     │        ▲           ▲
    │            │            │     ▼        │           │
    │            │            │  ┌────────┐  │           │
    │            │            └──│ Middle │──┘           │
    │            │               │ Block  │              │
    │            │               │ 16x16  │              │
    │            └───────────────└────────┘──────────────┘
    │                    (skip connections)
    └────────────────────────────────────────────────────┘

Inside each Block:
┌──────────────────────────────────────┐
│  ResNet Block                         │
│  ├── GroupNorm → SiLU → Conv         │
│  ├── Timestep Embedding injection          │
│  └── GroupNorm → SiLU → Conv         │
│                                       │
│  Self-Attention Block                 │
│  ├── LayerNorm → Self-Attention      │
│  └── Skip Connection                  │
│                                       │
│  Cross-Attention Block                │
│  ├── LayerNorm                        │
│  ├── Q = Linear(latent features)     │
│  ├── K = Linear(text embeddings)     │  ← Text Conditioning
│  ├── V = Linear(text embeddings)     │
│  └── Attention(Q, K, V)             │
│                                       │
│  Feed-Forward Block                   │
│  ├── LayerNorm → Linear → GEGLU     │
│  └── Linear → Skip Connection        │
└──────────────────────────────────────┘

SDXL (Podell et al., 2023) expanded the U-Net by approximately 3x (~2.6B parameters), uses two text encoders (OpenCLIP ViT-bigG and CLIP ViT-L), and applies improvements including training at various aspect ratios.

Model	U-Net Params	Text Encoder	Resolution	VAE Downsampling
SD 1.5	~860M	CLIP ViT-L/14 (1)	512x512	8x
SD 2.1	~865M	OpenCLIP ViT-H/14 (1)	768x768	8x
SDXL	~2.6B	OpenCLIP ViT-bigG + CLIP ViT-L (2)	1024x1024	8x
SDXL Refiner	~2.3B	OpenCLIP ViT-bigG (1)	1024x1024	8x

2.3.5 Classifier-Free Guidance (CFG)

Classifier-Free Guidance (CFG) by Ho & Salimans (2022) is a core training technique for modern T2I models.

Problems with Traditional Classifier Guidance:

Requires training a separate classifier
Needs a classifier that works on noisy images
Requires computing classifier gradients during inference

Classifier-Free Guidance Key Idea:

During training, text conditioning is replaced with an empty string ("") with a certain probability (typically 10-20%), so that a single model simultaneously learns both conditional and unconditional generation.

학습 시:
  - probability p_uncond (예: 10%): ε_θ(x_t, t, ∅)  (unconditional)
  - probability 1-p_uncond:          ε_θ(x_t, t, c)  (conditional)

추론 시:
  ε_guided = ε_θ(x_t, t, ∅) + w * (ε_θ(x_t, t, c) - ε_θ(x_t, t, ∅))

  - w: guidance scale (보통 7.5 ~ 15)
  - w=1: conditional 예측 그대로
  - w>1: 텍스트 조건 방향으로 더 강하게 이동

# Classifier-Free Guidance training implementation
def train_step_cfg(model, x_0, text_cond, p_uncond=0.1):
    noise = torch.randn_like(x_0)
    t = torch.randint(0, T, (x_0.shape[0],))
    x_t = add_noise(x_0, noise, t)

    # Randomly drop conditioning
    mask = torch.rand(x_0.shape[0]) < p_uncond
    cond = text_cond.clone()
    cond[mask] = empty_text_embedding  # null conditioning

    noise_pred = model(x_t, t, cond)
    loss = F.mse_loss(noise_pred, noise)
    return loss

# Classifier-Free Guidance inference
def sample_cfg(model, x_T, text_cond, guidance_scale=7.5):
    x_t = x_T
    for t in reversed(range(T)):
        # Unconditional prediction
        eps_uncond = model(x_t, t, empty_text_embedding)
        # Conditional prediction
        eps_cond = model(x_t, t, text_cond)
        # Guided prediction
        eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
        x_t = denoise_step(x_t, eps, t)
    return x_t

CFG dramatically improves generation quality and text fidelity, but if the guidance scale is too high, images become oversaturated or artifacts appear.

2.3.6 DALL-E 2: CLIP-Based Diffusion

DALL-E 2 (Ramesh et al., 2022) introduced a two-stage diffusion architecture leveraging the CLIP embedding space.

[DALL-E 2 Training Pipeline]

Text ──→ CLIP Text Encoder ──→ text embedding
                                    │
                              ┌─────┴─────┐
                              │   Prior    │  text emb → CLIP image emb
                              │ (Diffusion)│
                              └─────┬─────┘
                                    │ CLIP image embedding
                              ┌─────┴─────┐
                              │  Decoder   │  CLIP image emb → 64x64 image
                              │ (Diffusion)│
                              └─────┬─────┘
                                    │ 64x64
                              ┌─────┴─────┐
                              │ Super-Res  │  64x64 → 256x256 → 1024x1024
                              │ (Diffusion)│
                              └─────────── ┘

2.3.7 Imagen: The Power of T5 Text Encoder

Google's Imagen (Saharia et al., 2022) maximized text understanding by using the T5-XXL (4.6B parameter) text encoder.

Key findings:

Scaling text encoder size is more effective than scaling U-Net size
T5-XXL > CLIP ViT-L (Text understanding in quality)
Dynamic Thresholding: Stable generation even at high CFG scales

[Imagen Architecture]

Text ──→ T5-XXL (frozen) ──→ text embeddings
                                    │
                              ┌─────┴─────┐
                              │ Base Model │  64x64 생성
                              │  (U-Net)   │  cross-attention
                              └─────┬─────┘
                                    │
                              ┌─────┴─────┐
                              │ SR Model 1 │  64x64 → 256x256
                              │  (U-Net)   │
                              └─────┬─────┘
                                    │
                              ┌─────┴─────┐
                              │ SR Model 2 │  256x256 → 1024x1024
                              │  (U-Net)   │
                              └─────────── ┘

2.3.8 DiT: Diffusion Transformer

DiT (Diffusion Transformer) by Peebles & Xie (2023) is an architecture that replaces U-Net with Transformer, and is becoming the mainstream for recent T2I models.

[DiT Block Structure]

     Input Tokens (patchified latent)
           │
    ┌──────┴──────┐
    │  LayerNorm  │ ← adaLN-Zero (adaptive)
    │  (adaptive) │   γ, β = MLP(timestep + class)
    └──────┬──────┘
           │
    ┌──────┴──────┐
    │    Self-     │
    │  Attention   │
    └──────┬──────┘
           │ (+ residual)
    ┌──────┴──────┐
    │  LayerNorm  │ ← adaLN-Zero
    │  (adaptive) │
    └──────┬──────┘
           │
    ┌──────┴──────┐
    │ Pointwise   │
    │    FFN      │
    └──────┬──────┘
           │ (+ residual, scaled by α)
           ▼
     Output Tokens

Key Design Decisions of DiT:

Patchify: Split latent into p x p patches then linear projection to token sequence
adaLN-Zero: Adaptive Layer Normalization, injecting timestep and class information into LN parameters
Scaling: Systematic scaling law verification by model size (depth, width)

DiT Variant	Depth	Width	Parameters	GFLOPs
DiT-S/2	12	384	33M	6
DiT-B/2	12	768	130M	23
DiT-L/2	24	1024	458M	80
DiT-XL/2	28	1152	675M	119

2.4 Autoregressive-Based T2I

2.4.1 DALL-E (Original): Token-Based Autoregressive Generation

DALL-E (Ramesh et al., 2021) converts images into discrete tokens, then concatenates text tokens and image tokens into a single sequence to learn the joint distribution with an autoregressive Transformer.

[DALL-E Training Pipeline]

Stage 1: dVAE 학습
  Image (256x256) ──→ dVAE Encoder ──→ 32x32 grid of tokens (8192 vocabulary)
                                           ──→ dVAE Decoder ──→ Reconstructed Image

  Loss: Reconstruction + KL Divergence (Gumbel-Softmax relaxation)

Stage 2: Autoregressive Transformer 학습
  [BPE text tokens (256)] + [Image tokens (1024)] = 1280 tokens

  Transformer (12B params):
  - 64 layers, 62 attention heads
  - 학습 목적: next-token prediction (cross-entropy)
  - 텍스트 토큰은 causal attention (좌→우)
  - 이미지 토큰은 row-major order로 자기회귀 생성
  - 텍스트→이미지 cross-attention 포함

2.4.2 Parti: Encoder-Decoder Based

Google's Parti (Yu et al., 2022) formulated T2I as a sequence-to-sequence problem, combining a ViT-VQGAN tokenizer with an Encoder-Decoder Transformer.

Key features:

ViT-VQGAN: Vision Transformer-based image tokenizer
Encoder-Decoder: Uses Encoder for text encoding, Decoder for image token generation
Scaling: Systematic scale-up from 350M to 3B to 20B parameters
Achieves quality comparable to Imagen at the 20B model

2.4.3 CM3Leon: Efficient Multimodal Autoregressive

Meta's CM3Leon (Yu et al., 2023) significantly improved the efficiency of the autoregressive approach:

Retrieval-Augmented Training: Retrieve related image-text pairs during training and add to context
Decoder-Only: Pure decoder-only architecture unlike Parti
Instruction Tuning: Supervised fine-tuning for various tasks
5x less training cost: Reduces training compute by 1/5 for comparable performance
Achieves MS-COCO zero-shot FID of 4.88

2.5 Flow Matching: The Next-Generation Training Paradigm

2.5.1 Basic Principles of Flow Matching

Flow Matching (Lipman et al., 2023) learns a straight path from noise distribution to data distribution through a deterministic ODE (Ordinary Differential Equation) instead of Diffusion's stochastic process.

[Diffusion vs Flow Matching Comparison]

Diffusion (Stochastic):              Flow Matching (Deterministic):
  x_0 ~~~> x_T                         x_0 ──────> x_1
  (curved path, requires many steps)              (straight path, fewer steps possible)

  x₀ •                                x₀ •
     \  Curved                            \  Straight
      \    path                            \    path
       \                                    \
        \                                    \
     x_T •                              x₁ •  (= noise)

dx = f(x,t)dt + g(t)dW              dx/dt = v_θ(x_t, t)
(SDE 기반)                           (ODE 기반, velocity field 학습)

Flow Matching Training Objective:

L_FM = E_{t, x_0, x_1} [ ||v_θ(x_t, t) - u_t(x_t | x_0, x_1)||² ]

where:
  x_t = (1 - t) * x_0 + t * x_1      (linear interpolation)
  u_t = x_1 - x_0                      (target velocity: 직선 path)
  t ~ Uniform(0, 1)                    (또는 logit-normal)
  x_0 ~ p_data (실제 데이터)
  x_1 ~ N(0, I) (가우시안 노이즈)

2.5.2 Rectified Flow

Rectified Flow (Liu et al., 2023, ICLR 2023 Spotlight) is a key variant of Flow Matching that connects noise-data pairs in straight lines from an Optimal Transport perspective.

Key idea:

1-Rectified Flow: Randomly pair data x_0 and noise x_1 to learn straight paths
2-Rectified Flow (Reflow): Re-straighten pairs generated by 1-Rectified Flow to make paths closer to straight lines
Distillation: Distill the straightened model into a 1-step model

# Rectified Flow core training
def rectified_flow_train_step(model, x_0, x_1=None):
    """
    x_0: 실제 데이터 (latent)
    x_1: 노이즈 (None이면 랜덤 샘플링)
    """
    if x_1 is None:
        x_1 = torch.randn_like(x_0)

    # Time sampling (logit-normal for SD3)
    t = torch.sigmoid(torch.randn(x_0.shape[0]))  # logit-normal
    t = t.view(-1, 1, 1, 1)

    # Linear interpolation
    x_t = (1 - t) * x_0 + t * x_1

    # Target velocity (straight direction)
    target_v = x_1 - x_0

    # Velocity prediction
    v_pred = model(x_t, t)

    # Loss
    loss = F.mse_loss(v_pred, target_v)
    return loss

2.5.3 Flow Matching in Stable Diffusion 3

SD3 (Esser et al., 2024) is the first model to apply Rectified Flow to a large-scale T2I model. Key contributions from the paper "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis":

1. Logit-Normal Timestep Sampling:

Instead of a uniform distribution, timesteps are sampled using a logit-normal distribution, giving more weight to the middle portion of the trajectory (the most challenging prediction interval).

# SD3's Logit-Normal Timestep Sampling
def logit_normal_sampling(batch_size, m=0.0, s=1.0):
    """Give more weight to middle timesteps"""
    u = torch.randn(batch_size) * s + m
    t = torch.sigmoid(u)  # (0, 1) 범위
    return t

2. MM-DiT (Multi-Modal Diffusion Transformer):

SD3 introduced a new Transformer architecture that uses separate weights for text and images while enabling bidirectional information flow.

[MM-DiT Block]

  Image Tokens          Text Tokens
       │                     │
  ┌────┴────┐          ┌────┴────┐
  │adaLN(t) │          │adaLN(t) │
  └────┬────┘          └────┬────┘
       │                     │
       └──────┬──────────────┘
              │ (concatenate)
       ┌──────┴──────┐
       │ Joint Self- │  Image-text tokens
       │  Attention  │     attend to each other
       └──────┬──────┘
              │ (split)
       ┌──────┴──────────────┐
       │                     │
  ┌────┴────┐          ┌────┴────┐
  │   FFN   │          │   FFN   │
  │ (image) │          │ (text)  │
  └────┬────┘          └────┬────┘
       │                     │
  Image Out             Text Out

3. Scaling Laws:

Model	Blocks	Parameters	Performance (validation loss)
SD3-S	15	450M	High
SD3-M	24	2B	Medium
SD3-L	38	8B	Low (best performance)

Smooth scaling was confirmed where validation loss steadily decreases as model size and training steps increase.

2.5.4 Flux: Black Forest Labs' Flow Matching Model

Flux (Black Forest Labs, 2024) is a model based on SD3's Rectified Flow + Transformer architecture.

Variant	Training Method	추론 스텝	Features
FLUX.1 [pro]	Full training	25-50	Highest quality, API only
FLUX.1 [dev]	Guidance Distillation	25-50	Efficient inference, open weights
FLUX.1 [schnell]	Latent Adversarial Diffusion Distillation	1-4	Ultra-fast generation

Guidance Distillation: The Student model is trained to reproduce the output of the Teacher model (using CFG) without CFG, eliminating CFG computation (2x forward pass) at inference time.

Latent Adversarial Diffusion Distillation (LADD): Combines GAN's adversarial loss with diffusion distillation to enable 1-4 step generation.

3. Text Conditioning Methodologies

Text Conditioning is the mechanism that injects the meaning of text prompts into the image generation process in T2I models. The choice of text encoder and conditioning method has a decisive impact on generation quality.

3.1 CLIP Text Encoder

OpenAI's CLIP (Contrastive Language-Image Pre-training, Radford et al., 2021) is a model contrastively trained on 400 million image-text pairs.

[CLIP Training Process]

  Image ──→ Image Encoder ──→ image embedding ─┐
                                                 ├─ cosine similarity
  Text  ──→ Text Encoder  ──→ text embedding  ─┘

  Training objective: Increase similarity for matching pairs, decrease for non-matching pairs
  (InfoNCE Loss)

Characteristics of CLIP Text Encoder:

Both token sequence embeddings and [EOS] token pooled embeddings can be utilized
Maximum 77 token length limit
Strong at image-text alignment
시각적 개념에 특화된 Text understanding

CLIP 변형	파라미터	Models Used
CLIP ViT-L/14	~124M (text)	SD 1.x
OpenCLIP ViT-H/14	~354M (text)	SD 2.x
OpenCLIP ViT-bigG/14	~694M (text)	SDXL (primary)
CLIP ViT-L/14	~124M (text)	SDXL (secondary)

3.2 T5 Text Encoder

Google's T5 (Text-to-Text Transfer Transformer, Raffel et al., 2020) is a large-scale language model trained on a pure text corpus.

Advantages of T5 (Demonstrated in the Imagen paper):

Trained on a much larger text corpus than CLIP (C4 dataset)
Excellent at understanding complex sentence structures and relationships
Ability to process complex prompts including spatial relationships, quantities, and attribute combinations
텍스트 인코더 스케일링이 U-Net 스케일링보다 Effect적 (Imagen Key 발견)

T5 변형	파라미터	Models Used
T5-Small	60M	Experimental
T5-Base	220M	Experimental
T5-Large	770M	Experimental
T5-XL	3B	PixArt-alpha
T5-XXL	4.6B	Imagen, SD3, Flux
Flan-T5-XL	3B	PixArt-sigma

3.3 Cross-Attention Mechanism

Cross-attention is the core mechanism that injects text information into image features within the U-Net or DiT.

# Cross-Attention implementation
class CrossAttention(nn.Module):
    def __init__(self, d_model, d_context, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_model // n_heads

        self.to_q = nn.Linear(d_model, d_model, bias=False)      # latent → Q
        self.to_k = nn.Linear(d_context, d_model, bias=False)     # text → K
        self.to_v = nn.Linear(d_context, d_model, bias=False)     # text → V
        self.to_out = nn.Linear(d_model, d_model)

    def forward(self, x, context):
        """
        x: (B, H*W, d_model) - 이미지 latent features
        context: (B, seq_len, d_context) - 텍스트 임베딩
        """
        q = self.to_q(x)          # 이미지가 Query
        k = self.to_k(context)    # 텍스트가 Key
        v = self.to_v(context)    # 텍스트가 Value

        # Multi-head reshape
        q = q.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
        k = k.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)
        v = v.view(B, -1, self.n_heads, self.d_head).transpose(1, 2)

        # Attention
        attn = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_head)
        attn = F.softmax(attn, dim=-1)
        out = attn @ v

        out = out.transpose(1, 2).reshape(B, -1, d_model)
        return self.to_out(out)

3.4 Pooled Text Embeddings vs Sequence Embeddings

Modern T2I models simultaneously utilize two types of text embeddings:

[Text Embedding Types]

Text: "a photo of a cat"
         │
    ┌────┴────┐
    │  Text   │
    │ Encoder │
    └────┬────┘
         │
    ┌────┴──────────────────────┐
    │                           │
    ▼                           ▼
 Sequence Embeddings         Pooled Embedding
 (token-level)               (sentence-level)
 [h_1, h_2, ..., h_n]       h_pool = h_[EOS]
 Shape: (seq_len, d)         Shape: (d,)
    │                           │
    │                           │
    ▼                           ▼
 Cross-Attention에 사용       Global conditioning에 사용
 (세밀한 토큰별 정보)          (전체 문장 의미)
                              - Timestep embedding에 더하기
                              - adaLN 파라미터 조절
                              - Vector conditioning

Dual text encoder usage in SDXL:

# SDXL Text Conditioning
def get_sdxl_text_embeddings(text, clip_l, clip_g):
    # CLIP ViT-L: sequence embeddings (77, 768)
    clip_l_output = clip_l(text)
    clip_l_seq = clip_l_output.last_hidden_state      # (77, 768)
    clip_l_pooled = clip_l_output.pooler_output        # (768,)

    # OpenCLIP ViT-bigG: sequence embeddings (77, 1280)
    clip_g_output = clip_g(text)
    clip_g_seq = clip_g_output.last_hidden_state       # (77, 1280)
    clip_g_pooled = clip_g_output.pooler_output        # (1280,)

    # Concatenate sequence embeddings -> used for Cross-Attention
    text_embeddings = torch.cat([clip_l_seq, clip_g_seq], dim=-1)  # (77, 2048)

    # Concatenate pooled embeddings -> used for Vector conditioning
    pooled_embeddings = torch.cat([clip_l_pooled, clip_g_pooled], dim=-1)  # (2048,)

    return text_embeddings, pooled_embeddings

SD3 and Flux additionally combine T5-XXL sequence embeddings, using a triple text encoder configuration:

인코더	Role	Output Shape	Use Case
CLIP ViT-L	Visual alignment	pooled (768) + seq (77, 768)	pooled → vector cond
OpenCLIP ViT-bigG	Visual alignment	pooled (1280) + seq (77, 1280)	pooled → vector cond
T5-XXL	Text understanding	seq (max 256/512, 4096)	cross-attn / joint-attn

4. Training Datasets

The quality of T2I models directly depends on the scale and quality of training data. Here is a summary of major large-scale datasets.

4.1 Comparison of Major Datasets

Dataset	Scale	Source	Filtering Method	주요 Models Used
LAION-5B	58.5억 pairs	Common Crawl	CLIP similarity > 0.28 (영어)	SD 1.x, SD 2.x
LAION-400M	4억 pairs	Common Crawl	CLIP similarity 필터	Early research
LAION-Aesthetics	~1.2억 pairs	LAION-5B subset	Aesthetic score > 4.5/5.0	SD fine-tuning
CC3M	330만 pairs	Google 검색	Automated filtering pipeline	Research
CC12M	1,200만 pairs	Google 검색	Relaxed filtering	Research
COYO-700M	7.47억 pairs	Common Crawl	Image + text filtering	Research
WebLi	10B images	Web crawling	Top 10% CLIP similarity	PaLI, Imagen
JourneyDB	~460만 pairs	Midjourney	High-quality prompt-image	Research
SAM	11M images	다양한 Source	Manual + model-based	Segmentation + T2I
Internal (Proprietary)	수십억 pairs	Proprietary	Proprietary	DALL-E 3, Midjourney

4.2 LAION-5B Filtering Pipeline

LAION-5B (Schuhmann et al., 2022) is the most widely used open T2I training dataset:

[LAION-5B Data Collection and Filtering Pipeline]

Common Crawl (웹 아카이브)
        │
        ▼
1. HTML 파싱: <img> 태그에서 src URL + alt-text 추출
        │
        ▼
2. 이미지 다운로드 (img2dataset)
   - 최소 해상도 필터: width, height ≥ 64
   - 최대 종횡비: 3:1
        │
        ▼
3. CLIP 유사도 필터링
   - OpenAI CLIP ViT-B/32로 image-text similarity 계산
   - 영어: cosine similarity ≥ 0.28
   - 기타 언어: cosine similarity ≥ 0.26
        │
        ▼
4. 안전성 필터링
   - NSFW 탐지 점수 (CLIP 기반)
   - Watermark 탐지 점수
   - Toxic content 탐지
        │
        ▼
5. 중복 제거 (deduplication)
   - 해시 기반 exact duplicate 제거
   - CLIP embedding 기반 near-duplicate 제거
        │
        ▼
최종: 58.5억 이미지-텍스트 pairs
 - 23.2억 영어
 - 22.6억 기타 100+ 언어
 - 12.7억 언어 미확인

4.3 Data Quality Assessment

The latest models tend to focus on data quality over data quantity:

1. CLIP Score-Based Filtering:

# CLIP Score computation
from transformers import CLIPModel, CLIPProcessor
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

inputs = processor(text=[caption], images=[image], return_tensors="pt")
outputs = model(**inputs)
clip_score = outputs.logits_per_image.item() / 100.0  # normalized

2. Aesthetic Score Filtering:

LAION-Aesthetics is a subset that trains a separate aesthetic predictor (CLIP embedding to MLP to score) and extracts only images with an aesthetic score of 4.5 or higher.

3. Caption Quality Improvement (DALL-E 3's Core Innovation):

DALL-E 3 (Betker et al., 2023) achieved dramatic performance improvement through caption quality improvement alone without any architecture changes:

Train a dedicated image captioning model to generate detailed synthetic captions
Train with 95% synthetic captions + 5% original captions
Comparison experiments of three types: short synthetic, detailed synthetic, and human annotation
Detailed synthetic captions are overwhelmingly superior

[DALL-E 3 Caption Improvement Effect]

Before: "cat on table"
      -> Vague and lacks detail

After: "A fluffy orange tabby cat sitting on a round wooden
       dining table, natural sunlight streaming through a
       window behind, casting soft shadows. The cat has
       bright green eyes and is looking directly at the camera."
      -> Includes detailed attributes, spatial relationships, and lighting information

4.4 Data Preprocessing Techniques

전처리 Technique	Description	Effect
Center Crop	Crop center of image to square	Resolution standardization
Random Crop	Random position crop	Data augmentation
Bucket Sampling	Group images with similar aspect ratios	Multi-aspect ratio training (SDXL)
Caption Dropout	Replace caption with empty string at a certain probability	CFG training support
Multi-resolution	Progressive learning from low to high resolution	Training efficiency + quality
Tag Shuffling	Random shuffle of tag order	Reduced text order bias

5. Fine-tuning & Customization Techniques

Fine-tuning techniques that adapt pretrained T2I models to specific styles, subjects, and control conditions are essential for practical applications.

5.1 LoRA (Low-Rank Adaptation)

LoRA by Hu et al. (2022) is an efficient method for fine-tuning large model weights, and is also extensively used in T2I models.

[LoRA Principle]

원본 가중치:  W_0 ∈ R^{d×k}  (고정, frozen)
LoRA 업데이트: ΔW = B × A      where A ∈ R^{r×k}, B ∈ R^{d×r}

최종 출력: h = W_0 x + ΔW x = W_0 x + B(Ax)

- r << min(d, k): low-rank (보통 4, 8, 16, 32, 64)
- 학습 파라미터: A와 B만 (전체 대비 매우 적음)
- 원본 가중치는 고정 → 메모리 효율적

# LoRA application example (Stable Diffusion U-Net attention layer)
class LoRALinear(nn.Module):
    def __init__(self, original_layer, rank=4, alpha=1.0):
        super().__init__()
        self.original = original_layer  # frozen
        in_features = original_layer.in_features
        out_features = original_layer.out_features

        # LoRA layers
        self.lora_A = nn.Linear(in_features, rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=False)
        self.scale = alpha / rank

        # Initialization
        nn.init.kaiming_uniform_(self.lora_A.weight)
        nn.init.zeros_(self.lora_B.weight)  # Initialize B to 0 -> identical to original at start

    def forward(self, x):
        original_out = self.original(x)        # Frozen original output
        lora_out = self.lora_B(self.lora_A(x)) # LoRA update
        return original_out + self.scale * lora_out

LoRA Training Configuration (Diffusers-based):

# Diffusers LoRA training execution example
accelerate launch train_text_to_image_lora.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --dataset_name="lambdalabs/naruto-blip-captions" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=15000 \
  --learning_rate=1e-04 \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=0 \
  --rank=4 \
  --mixed_precision="fp16" \
  --output_dir="./sdxl-naruto-lora"

LoRA 파라미터	Typical Range	Impact
Rank (r)	4-128	Higher values increase expressiveness and memory
Alpha (α)	rank와 동일 ~ 2x	Learning rate scaling
Target Modules	attn Q,K,V,O + FFN	Application scope
Learning Rate	1e-4 ~ 1e-5	Convergence speed
Training Time	5-30분 (단일 GPU)	Enables fast iteration
File Size	1-200 MB	Easy to share and distribute

5.2 DreamBooth

DreamBooth by Ruiz et al. (2023) is a technique for injecting the concept of a specific subject into a model using 3-5 images.

[DreamBooth Training Process]

Input: 3-5 images of a specific subject + unique identifier "[V]"
      Example: "a [V] dog" (specific dog)

Training strategy:
1. Fine-tune model with subject images
   - "a [V] dog" → 해당 강아지 이미지

2. Prior Preservation Loss (Key!)
   - Pre-generate "a dog" images with the original model
   - Preserve general dog generation capability during fine-tuning
   - Prevent language drift

L = L_recon([V] images) + λ * L_prior(class images)

# DreamBooth + LoRA training (recommended combination)
# Based on diffusers library
accelerate launch train_dreambooth_lora.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --instance_data_dir="./my_dog_images" \
  --instance_prompt="a photo of sks dog" \
  --class_data_dir="./class_dog_images" \
  --class_prompt="a photo of dog" \
  --with_prior_preservation \
  --prior_loss_weight=1.0 \
  --num_class_images=200 \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --max_train_steps=500 \
  --rank=4 \
  --mixed_precision="fp16"

5.3 Textual Inversion

Textual Inversion by Gal et al. (2023) is a method that learns only new token embeddings without modifying any model weights.

[Textual Inversion]

Existing token space:  [cat] [dog] [car] [tree] ...
                                  │
Add new token:               [S*] New concept to learn
                                  │
Training: Optimize only the embedding vector of [S*] with 3-5 images
Entire rest of model is frozen

Advantage: Minimal parameters (1 token = 768 or 1024 floats)
Disadvantage: Less expressive than LoRA/DreamBooth

5.4 ControlNet

ControlNet by Zhang & Agrawala (2023) is a method for adding structural conditions (edge, depth, pose, etc.) to pretrained diffusion models.

[ControlNet Architecture]

                   Control Input (예: Canny edge)
                          │
                    ┌─────┴─────┐
                    │  Zero     │
                    │  Conv     │
                    └─────┬─────┘
                          │
                    ┌─────┴─────┐
Locked U-Net        │ Trainable │  Copy of U-Net Encoder
(원본 고정)          │  Copy of  │     (trainable)
    │               │ U-Net Enc │
    │               └─────┬─────┘
    │                     │
    │               ┌─────┴─────┐
    │               │  Zero     │  Output is 0 at training start
    │               │  Conv     │     (starts without affecting original model)
    │               └─────┬─────┘
    │                     │
    └─────── + ───────────┘  Add to original U-Net features
                  │
              Final Output

ControlNet's Core Training Technique - Zero Convolution:

# Zero Convolution: Initialize weights and biases to 0
class ZeroConv(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, 1)
        nn.init.zeros_(self.conv.weight)
        nn.init.zeros_(self.conv.bias)

    def forward(self, x):
        return self.conv(x)

# Training start: zero conv output = 0
# -> Adding ControlNet doesn't affect original model output
# -> Gradually reflects control signal as training progresses

Condition Type	Input	Use Case
Canny Edge	Edge map	Contour-based generation
Depth	Depth map	3D structure preservation
OpenPose	Joint positions	Human pose control
Semantic Segmentation	Segmentation map	Layout control
Scribble	Scribble	Rough composition
Normal Map	Surface normal map	3D shape control
Tile	Low-resolution/tile	Super-resolution

5.5 IP-Adapter

IP-Adapter (Image Prompt Adapter) by Ye et al. (2023) is an adapter that uses images as prompts to transfer style or subjects.

[IP-Adapter Architecture]

Reference Image ──→ CLIP Image Encoder ──→ image features
                                               │
                                         ┌─────┴─────┐
                                         │ Projection │  Trainable
                                         │   Layer    │
                                         └─────┬─────┘
                                               │
                                         ┌─────┴─────┐
                                         │ Decoupled │  Separate cross-attention
                                         │ Cross-Attn │    (separated from text cross-attn)
                                         └─────┬─────┘
                                               │
Original U-Net Cross-Attention ────── + ───────┘
(text conditioning)

출력 = Text_CrossAttn(Q, K_text, V_text) + λ * Image_CrossAttn(Q, K_img, V_img)

5.6 Comparison of Fine-tuning Techniques

Technique	Modified Target	Training Images	Training Time	File Size	주요 Use Case
LoRA	Attention weights (low-rank)	Tens to thousands	5-30분	1-200MB	Style, concepts
DreamBooth	Full model or + LoRA	3-10	5-60분	2-7GB (전체) 또는 1-200MB (LoRA)	Specific subject
Textual Inversion	Token embeddings only	3-10	30분-수시간	Few KB	Simple concepts
ControlNet	U-Net Encoder copy	Tens of thousands to hundreds of thousands	Several days	~1.5GB	Structural control
IP-Adapter	Projection + Cross-Attn	Large-scale	Several days	~100MB	Image prompting

6. Latest Trends (2024-2026)

6.1 Consistency Models

Consistency Models by Yang Song et al. (2023) is a method for reducing multi-step generation in diffusion models to 1-step or few-step.

[Consistency Models Key Idea]

Diffusion: x_T → x_{T-1} → ... → x_1 → x_0  (hundreds of steps)

Consistency:
  PF-ODE trajectory 위의 모든 점 x_t가
  동일한 x_0로 매핑되도록 학습

  f_θ(x_t, t) = x_0  ∀t ∈ [0, T]

  Key 제약: f_θ(x_0, 0) = x_0 (self-consistency)

     x_T ────→ f_θ ────→ x_0
      │                    ↑
     x_t ────→ f_θ ───────┘  (maps to the same x_0!)
      │                    ↑
    x_t' ────→ f_θ ───────┘

Two Training Methods:

방법	Description	장점	단점
Consistency Distillation (CD)	사전학습된 diffusion model 필요, PF-ODE 시뮬레이션	높은 품질	teacher 모델 필요
Consistency Training (CT)	실제 데이터에서 직접 학습	teacher 불필요	CD보다 품질 다소 Low

Performance:

CIFAR-10: FID 3.55 (1-step), 2.93 (2-step)
ImageNet 64x64: FID 6.20 (1-step)

Follow-up research, Improved Consistency Training (iCT) and Latent Consistency Models (LCM), applied this to large-scale T2I models, enabling 2-4 step generation at the SDXL level.

6.2 The Spread of DiT (Diffusion Transformer) Architecture

Since 2024, DiT has been replacing U-Net to become the mainstream backbone for T2I:

모델	Year	Backbone	파라미터	Key Features
DiT (원본)	2023	Transformer	675M	Class-conditional, adaLN-Zero
PixArt-alpha	2023	DiT + Cross-Attn	600M	T2I, low-cost training
PixArt-sigma	2024	DiT + KV Compression	600M	4K resolution, weak-to-strong
SD3	2024	MM-DiT	2B-8B	Flow Matching, triple text encoder
Flux	2024	MM-DiT variant	~12B	Distillation variant
Playground v2.5	2024	SDXL U-Net	~2.6B	EDM noise schedule
Hunyuan-DiT	2024	DiT	~1.5B	Chinese+English bilingual
Lumina-T2X	2024	DiT	다양	Multi-modal generation

6.3 PixArt-alpha and PixArt-sigma

PixArt-alpha (Chen et al., 2023) is a pioneering model for efficient DiT training:

Core innovation - Training Decomposition:

[PixArt-alpha 3-Stage Training]

Stage 1: Pixel Dependency 학습 (저비용)
  - ImageNet 사전학습된 DiT에서 시작
  - 클래스 조건부 → T2I 전환의 기초

Stage 2: Text-Image Alignment 학습
  - Cross-attention으로 텍스트 조건 주입
  - LLaVA로 생성한 고품질 synthetic caption 사용

Stage 3: High-quality Aesthetic 학습
  - 고품질 미적 Dataset으로 fine-tuning
  - JourneyDB 등 활용

총 학습 비용: ~675 A100 GPU days
(SD 1.5의 ~6,250 A100 GPU days 대비 10.8%)

Improvements in PixArt-sigma (Chen et al., 2024):

Weak-to-Strong Training: Enhanced training with higher quality data based on PixArt-alpha
KV Compression: Compress Key and Value in Attention for improved efficiency, enabling 4K resolution
Comparable performance to SDXL (2.6B) with only 0.6B parameters

6.4 Comparison of SDXL, SD3, and Flux

[Stable Diffusion Lineage by Generation]

SD 1.x (2022)     SDXL (2023)       SD3 (2024)         Flux (2024)
    │                  │                │                   │
  U-Net 860M       U-Net 2.6B      MM-DiT 2-8B        MM-DiT ~12B
    │                  │                │                   │
  CLIP ViT-L       CLIP-L +          CLIP-L +            CLIP-L +
                   OpenCLIP-G        OpenCLIP-G +        OpenCLIP-G +
                                     T5-XXL               T5-XXL
    │                  │                │                   │
  Diffusion        Diffusion        Rectified            Rectified
  (DDPM)           (DDPM)           Flow                 Flow
    │                  │                │                   │
  512x512          1024x1024        1024x1024            1024x1024+
    │                  │                │                   │
  CFG 7.5          CFG 5-9          CFG 3.5-7            Guidance
                                                         Distillation

6.5 Training Innovations of DALL-E 3

The core innovation of DALL-E 3 (Betker et al., 2023) lies in improving training data caption quality:

Image Captioner Training: Separately train a CoCa-based image captioning model
Synthetic Caption Generation: Re-label entire training data with detailed synthetic captions
Caption Mixing: Train with 95% synthetic + 5% original captions
Descriptive vs Short: 상세한 Description형 캡션이 짧은 태그형보다 우수

6.6 Three Key Insights of Playground v2.5

Playground v2.5 (Li et al., 2024) surpassed DALL-E 3 and Midjourney 5.2 through training strategy improvements based on the SDXL architecture:

1. EDM Noise Schedule Adoption:

# EDM Framework (Karras et al., 2022)
# σ(t) 기반 noise schedule - Zero Terminal SNR 보장
# 기존 SD의 linear schedule 대비 색상/대비 크게 개선

def edm_precondition(sigma, x_noisy, F_theta):
    """EDM Preconditioning"""
    c_skip = 1.0 / (sigma ** 2 + 1)
    c_out = sigma / (sigma ** 2 + 1).sqrt()
    c_in = 1.0 / (sigma ** 2 + 1).sqrt()
    c_noise = sigma.log() / 4

    D_x = c_skip * x_noisy + c_out * F_theta(c_in * x_noisy, c_noise)
    return D_x

2. Multi-Aspect Ratio Training:

Bucketed dataset: Group images with similar aspect ratios하여 배치 구성
Supports various aspect ratios during training (1:1, 4:3, 16:9, etc.)

3. Human Preference Alignment:

Training strategy utilizing human preference data
Maximize aesthetic quality through quality-tuning

7. Practical Training Pipeline Guide

7.1 Training Infrastructure

GPU Requirements

Training Scale	Recommended GPU	VRAM	Training Duration	Cost (Estimated)
LoRA Fine-tuning	RTX 3090/4090 1대	24GB	5-30분	< $1
DreamBooth	A100 40GB 1대	40GB	30-60분	$2-5
ControlNet 학습	A100 80GB 4-8대	320-640GB	2-5일	$500-2,000
SD 1.5 수준 학습	A100 80GB 256대	~20TB	24일	~$150K
SDXL 수준 학습	A100 80GB 512-1024대	~40-80TB	수주	~$500K-1M
SD3/Flux 수준 학습	H100 80GB 1024+대	~80TB+	수주-수개월	> $1M

Distributed Training Strategy

[Large-Scale Distributed Training Configuration]

┌─────────────────────────────────────────────────────┐
│                  Data Parallel (DP/DDP)               │
│                                                       │
│  GPU 0        GPU 1        GPU 2        GPU 3        │
│  ┌──────┐    ┌──────┐    ┌──────┐    ┌──────┐       │
│  │Full  │    │Full  │    │Full  │    │Full  │       │
│  │Model │    │Model │    │Model │    │Model │       │
│  │Copy  │    │Copy  │    │Copy  │    │Copy  │       │
│  └──────┘    └──────┘    └──────┘    └──────┘       │
│  Batch 1     Batch 2     Batch 3     Batch 4        │
│                                                       │
│  -> Synchronize gradients with All-Reduce                       │
│  -> Different data batches on each GPU                       │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│              FSDP (Fully Sharded Data Parallel)       │
│                                                       │
│  GPU 0        GPU 1        GPU 2        GPU 3        │
│  ┌──────┐    ┌──────┐    ┌──────┐    ┌──────┐       │
│  │Shard │    │Shard │    │Shard │    │Shard │       │
│  │ 1/4  │    │ 2/4  │    │ 3/4  │    │ 4/4  │       │
│  └──────┘    └──────┘    └──────┘    └──────┘       │
│                                                       │
│  -> Shard model parameters across GPUs                   │
│  -> All-Gather only needed shards during Forward/Backward       │
│  -> Maximize memory efficiency (enables 8B+ model training)               │
└─────────────────────────────────────────────────────┘

7.2 Representative Training Framework: Diffusers

HuggingFace's Diffusers library is the de facto standard for T2I model training.

# Diffusers-based Text-to-Image full training pipeline
from diffusers import AutoencoderKL, UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTextModel, CLIPTokenizer
from accelerate import Accelerator
import torch

# 1. Load models
vae = AutoencoderKL.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="vae"
)
unet = UNet2DConditionModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet"
)
text_encoder = CLIPTextModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="text_encoder"
)
tokenizer = CLIPTokenizer.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="tokenizer"
)
noise_scheduler = DDPMScheduler.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler"
)

# 2. Freeze VAE and Text Encoder
vae.requires_grad_(False)
text_encoder.requires_grad_(False)

# 3. Accelerator setup (distributed training + Mixed Precision)
accelerator = Accelerator(
    mixed_precision="fp16",          # or "bf16"
    gradient_accumulation_steps=4,
)

# 4. Optimizer
optimizer = torch.optim.AdamW(
    unet.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
    weight_decay=1e-2,
    eps=1e-8,
)

# 5. EMA setup
from diffusers.training_utils import EMAModel
ema_unet = EMAModel(
    unet.parameters(),
    decay=0.9999,
    use_ema_warmup=True,
)

# 6. Prepare for distributed training
unet, optimizer, dataloader = accelerator.prepare(unet, optimizer, dataloader)

# 7. Training loop
for epoch in range(num_epochs):
    for batch in dataloader:
        with accelerator.accumulate(unet):
            images = batch["images"]
            captions = batch["captions"]

            # Latent encoding
            with torch.no_grad():
                latents = vae.encode(images).latent_dist.sample()
                latents = latents * vae.config.scaling_factor

            # Text encoding
            with torch.no_grad():
                text_inputs = tokenizer(captions, padding="max_length",
                                       max_length=77, return_tensors="pt")
                text_embeds = text_encoder(text_inputs.input_ids)[0]

            # Add noise
            noise = torch.randn_like(latents)
            timesteps = torch.randint(0, 1000, (latents.shape[0],))
            noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)

            # Classifier-Free Guidance: random caption dropout
            if torch.rand(1) < 0.1:  # 10% probability로 unconditional
                text_embeds = torch.zeros_like(text_embeds)

            # Predict noise
            noise_pred = unet(noisy_latents, timesteps, text_embeds).sample

            # Loss computation
            loss = F.mse_loss(noise_pred, noise)

            # Backward
            accelerator.backward(loss)
            accelerator.clip_grad_norm_(unet.parameters(), 1.0)
            optimizer.step()
            optimizer.zero_grad()

            # EMA update
            ema_unet.step(unet.parameters())

7.3 Mixed Precision Training

Mixed Precision is a technique that improves memory and computational efficiency by combining FP32 and FP16/BF16.

[Mixed Precision Training]

Forward Pass:
  - Model weights: FP16/BF16 (half memory)
  - Activation: FP16/BF16

Loss Scaling:
  - Multiply loss by a large scale (e.g., 2^16) to prevent gradient underflow
  - Scale down gradient again after backward

Backward Pass:
  - Gradient: FP16/BF16

Optimizer Step:
  - Master Weights: FP32 (maintain precision!)
  - Update FP32 master weights then create FP16 copy

Precision	메모리	연산 속도	수치 안정성	권장
FP32	4 bytes	기준	최고	Optimizer State
FP16	2 bytes	~2x	Low (overflow 위험)	Forward/Backward
BF16	2 bytes	~2x	High (넓은 범위)	H100/A100에서 권장
TF32	4 bytes (저장)	~1.5x	High	A100 default

# BF16 Mixed Precision config (accelerate-based)
# accelerate config (YAML)
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
mixed_precision: bf16
num_machines: 1
num_processes: 8

7.4 EMA (Exponential Moving Average)

EMA is a technique that maintains a moving average of model weights during training to achieve more stable results during inference. It is used in nearly all T2I model training.

[EMA Update]

θ_ema ← λ * θ_ema + (1 - λ) * θ_model

- λ: decay rate (보통 0.9999 ~ 0.99999)
- θ_model: 현재 학습 중인 모델 가중치
- θ_ema: EMA 가중치 (추론 시 사용)
- Effect: gradient noise를 평활화하여 더 안정적인 가중치

# Diffusers EMA implementation
from diffusers.training_utils import EMAModel

# Create EMA model
ema_model = EMAModel(
    unet.parameters(),
    decay=0.9999,              # decay rate
    use_ema_warmup=True,       # warmup 사용
    inv_gamma=1.0,             # warmup 파라미터
    power=3/4,                 # warmup 파라미터
)

# Update at every training step
ema_model.step(unet.parameters())

# Apply EMA weights at inference
ema_model.copy_to(unet.parameters())

# Or use context manager
with ema_model.average_parameters():
    # EMA weights are used inside this block
    output = unet(noisy_latents, timesteps, text_embeds)

7.5 Training Hyperparameter Guide

Hyperparameter	SD 1.5	SDXL	SD3/Flux	LoRA
Learning Rate	1e-4	1e-4	1e-4	1e-4 ~ 5e-5
Batch Size (총)	2048	2048	2048+	1-8
Optimizer	AdamW	AdamW	AdamW	AdamW / Prodigy
Weight Decay	0.01	0.01	0.01	0.01
Grad Clip	1.0	1.0	1.0	1.0
EMA Decay	0.9999	0.9999	0.9999	N/A
Warmup Steps	10,000	10,000	10,000	0-500
Precision	FP32/FP16	BF16	BF16	FP16/BF16
CFG Dropout	10%	10%	10%	10%
Resolution	512	1024	1024	Original resolution
Total Steps	~500K	~500K+	~1M+	500-15,000

8. Key Paper References

8.1 Core Paper Table

#	Paper Title	Authors	Year	Key Contribution	Link
1	Generative Adversarial Networks	Goodfellow et al.	2014	GAN framework proposal	arXiv:1406.2661
2	Neural Discrete Representation Learning (VQ-VAE)	van den Oord et al.	2017	Vector Quantized discrete latent space	arXiv:1711.00937
3	A Style-Based Generator Architecture for GANs (StyleGAN)	Karras et al.	2019	Style-based generator, Progressive Growing	arXiv:1812.04948
4	Large Scale GAN Training (BigGAN)	Brock et al.	2019	Large-scale GAN 학습, Truncation Trick	arXiv:1809.11096
5	Generating Diverse High-Fidelity Images with VQ-VAE-2	Razavi et al.	2019	Hierarchical VQ-VAE, high-resolution generation	arXiv:1906.00446
6	Denoising Diffusion Probabilistic Models (DDPM)	Ho et al.	2020	Practical training of Diffusion models	arXiv:2006.11239
7	Learning Transferable Visual Models From Natural Language Supervision (CLIP)	Radford et al.	2021	CLIP contrastive learning, image-text alignment	arXiv:2103.00020
8	Zero-Shot Text-to-Image Generation (DALL-E)	Ramesh et al.	2021	dVAE + Autoregressive Transformer T2I	arXiv:2102.12092
9	High-Resolution Image Synthesis with Latent Diffusion Models (LDM)	Rombach et al.	2022	Latent Diffusion, Cross-Attention conditioning	arXiv:2112.10752
10	Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)	Ramesh et al.	2022	CLIP-based 2-stage Diffusion, Prior + Decoder	arXiv:2204.06125
11	Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)	Saharia et al.	2022	T5-XXL 텍스트 인코더의 Effect, Dynamic Thresholding	arXiv:2205.11487
12	Classifier-Free Diffusion Guidance	Ho & Salimans	2022	CFG 학습 Technique, unconditional-conditional 동시 학습	arXiv:2207.12598
13	Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Parti)	Yu et al.	2022	Autoregressive T2I, 20B scaling	arXiv:2206.10789
14	LoRA: Low-Rank Adaptation of Large Language Models	Hu et al.	2022	Low-rank fine-tuning Technique	arXiv:2106.09685
15	Elucidating the Design Space of Diffusion-Based Generative Models (EDM)	Karras et al.	2022	Systematic Diffusion design space analysis, Preconditioning	arXiv:2206.00364
16	An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion	Gal et al.	2023	Personalization via new token embedding learning	arXiv:2208.01618
17	DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation	Ruiz et al.	2023	Subject personalization with few images, Prior Preservation	arXiv:2208.12242
18	Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)	Zhang & Agrawala	2023	Structural control(edge, depth, pose) 추가	arXiv:2302.05543
19	Consistency Models	Song et al.	2023	1-step generation, PF-ODE consistency learning	arXiv:2303.01469
20	SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis	Podell et al.	2023	Large U-Net, Dual Text Encoder, Multi-AR training	arXiv:2307.01952
21	Scalable Diffusion Models with Transformers (DiT)	Peebles & Xie	2023	Diffusion + Transformer, adaLN-Zero	arXiv:2212.09748
22	Flow Matching for Generative Modeling	Lipman et al.	2023	ODE-based Flow Matching framework	arXiv:2210.02747
23	Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow	Liu et al.	2023	Rectified Flow, Optimal Transport	arXiv:2209.03003
24	IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models	Ye et al.	2023	Image prompt adapter, Decoupled Cross-Attn	arXiv:2308.06721
25	Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3Leon)	Yu et al.	2023	Efficient autoregressive T2I, Retrieval Augmented	arXiv:2309.02591
26	PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis	Chen et al.	2023	Low-cost DiT training, training decomposition strategy	arXiv:2310.00426
27	Improving Image Generation with Better Captions (DALL-E 3)	Betker et al.	2023	Dramatic quality improvement via synthetic captions	cdn.openai.com/papers/dall-e-3.pdf
28	PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation	Chen et al.	2024	Weak-to-Strong training, KV Compression, 4K	arXiv:2403.04692
29	Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3)	Esser et al.	2024	MM-DiT, Rectified Flow Large-scale 적용, Logit-Normal	arXiv:2403.03206
30	Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation	Li et al.	2024	EDM Noise Schedule, Multi-AR, Human Preference	arXiv:2402.17245

8.2 Additional Reference Papers

Paper Title	Year	Key
LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models	2022	5.85 billion open image-text dataset
Improved Denoising Diffusion Probabilistic Models	2021	Cosine schedule, learned variance
Denoising Diffusion Implicit Models (DDIM)	2021	Deterministic sampling, speed improvement
Progressive Distillation for Fast Sampling of Diffusion Models	2022	Inference acceleration via progressive distillation
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation	2024	Rectified Flow 1-step generation
Latent Consistency Models	2024	LCM, SDXL-based few-step generation
SDXL-Turbo: Adversarial Diffusion Distillation	2024	1-4 step SDXL generation
Stable Cascade	2024	Wuerstchen-based 3-stage hierarchical generation

9. Conclusion and Outlook

Text-to-Image model training methodologies started from GAN's adversarial training, passed through Diffusion's iterative denoising, and are now converging on a new paradigm of Flow Matching + DiT.

Key Trend Summary

[T2I Training Methodology Evolution]

Efficiency:  Full Training ──→ LoRA/Adapter ──→ Prompt Tuning
         (months, $1M+)     (minutes, less than $1)      (seconds)

Architecture: U-Net ────────→ DiT ─────────→ MM-DiT + Flow Matching
          (SD 1.x-SDXL)    (DiT, PixArt)   (SD3, Flux)

Generation speed: 50-1000 steps ──→ 20-50 steps ──→ 1-4 steps
           (DDPM)            (DDIM, DPM++)   (LCM, LADD, CM)

Data quality: Web crawling ──→ 필터링 ──→ Synthetic Caption
            (LAION raw)    (aesthetic)  (DALL-E 3 방식)

Text understanding: CLIP only ──→ CLIP + T5 ──→ 3중 Encoder
             (SD 1.x)     (Imagen)      (SD3, Flux)

Future Outlook

Maximizing training efficiency: As demonstrated by PixArt-alpha, the trend of reducing training costs to 1/10 or less while maintaining quality will continue.
Data-Centric AI approach: As DALL-E 3 demonstrated, data quality and captioning are becoming more important than architecture.
Few-Step / One-Step 생성: Consistency Models, LCM, LADD 등의 증류 Technique이 발전하여 실시간 생성이 표준이 될 것이다.
Unified Multi-Modal Generation: Expanding to models that integrate not only text-to-image but also video, 3D, and audio.
Advanced Personalization: Beyond LoRA, DreamBooth, and IP-Adapter, more accurate subject reproduction with even less data will become possible.

T2I model training methodology has entered an era where the key is not simply "training a larger model with more data," but rather what data, with what schedule, and with what conditioning to train with. We hope the methodologies covered in this article can be used as a foundation for training your own T2I models or effectively customizing existing ones.