Skip to content

✍️ 필사 모드: Diffusion Models Deep Dive — DDPM, Latent Diffusion, Classifier-Free Guidance, DDIM, Stable Diffusion Complete Guide (2025)

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

TL;DR

  • A Diffusion Model is a generative model that "learns gradual noise removal." A forward process progressively adds noise to images, and a reverse process (neural network) undoes it.
  • DDPM (2020, Ho et al.): the starting point of modern diffusion. Simplified the variational lower bound into a single noise prediction loss.
  • Score Matching: the same idea in different language. Reverse diffusion = score function estimation.
  • U-Net: the standard backbone for diffusion. Encoder-decoder with skip connections. Outputs noise at the same size as input.
  • Latent Diffusion Model (LDM, 2021): compress images into a latent space with a VAE, then diffuse inside it. 48x compute savings → feasible on consumer GPUs.
  • Stable Diffusion = LDM + CLIP text encoder + open-source release (2022, Stability AI).
  • Classifier-Free Guidance: combine conditional and unconditional models to tune prompt adherence. The secret behind text-to-image quality.
  • DDIM: deterministic sampling. Accelerates 1000 steps to 20 steps.
  • Consistency Model: generation in 1-4 steps. Real-time level.
  • Conditional Generation: controllable via ControlNet, LoRA, InstructPix2Pix, and more.
  • Video and 3D: Sora, Stable Video Diffusion, DreamFusion. Diffusion extends beyond images.

1. Evolution of Generative Models

1.1 The GAN Era (2014-2020)

Ian Goodfellow's GAN (Generative Adversarial Network) in 2014. A Generator produces images, a Discriminator judges "real vs fake." The two networks train by competing.

Advantages:

  • One-shot sample generation (fast).
  • High-quality images.

Disadvantages:

  • Mode collapse: generates only a subset of modes.
  • Training instability: highly hyperparameter-sensitive.
  • Hard to scale: unstable when extended with more data.

Development continued through StyleGAN (2019), but complex scenes (including text) remained a limit.

1.2 Limits of VAE

VAE (Variational Autoencoder) is stable but produces blurry results. The encoder learns a distribution over latent codes and the decoder reconstructs. Generation is possible but quality lagged behind GANs.

1.3 Limits of Autoregressive Models

PixelRNN, PixelCNN, iGPT: generate pixels sequentially. Good quality but extremely slow (predicting one pixel at a time). Generating a 1024x1024 image = 1 million forward passes.

1.4 Diffusion Arrives (2020)

In 2020, DDPM (Ho, Jain, Abbeel). A "strangely simple" method:

  1. Gradually add noise to images (forward).
  2. Train a neural network to predict the noise.
  3. For generation, start from random noise and iteratively denoise.

Result: samples that are higher quality, more stable, and more diverse than GANs. The community got excited.

1.5 Scaling Success

2021: Dhariwal & Nichol published "Diffusion Beats GANs." SOTA image quality.

2021 GLIDE: OpenAI's text-to-image diffusion. Too slow and heavy though.

2021 LDM: Rombach et al.'s Latent Diffusion. A game changer. Diffusion in latent space rather than pixel space → 50x faster.

2022 DALL-E 2: OpenAI's powerful text-to-image model.

2022 Stable Diffusion: Stability AI open-sourced LDM. Runs on a local GPU. Democratization of generative AI.

2022 Imagen: Google. Proved the importance of the text encoder.

2024 Sora: OpenAI's video diffusion. Minute-long high-quality video.

1.6 Why Did Diffusion Succeed?

Core reason: training is simple and stable. Not a GAN-style min-max game — just plain MSE loss. Improves consistently with scale. More data + bigger model = better results. "Simple things scale."


2. Forward Process — Adding Noise

2.1 The Idea

Gradually add noise to an image x_0:

x_0 (original) → x_1 → x_2 → ... → x_T (nearly pure noise)

A small amount of Gaussian noise per step. T = 1000 (DDPM).

2.2 Equations

Transition at each step:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)

Meaning:

  • x_t is sampled from a Gaussian with mean sqrt(1 - beta_t) * x_{t-1} and variance beta_t.
  • beta_t is the noise schedule — how much noise to add at each step.
  • Small beta early (preserve image), larger beta later (noise dominates).

2.3 Noise Schedule

Linear (original DDPM):

beta_1 = 0.0001
beta_T = 0.02
beta_t = linear interpolation

Cosine (Nichol & Dhariwal, 2021):

alpha_t = cos^2(pi/2 * (t/T + s) / (1 + s))
beta_t = 1 - alpha_t / alpha_{t-1}

Cosine adds noise more slowly early on, which is beneficial at lower resolutions. Most modern models use cosine or variants.

2.4 Closed Form

A convenient property when applied in sequence: you can directly compute x_t for any t.

Define:

αt=1βt,αˉt=s=1tαs\alpha_t = 1 - \beta_t, \quad \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s

Then:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I)

Or in samplable form:

xt=αˉtx0+1αˉtϵ,ϵN(0,I)x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

The forward process is computed mathematically without any learning. Given the original x_0 and random noise epsilon, you can instantly produce x_t at any timestep.

2.5 Terminal State

If T is large enough, alpha_bar_T approaches 0 and x_T approaches N(0, I). Pure Gaussian noise independent of the original.

This is key: whatever image you start from, the terminal state is the same. The reverse process can start from pure noise and reach an image.


3. Reverse Process — What Must Be Learned

3.1 Goal

We want to "reverse" the forward process:

pθ(xt1xt)=?p_\theta(x_{t-1} | x_t) = ?

The true reverse can be written via Bayes but is intractable. A neural network approximates it.

3.2 Parameterization

Two approaches:

1. Predict the mean mu_theta(x_t, t):

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

2. Predict the noise epsilon_theta(x_t, t):

Ho et al. showed that directly predicting noise works much better. The mean is then derived via:

μθ(xt,t)=1αt(xtβt1αˉtϵθ(xt,t))\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right)

3.3 Simplifying the Loss

Deriving the Variational Lower Bound is complex, but Ho et al.'s insight: a plain MSE loss gives nearly the same effect:

Lsimple=Et,x0,ϵ[ϵϵθ(αˉtx0+1αˉtϵ,t)2]L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t) \|^2 \right]

In other words:

  1. Random original image x_0.
  2. Random timestep t.
  3. Random noise epsilon.
  4. Compute x_t via forward process.
  5. Give x_t to the network and let it predict epsilon.
  6. Train with MSE.

Astonishingly simple. No GAN min-max. Pure supervised learning.

3.4 Sampling (Inference)

After training, to generate:

def sample(model, T=1000):
    # start: pure noise
    x_t = torch.randn(shape)
    
    for t in reversed(range(T)):
        # predict noise
        epsilon = model(x_t, t)
        
        # compute mean
        alpha_t = alphas[t]
        alpha_bar_t = alpha_bars[t]
        beta_t = betas[t]
        
        mu = (1 / sqrt(alpha_t)) * (x_t - beta_t / sqrt(1 - alpha_bar_t) * epsilon)
        
        # sample next step
        if t > 0:
            z = torch.randn_like(x_t)
            sigma = sqrt(beta_t)
            x_t = mu + sigma * z
        else:
            x_t = mu
    
    return x_t  # generated image

Each step calls the model once → next step. 1000 steps = 1000 forward passes. Very slow.

3.5 Why It Works

Intuition: transforming a clean image into a noisy one is stochastic; the reverse is too. With sufficiently small steps, each transition can be approximated as Gaussian. The model only needs to learn "which direction makes things slightly cleaner at the current noise level."

Mathematically this is the discretization of a diffusion SDE. The continuous SDE form:

dx=f(x,t)dt+g(t)dWdx = f(x, t)dt + g(t)dW

Reverse SDE (Anderson 1982):

dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dWˉdx = [f(x, t) - g(t)^2 \nabla_x \log p_t(x)] dt + g(t) d\bar{W}

grad_x log p_t(x) is the score function — covered in the next section.


4. Score-Based Perspective

4.1 Score Function

The score of a distribution p(x) is:

xlogp(x)\nabla_x \log p(x)

"The direction in which probability increases at this point." The gradient pointing toward high-probability regions.

4.2 Score Matching = Diffusion

A surprising equivalence: DDPM noise prediction = score estimation.

Sketch: when x_t ~ N(sqrt(alpha_bar_t) x_0, (1 - alpha_bar_t) I),

xtlogp(xtx0)=ϵ1αˉt\nabla_{x_t} \log p(x_t | x_0) = -\frac{\epsilon}{\sqrt{1 - \bar{\alpha}_t}}

So score is proportional to noise. Noise prediction and score estimation are two expressions of the same problem.

4.3 Song and Ermon's Perspective

Yang Song's research (2019-2020) proposed the score-based generative model first. Independent of DDPM but mathematically equivalent.

Advantage: the continuous SDE framework is cleaner. It can also be expressed as an ODE (probability flow ODE).

4.4 Unified Framework

Song et al. 2021 "Score-Based Generative Modeling through SDEs": DDPM, NCSN, and so on are all special cases of the same SDE framework.

Modern diffusion research moves fluidly between SDE and DDPM viewpoints. Practically, DDPM notation is intuitive; theoretically, SDE is more elegant.


5. U-Net Architecture

The backbone of diffusion models is mostly the U-Net.

5.1 Why U-Net?

Requirements of diffusion models:

  • Input and output at the same spatial size (image → noise).
  • Multi-scale features (globally coherent and locally detailed).
  • Preserve spatial information.

U-Net (originally for medical image segmentation, 2015):

  • Encoder: resolution down, channels up.
  • Decoder: resolution up, channels down.
  • Skip connections: directly connect the same level of encoder and decoder.

A perfect match.

5.2 Structure

Input (e.g., 64x64x3)
   |
   v
Conv -> ResBlock -> Attention
   |      |
   |   Downsample (-> 32x32)
   |      |
   |   ResBlock -> Attention
   |      |
   |   Downsample (-> 16x16)
   |      |
   |   ResBlock -> Attention    <- Bottleneck
   |      |
   |   Upsample (-> 32x32)
   |      ^ (skip from encoder)
   |   ResBlock -> Attention
   |      ^
   |   Upsample (-> 64x64)
   |      ^ (skip)
   v
Output (64x64x3) — predicted noise

5.3 Timestep Embedding

The U-Net must know the current timestep t. Even for the same image, denoising in "early diffusion (slightly noisy)" vs "late (very noisy)" should differ.

Sinusoidal positional embedding:

PE(t,2i)=sin(t/100002i/d)\text{PE}(t, 2i) = \sin(t / 10000^{2i/d})

Then an MLP — added as a bias into ResBlocks.

5.4 Attention Layers

Diffusion U-Nets include self-attention. Captures long-range dependencies (correlations between distant pixels). Especially near the bottleneck.

5.5 Conditioning

For text-conditioned generation, cross-attention is added:

Image features -> Query
Text features (CLIP) -> Key, Value

The U-Net learns to bridge text and image. This is the core of "following the prompt."

5.6 2025: Diffusion Transformer (DiT)

In 2023, William Peebles et al. proposed the Diffusion Transformer (DiT). A pure Transformer instead of a U-Net.

  • Patch-wise input (like ViT).
  • Excellent scaling properties.
  • The foundation of Sora.

DiT surpasses U-Net at large scale → the trend in 2024-2025 is shifting toward DiT.


6. DDIM — Faster Sampling

6.1 DDPM Is Slow

DDPM samples 1000 steps. Top quality, but too slow. Over a minute.

6.2 The DDIM Insight

Song et al. 2021: the reverse process of DDPM can be made deterministic, letting you skip steps.

Core equation (simplified):

xt1=αˉt1x^0+1αˉt1ϵθ(xt,t)x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \hat{x}_0 + \sqrt{1 - \bar{\alpha}_{t-1}} \epsilon_\theta(x_t, t)

Where:

x^0=xt1αˉtϵθ(xt,t)αˉt\hat{x}_0 = \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}

x_hat_0 is the "current best estimate" — "if I had to produce a clean image right now, this is it."

6.3 Fast Sampling

DDPM needs 1000 steps. DDIM can use any sub-sequence:

timesteps = [0, 50, 100, 150, 200, ..., 950]  # 20 steps

50x faster. Quality drops slightly but is nearly identical.

6.4 Various Samplers

Many variants after DDIM:

  • Euler: Euler integrator from the ODE viewpoint.
  • Heun: 2nd-order Runge-Kutta.
  • DPM-Solver: specialized ODE solver for diffusion. 10-25 steps.
  • DPM-Solver++: further improved. Current default for Stable Diffusion.
  • UniPC: unified predictor-corrector.

All share the same idea: treat sampling as an ODE problem and apply efficient solvers.


7. Classifier-Free Guidance

The most important trick in text-to-image.

7.1 The Problem

Naive conditional generation:

x_t -> model(x_t, text) -> noise

Problem: the model can ignore the text. Text is a weak signal during training.

7.2 Classifier Guidance (old)

Originally proposed by Dhariwal & Nichol. Train a separate classifier:

scorecond=xlogp(xy)=xlogp(x)+xlogp(yx)\text{score}_{\text{cond}} = \nabla_x \log p(x | y) = \nabla_x \log p(x) + \nabla_x \log p(y | x)

grad_x log p(y|x) is the classifier gradient. Problem: requires a separate classifier, and it's complex.

7.3 Classifier-Free Guidance (new)

Ho & Salimans 2022: achieve the same effect without a classifier.

At training:

  1. With 10-20% probability, replace the text with null (empty condition).
  2. The same model learns both conditional and unconditional generation.

At inference:

noise_cond = model(x_t, text)         # conditional prediction
noise_uncond = model(x_t, null)       # unconditional prediction

# Extrapolation!
guided_noise = noise_uncond + scale * (noise_cond - noise_uncond)

scale = 1 is plain conditional. scale > 1 amplifies the conditional direction. Typical values: 5-12.

7.4 Why It Works

Intuition: "subtracting unconditional from conditional leaves only the effect of the condition." Amplify that → emphasize the condition. As if saying "be more faithful to the text."

In practice, this determines prompt adherence in text-to-image. Without it, prompts get ignored.

7.5 Trade-off

Raising scale:

  • Pros: more accurate prompt following.
  • Cons: reduced image diversity.
  • Cons: more artifacts (at excessive scale).

Typically 7-12 is the sweet spot.


8. Latent Diffusion Model

8.1 The Problem with Pixel Space

The original DDPM diffuses in pixel space. 512x512x3 = 786,432 dimensions. Each step processes all these dimensions → very expensive.

8.2 The LDM Idea (Rombach et al. 2021)

"There's a lot of waste in pixel space. Most pixels look like their neighbors, and meaningful information lives in far fewer dimensions. Compress with a VAE and diffuse in that space."

Structure:

Image (512x512x3)
     |
   VAE Encoder
     |
   Latent (64x64x4)  <- diffusion happens here
     |
   VAE Decoder
     |
   Image (512x512x3)

The latent space is 64x64x4 = 16,384 dimensions. 1/48 the original. 48x compute savings.

8.3 Training the VAE

LDM uses a pre-trained VAE. The VAE is trained separately:

  • Encoder: image → latent.
  • Decoder: latent → image.
  • Loss: reconstruction + KL + perceptual (LPIPS) + adversarial (discriminator).

Once trained, it's reused across diffusion models.

8.4 Diffusion in Latent Space

VAE is fixed. Standard DDPM runs on top:

latent = vae.encode(image)  # image -> latent
noisy_latent = forward_diffusion(latent, t)
predicted_noise = unet(noisy_latent, t, text)
# Loss: MSE between predicted_noise and actual noise

At generation:

noise = torch.randn(1, 4, 64, 64)  # latent size
for t in reversed(timesteps):
    noise = denoise_step(unet, noise, t, text)
image = vae.decode(noise)

8.5 Benefits Summary

  • Compute: 48x faster.
  • Memory: also 48x less.
  • High resolution possible: 1024x1024 is reasonable.
  • Semantic abstraction: latent space learns meaningful representations → better generalization.

8.6 Limitations

  • VAE loss: small-detail information loss (text, faces, etc.).
  • VAE artifacts: slight blurring or subtle distortion.
  • Still overwhelmingly worthwhile.

9. Stable Diffusion

9.1 Composition

Stable Diffusion (2022 Stability AI) = LDM + CLIP + open source.

Components:

  1. VAE: image ↔ 8x compressed latent.
  2. CLIP text encoder: text → 768-dim vector.
  3. U-Net: diffusion in latent space. Text conditioning via cross-attention.
  4. Scheduler: DDIM, DPM-Solver, etc.

9.2 CLIP Text Encoder

OpenAI CLIP (Contrastive Language-Image Pre-training, 2021). A text/image encoder trained on massive image-text pairs.

Role:

  • Prompt → high-dimensional embedding.
  • The U-Net's cross-attention references this embedding.

Better CLIP = better prompt understanding. Stable Diffusion uses CLIP ViT-L/14.

9.3 Training Data

LAION-5B: 5 billion image-text pairs crawled from the web. Open, hence reproducible. Also the source of copyright controversy.

9.4 Version Evolution

SD 1.x (2022): first release. 512x512. OpenAI CLIP.

SD 2.x (2022): improved aesthetics. Uses OpenCLIP.

SDXL (2023): larger model, refiner. 1024x1024 native. Significantly better quality.

SD 3 (2024): adopts rectified flow. Built on the Diffusion Transformer.

Flux (2024 Black Forest Labs): a strong successor to SD3. Open-source SOTA.

9.5 Open vs Closed

Stable Diffusion: open-source. Local execution, fine-tuning, and modifications possible.

DALL-E 3, Midjourney: API only. Possibly higher quality, but black boxes.

For research and customization, SD is mainstream. For commercial quality, closed options.


10. Conditional Generation Techniques

10.1 Text-to-Image (basic)

Covered above. Text prompt → image.

10.2 Image-to-Image (img2img)

Start the initial latent not from random noise but from the latent of an existing image.

input_latent = vae.encode(input_image)
noisy = forward_diffuse(input_latent, strength)  # partial noise
output = denoise(noisy, text)

At strength = 0.3 it changes slightly; at 0.9 it nearly regenerates. The "edit this image this way" pattern.

10.3 Inpainting

Regenerate only inside the mask. Noise the masked region, keep the rest original.

mask = user_drawn_mask
latent = vae.encode(image)
for t in reversed(timesteps):
    # predict noise
    noise = unet(noisy, t, text)
    # replace outside-mask with original (every step)
    noisy = noisy * mask + forward_diffuse(latent, t) * (1 - mask)

Edit specific regions. Photoshop's "generative fill" is this.

10.4 ControlNet

Lvmin Zhang et al. 2023. Control structure via additional conditioning.

Idea: train a "control" network that copies the U-Net. The control network accepts a hint image (edges, poses, depth) and influences the U-Net.

Input image (pose skeleton)
   |
ControlNet copy of U-Net
   |
Influence on main U-Net
   |
Output: matches the prompt but follows the pose

ControlNet variants:

  • Canny: edge-based.
  • Depth: depth-map-based.
  • OpenPose: human-pose-based.
  • Scribble: hand sketch.
  • Segmentation: semantic mask.

The "specific composition + free style" pattern. Very useful.

10.5 LoRA (Low-Rank Adaptation)

Specialize a pretrained model with a small number of parameters.

Idea: add a low-rank update to each layer's weight W.

W=W+ΔW=W+BAW' = W + \Delta W = W + BA

B in R^{d x r}, A in R^{r x k}, r << min(d, k).

Train only A and B. Keep original W frozen. Result:

  • Far fewer parameters: hundreds of MB → few MB.
  • Fast training: learn a specific style/character from thousands of images.
  • Swappable: enable or disable multiple LoRAs.

Sites like Civitai have tens of thousands of LoRAs. "This anime character style," "this artist style," and so on.

10.6 DreamBooth

Train on a specific subject (my dog, a specific person). With just 5-10 photos.

Input: a few photos of the dog
Training: bind a unique token like "sks dog" to this dog
Output: "sks dog on the beach" -> your dog at the beach

Combined with LoRA, lightweight DreamBooth is now mainstream.

10.7 Textual Inversion

Learn a new word. Train only the word's embedding while the model stays frozen.

Train a token "<new_concept>"
Its embedding captures a specific style/subject

Lighter but more limited.

10.8 InstructPix2Pix

Edit via natural-language instructions.

Input: image + "remove the person"
Output: image with the person removed

A ChatGPT-style editing interface.


11. Consistency Models — Real-Time Generation

11.1 The Problem

Even DDIM needs 20-50 steps. For real-time (under 100ms), that's not enough.

11.2 Consistency Model (Song et al. 2023)

"Diffusion follows an ODE trajectory. Let's train the model to jump straight to x_0 from any point on that trajectory."

After training, high-quality generation in 1-4 steps.

11.3 Latent Consistency Model

LCM (2023): LDM + Consistency Distillation. Accelerates Stable Diffusion to 2-4 steps.

Existing SDXL: 25 steps x 150ms = 3.75s
LCM SDXL: 4 steps x 150ms = 600ms

Real-time interactive generation becomes possible. As a user types a prompt, results update live.

11.4 SDXL Turbo

Stability AI's 2023 release. Generates SDXL in 1 step. Quality drops slightly but is nearly instant.

11.5 Progressive Distillation

A technique that learns to compress many steps of the original (teacher) model into a single step. Repeatedly halve the step count.

1000 → 500 → 250 → ... → 1 step.

At each stage a student model approximates n teacher steps with 1 step.


12. Video Diffusion

12.1 Naive Approach

Diffuse each frame independently → no temporal consistency. Flicker and character drift.

12.2 Temporal Attention

Add temporal attention layers to the U-Net. Apply attention along the time axis to maintain consistency.

Spatial Attention (positions within an image)
Temporal Attention (across frames)

12.3 Video Diffusion Models

Stable Video Diffusion (2023): image → short video (2-4 seconds).

Sora (2024 OpenAI): minute-long high quality. Built on the Diffusion Transformer. Unified space-time processing in patches.

Sora's secrets:

  • DiT architecture (not U-Net).
  • Spacetime patches: 3D volumetric processing.
  • Massive scale (model + data).
  • High-quality captions (GPT-4 generates video descriptions).

13. 3D Diffusion

13.1 DreamFusion

Generate 3D objects using a 2D diffusion model.

Idea: SDS (Score Distillation Sampling). Optimize 3D object parameters (NeRF). The 2D diffusion model evaluates "is this view faithful to the prompt?"

Result: after hours of training, "text → 3D mesh."

13.2 Gaussian Splatting + Diffusion

From 2023+, Gaussian Splatting became the standard 3D representation. Diffusion generates Gaussian parameters.

13.3 Limitations

Much harder than 2D. Data scarcity, quality limits. Currently an active research area.


14. Implementation in Practice

14.1 Hugging Face Diffusers

A Python library. Stable Diffusion in 3 lines:

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

image = pipe("a photograph of an astronaut riding a horse").images[0]
image.save("astronaut.png")

14.2 Swapping the Scheduler

from diffusers import DPMSolverMultistepScheduler

pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe.scheduler.config
)

# now high quality in 20 steps
image = pipe(prompt, num_inference_steps=20).images[0]

14.3 Loading a LoRA

pipe.load_lora_weights("path/to/lora", weight_name="my_style.safetensors")
pipe.fuse_lora(lora_scale=0.7)

image = pipe("a cat in my_style style").images[0]

14.4 ControlNet

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny"
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet
)

from PIL import Image
edge_image = Image.open("canny_edges.png")
image = pipe(prompt="a modern house", image=edge_image).images[0]

14.5 Inference Optimization

  • FP16: torch_dtype=torch.float16. Half the memory, speed up.
  • Memory-efficient attention: pipe.enable_xformers_memory_efficient_attention() or PyTorch 2+'s scaled_dot_product_attention.
  • VAE tiling: prevents OOM during large-image generation.
  • TensorRT / OpenVINO: compiler optimizations. 2-3x speedup.

LAION-5B contains copyrighted images. The legality of training is contested.

  • Getty Images: lawsuit against Stability AI (2023).
  • Andersen v. Stability AI: an artists' class action.

As of 2025, rulings are in progress. Outcomes will affect the entire AI industry.

15.2 Deepfakes

Face swaps, non-consensual imagery. As diffusion quality rises, abuse grows.

Countermeasures:

  • Content credentials (C2PA): metadata marking images as AI-generated.
  • Watermarking: invisible watermarks like Stable Signature.
  • Detection: AI-generated detection models.

15.3 Bias

Training-data bias reflects in outputs. Certain jobs = male, certain appearances = superior. Research and mitigation are active.

15.4 Jobs

Concerns from illustrators and designers. "If AI draws, what do I do?" Debate between "adapt with new tools" and "regulate."


16. Future Directions

16.1 High Quality + Fast Generation

Further progress for consistency models. 1-step high-quality generation.

16.2 Control

Beyond simple text — precise control over pose, composition, color, style.

16.3 Consistency

Consistency across multiple images/frames. Maintaining character identity.

16.4 Multimodal

Unified generation across text + image + audio + video.

16.5 Smaller Models

Acceleration like LCM and Turbo, plus shrinking models. Running on mobile and edge.

16.6 Physical Simulation

A possibility shown by Sora: diffusion "implicitly" learns physical laws. Extension into a "world model."


17. Learning Resources

Papers:

  • DDPM (Ho et al. 2020) — essential.
  • Latent Diffusion (Rombach et al. 2021).
  • Classifier-Free Guidance (Ho & Salimans 2022).
  • DDIM (Song et al. 2020).
  • Stable Diffusion (2022 Stability AI).
  • DiT (Peebles & Xie 2023).
  • Consistency Model (Song et al. 2023).

Books/Guides:

  • "Hands-On Machine Learning" — Aurelien Geron (extended edition adds diffusion).
  • Lilian Weng's blog: "What are Diffusion Models?"
  • Jay Alammar's visualization blog.

Code:

  • Hugging Face Diffusers library (Python).
  • Stability AI's official repo (stability-ai/generative-models).
  • AUTOMATIC1111 Stable Diffusion Web UI.

Lectures:

  • Yang Song's MIT lecture (score-based models).
  • Andrew Ng's generative AI course.

18. Summary — One-Page Cheat Sheet

+-----------------------------------------------------+
|          Diffusion Models Cheat Sheet                |
+-----------------------------------------------------+
| Core idea:                                           |
|   add noise -> network predicts noise -> reverse     |
|                                                       |
| Forward Process:                                      |
|   x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1-abar) * eps |
|   computable without learning                         |
|   Noise schedule: linear, cosine                      |
|                                                       |
| Reverse Process:                                      |
|   learn network eps_theta(x_t, t)                     |
|   Loss: MSE(eps, eps_theta)                           |
|   Simple, stable, scalable                            |
|                                                       |
| Mathematical view:                                    |
|   DDPM: variational bound                             |
|   Score matching: grad log p(x)                       |
|   SDE: dx = f dt + g dW                               |
|   ODE: probability flow                               |
|                                                       |
| U-Net architecture:                                   |
|   Encoder-decoder + skip                              |
|   Timestep embedding (sinusoidal)                     |
|   Self-attention + Cross-attention (text)             |
|   2024+: Diffusion Transformer (DiT)                 |
|                                                       |
| Sampling:                                             |
|   DDPM: 1000 steps, slow                              |
|   DDIM: deterministic, 20-50 steps                    |
|   DPM-Solver++: 10-25 steps                           |
|   Consistency Model: 1-4 steps                        |
|                                                       |
| Latent Diffusion (LDM):                               |
|   VAE encoder: image -> latent                        |
|   diffuse in latent (48x savings)                     |
|   VAE decoder: latent -> image                        |
|   foundation of Stable Diffusion                      |
|                                                       |
| Classifier-Free Guidance:                             |
|   Train: 10% null condition                           |
|   Infer: uncond + scale*(cond - uncond)               |
|   scale 7-12 = prompt faithful                        |
|                                                       |
| Conditional control:                                  |
|   Text: CLIP + cross-attention                        |
|   Image: img2img                                      |
|   Structure: ControlNet (canny, pose, depth)          |
|   Style: LoRA                                         |
|   Subject: DreamBooth                                 |
|   Edit: InstructPix2Pix, inpainting                   |
|                                                       |
| Production:                                           |
|   Stable Diffusion (open)                             |
|   DALL-E 3, Midjourney (closed)                       |
|   Flux (2024+, SOTA open)                             |
|                                                       |
| Extensions:                                           |
|   Video: Sora, Stable Video Diffusion                 |
|   3D: DreamFusion                                     |
|   Audio: AudioLDM                                     |
+-----------------------------------------------------+

19. Quiz

Q1. Why is the DDPM training loss just MSE?

A. Ho et al. (2020)'s key insight — after simplifying the variational lower bound mathematically, each step boils down to the "MSE between real noise and predicted noise". Originally there are complex KL-divergence terms, but after reparameterization and clever derivation, most terms vanish and only the noise-prediction error remains. This is why diffusion training is plain supervised learning rather than a GAN-style unstable min-max game. Because the loss is simple, it improves consistently with scale and is less sensitive to hyperparameters. A classic example of "simple things scale."

Q2. How does Latent Diffusion cut compute by 48x?

A. Compress the image with a VAE and diffuse in the latent space. Instead of the original 512x512x3 = 786,432-dim pixel space, denoising runs in the 64x64x4 = 16,384-dim VAE latent space. Each step processes a tensor 48x smaller → both compute and memory drop by 48x. The VAE is pre-trained and frozen during diffusion training. Loss is minor (because VAE reconstruction is high quality). This simple change shifted things from "needs a datacenter" to "runs on a consumer GPU," which is the backdrop that let Stable Diffusion democratize via open source. A case where a small structural change produces social impact.

Q3. How does Classifier-Free Guidance work?

A. At training, with 10-20% probability the text condition is replaced with null (empty condition), so the same model learns both conditional and unconditional generation. At inference, compute two predictions: noise_cond = model(x, text) and noise_uncond = model(x, null). Then extrapolate: guided = uncond + scale * (cond - uncond). scale > 1 means "exaggerate in the conditional direction" → more prompt-faithful. Intuitively, it amplifies "subtracting unconditional from conditional leaves the pure effect of the condition." Scale 7-12 is the sweet spot. This simple trick determines the practical quality of text-to-image — without it, prompts get ignored.

Q4. What's the relationship between the score-based view and the DDPM view?

A. Mathematically equivalent. Yang Song's score-based generative models (2019) and Ho's DDPM (2020) were developed independently but are two expressions of the same thing. Estimating the score function grad_x log p(x) and "predicting noise epsilon" are related by a constant factor: grad_{x_t} log p(x_t) = -epsilon / sqrt(1 - alpha_bar_t). Song et al. 2021 unified both in the SDE framework. In practice, DDPM notation is intuitive and code is simpler; theoretically, SDE/score is elegant and handy for continuous limits. Moving fluently between the two makes papers much easier to read.

Q5. Why is DDIM 50x faster than DDPM?

A. It makes sampling deterministic so you can skip steps. DDPM's reverse process is stochastic (adds random noise every step), so you must step through 1000 steps sequentially. DDIM reinterprets the same trained model as a deterministic ODE trajectory — the "direction from current x_t to x_{t-1}" is mathematically fixed. Because it's deterministic, jumping on a sub-sequence (e.g., [0, 50, 100, ..., 950] 20 steps) is mathematically sensible. Quality drops slightly but is nearly identical. Later DPM-Solver and DPM-Solver++ cut it further to 10-25 steps with more efficient ODE solvers. All build on the "DDIM deterministic path" idea.

Q6. How is ControlNet different from existing conditional generation?

A. Trains an additional copy of the U-Net for structural control. Existing cross-attention text conditioning controls "style/content" but is poor at "exact composition/pose." ControlNet makes a copy of the U-Net encoder, takes new inputs (canny edges, pose skeleton, depth map, etc.), and influences each layer of the original U-Net. The original U-Net stays frozen, and only the ControlNet portion trains → possible with small data. Result: "composition from this sketch + content from the prompt" combined precisely. Turn a photo's pose into manga style, or render from a building layout, etc. Essential when fine control is needed in practice. Standard SD feature since 2023.

Q7. Why can Consistency Models generate in 1-4 steps?

A. The model is trained to jump directly to x_0 from any point on the trajectory. Vanilla diffusion learns "a step cleaner at the current step," whereas the Consistency Model learns "from the current step, go all the way clean." This is possible because diffusion's probability-flow ODE trajectory is deterministic — a consistency condition is added to the loss requiring all points on the same trajectory to converge to the same x_0. After training, 1 step gives a decent result and 2-4 steps give high quality. Latent Consistency Model (LCM) applies this to LDM and runs SDXL in 4 steps. The technology that enables real-time interactive generation (results update live as the user types).


If you found this helpful, check out these related posts:

  • "Transformer Architecture Deep Dive" — the foundation for DiT.
  • "LLVM Compiler Infrastructure" — background on how MLIR became the foundation for AI compilers.
  • "RDMA and NCCL" — the networking foundation for training large diffusion models.
  • "CUDA and GPU Kernels" — accelerating diffusion inference.

현재 단락 (1/495)

- A **Diffusion Model** is a generative model that "learns gradual noise removal." A **forward proce...

작성 글자: 0원문 글자: 29,100작성 단락: 0/495