Generative AI & Diffusion Models: Complete Guide from Stable Diffusion to Video Generation

Introduction
1. Generative Model Lineage: GAN → VAE → Flow → Diffusion → Consistency
2. Diffusion Math: DDPM, Score Matching, SDE
3. Stable Diffusion Internal Architecture
4. LoRA & DreamBooth Fine-tuning
- 4.1 How LoRA Works
- 4.2 DreamBooth Fine-tuning
5. ControlNet & IP-Adapter
- 5.1 ControlNet Architecture
- 5.2 IP-Adapter & InstantID
6. Advanced Image Editing: InstructPix2Pix
7. Video Generation: Sora, CogVideoX
- 7.1 Sora's Technical Innovations
- 7.2 Maintaining Temporal Consistency
8. Music and Audio Generation
9. Production Deployment
10. Quiz: Test Your Diffusion Model Knowledge
Conclusion

Introduction

When Stable Diffusion was released in 2022, AI image generation entered the era of mass adoption. Yet few people can truly answer "why does an image emerge from noise?" with any depth.

This guide takes you from the complete lineage of generative models — GAN through Consistency Models — through the mathematics of DDPM, the internal architecture of Stable Diffusion, ControlNet, LoRA fine-tuning, and finally video generation systems like Sora.

1. Generative Model Lineage: GAN → VAE → Flow → Diffusion → Consistency

1.1 GAN (Generative Adversarial Network, 2014)

Proposed by Ian Goodfellow, GANs learn through an adversarial game between a Generator and a Discriminator.

Strengths: High-quality image generation, fast sampling
Weaknesses: Unstable training (mode collapse), lack of diversity

# Basic GAN architecture
import torch
import torch.nn as nn

class Generator(nn.Module):
    def __init__(self, latent_dim=100, img_size=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 512),
            nn.ReLU(),
            nn.Linear(512, img_size * img_size * 3),
            nn.Tanh()
        )

    def forward(self, z):
        return self.net(z).view(-1, 3, 64, 64)

1.2 VAE (Variational Autoencoder, 2013)

VAEs learn a distribution in latent space, then decode samples from that distribution to reconstruct images.

Loss function: $\mathcal{L} = \mathbb{E}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))$

Strengths: Interpretable latent space, stable training
Weaknesses: Generated samples are blurry compared to GANs

1.3 Normalizing Flow (2015~)

Flow models stack invertible transformations to map a simple distribution to a complex one.

$p(x) = p(z) \left|\det \frac{\partial f^{-1}}{\partial x}\right|$

Strengths: Exact likelihood computation
Weaknesses: Architecture constraints (invertibility), memory inefficiency

1.4 Diffusion Models (2020~)

Diffusion models gradually add noise to data and learn to reverse that process. Combining score matching with SDE theory, they represent the current state of the art in generative modeling.

1.5 Consistency Models (2023)

Consistency Models solve Diffusion's slow sampling problem. They learn a consistency function that maps directly from any noise level to the original data in a single step.

2. Diffusion Math: DDPM, Score Matching, SDE

2.1 DDPM Forward Process

DDPM (Denoising Diffusion Probabilistic Models) adds Gaussian noise to original data $x_0$ over T steps.

$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$

Accumulating this, we can sample directly at any timestep t:

$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)$

where $\bar{\alpha}_t = \prod_{s=1}^{t}(1-\beta_s)$ .

import torch
import torch.nn.functional as F

class DDPMScheduler:
    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
        self.T = num_timesteps
        # Linear noise schedule
        self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        self.alphas = 1.0 - self.betas
        self.alpha_bar = torch.cumprod(self.alphas, dim=0)

    def add_noise(self, x0, noise, t):
        """Add t steps of noise to x0 via reparameterization trick"""
        sqrt_alpha_bar = self.alpha_bar[t] ** 0.5
        sqrt_one_minus = (1 - self.alpha_bar[t]) ** 0.5
        # Reshape for broadcasting
        sqrt_alpha_bar = sqrt_alpha_bar.view(-1, 1, 1, 1)
        sqrt_one_minus = sqrt_one_minus.view(-1, 1, 1, 1)
        return sqrt_alpha_bar * x0 + sqrt_one_minus * noise

2.2 DDPM Reverse Process

The reverse process uses a neural network to predict the noise at each step:

$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$

The training objective is MSE between the added noise and predicted noise:

$\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, t)\|^2\right]$

def ddpm_training_step(model, scheduler, x0, optimizer):
    batch_size = x0.shape[0]
    # Sample random timesteps
    t = torch.randint(0, scheduler.T, (batch_size,))
    # Sample Gaussian noise
    noise = torch.randn_like(x0)
    # Add noise (forward process)
    xt = scheduler.add_noise(x0, noise, t)
    # Predict noise
    predicted_noise = model(xt, t)
    # MSE loss
    loss = F.mse_loss(predicted_noise, noise)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

2.3 Score Matching Perspective

The score function is the gradient of the log probability of the data distribution:

$s_\theta(x) = \nabla_x \log p_\theta(x)$

The noise prediction in Diffusion models is equivalent to learning the score function:

$\epsilon_\theta(x_t, t) \approx -\sqrt{1-\bar{\alpha}_t} \cdot \nabla_{x_t} \log q(x_t)$

2.4 SDE Perspective (Stochastic Differential Equation)

Song Yang's SDE framework generalizes Diffusion to continuous time.

Forward SDE: $dx = f(x,t)dt + g(t)dW$

Reverse SDE: $dx = [f(x,t) - g(t)^2 \nabla_x \log p_t(x)]dt + g(t)d\bar{W}$

This framework unifies DDPM, SMLD (NCSN), and ODE-based samplers under a single theoretical lens.

3. Stable Diffusion Internal Architecture

3.1 Overall Architecture

Stable Diffusion consists of three core components:

VAE (Variational Autoencoder): Pixel space to/from latent space
U-Net: Noise prediction in latent space
CLIP Text Encoder: Converts text prompts to embeddings

from diffusers import StableDiffusionPipeline
import torch

# Basic pipeline usage
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

# Generate image
image = pipe(
    prompt="a serene mountain landscape at sunset, photorealistic",
    negative_prompt="blurry, low quality, distorted",
    num_inference_steps=30,
    guidance_scale=7.5,
    width=512,
    height=512
).images[0]

image.save("output.png")

3.2 Why Latent Space?

Running Diffusion directly in pixel space requires processing 512x512x3 = 786,432 dimensions. SD's VAE compresses this to 64x64x4 = 16,384 dimensions.

Computation cost: ~48x reduction
Quality loss: Minimized through VAE's perceptual loss

# Visualizing the VAE latent space
from diffusers import AutoencoderKL
from PIL import Image
import torchvision.transforms as T

vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
vae = vae.to("cuda").eval()

transform = T.Compose([T.Resize((512, 512)), T.ToTensor(),
                        T.Normalize([0.5], [0.5])])

img = transform(Image.open("input.png")).unsqueeze(0).to("cuda")
with torch.no_grad():
    # Pixel → Latent (encoding)
    latent = vae.encode(img).latent_dist.sample()
    latent = latent * vae.config.scaling_factor
    print(f"Latent space shape: {latent.shape}")  # [1, 4, 64, 64]

3.3 CLIP Text Encoder

CLIP is trained on image-text pairs. In SD, only the text encoder is used, converting prompts into 77 tokens × 768-dimensional embeddings.

from transformers import CLIPTextModel, CLIPTokenizer

tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")

prompt = "a fantasy castle in the clouds"
tokens = tokenizer(prompt, padding="max_length", max_length=77,
                   return_tensors="pt")
with torch.no_grad():
    text_emb = text_encoder(tokens.input_ids)[0]
print(f"Text embedding shape: {text_emb.shape}")  # [1, 77, 768]

3.4 CFG (Classifier-Free Guidance)

CFG controls the strength of conditional generation. A higher guidance_scale follows the prompt more strictly; a lower value yields more diversity.

$\epsilon_{guided} = \epsilon_{uncond} + w \cdot (\epsilon_{cond} - \epsilon_{uncond})$

4. LoRA & DreamBooth Fine-tuning

4.1 How LoRA Works

Instead of updating the full weight matrix $W \in \mathbb{R}^{d \times k}$ , LoRA represents the change as a product of two low-rank matrices:

$W' = W + \Delta W = W + BA$

where $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ , and $r \ll \min(d, k)$ .

Typically r=4~~16, training only **0.1~~1%** of total parameters.

from diffusers import StableDiffusionPipeline
from peft import LoraConfig, get_peft_model
import torch

# LoRA configuration
lora_config = LoraConfig(
    r=16,                          # Rank
    lora_alpha=32,                 # Scaling parameter
    target_modules=["to_q", "to_v", "to_k", "to_out.0"],
    lora_dropout=0.05,
    bias="none",
)

# Apply LoRA to model
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5"
)
unet_lora = get_peft_model(pipe.unet, lora_config)
unet_lora.print_trainable_parameters()
# Trainable params: ~3M / Total: ~860M (about 0.3%)

4.2 DreamBooth Fine-tuning

DreamBooth learns a specific subject from just 3-10 images, using a rare token (e.g., "sks") as the subject identifier.

from diffusers import DiffusionPipeline
import torch

# Load DreamBooth-trained model
pipe = DiffusionPipeline.from_pretrained(
    "./dreambooth-sks-dog",  # Trained checkpoint
    torch_dtype=torch.float16
).to("cuda")

# Generate the specific dog using "sks dog"
images = pipe(
    "a photo of sks dog in front of the Eiffel Tower",
    num_inference_steps=50,
    guidance_scale=7.5
).images

5. ControlNet & IP-Adapter

5.1 ControlNet Architecture

ControlNet copies the U-Net's encoder as a separate control network and uses zero convolutions to protect the original SD weights.

Supported conditioning types:

Depth map: Spatial depth information
Canny edge: Outline/edge preservation
OpenPose: Human pose control
Scribble: Rough sketch to detailed image

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import torch
import cv2
import numpy as np

# Load ControlNet model (Canny edge)
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Extract Canny edges
image = load_image("input.png")
image_np = np.array(image)
low_threshold, high_threshold = 100, 200
canny_image = cv2.Canny(image_np, low_threshold, high_threshold)
canny_image = canny_image[:, :, None]
canny_image = np.concatenate([canny_image] * 3, axis=2)

# ControlNet inference
result = pipe(
    prompt="a beautiful landscape, detailed, 8k",
    image=canny_image,
    num_inference_steps=30,
    controlnet_conditioning_scale=1.0,
).images[0]

5.2 IP-Adapter & InstantID

IP-Adapter uses a reference image's style and content as conditions alongside the text prompt.

InstantID maintains consistent identity from a single portrait photo while generating diverse styles. It combines ControlNet (pose control) with IP-Adapter (facial features).

6. Advanced Image Editing: InstructPix2Pix

InstructPix2Pix edits images using natural language instructions — commands like "change the horse to a zebra".

from diffusers import StableDiffusionInstructPix2PixPipeline
import torch
from diffusers.utils import load_image

pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(
    "timbrooks/instruct-pix2pix",
    torch_dtype=torch.float16,
    safety_checker=None
).to("cuda")

image = load_image("horse.png")
result = pipe(
    "turn the horse into a zebra",
    image=image,
    num_inference_steps=30,
    image_guidance_scale=1.5,  # Fidelity to original image
    guidance_scale=7.5          # Text instruction strength
).images[0]

7. Video Generation: Sora, CogVideoX

7.1 Sora's Technical Innovations

OpenAI's Sora uses a Video Diffusion Transformer architecture, treating video as a sequence of "spacetime patches". Key innovations:

Spatial-temporal attention: Simultaneous attention over space and time
Variable resolution training: Learns across multiple resolutions and frame rates
Recaptioning: Enhanced quality of video captions

7.2 Maintaining Temporal Consistency

The biggest challenge in video generation is temporal consistency.

Motion prior: Learning the distribution of natural motion
Cross-frame attention: Sharing features across frames
Optical flow guidance: Controlling motion with optical flow

from diffusers import CogVideoXPipeline
import torch

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
).to("cuda")

video = pipe(
    prompt="A serene lake with rippling water, birds flying overhead",
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
).frames[0]

8. Music and Audio Generation

8.1 MusicGen (Meta)

MusicGen is a language-model-based system for generating music from text descriptions.

from audiocraft.models import MusicGen
import torchaudio

model = MusicGen.get_pretrained("facebook/musicgen-large")
model.set_generation_params(duration=30)  # Generate 30 seconds

descriptions = ["happy jazz piano with upbeat rhythm"]
wav = model.generate(descriptions)
torchaudio.save("music.wav", wav[0].cpu(), sample_rate=32000)

8.2 AudioLM Architecture

Google's AudioLM uses hierarchical tokenization:

Semantic tokens (w2v-BERT): Semantic content
Coarse acoustic tokens (SoundStream): Coarse acoustics
Fine acoustic tokens (SoundStream): Fine-grained acoustics

8.3 VALL-E Speech Synthesis

Microsoft's VALL-E clones a speaker's voice from just a 3-second audio sample. It autoregressively generates speech codec tokens, much like a language model generating text.

9. Production Deployment

9.1 Optimizing with diffusers

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
)

# Memory optimizations
pipe.enable_attention_slicing()           # Attention slicing
pipe.enable_vae_slicing()                 # VAE slicing
pipe.enable_model_cpu_offload()           # CPU offload

# xformers acceleration (if installed)
try:
    pipe.enable_xformers_memory_efficient_attention()
    print("xformers enabled")
except:
    print("xformers not found, using default attention")

9.2 ComfyUI API Integration

import json
import urllib.request

def queue_prompt(prompt_workflow, server_address="127.0.0.1:8188"):
    """Execute workflow via ComfyUI API"""
    p = {"prompt": prompt_workflow}
    data = json.dumps(p).encode("utf-8")
    req = urllib.request.Request(
        f"http://{server_address}/prompt",
        data=data,
        headers={"Content-Type": "application/json"}
    )
    with urllib.request.urlopen(req) as response:
        return json.loads(response.read())

# ComfyUI workflow (JSON format)
workflow = {
    "1": {
        "class_type": "CheckpointLoaderSimple",
        "inputs": {"ckpt_name": "v1-5-pruned-emaonly.ckpt"}
    },
    "2": {
        "class_type": "CLIPTextEncode",
        "inputs": {
            "text": "a beautiful sunset over mountains",
            "clip": ["1", 1]
        }
    },
    "3": {
        "class_type": "KSampler",
        "inputs": {
            "model": ["1", 0],
            "positive": ["2", 0],
            "negative": ["4", 0],
            "latent_image": ["5", 0],
            "seed": 42,
            "steps": 30,
            "cfg": 7.5,
            "sampler_name": "euler",
            "scheduler": "karras",
            "denoise": 1.0
        }
    }
}

result = queue_prompt(workflow)
print(f"Prompt ID: {result['prompt_id']}")

9.3 ONNX/TensorRT Optimization

from diffusers import OnnxStableDiffusionPipeline

# ONNX Runtime inference (works on CPU and GPU)
pipe = OnnxStableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    revision="onnx",
    provider="CUDAExecutionProvider",
)

image = pipe("a mountain lake at dawn").images[0]

10. Quiz: Test Your Diffusion Model Knowledge

Q1. Why does DDPM use Gaussian noise in the forward process? What is the mathematical reason?

Answer: The Central Limit Theorem and the reproductive property of Gaussian distributions.

Explanation: There are three reasons. First, the Gaussian distribution is closed under addition — the sum of two Gaussians is also Gaussian. Second, the reparameterization trick enables direct sampling at any arbitrary timestep t: $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ . Third, by the Central Limit Theorem, as T approaches infinity, any distribution converges to a standard Gaussian, making the endpoint of the forward process a well-defined prior.

Q2. Why does Stable Diffusion's U-Net operate in latent space rather than pixel space?

Answer: Computational efficiency combined with semantic compression.

Explanation: Running Diffusion in pixel space (512x512x3) leads to explosive computation. Using a VAE to compress to a 64x64x4 latent space reduces spatial dimensions by roughly 48x. Additionally, the VAE's latent space encodes semantic features rather than pixel-level noise, allowing high-quality image generation in fewer steps compared to pixel-space diffusion.

Q3. Why is LoRA more efficient than full weight fine-tuning?

Answer: Minimal parameter updates through low-rank decomposition.

Explanation: Updating the full weight matrix $W \in \mathbb{R}^{d \times k}$ requires training $d \times k$ parameters. LoRA decomposes the update as $\Delta W = BA$ ( $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ , $r \ll d, k$ ), training only $(d+k) \times r$ parameters. With r=16 and d=k=768, this achieves roughly 98% parameter reduction. Because original weights remain frozen, multiple LoRA adapters can be swapped for different styles.

Q4. How does ControlNet's architecture accept additional conditioning signals like depth maps and edge maps?

Answer: Trainable copy of the encoder plus zero convolutions.

Explanation: ControlNet copies the encoder blocks of SD's U-Net into a separate control network. The key innovation is zero convolution (1x1 convolution initialized to zero weights): at training start the control signal has zero influence, protecting the original SD quality. As training progresses, zero convolution weights grow and the control effect strengthens. Conditioning images (depth maps, edge maps) are processed through a separate small encoder before entering the control network.

Q5. How can Consistency Models reduce sampling steps compared to DDPM?

Answer: Learning a consistency function that maps any noise level directly to the original data.

Explanation: DDPM requires traversing all T=1000 reverse steps (even DDIM needs 20-50 steps). Consistency Models learn a consistency function $f_\theta(x_t, t) \approx x_0$ that must output the same $x_0$ for any point $x_t$ on the same trajectory (the consistency condition). This enables high-quality sampling in just 1-2 steps, representing a 100-500x speedup over DDPM's 1000 steps.

Conclusion

Diffusion models combine mathematical elegance with practical performance, forming the core of today's generative AI. From DDPM's Gaussian mathematics to Stable Diffusion's latent space, ControlNet's control mechanism, LoRA's efficient training, and Sora's video generation — all of these technologies stand on one beautiful mathematical framework.

Recommended learning path:

Read the DDPM paper (Ho et al., 2020) in full
Work through HuggingFace diffusers tutorials hands-on
Run ControlNet and LoRA fine-tuning yourself
Build custom workflows with ComfyUI