Split View: 생성형 AI & 디퓨전 모델 완전 정복: Stable Diffusion, ControlNet, 비디오 생성까지

생성형 AI & 디퓨전 모델 완전 정복: Stable Diffusion, ControlNet, 비디오 생성까지

들어가며
1. 생성 모델 계보: GAN → VAE → Flow → Diffusion → Consistency
2. Diffusion 수학: DDPM, Score Matching, SDE
3. Stable Diffusion 내부 구조
4. LoRA & DreamBooth Fine-tuning
- 4.1 LoRA (Low-Rank Adaptation) 원리
- 4.2 DreamBooth Fine-tuning
5. ControlNet & IP-Adapter
- 5.1 ControlNet 아키텍처
- 5.2 IP-Adapter & InstantID
6. 고급 이미지 편집: InstructPix2Pix
7. 비디오 생성: Sora, CogVideoX
- 7.1 Sora의 기술적 혁신
- 7.2 시간적 일관성 유지
8. 음악/오디오 생성
9. 프로덕션 배포
10. 퀴즈: 생성형 AI & Diffusion 이해 점검
마치며

들어가며

2022년 Stable Diffusion이 공개되면서 이미지 생성 AI는 대중화의 시대를 맞이했습니다. 하지만 "왜 노이즈에서 그림이 나오는가?"라는 질문에 제대로 답할 수 있는 사람은 드뭅니다.

이 가이드에서는 GAN부터 Consistency Models까지 생성 모델의 계보를 정리하고, DDPM의 수학, Stable Diffusion 내부 구조, ControlNet, LoRA fine-tuning, 그리고 Sora 같은 비디오 생성 모델까지 완전히 해부합니다.

1. 생성 모델 계보: GAN → VAE → Flow → Diffusion → Consistency

1.1 GAN (Generative Adversarial Network, 2014)

Ian Goodfellow가 제안한 GAN은 **생성자(Generator)**와 **판별자(Discriminator)**의 적대적 게임으로 학습합니다.

장점: 고품질 이미지 생성, 빠른 샘플링
단점: 학습 불안정(mode collapse), 다양성 부족

# 기본 GAN 구조
import torch
import torch.nn as nn

class Generator(nn.Module):
    def __init__(self, latent_dim=100, img_size=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 512),
            nn.ReLU(),
            nn.Linear(512, img_size * img_size * 3),
            nn.Tanh()
        )

    def forward(self, z):
        return self.net(z).view(-1, 3, 64, 64)

1.2 VAE (Variational Autoencoder, 2013)

VAE는 인코더가 잠재 공간의 분포를 학습하고, 디코더가 그 분포에서 샘플링해 이미지를 복원합니다.

손실 함수: $\mathcal{L} = \mathbb{E}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))$

장점: 잠재 공간 해석 가능, 학습 안정적
단점: 샘플 품질이 GAN보다 흐릿함 (blurry)

1.3 Normalizing Flow (2015~)

Flow 모델은 **가역 변환(invertible transformation)**을 쌓아 단순한 분포를 복잡한 분포로 변환합니다.

$p(x) = p(z) \left|\det \frac{\partial f^{-1}}{\partial x}\right|$

장점: 정확한 likelihood 계산 가능
단점: 네트워크 구조 제약(가역성), 메모리 비효율

1.4 Diffusion Models (2020~)

Diffusion은 데이터에 점진적으로 노이즈를 추가한 뒤, 그 역과정을 학습합니다. Score matching과 SDE 이론이 결합된 현재 최고 수준의 생성 모델입니다.

1.5 Consistency Models (2023)

Consistency Models는 Diffusion의 느린 샘플링 문제를 해결합니다. 임의의 노이즈 수준에서 직접 원본 데이터로 매핑하는 일관성 함수를 학습합니다.

2. Diffusion 수학: DDPM, Score Matching, SDE

2.1 DDPM Forward Process

DDPM(Denoising Diffusion Probabilistic Models)의 순방향 과정은 원본 데이터 $x_0$ 에 T 스텝에 걸쳐 가우시안 노이즈를 추가합니다.

$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$

이를 누적하면 임의의 타임스텝 t에서 직접 샘플링할 수 있습니다:

$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)$

여기서 $\bar{\alpha}_t = \prod_{s=1}^{t}(1-\beta_s)$ 입니다.

import torch
import torch.nn.functional as F

class DDPMScheduler:
    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
        self.T = num_timesteps
        # 선형 노이즈 스케줄
        self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        self.alphas = 1.0 - self.betas
        self.alpha_bar = torch.cumprod(self.alphas, dim=0)

    def add_noise(self, x0, noise, t):
        """x0에 t 스텝만큼 노이즈 추가 (reparameterization trick)"""
        sqrt_alpha_bar = self.alpha_bar[t] ** 0.5
        sqrt_one_minus = (1 - self.alpha_bar[t]) ** 0.5
        # 브로드캐스팅을 위한 shape 조정
        sqrt_alpha_bar = sqrt_alpha_bar.view(-1, 1, 1, 1)
        sqrt_one_minus = sqrt_one_minus.view(-1, 1, 1, 1)
        return sqrt_alpha_bar * x0 + sqrt_one_minus * noise

2.2 DDPM Reverse Process

역방향 과정은 뉴럴 네트워크가 각 스텝의 노이즈를 예측합니다:

$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$

학습 목표는 추가된 노이즈와 예측된 노이즈 간의 MSE:

$\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, t)\|^2\right]$

def ddpm_training_step(model, scheduler, x0, optimizer):
    batch_size = x0.shape[0]
    # 랜덤 타임스텝 샘플링
    t = torch.randint(0, scheduler.T, (batch_size,))
    # 가우시안 노이즈 샘플링
    noise = torch.randn_like(x0)
    # 노이즈 추가 (forward process)
    xt = scheduler.add_noise(x0, noise, t)
    # 노이즈 예측
    predicted_noise = model(xt, t)
    # MSE 손실
    loss = F.mse_loss(predicted_noise, noise)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

2.3 Score Matching 관점

Score function은 데이터 분포의 로그 확률의 그래디언트입니다:

$s_\theta(x) = \nabla_x \log p_\theta(x)$

Diffusion 모델의 노이즈 예측은 사실 score function을 학습하는 것과 동치입니다:

$\epsilon_\theta(x_t, t) \approx -\sqrt{1-\bar{\alpha}_t} \cdot \nabla_{x_t} \log q(x_t)$

2.4 SDE 관점 (Stochastic Differential Equation)

Song Yang의 SDE 프레임워크는 Diffusion을 연속 시간으로 일반화합니다.

순방향 SDE: $dx = f(x,t)dt + g(t)dW$

역방향 SDE: $dx = [f(x,t) - g(t)^2 \nabla_x \log p_t(x)]dt + g(t)d\bar{W}$

이 프레임워크로 DDPM, SMLD(NCSN), ODE 샘플러를 통일된 관점으로 이해할 수 있습니다.

3. Stable Diffusion 내부 구조

3.1 전체 아키텍처

Stable Diffusion은 세 핵심 컴포넌트로 구성됩니다:

VAE (Variational Autoencoder): 픽셀 공간 ↔ 잠재 공간 변환
U-Net: 잠재 공간에서 노이즈 예측
CLIP Text Encoder: 텍스트 프롬프트를 임베딩으로 변환

from diffusers import StableDiffusionPipeline
import torch

# 기본 파이프라인 사용
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

# 이미지 생성
image = pipe(
    prompt="a serene mountain landscape at sunset, photorealistic",
    negative_prompt="blurry, low quality, distorted",
    num_inference_steps=30,
    guidance_scale=7.5,
    width=512,
    height=512
).images[0]

image.save("output.png")

3.2 왜 잠재 공간(Latent Space)인가?

픽셀 공간에서 직접 Diffusion을 돌리면 512x512x3 = 786,432차원을 다뤄야 합니다. 반면 SD의 VAE는 이를 64x64x4 = 16,384차원으로 압축합니다.

연산 비용: 약 48배 감소
품질 손실: VAE의 perceptual loss 덕분에 최소화

# VAE 잠재 공간 시각화
from diffusers import AutoencoderKL
from PIL import Image
import torchvision.transforms as T

vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
vae = vae.to("cuda").eval()

transform = T.Compose([T.Resize((512, 512)), T.ToTensor(),
                        T.Normalize([0.5], [0.5])])

img = transform(Image.open("input.png")).unsqueeze(0).to("cuda")
with torch.no_grad():
    # 픽셀 → 잠재 (인코딩)
    latent = vae.encode(img).latent_dist.sample()
    latent = latent * vae.config.scaling_factor
    print(f"잠재 공간 크기: {latent.shape}")  # [1, 4, 64, 64]

3.3 CLIP Text Encoder

CLIP은 이미지-텍스트 쌍으로 학습된 모델입니다. SD에서는 텍스트 인코더로만 사용해 프롬프트를 77 토큰 × 768차원 임베딩으로 변환합니다.

from transformers import CLIPTextModel, CLIPTokenizer

tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")

prompt = "a fantasy castle in the clouds"
tokens = tokenizer(prompt, padding="max_length", max_length=77,
                   return_tensors="pt")
with torch.no_grad():
    text_emb = text_encoder(tokens.input_ids)[0]
print(f"텍스트 임베딩 크기: {text_emb.shape}")  # [1, 77, 768]

3.4 CFG (Classifier-Free Guidance)

CFG는 조건부 생성의 강도를 조절합니다. guidance_scale이 높을수록 프롬프트를 강하게 따르고, 낮을수록 다양성이 높아집니다.

$\epsilon_{guided} = \epsilon_{uncond} + w \cdot (\epsilon_{cond} - \epsilon_{uncond})$

4. LoRA & DreamBooth Fine-tuning

4.1 LoRA (Low-Rank Adaptation) 원리

전체 가중치 행렬 $W \in \mathbb{R}^{d \times k}$ 를 직접 업데이트하는 대신, 두 저랭크 행렬의 곱으로 변화량을 표현합니다:

$W' = W + \Delta W = W + BA$

여기서 $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ , $r \ll \min(d, k)$ .

일반적으로 r=4~~16으로, 전체 파라미터 대비 **0.1~~1%만 학습**합니다.

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
from peft import LoraConfig, get_peft_model
import torch

# LoRA 설정
lora_config = LoraConfig(
    r=16,                          # 랭크
    lora_alpha=32,                 # 스케일링 파라미터
    target_modules=["to_q", "to_v", "to_k", "to_out.0"],
    lora_dropout=0.05,
    bias="none",
)

# 모델에 LoRA 적용
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5"
)
unet_lora = get_peft_model(pipe.unet, lora_config)
unet_lora.print_trainable_parameters()
# 학습 가능 파라미터: ~3M / 전체: ~860M (약 0.3%)

4.2 DreamBooth Fine-tuning

DreamBooth는 3~10장의 이미지로 특정 개체를 학습합니다. 희귀 토큰(예: "sks")을 개체 식별자로 사용합니다.

from diffusers import DiffusionPipeline
import torch

# DreamBooth 학습된 모델 로드
pipe = DiffusionPipeline.from_pretrained(
    "./dreambooth-sks-dog",  # 학습된 체크포인트
    torch_dtype=torch.float16
).to("cuda")

# "sks dog"로 특정 강아지 생성
images = pipe(
    "a photo of sks dog in the Eiffel Tower",
    num_inference_steps=50,
    guidance_scale=7.5
).images

5. ControlNet & IP-Adapter

5.1 ControlNet 아키텍처

ControlNet은 U-Net의 인코더 부분을 복사해 별도의 제어 네트워크를 만들고, zero convolution으로 SD의 원본 가중치를 보호합니다.

지원 컨디셔닝:

Depth map: 공간적 깊이 정보
Canny edge: 윤곽선 보존
OpenPose: 인체 자세 제어
Scribble: 러프 스케치 → 세밀한 이미지

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import torch
import cv2
import numpy as np

# ControlNet 모델 로드 (Canny edge)
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Canny edge 추출
image = load_image("input.png")
image_np = np.array(image)
low_threshold, high_threshold = 100, 200
canny_image = cv2.Canny(image_np, low_threshold, high_threshold)
canny_image = canny_image[:, :, None]  # 채널 추가
canny_image = np.concatenate([canny_image] * 3, axis=2)

# ControlNet 추론
result = pipe(
    prompt="a beautiful landscape, detailed, 8k",
    image=canny_image,
    num_inference_steps=30,
    controlnet_conditioning_scale=1.0,
).images[0]

5.2 IP-Adapter & InstantID

IP-Adapter는 참조 이미지의 스타일/내용을 프롬프트와 함께 조건으로 사용합니다.

InstantID는 단 한 장의 얼굴 사진으로 일관된 ID를 유지하면서 다양한 스타일을 생성합니다. ControlNet(자세 제어)과 IP-Adapter(얼굴 특징)를 결합한 구조입니다.

6. 고급 이미지 편집: InstructPix2Pix

InstructPix2Pix는 텍스트 명령으로 이미지를 편집합니다. "말을 얼룩말로 바꿔줘"와 같은 명령을 이해합니다.

from diffusers import StableDiffusionInstructPix2PixPipeline
import torch
from diffusers.utils import load_image

pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(
    "timbrooks/instruct-pix2pix",
    torch_dtype=torch.float16,
    safety_checker=None
).to("cuda")

image = load_image("horse.png")
result = pipe(
    "turn the horse into a zebra",
    image=image,
    num_inference_steps=30,
    image_guidance_scale=1.5,  # 원본 이미지 충실도
    guidance_scale=7.5          # 텍스트 지시 강도
).images[0]

7. 비디오 생성: Sora, CogVideoX

7.1 Sora의 기술적 혁신

OpenAI Sora는 Video Diffusion Transformer 구조로, 비디오를 "spacetime patches"의 시퀀스로 처리합니다. 핵심 혁신:

Spatial-temporal attention: 공간 + 시간 차원 동시 어텐션
Variable resolution training: 다양한 해상도/프레임률 학습
Recaptioning: 비디오 캡션 품질 향상

7.2 시간적 일관성 유지

비디오 생성에서 가장 큰 도전: 시간적 일관성(temporal consistency)

Motion prior: 자연스러운 움직임 분포 학습
Cross-frame attention: 프레임 간 특징 공유
Optical flow guidance: 광학 흐름으로 움직임 제어

from diffusers import CogVideoXPipeline
import torch

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
).to("cuda")

video = pipe(
    prompt="A serene lake with rippling water, birds flying overhead",
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
).frames[0]

8. 음악/오디오 생성

8.1 MusicGen (Meta)

MusicGen은 텍스트로 음악을 생성하는 언어 모델 기반 시스템입니다.

from audiocraft.models import MusicGen
import torchaudio

model = MusicGen.get_pretrained("facebook/musicgen-large")
model.set_generation_params(duration=30)  # 30초 생성

descriptions = ["happy jazz piano with upbeat rhythm"]
wav = model.generate(descriptions)
torchaudio.save("music.wav", wav[0].cpu(), sample_rate=32000)

8.2 AudioLM 아키텍처

Google AudioLM은 계층적 토큰화를 사용합니다:

Semantic tokens (w2v-BERT): 의미 정보
Coarse acoustic tokens (SoundStream): 거친 음향
Fine acoustic tokens (SoundStream): 세밀한 음향

8.3 VALL-E 음성 합성

Microsoft VALL-E는 3초 음성 샘플만으로 화자의 목소리를 복제합니다. 언어 모델처럼 음성 코덱 토큰을 자기회귀 생성합니다.

9. 프로덕션 배포

9.1 diffusers 라이브러리 최적화

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
)

# 메모리 최적화
pipe.enable_attention_slicing()           # 어텐션 슬라이싱
pipe.enable_vae_slicing()                 # VAE 슬라이싱
pipe.enable_model_cpu_offload()           # CPU 오프로드

# xformers 가속 (설치 시)
try:
    pipe.enable_xformers_memory_efficient_attention()
    print("xformers 활성화")
except:
    print("xformers 없음, 기본 어텐션 사용")

9.2 ComfyUI API 호출

import json
import urllib.request
import urllib.parse

def queue_prompt(prompt_workflow, server_address="127.0.0.1:8188"):
    """ComfyUI API로 워크플로우 실행"""
    p = {"prompt": prompt_workflow}
    data = json.dumps(p).encode("utf-8")
    req = urllib.request.Request(
        f"http://{server_address}/prompt",
        data=data,
        headers={"Content-Type": "application/json"}
    )
    with urllib.request.urlopen(req) as response:
        return json.loads(response.read())

# ComfyUI 워크플로우 (JSON 형식)
workflow = {
    "1": {
        "class_type": "CheckpointLoaderSimple",
        "inputs": {"ckpt_name": "v1-5-pruned-emaonly.ckpt"}
    },
    "2": {
        "class_type": "CLIPTextEncode",
        "inputs": {
            "text": "a beautiful sunset over mountains",
            "clip": ["1", 1]
        }
    },
    "3": {
        "class_type": "KSampler",
        "inputs": {
            "model": ["1", 0],
            "positive": ["2", 0],
            "negative": ["4", 0],
            "latent_image": ["5", 0],
            "seed": 42,
            "steps": 30,
            "cfg": 7.5,
            "sampler_name": "euler",
            "scheduler": "karras",
            "denoise": 1.0
        }
    }
}

result = queue_prompt(workflow)
print(f"프롬프트 ID: {result['prompt_id']}")

9.3 ONNX/TensorRT 최적화

from diffusers import OnnxStableDiffusionPipeline
import numpy as np

# ONNX 런타임으로 추론 (CPU/GPU 모두 가능)
pipe = OnnxStableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    revision="onnx",
    provider="CUDAExecutionProvider",
)

image = pipe("a mountain lake at dawn").images[0]

10. 퀴즈: 생성형 AI & Diffusion 이해 점검

Q1. DDPM에서 Forward Process가 가우시안 노이즈를 사용하는 수학적 이유는?

정답: 중심극한정리(CLT)와 가우시안 분포의 재생성

설명: 가우시안 노이즈를 사용하는 이유는 세 가지입니다. 첫째, 가우시안 분포는 덧셈에 대해 닫혀 있습니다 (두 가우시안의 합도 가우시안). 둘째, reparameterization trick이 가능해 임의의 타임스텝 t에서 직접 샘플링할 수 있습니다: $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ . 셋째, 중심극한정리에 의해 T → 무한대일 때 어떤 분포도 표준 가우시안으로 수렴합니다.

Q2. Stable Diffusion에서 U-Net이 잠재 공간(latent space)에서 동작하는 이유는?

정답: 계산 효율성 + 의미적 압축

설명: 픽셀 공간(512x512x3)에서 Diffusion을 돌리면 연산량이 폭발합니다. VAE를 통해 64x64x4 잠재 공간으로 압축하면 공간 차원이 약 48배 줄어듭니다. 또한 VAE의 잠재 공간은 픽셀 수준의 노이즈가 아닌 의미적 특징을 포함하므로, 더 적은 스텝으로 고품질 이미지를 생성할 수 있습니다.

Q3. LoRA가 전체 가중치 fine-tuning보다 효율적인 이유는?

정답: 저랭크 분해로 업데이트 파라미터 최소화

설명: 전체 가중치 행렬 $W \in \mathbb{R}^{d \times k}$ 를 업데이트하면 $d \times k$ 개의 파라미터가 필요합니다. LoRA는 $\Delta W = BA$ ( $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ , $r \ll d, k$ )로 분해해 $(d+k) \times r$ 개만 학습합니다. r=16, d=k=768일 때 약 98% 파라미터 절감이 가능합니다. 또한 원본 가중치는 고정되므로 여러 LoRA를 교체하며 스타일 전환이 가능합니다.

Q4. ControlNet의 아키텍처 설계: 어떻게 추가 컨디셔닝을 받아들이는가?

정답: Trainable copy + Zero convolution

설명: ControlNet은 SD U-Net의 인코더 블록을 복사해 별도의 제어 네트워크를 만듭니다. 핵심은 zero convolution (초기 가중치 0인 1x1 컨볼루션)으로, 학습 초기에는 제어 신호의 영향이 0이 되어 원본 SD의 품질을 보호합니다. 학습이 진행되면서 zero convolution의 가중치가 점점 커지며 제어 효과가 강해집니다. depth map, edge map 등의 컨디셔닝 이미지는 별도의 인코더를 통해 처리됩니다.

Q5. Consistency Models가 DDPM보다 샘플링 단계를 줄일 수 있는 원리는?

정답: 임의 타임스텝에서 원본으로 직접 매핑하는 일관성 함수 학습

설명: DDPM은 T=1000 스텝을 역방향으로 모두 거쳐야 합니다 (DDIM으로 줄여도 20-50 스텝). Consistency Models는 일관성 함수 $f_\theta(x_t, t) \approx x_0$ 를 학습합니다. 이 함수는 같은 "trajectory" 위의 어떤 점 $x_t$ 에서도 동일한 $x_0$ 를 출력해야 합니다 (일관성 조건). 덕분에 단 1~2 스텝으로 고품질 샘플링이 가능하며, DDPM의 1000 스텝 대비 100~500배 빠릅니다.

마치며

Diffusion 모델은 수학적 우아함과 실용적 성능을 동시에 갖춘 현재 생성 AI의 핵심입니다. DDPM의 가우시안 수학부터 Stable Diffusion의 잠재 공간, ControlNet의 제어 메커니즘, LoRA의 효율적 학습, 그리고 Sora의 비디오 생성까지 — 이 모든 기술이 하나의 아름다운 수학적 프레임워크 위에 서 있습니다.

다음 단계로 권장하는 학습 경로:

DDPM 논문 (Ho et al., 2020) 완독
HuggingFace diffusers 튜토리얼 실습
ControlNet, LoRA fine-tuning 직접 실행
ComfyUI로 커스텀 워크플로우 구축

Generative AI & Diffusion Models: Complete Guide from Stable Diffusion to Video Generation

Introduction
1. Generative Model Lineage: GAN → VAE → Flow → Diffusion → Consistency
2. Diffusion Math: DDPM, Score Matching, SDE
3. Stable Diffusion Internal Architecture
4. LoRA & DreamBooth Fine-tuning
- 4.1 How LoRA Works
- 4.2 DreamBooth Fine-tuning
5. ControlNet & IP-Adapter
- 5.1 ControlNet Architecture
- 5.2 IP-Adapter & InstantID
6. Advanced Image Editing: InstructPix2Pix
7. Video Generation: Sora, CogVideoX
- 7.1 Sora's Technical Innovations
- 7.2 Maintaining Temporal Consistency
8. Music and Audio Generation
9. Production Deployment
10. Quiz: Test Your Diffusion Model Knowledge
Conclusion

Introduction

When Stable Diffusion was released in 2022, AI image generation entered the era of mass adoption. Yet few people can truly answer "why does an image emerge from noise?" with any depth.

This guide takes you from the complete lineage of generative models — GAN through Consistency Models — through the mathematics of DDPM, the internal architecture of Stable Diffusion, ControlNet, LoRA fine-tuning, and finally video generation systems like Sora.

1. Generative Model Lineage: GAN → VAE → Flow → Diffusion → Consistency

1.1 GAN (Generative Adversarial Network, 2014)

Proposed by Ian Goodfellow, GANs learn through an adversarial game between a Generator and a Discriminator.

Strengths: High-quality image generation, fast sampling
Weaknesses: Unstable training (mode collapse), lack of diversity

# Basic GAN architecture
import torch
import torch.nn as nn

class Generator(nn.Module):
    def __init__(self, latent_dim=100, img_size=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 512),
            nn.ReLU(),
            nn.Linear(512, img_size * img_size * 3),
            nn.Tanh()
        )

    def forward(self, z):
        return self.net(z).view(-1, 3, 64, 64)

1.2 VAE (Variational Autoencoder, 2013)

VAEs learn a distribution in latent space, then decode samples from that distribution to reconstruct images.

Loss function: $\mathcal{L} = \mathbb{E}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))$

Strengths: Interpretable latent space, stable training
Weaknesses: Generated samples are blurry compared to GANs

1.3 Normalizing Flow (2015~)

Flow models stack invertible transformations to map a simple distribution to a complex one.

$p(x) = p(z) \left|\det \frac{\partial f^{-1}}{\partial x}\right|$

Strengths: Exact likelihood computation
Weaknesses: Architecture constraints (invertibility), memory inefficiency

1.4 Diffusion Models (2020~)

Diffusion models gradually add noise to data and learn to reverse that process. Combining score matching with SDE theory, they represent the current state of the art in generative modeling.

1.5 Consistency Models (2023)

Consistency Models solve Diffusion's slow sampling problem. They learn a consistency function that maps directly from any noise level to the original data in a single step.

2. Diffusion Math: DDPM, Score Matching, SDE

2.1 DDPM Forward Process

DDPM (Denoising Diffusion Probabilistic Models) adds Gaussian noise to original data $x_0$ over T steps.

$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$

Accumulating this, we can sample directly at any timestep t:

$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)$

where $\bar{\alpha}_t = \prod_{s=1}^{t}(1-\beta_s)$ .

import torch
import torch.nn.functional as F

class DDPMScheduler:
    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
        self.T = num_timesteps
        # Linear noise schedule
        self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        self.alphas = 1.0 - self.betas
        self.alpha_bar = torch.cumprod(self.alphas, dim=0)

    def add_noise(self, x0, noise, t):
        """Add t steps of noise to x0 via reparameterization trick"""
        sqrt_alpha_bar = self.alpha_bar[t] ** 0.5
        sqrt_one_minus = (1 - self.alpha_bar[t]) ** 0.5
        # Reshape for broadcasting
        sqrt_alpha_bar = sqrt_alpha_bar.view(-1, 1, 1, 1)
        sqrt_one_minus = sqrt_one_minus.view(-1, 1, 1, 1)
        return sqrt_alpha_bar * x0 + sqrt_one_minus * noise

2.2 DDPM Reverse Process

The reverse process uses a neural network to predict the noise at each step:

$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$

The training objective is MSE between the added noise and predicted noise:

$\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, t)\|^2\right]$

def ddpm_training_step(model, scheduler, x0, optimizer):
    batch_size = x0.shape[0]
    # Sample random timesteps
    t = torch.randint(0, scheduler.T, (batch_size,))
    # Sample Gaussian noise
    noise = torch.randn_like(x0)
    # Add noise (forward process)
    xt = scheduler.add_noise(x0, noise, t)
    # Predict noise
    predicted_noise = model(xt, t)
    # MSE loss
    loss = F.mse_loss(predicted_noise, noise)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

2.3 Score Matching Perspective

The score function is the gradient of the log probability of the data distribution:

$s_\theta(x) = \nabla_x \log p_\theta(x)$

The noise prediction in Diffusion models is equivalent to learning the score function:

$\epsilon_\theta(x_t, t) \approx -\sqrt{1-\bar{\alpha}_t} \cdot \nabla_{x_t} \log q(x_t)$

2.4 SDE Perspective (Stochastic Differential Equation)

Song Yang's SDE framework generalizes Diffusion to continuous time.

Forward SDE: $dx = f(x,t)dt + g(t)dW$

Reverse SDE: $dx = [f(x,t) - g(t)^2 \nabla_x \log p_t(x)]dt + g(t)d\bar{W}$

This framework unifies DDPM, SMLD (NCSN), and ODE-based samplers under a single theoretical lens.

3. Stable Diffusion Internal Architecture

3.1 Overall Architecture

Stable Diffusion consists of three core components:

VAE (Variational Autoencoder): Pixel space to/from latent space
U-Net: Noise prediction in latent space
CLIP Text Encoder: Converts text prompts to embeddings

from diffusers import StableDiffusionPipeline
import torch

# Basic pipeline usage
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

# Generate image
image = pipe(
    prompt="a serene mountain landscape at sunset, photorealistic",
    negative_prompt="blurry, low quality, distorted",
    num_inference_steps=30,
    guidance_scale=7.5,
    width=512,
    height=512
).images[0]

image.save("output.png")

3.2 Why Latent Space?

Running Diffusion directly in pixel space requires processing 512x512x3 = 786,432 dimensions. SD's VAE compresses this to 64x64x4 = 16,384 dimensions.

Computation cost: ~48x reduction
Quality loss: Minimized through VAE's perceptual loss

# Visualizing the VAE latent space
from diffusers import AutoencoderKL
from PIL import Image
import torchvision.transforms as T

vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
vae = vae.to("cuda").eval()

transform = T.Compose([T.Resize((512, 512)), T.ToTensor(),
                        T.Normalize([0.5], [0.5])])

img = transform(Image.open("input.png")).unsqueeze(0).to("cuda")
with torch.no_grad():
    # Pixel → Latent (encoding)
    latent = vae.encode(img).latent_dist.sample()
    latent = latent * vae.config.scaling_factor
    print(f"Latent space shape: {latent.shape}")  # [1, 4, 64, 64]

3.3 CLIP Text Encoder

CLIP is trained on image-text pairs. In SD, only the text encoder is used, converting prompts into 77 tokens × 768-dimensional embeddings.

from transformers import CLIPTextModel, CLIPTokenizer

tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")

prompt = "a fantasy castle in the clouds"
tokens = tokenizer(prompt, padding="max_length", max_length=77,
                   return_tensors="pt")
with torch.no_grad():
    text_emb = text_encoder(tokens.input_ids)[0]
print(f"Text embedding shape: {text_emb.shape}")  # [1, 77, 768]

3.4 CFG (Classifier-Free Guidance)

CFG controls the strength of conditional generation. A higher guidance_scale follows the prompt more strictly; a lower value yields more diversity.

$\epsilon_{guided} = \epsilon_{uncond} + w \cdot (\epsilon_{cond} - \epsilon_{uncond})$

4. LoRA & DreamBooth Fine-tuning

4.1 How LoRA Works

Instead of updating the full weight matrix $W \in \mathbb{R}^{d \times k}$ , LoRA represents the change as a product of two low-rank matrices:

$W' = W + \Delta W = W + BA$

where $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ , and $r \ll \min(d, k)$ .

Typically r=4~~16, training only **0.1~~1%** of total parameters.

from diffusers import StableDiffusionPipeline
from peft import LoraConfig, get_peft_model
import torch

# LoRA configuration
lora_config = LoraConfig(
    r=16,                          # Rank
    lora_alpha=32,                 # Scaling parameter
    target_modules=["to_q", "to_v", "to_k", "to_out.0"],
    lora_dropout=0.05,
    bias="none",
)

# Apply LoRA to model
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5"
)
unet_lora = get_peft_model(pipe.unet, lora_config)
unet_lora.print_trainable_parameters()
# Trainable params: ~3M / Total: ~860M (about 0.3%)

4.2 DreamBooth Fine-tuning

DreamBooth learns a specific subject from just 3-10 images, using a rare token (e.g., "sks") as the subject identifier.

from diffusers import DiffusionPipeline
import torch

# Load DreamBooth-trained model
pipe = DiffusionPipeline.from_pretrained(
    "./dreambooth-sks-dog",  # Trained checkpoint
    torch_dtype=torch.float16
).to("cuda")

# Generate the specific dog using "sks dog"
images = pipe(
    "a photo of sks dog in front of the Eiffel Tower",
    num_inference_steps=50,
    guidance_scale=7.5
).images

5. ControlNet & IP-Adapter

5.1 ControlNet Architecture

ControlNet copies the U-Net's encoder as a separate control network and uses zero convolutions to protect the original SD weights.

Supported conditioning types:

Depth map: Spatial depth information
Canny edge: Outline/edge preservation
OpenPose: Human pose control
Scribble: Rough sketch to detailed image

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import torch
import cv2
import numpy as np

# Load ControlNet model (Canny edge)
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Extract Canny edges
image = load_image("input.png")
image_np = np.array(image)
low_threshold, high_threshold = 100, 200
canny_image = cv2.Canny(image_np, low_threshold, high_threshold)
canny_image = canny_image[:, :, None]
canny_image = np.concatenate([canny_image] * 3, axis=2)

# ControlNet inference
result = pipe(
    prompt="a beautiful landscape, detailed, 8k",
    image=canny_image,
    num_inference_steps=30,
    controlnet_conditioning_scale=1.0,
).images[0]

5.2 IP-Adapter & InstantID

IP-Adapter uses a reference image's style and content as conditions alongside the text prompt.

InstantID maintains consistent identity from a single portrait photo while generating diverse styles. It combines ControlNet (pose control) with IP-Adapter (facial features).

6. Advanced Image Editing: InstructPix2Pix

InstructPix2Pix edits images using natural language instructions — commands like "change the horse to a zebra".

from diffusers import StableDiffusionInstructPix2PixPipeline
import torch
from diffusers.utils import load_image

pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(
    "timbrooks/instruct-pix2pix",
    torch_dtype=torch.float16,
    safety_checker=None
).to("cuda")

image = load_image("horse.png")
result = pipe(
    "turn the horse into a zebra",
    image=image,
    num_inference_steps=30,
    image_guidance_scale=1.5,  # Fidelity to original image
    guidance_scale=7.5          # Text instruction strength
).images[0]

7. Video Generation: Sora, CogVideoX

7.1 Sora's Technical Innovations

OpenAI's Sora uses a Video Diffusion Transformer architecture, treating video as a sequence of "spacetime patches". Key innovations:

Spatial-temporal attention: Simultaneous attention over space and time
Variable resolution training: Learns across multiple resolutions and frame rates
Recaptioning: Enhanced quality of video captions

7.2 Maintaining Temporal Consistency

The biggest challenge in video generation is temporal consistency.

Motion prior: Learning the distribution of natural motion
Cross-frame attention: Sharing features across frames
Optical flow guidance: Controlling motion with optical flow

from diffusers import CogVideoXPipeline
import torch

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
).to("cuda")

video = pipe(
    prompt="A serene lake with rippling water, birds flying overhead",
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
).frames[0]

8. Music and Audio Generation

8.1 MusicGen (Meta)

MusicGen is a language-model-based system for generating music from text descriptions.

from audiocraft.models import MusicGen
import torchaudio

model = MusicGen.get_pretrained("facebook/musicgen-large")
model.set_generation_params(duration=30)  # Generate 30 seconds

descriptions = ["happy jazz piano with upbeat rhythm"]
wav = model.generate(descriptions)
torchaudio.save("music.wav", wav[0].cpu(), sample_rate=32000)

8.2 AudioLM Architecture

Google's AudioLM uses hierarchical tokenization:

Semantic tokens (w2v-BERT): Semantic content
Coarse acoustic tokens (SoundStream): Coarse acoustics
Fine acoustic tokens (SoundStream): Fine-grained acoustics

8.3 VALL-E Speech Synthesis

Microsoft's VALL-E clones a speaker's voice from just a 3-second audio sample. It autoregressively generates speech codec tokens, much like a language model generating text.

9. Production Deployment

9.1 Optimizing with diffusers

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
)

# Memory optimizations
pipe.enable_attention_slicing()           # Attention slicing
pipe.enable_vae_slicing()                 # VAE slicing
pipe.enable_model_cpu_offload()           # CPU offload

# xformers acceleration (if installed)
try:
    pipe.enable_xformers_memory_efficient_attention()
    print("xformers enabled")
except:
    print("xformers not found, using default attention")

9.2 ComfyUI API Integration

import json
import urllib.request

def queue_prompt(prompt_workflow, server_address="127.0.0.1:8188"):
    """Execute workflow via ComfyUI API"""
    p = {"prompt": prompt_workflow}
    data = json.dumps(p).encode("utf-8")
    req = urllib.request.Request(
        f"http://{server_address}/prompt",
        data=data,
        headers={"Content-Type": "application/json"}
    )
    with urllib.request.urlopen(req) as response:
        return json.loads(response.read())

# ComfyUI workflow (JSON format)
workflow = {
    "1": {
        "class_type": "CheckpointLoaderSimple",
        "inputs": {"ckpt_name": "v1-5-pruned-emaonly.ckpt"}
    },
    "2": {
        "class_type": "CLIPTextEncode",
        "inputs": {
            "text": "a beautiful sunset over mountains",
            "clip": ["1", 1]
        }
    },
    "3": {
        "class_type": "KSampler",
        "inputs": {
            "model": ["1", 0],
            "positive": ["2", 0],
            "negative": ["4", 0],
            "latent_image": ["5", 0],
            "seed": 42,
            "steps": 30,
            "cfg": 7.5,
            "sampler_name": "euler",
            "scheduler": "karras",
            "denoise": 1.0
        }
    }
}

result = queue_prompt(workflow)
print(f"Prompt ID: {result['prompt_id']}")

9.3 ONNX/TensorRT Optimization

from diffusers import OnnxStableDiffusionPipeline

# ONNX Runtime inference (works on CPU and GPU)
pipe = OnnxStableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    revision="onnx",
    provider="CUDAExecutionProvider",
)

image = pipe("a mountain lake at dawn").images[0]

10. Quiz: Test Your Diffusion Model Knowledge

Q1. Why does DDPM use Gaussian noise in the forward process? What is the mathematical reason?

Answer: The Central Limit Theorem and the reproductive property of Gaussian distributions.

Explanation: There are three reasons. First, the Gaussian distribution is closed under addition — the sum of two Gaussians is also Gaussian. Second, the reparameterization trick enables direct sampling at any arbitrary timestep t: $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ . Third, by the Central Limit Theorem, as T approaches infinity, any distribution converges to a standard Gaussian, making the endpoint of the forward process a well-defined prior.

Q2. Why does Stable Diffusion's U-Net operate in latent space rather than pixel space?

Answer: Computational efficiency combined with semantic compression.

Explanation: Running Diffusion in pixel space (512x512x3) leads to explosive computation. Using a VAE to compress to a 64x64x4 latent space reduces spatial dimensions by roughly 48x. Additionally, the VAE's latent space encodes semantic features rather than pixel-level noise, allowing high-quality image generation in fewer steps compared to pixel-space diffusion.

Q3. Why is LoRA more efficient than full weight fine-tuning?

Answer: Minimal parameter updates through low-rank decomposition.

Explanation: Updating the full weight matrix $W \in \mathbb{R}^{d \times k}$ requires training $d \times k$ parameters. LoRA decomposes the update as $\Delta W = BA$ ( $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ , $r \ll d, k$ ), training only $(d+k) \times r$ parameters. With r=16 and d=k=768, this achieves roughly 98% parameter reduction. Because original weights remain frozen, multiple LoRA adapters can be swapped for different styles.

Q4. How does ControlNet's architecture accept additional conditioning signals like depth maps and edge maps?

Answer: Trainable copy of the encoder plus zero convolutions.

Explanation: ControlNet copies the encoder blocks of SD's U-Net into a separate control network. The key innovation is zero convolution (1x1 convolution initialized to zero weights): at training start the control signal has zero influence, protecting the original SD quality. As training progresses, zero convolution weights grow and the control effect strengthens. Conditioning images (depth maps, edge maps) are processed through a separate small encoder before entering the control network.

Q5. How can Consistency Models reduce sampling steps compared to DDPM?

Answer: Learning a consistency function that maps any noise level directly to the original data.

Explanation: DDPM requires traversing all T=1000 reverse steps (even DDIM needs 20-50 steps). Consistency Models learn a consistency function $f_\theta(x_t, t) \approx x_0$ that must output the same $x_0$ for any point $x_t$ on the same trajectory (the consistency condition). This enables high-quality sampling in just 1-2 steps, representing a 100-500x speedup over DDPM's 1000 steps.

Conclusion

Diffusion models combine mathematical elegance with practical performance, forming the core of today's generative AI. From DDPM's Gaussian mathematics to Stable Diffusion's latent space, ControlNet's control mechanism, LoRA's efficient training, and Sora's video generation — all of these technologies stand on one beautiful mathematical framework.

Recommended learning path:

Read the DDPM paper (Ho et al., 2020) in full
Work through HuggingFace diffusers tutorials hands-on
Run ControlNet and LoRA fine-tuning yourself
Build custom workflows with ComfyUI