💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

Generative AI is the hottest area in technology today. DALL-E 3 creates photorealistic images from a single line of text, Stable Diffusion generates artwork, and Sora produces videos. Behind all these breakthroughs lie generative models developed over decades.

This guide provides complete coverage of VAE (Variational Autoencoder), GAN (Generative Adversarial Network), and modern diffusion models — with mathematical intuition and full PyTorch code implementations.

1. Overview of Generative Models

1.1 Generative vs Discriminative Models

Deep learning models are broadly divided into two categories.

**Discriminative Model**: Given input data x, learns the conditional probability P(y|x). Used for image classification, object detection, etc.

**Generative Model**: Learns the joint probability distribution P(x) or P(x, y). After training, it can generate new data samples.

The core question of generative models: "How was this data generated? How can we generate new samples from the same distribution?"

1.2 The Concept of Latent Space

Most generative models leverage a **latent space** — a lower-dimensional representation space.

For example, a 28x28 MNIST image (784 dimensions) can be compressed into a 2-to-100 dimensional latent vector. In this latent space:

- The digit '3' and digit '8' are positioned close to each other

- Decoding a midpoint vector produces something between '3' and '8'

- Walking through (interpolating) the latent space enables continuous transformations

1.3 Applications of Generative Models

- **Image generation**: Realistic faces, landscapes, artwork

- **Image-to-image translation**: Day photo to night photo, sketch to photo

- **Data augmentation**: Supplement scarce training data

- **Drug discovery**: Generate novel molecular structures

- **Text generation**: GPT family language models

- **Music/speech synthesis**: Compose music, TTS systems

2. Autoencoder

We begin with the autoencoder, the foundation needed to understand VAEs.

2.1 Encoder-Decoder Architecture

An autoencoder has two parts.

- **Encoder**: High-dimensional input x → low-dimensional latent vector z

- **Decoder**: Low-dimensional latent vector z → reconstructed output x'

Goal: Train so that x' is as similar as possible to x (minimize reconstruction loss).

from torchvision import datasets, transforms

from torch.utils.data import DataLoader

class Autoencoder(nn.Module):

"""Basic Autoencoder for MNIST"""

def __init__(self, input_dim=784, latent_dim=32):

super().__init__()

Encoder

self.encoder = nn.Sequential(

nn.Linear(input_dim, 256),

nn.ReLU(),

nn.Linear(256, 128),

nn.ReLU(),

nn.Linear(128, latent_dim),

nn.ReLU()

)

Decoder

self.decoder = nn.Sequential(

nn.Linear(latent_dim, 128),

nn.ReLU(),

nn.Linear(128, 256),

nn.ReLU(),

nn.Linear(256, input_dim),

nn.Sigmoid() # Constrain pixel values to [0, 1]

)

def forward(self, x):

x = x.view(x.size(0), -1)

z = self.encoder(x)

x_reconstructed = self.decoder(z)

return x_reconstructed, z

def encode(self, x):

x = x.view(x.size(0), -1)

return self.encoder(x)

def decode(self, z):

return self.decoder(z).view(-1, 1, 28, 28)

def train_autoencoder(epochs=20):

"""Train Autoencoder"""

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

transform = transforms.Compose([transforms.ToTensor()])

train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)

model = Autoencoder().to(device)

optimizer = optim.Adam(model.parameters(), lr=1e-3)

criterion = nn.BCELoss()

for epoch in range(epochs):

total_loss = 0

for data, _ in train_loader:

data = data.to(device)

optimizer.zero_grad()

reconstructed, z = model(data)

target = data.view(data.size(0), -1)

loss = criterion(reconstructed, target)

loss.backward()

optimizer.step()

total_loss += loss.item()

avg_loss = total_loss / len(train_loader)

print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

return model

2.2 Limitations of Basic Autoencoders

Basic autoencoders compress data well, but **generating new samples is difficult**. The latent space is not regularized, so decoding a random latent vector often produces meaningless images.

This is exactly the problem that VAEs solve.

3. Variational Autoencoder (VAE)

The VAE was proposed in the groundbreaking 2013 paper by Kingma and Welling (arXiv:1312.6114).

3.1 Core Idea of VAE

The core of VAE is learning a **probability distribution** in the latent space.

- Basic autoencoder: z = encoder(x) (deterministic point)

- VAE: z ~ N(μ, σ²) (sampled from a Gaussian distribution)

The encoder now outputs the **distribution parameters (mean μ, variance σ²)** instead of a latent vector z directly.

When generating new images, we sample z from a standard normal distribution N(0, I) and feed it to the decoder.

3.2 ELBO (Evidence Lower Bound)

The training objective of VAE is to maximize the data log-likelihood log P(x). Since this is hard to optimize directly, we maximize the ELBO instead.

log P(x) >= E_q[log P(x|z)] - KL[q(z|x) || P(z)]

↑ ↑

Reconstruction loss KL Divergence

**ELBO = Reconstruction Loss + KL Divergence**

- **Reconstruction loss**: How similar is the decoded output to the original?

- **KL Divergence**: How close is the learned latent distribution q(z|x) to the prior P(z) = N(0, I)?

3.3 Reparameterization Trick

Directly sampling z ~ N(μ, σ²) blocks backpropagation. The reparameterization trick solves this.

z = μ + σ * ε, ε ~ N(0, I)

This separates the randomness (ε) from the network parameters, enabling gradient computation with respect to μ and σ.

3.4 Complete VAE Implementation (MNIST)

from torchvision import datasets, transforms

from torch.utils.data import DataLoader

class VAE(nn.Module):

"""Variational Autoencoder"""

def __init__(self, input_dim=784, hidden_dim=400, latent_dim=20):

super(VAE, self).__init__()

self.latent_dim = latent_dim

Encoder: input -> μ, log(σ²)

self.fc1 = nn.Linear(input_dim, hidden_dim)

self.fc_mu = nn.Linear(hidden_dim, latent_dim) # Mean

self.fc_logvar = nn.Linear(hidden_dim, latent_dim) # Log variance

Decoder: latent vector -> reconstruction

self.fc3 = nn.Linear(latent_dim, hidden_dim)

self.fc4 = nn.Linear(hidden_dim, input_dim)

def encode(self, x):

"""Encoder: x -> (μ, log_var)"""

h = F.relu(self.fc1(x))

return self.fc_mu(h), self.fc_logvar(h)

def reparameterize(self, mu, logvar):

"""Reparameterization trick: z = μ + ε * σ"""

if self.training:

std = torch.exp(0.5 * logvar) # σ = exp(0.5 * log σ²)

eps = torch.randn_like(std) # ε ~ N(0, I)

return mu + eps * std

else:

return mu # During inference, use mean only

def decode(self, z):

"""Decoder: z -> x'"""

h = F.relu(self.fc3(z))

return torch.sigmoid(self.fc4(h))

def forward(self, x):

x_flat = x.view(-1, 784)

mu, logvar = self.encode(x_flat)

z = self.reparameterize(mu, logvar)

x_recon = self.decode(z)

return x_recon, mu, logvar

def generate(self, num_samples, device):

"""Generate images by sampling from standard normal distribution"""

with torch.no_grad():

z = torch.randn(num_samples, self.latent_dim).to(device)

samples = self.decode(z)

return samples.view(num_samples, 1, 28, 28)

def vae_loss(x_recon, x, mu, logvar, beta=1.0):

"""

VAE loss = Reconstruction loss + β * KL Divergence

beta=1: Standard VAE

beta>1: β-VAE (more disentangled representations)

"""

Reconstruction loss (BCE)

recon_loss = F.binary_cross_entropy(

x_recon, x.view(-1, 784),

reduction='sum'

)

KL Divergence: KL[N(μ, σ²) || N(0, 1)]

= -0.5 * Σ(1 + log σ² - μ² - σ²)

kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

return recon_loss + beta * kl_loss

def train_vae(epochs=50, latent_dim=20, beta=1.0):

"""Train VAE"""

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

transform = transforms.Compose([transforms.ToTensor()])

train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)

model = VAE(latent_dim=latent_dim).to(device)

optimizer = optim.Adam(model.parameters(), lr=1e-3)

train_losses = []

for epoch in range(epochs):

model.train()

total_loss = 0

for data, _ in train_loader:

data = data.to(device)

optimizer.zero_grad()

x_recon, mu, logvar = model(data)

loss = vae_loss(x_recon, data, mu, logvar, beta)

loss.backward()

optimizer.step()

total_loss += loss.item()

avg_loss = total_loss / len(train_dataset)

train_losses.append(avg_loss)

if (epoch + 1) % 10 == 0:

print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

return model, train_losses

def interpolate_latent_space(model, img1, img2, steps=10, device='cpu'):

"""Interpolate between two images in latent space"""

model.eval()

with torch.no_grad():

z1_flat = img1.view(-1, 784).to(device)

z2_flat = img2.view(-1, 784).to(device)

mu1, _ = model.encode(z1_flat)

mu2, _ = model.encode(z2_flat)

Linear interpolation

interpolated_images = []

for alpha in np.linspace(0, 1, steps):

z_interp = (1 - alpha) * mu1 + alpha * mu2

img_recon = model.decode(z_interp)

interpolated_images.append(img_recon.view(1, 28, 28))

return interpolated_images

3.5 Convolutional VAE (for CIFAR-10)

class ConvVAE(nn.Module):

"""Convolutional VAE for color images"""

def __init__(self, latent_dim=128):

super().__init__()

self.latent_dim = latent_dim

Convolutional encoder

self.encoder_conv = nn.Sequential(

nn.Conv2d(3, 32, 4, stride=2, padding=1), # 16x16

nn.ReLU(),

nn.Conv2d(32, 64, 4, stride=2, padding=1), # 8x8

nn.ReLU(),

nn.Conv2d(64, 128, 4, stride=2, padding=1), # 4x4

nn.ReLU(),

)

self.fc_mu = nn.Linear(128 * 4 * 4, latent_dim)

self.fc_logvar = nn.Linear(128 * 4 * 4, latent_dim)

Transposed convolutional decoder

self.decoder_fc = nn.Linear(latent_dim, 128 * 4 * 4)

self.decoder_conv = nn.Sequential(

nn.ConvTranspose2d(128, 64, 4, stride=2, padding=1), # 8x8

nn.ReLU(),

nn.ConvTranspose2d(64, 32, 4, stride=2, padding=1), # 16x16

nn.ReLU(),

nn.ConvTranspose2d(32, 3, 4, stride=2, padding=1), # 32x32

nn.Sigmoid()

)

def encode(self, x):

h = self.encoder_conv(x).view(x.size(0), -1)

return self.fc_mu(h), self.fc_logvar(h)

def reparameterize(self, mu, logvar):

std = torch.exp(0.5 * logvar)

eps = torch.randn_like(std)

return mu + eps * std

def decode(self, z):

h = F.relu(self.decoder_fc(z)).view(-1, 128, 4, 4)

return self.decoder_conv(h)

def forward(self, x):

mu, logvar = self.encode(x)

z = self.reparameterize(mu, logvar)

return self.decode(z), mu, logvar

4. Generative Adversarial Networks (GAN)

GAN was proposed by Ian Goodfellow in a 2014 paper (arXiv:1406.2661) — a revolutionary idea.

4.1 The Generator-Discriminator Game

GAN involves two neural networks competing and learning from each other.

**Generator (G)**: Creates fake data from random noise z.

- Goal: Generate data realistic enough to fool the Discriminator

**Discriminator (D)**: Distinguishes between real and generated fake data.

- Goal: Classify real data as 1, fake data as 0

In this game, both networks compete until they reach a Nash Equilibrium.

4.2 Minimax Loss

min_G max_D [E_x[log D(x)] + E_z[log(1 - D(G(z)))]]

- **D maximizes**: Maximize log D(x) on real data, maximize log(1 - D(G(z))) on fake data

- **G minimizes**: Minimize log(1 - D(G(z))) so that G(z) fools D

In practice, we use `-log(D(G(z)))` as the generator loss (non-saturating loss).

4.3 Basic GAN Implementation

from torchvision import datasets, transforms

class Generator(nn.Module):

"""GAN Generator"""

def __init__(self, noise_dim=100, output_dim=784):

super().__init__()

self.model = nn.Sequential(

nn.Linear(noise_dim, 256),

nn.LeakyReLU(0.2),

nn.BatchNorm1d(256),

nn.Linear(256, 512),

nn.LeakyReLU(0.2),

nn.BatchNorm1d(512),

nn.Linear(512, 1024),

nn.LeakyReLU(0.2),

nn.BatchNorm1d(1024),

nn.Linear(1024, output_dim),

nn.Tanh() # Range [-1, 1]

)

def forward(self, z):

return self.model(z).view(-1, 1, 28, 28)

class Discriminator(nn.Module):

"""GAN Discriminator"""

def __init__(self, input_dim=784):

super().__init__()

self.model = nn.Sequential(

nn.Linear(input_dim, 1024),

nn.LeakyReLU(0.2),

nn.Dropout(0.3),

nn.Linear(1024, 512),

nn.LeakyReLU(0.2),

nn.Dropout(0.3),

nn.Linear(512, 256),

nn.LeakyReLU(0.2),

nn.Dropout(0.3),

nn.Linear(256, 1),

nn.Sigmoid()

)

def forward(self, x):

return self.model(x.view(x.size(0), -1))

def train_gan(epochs=200, noise_dim=100, lr=2e-4):

"""Train GAN"""

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

transform = transforms.Compose([

transforms.ToTensor(),

transforms.Normalize([0.5], [0.5]) # Normalize to [-1, 1]

])

train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)

dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)

G = Generator(noise_dim).to(device)

D = Discriminator().to(device)

optimizer_G = optim.Adam(G.parameters(), lr=lr, betas=(0.5, 0.999))

optimizer_D = optim.Adam(D.parameters(), lr=lr, betas=(0.5, 0.999))

criterion = nn.BCELoss()

for epoch in range(epochs):

for i, (real_imgs, _) in enumerate(dataloader):

batch_size = real_imgs.size(0)

real_imgs = real_imgs.to(device)

real_labels = torch.ones(batch_size, 1).to(device)

fake_labels = torch.zeros(batch_size, 1).to(device)

=== Update Discriminator ===

optimizer_D.zero_grad()

d_real = D(real_imgs)

d_loss_real = criterion(d_real, real_labels)

z = torch.randn(batch_size, noise_dim).to(device)

fake_imgs = G(z).detach() # Detach G's gradients

d_fake = D(fake_imgs)

d_loss_fake = criterion(d_fake, fake_labels)

d_loss = d_loss_real + d_loss_fake

d_loss.backward()

optimizer_D.step()

=== Update Generator ===

optimizer_G.zero_grad()

z = torch.randn(batch_size, noise_dim).to(device)

fake_imgs = G(z)

g_pred = D(fake_imgs)

G wants D to classify fakes as real

g_loss = criterion(g_pred, real_labels)

g_loss.backward()

optimizer_G.step()

if (epoch + 1) % 20 == 0:

print(f"Epoch {epoch+1}/{epochs} | D Loss: {d_loss.item():.4f} | G Loss: {g_loss.item():.4f}")

return G, D

4.4 Mode Collapse

One of the biggest problems with GANs. The Generator stops producing diverse images and repeatedly generates only a few patterns that successfully fool the Discriminator.

5. Evolution of GANs

5.1 DCGAN (Deep Convolutional GAN)

Proposed in 2015, DCGAN successfully applied convolutional neural networks to GANs.

class DCGANGenerator(nn.Module):

"""DCGAN Generator (for 64x64 images)"""

def __init__(self, noise_dim=100, ngf=64, nc=3):

super().__init__()

self.main = nn.Sequential(

Input: noise_dim x 1 x 1

nn.ConvTranspose2d(noise_dim, ngf * 8, 4, 1, 0, bias=False),

nn.BatchNorm2d(ngf * 8),

nn.ReLU(True),

State: (ngf*8) x 4 x 4

nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),

nn.BatchNorm2d(ngf * 4),

nn.ReLU(True),

State: (ngf*4) x 8 x 8

nn.ConvTranspose2d(ngf * 4, ngf * 2, 4, 2, 1, bias=False),

nn.BatchNorm2d(ngf * 2),

nn.ReLU(True),

State: (ngf*2) x 16 x 16

nn.ConvTranspose2d(ngf * 2, ngf, 4, 2, 1, bias=False),

nn.BatchNorm2d(ngf),

nn.ReLU(True),

State: (ngf) x 32 x 32

nn.ConvTranspose2d(ngf, nc, 4, 2, 1, bias=False),

nn.Tanh()

Output: nc x 64 x 64

)

def forward(self, z):

z = z.view(z.size(0), -1, 1, 1)

return self.main(z)

class DCGANDiscriminator(nn.Module):

"""DCGAN Discriminator (for 64x64 images)"""

def __init__(self, ndf=64, nc=3):

super().__init__()

self.main = nn.Sequential(

Input: nc x 64 x 64

nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),

nn.LeakyReLU(0.2, inplace=True),

nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),

nn.BatchNorm2d(ndf * 2),

nn.LeakyReLU(0.2, inplace=True),

nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),

nn.BatchNorm2d(ndf * 4),

nn.LeakyReLU(0.2, inplace=True),

nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),

nn.BatchNorm2d(ndf * 8),

nn.LeakyReLU(0.2, inplace=True),

nn.Conv2d(ndf * 8, 1, 4, 1, 0, bias=False),

nn.Sigmoid()

)

def forward(self, x):

return self.main(x).view(-1, 1)

5.2 WGAN (Wasserstein GAN)

WGAN (arXiv:1701.07875) uses Wasserstein distance instead of Jensen-Shannon Divergence, greatly improving training stability.

class WGANDiscriminator(nn.Module):

"""WGAN Critic: no Sigmoid output"""

def __init__(self, input_dim=784):

super().__init__()

self.model = nn.Sequential(

nn.Linear(input_dim, 512),

nn.LeakyReLU(0.2),

nn.Linear(512, 256),

nn.LeakyReLU(0.2),

nn.Linear(256, 1)

No Sigmoid! Needed for Wasserstein distance computation

)

def forward(self, x):

return self.model(x.view(x.size(0), -1))

def train_wgan(G, D, dataloader, device, epochs=100,

n_critic=5, clip_value=0.01, lr=5e-5):

"""Train WGAN"""

optimizer_G = optim.RMSprop(G.parameters(), lr=lr)

optimizer_D = optim.RMSprop(D.parameters(), lr=lr)

for epoch in range(epochs):

for i, (real_imgs, _) in enumerate(dataloader):

real_imgs = real_imgs.to(device)

batch_size = real_imgs.size(0)

Update Critic n_critic times

for _ in range(n_critic):

optimizer_D.zero_grad()

z = torch.randn(batch_size, 100).to(device)

fake_imgs = G(z).detach()

Wasserstein loss: E[D(real)] - E[D(fake)]

d_loss = -torch.mean(D(real_imgs)) + torch.mean(D(fake_imgs))

d_loss.backward()

optimizer_D.step()

Weight clipping (Lipschitz constraint)

for p in D.parameters():

p.data.clamp_(-clip_value, clip_value)

Update Generator

optimizer_G.zero_grad()

z = torch.randn(batch_size, 100).to(device)

fake_imgs = G(z)

g_loss = -torch.mean(D(fake_imgs))

g_loss.backward()

optimizer_G.step()

5.3 WGAN-GP (Gradient Penalty)

Instead of weight clipping in WGAN, gradient penalty enforces the Lipschitz constraint more effectively.

def gradient_penalty(D, real_imgs, fake_imgs, device):

"""Compute gradient penalty"""

batch_size = real_imgs.size(0)

Random interpolation between real and fake images

alpha = torch.rand(batch_size, 1, 1, 1).to(device)

interpolated = alpha * real_imgs + (1 - alpha) * fake_imgs

interpolated.requires_grad_(True)

d_interpolated = D(interpolated)

gradients = torch.autograd.grad(

outputs=d_interpolated,

inputs=interpolated,

grad_outputs=torch.ones_like(d_interpolated),

create_graph=True,

retain_graph=True

)[0]

gradients = gradients.view(batch_size, -1)

gradient_norm = gradients.norm(2, dim=1)

penalty = ((gradient_norm - 1) ** 2).mean()

return penalty

def wgan_gp_d_loss(D, real_imgs, fake_imgs, device, lambda_gp=10):

"""WGAN-GP Discriminator loss"""

d_real = D(real_imgs).mean()

d_fake = D(fake_imgs).mean()

gp = gradient_penalty(D, real_imgs, fake_imgs, device)

return -d_real + d_fake + lambda_gp * gp

5.4 Conditional GAN (cGAN)

Generate images of specific classes by adding a label condition.

class ConditionalGenerator(nn.Module):

"""Conditional GAN Generator"""

def __init__(self, noise_dim=100, num_classes=10, embed_dim=50):

super().__init__()

self.label_embedding = nn.Embedding(num_classes, embed_dim)

self.model = nn.Sequential(

nn.Linear(noise_dim + embed_dim, 256),

nn.LeakyReLU(0.2),

nn.Linear(256, 512),

nn.LeakyReLU(0.2),

nn.Linear(512, 784),

nn.Tanh()

)

def forward(self, z, labels):

label_embed = self.label_embedding(labels)

x = torch.cat([z, label_embed], dim=1)

return self.model(x).view(-1, 1, 28, 28)

6. Diffusion Models

Diffusion models are now the standard for image generation. DDPM (Denoising Diffusion Probabilistic Models, Ho et al., 2020, arXiv:2006.11239) pioneered this field.

6.1 Intuition for Diffusion Models

The core idea has two processes.

**Forward Process (Adding Noise)**: Gradually add Gaussian noise to a real image until it becomes pure noise. After T steps, it becomes a standard normal distribution.

**Reverse Process (Removing Noise)**: Starting from pure noise, progressively remove noise to recover the original image. A neural network learns this reverse process.

6.2 Forward Process Math

At each step, noise is added.

q(x_t | x_{t-1}) = N(x_t; sqrt(1-β_t) * x_{t-1}, β_t * I)

Where β_t is the noise schedule.

**Jump directly to step t (important property)**:

q(x_t | x_0) = N(x_t; sqrt(ā_t) * x_0, (1-ā_t) * I)

Where ā_t = product of (1 - β_s) for s from 1 to t.

This allows computing the noisy image at any arbitrary timestep t directly.

class NoiseSchedule:

"""Manage the noise schedule"""

def __init__(self, timesteps=1000, beta_start=1e-4, beta_end=0.02):

self.timesteps = timesteps

Linear noise schedule

self.betas = torch.linspace(beta_start, beta_end, timesteps)

self.alphas = 1.0 - self.betas

self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)

self.alphas_cumprod_prev = F.pad(self.alphas_cumprod[:-1], (1, 0), value=1.0)

Forward process coefficients

self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)

self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)

Reverse process coefficients

self.posterior_variance = (

self.betas * (1.0 - self.alphas_cumprod_prev) /

(1.0 - self.alphas_cumprod)

)

def q_sample(self, x_start, t, noise=None):

"""Forward process: add t steps of noise to x_0"""

if noise is None:

noise = torch.randn_like(x_start)

sqrt_alphas_cumprod_t = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)

sqrt_one_minus_alphas_cumprod_t = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)

return sqrt_alphas_cumprod_t * x_start + sqrt_one_minus_alphas_cumprod_t * noise

6.3 U-Net Architecture

The noise prediction network of diffusion models uses a U-Net architecture, conditioned on the timestep t via sinusoidal embeddings.

class SinusoidalPositionEmbeddings(nn.Module):

"""Sinusoidal position embeddings for timesteps"""

def __init__(self, dim):

super().__init__()

self.dim = dim

def forward(self, time):

device = time.device

half_dim = self.dim // 2

embeddings = np.log(10000) / (half_dim - 1)

embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)

embeddings = time[:, None] * embeddings[None, :]

embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)

return embeddings

class ResidualBlock(nn.Module):

"""Residual block with timestep conditioning"""

def __init__(self, in_channels, out_channels, time_emb_dim):

super().__init__()

self.time_mlp = nn.Sequential(

nn.SiLU(),

nn.Linear(time_emb_dim, out_channels)

)

self.block1 = nn.Sequential(

nn.GroupNorm(8, in_channels),

nn.SiLU(),

nn.Conv2d(in_channels, out_channels, 3, padding=1)

)

self.block2 = nn.Sequential(

nn.GroupNorm(8, out_channels),

nn.SiLU(),

nn.Conv2d(out_channels, out_channels, 3, padding=1)

)

self.residual_conv = (

nn.Conv2d(in_channels, out_channels, 1)

if in_channels != out_channels else nn.Identity()

)

def forward(self, x, time_emb):

h = self.block1(x)

time_emb = self.time_mlp(time_emb)[:, :, None, None]

h = h + time_emb

h = self.block2(h)

return h + self.residual_conv(x)

class SimpleUNet(nn.Module):

"""Simplified U-Net for DDPM"""

def __init__(self, in_channels=1, model_channels=64, time_emb_dim=256):

super().__init__()

Timestep embedding

self.time_mlp = nn.Sequential(

SinusoidalPositionEmbeddings(model_channels),

nn.Linear(model_channels, time_emb_dim),

nn.SiLU(),

nn.Linear(time_emb_dim, time_emb_dim)

)

Encoder

self.down1 = ResidualBlock(in_channels, model_channels, time_emb_dim)

self.down2 = ResidualBlock(model_channels, model_channels * 2, time_emb_dim)

self.pool = nn.MaxPool2d(2)

Bottleneck

self.bottleneck = ResidualBlock(model_channels * 2, model_channels * 2, time_emb_dim)

Decoder

self.up1 = ResidualBlock(model_channels * 4, model_channels, time_emb_dim)

self.up2 = ResidualBlock(model_channels * 2, model_channels, time_emb_dim)

self.upsample = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)

Output

self.final = nn.Conv2d(model_channels, in_channels, 1)

def forward(self, x, t):

time_emb = self.time_mlp(t)

Encoder path

x1 = self.down1(x, time_emb)

x2 = self.down2(self.pool(x1), time_emb)

Bottleneck

x_mid = self.bottleneck(x2, time_emb)

Decoder path (with skip connections)

x = self.up1(torch.cat([self.upsample(x_mid), x2], dim=1), time_emb)

x = self.up2(torch.cat([self.upsample(x), x1], dim=1), time_emb)

return self.final(x)

6.4 DDPM Training and Sampling

class DDPM:

"""DDPM training and sampling"""

def __init__(self, model, noise_schedule, device):

self.model = model

self.schedule = noise_schedule

self.device = device

def get_loss(self, x_start, t):

"""Training loss: noise prediction error"""

noise = torch.randn_like(x_start)

x_noisy = self.schedule.q_sample(x_start, t, noise)

Neural network predicts the added noise

predicted_noise = self.model(x_noisy, t)

return F.mse_loss(noise, predicted_noise)

@torch.no_grad()

def p_sample(self, x_t, t):

"""Reverse process: denoise one step from x_t to x_{t-1}"""

betas_t = self.schedule.betas[t].view(-1, 1, 1, 1).to(self.device)

sqrt_one_minus_alphas_cumprod_t = (

self.schedule.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1).to(self.device)

)

sqrt_recip_alphas_t = torch.sqrt(

1.0 / self.schedule.alphas[t]

).view(-1, 1, 1, 1).to(self.device)

predicted_noise = self.model(x_t, t)

model_mean = sqrt_recip_alphas_t * (

x_t - betas_t * predicted_noise / sqrt_one_minus_alphas_cumprod_t

)

if t[0] == 0:

return model_mean

else:

posterior_variance_t = (

self.schedule.posterior_variance[t].view(-1, 1, 1, 1).to(self.device)

)

noise = torch.randn_like(x_t)

return model_mean + torch.sqrt(posterior_variance_t) * noise

@torch.no_grad()

def sample(self, batch_size, image_shape):

"""Generate images starting from pure noise"""

img = torch.randn(batch_size, *image_shape).to(self.device)

for i in reversed(range(self.schedule.timesteps)):

t = torch.full((batch_size,), i, dtype=torch.long, device=self.device)

img = self.p_sample(img, t)

return img

def train_ddpm(epochs=100):

"""Train DDPM"""

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

transform = transforms.Compose([

transforms.ToTensor(),

transforms.Normalize([0.5], [0.5])

])

dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)

dataloader = DataLoader(dataset, batch_size=128, shuffle=True)

model = SimpleUNet(in_channels=1).to(device)

schedule = NoiseSchedule(timesteps=1000)

ddpm = DDPM(model, schedule, device)

optimizer = optim.Adam(model.parameters(), lr=2e-4)

for epoch in range(epochs):

total_loss = 0

for batch, (x, _) in enumerate(dataloader):

x = x.to(device)

Random timestep sampling

t = torch.randint(0, schedule.timesteps, (x.size(0),), device=device)

loss = ddpm.get_loss(x, t)

optimizer.zero_grad()

loss.backward()

optimizer.step()

total_loss += loss.item()

if (epoch + 1) % 10 == 0:

avg_loss = total_loss / len(dataloader)

print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

return model, ddpm

7. DDIM - Fast Sampling

DDPM requires 1000 sampling steps, making it slow. DDIM (Denoising Diffusion Implicit Models, Song et al., 2020) provides faster, deterministic sampling.

7.1 DDIM's Idea

DDIM redefines the sampling process as a non-Markovian process. The key is using the same trained model but drastically reducing the number of sampling steps (1000 → 50 steps).

@torch.no_grad()

def ddim_sample(model, schedule, batch_size, image_shape,

ddim_timesteps=50, eta=0.0, device='cpu'):

"""

DDIM sampling

eta=0.0: fully deterministic

eta=1.0: equivalent to DDPM

"""

Select evenly spaced timesteps

c = schedule.timesteps // ddim_timesteps

timestep_seq = list(range(0, schedule.timesteps, c))[::-1]

img = torch.randn(batch_size, *image_shape).to(device)

for i, t in enumerate(timestep_seq):

t_tensor = torch.full((batch_size,), t, dtype=torch.long, device=device)

t_prev = timestep_seq[i + 1] if i + 1 < len(timestep_seq) else -1

alpha_bar = schedule.alphas_cumprod[t].to(device)

alpha_bar_prev = (

schedule.alphas_cumprod[t_prev].to(device) if t_prev >= 0

else torch.tensor(1.0, device=device)

)

Predict noise

pred_noise = model(img, t_tensor)

Predict x_0

pred_x0 = (img - torch.sqrt(1 - alpha_bar) * pred_noise) / torch.sqrt(alpha_bar)

pred_x0 = torch.clamp(pred_x0, -1, 1)

Compute sigma

sigma = eta * torch.sqrt(

(1 - alpha_bar_prev) / (1 - alpha_bar) * (1 - alpha_bar / alpha_bar_prev)

)

Direction towards x_t

pred_dir = torch.sqrt(1 - alpha_bar_prev - sigma**2) * pred_noise

Next step image

noise = torch.randn_like(img) if t_prev >= 0 else 0

img = torch.sqrt(alpha_bar_prev) * pred_x0 + pred_dir + sigma * noise

return img

8. Stable Diffusion Analysis

Stable Diffusion (arXiv:2112.10752) is a groundbreaking approach that performs diffusion in **latent space** rather than pixel space.

8.1 Latent Diffusion Models (LDM)

Applying diffusion directly to high-resolution images (512x512) is computationally expensive. LDM:

1. Compresses images to latent space using a VAE encoder (512x512x3 → 64x64x4)

2. Performs diffusion in latent space

3. Recovers original resolution with the VAE decoder

This reduces computational cost by more than 8x.

8.2 Stable Diffusion Components

Text prompt → CLIP text encoder → Text embeddings

↓

Pure noise z_T → [U-Net + Cross-Attention] → Latent z_0

↓

VAE decoder → Final image

**CLIP Text Encoder**: Transforms text into meaningful vector representations.

**U-Net with Cross-Attention**: Uses text embeddings as conditions via cross-attention.

**Classifier-Free Guidance (CFG)**: Combines conditional and unconditional predictions to improve text fidelity.

guided_noise = uncond_noise + guidance_scale * (cond_noise - uncond_noise)

8.3 Using Stable Diffusion with diffusers

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler

model_id = "stabilityai/stable-diffusion-2-1"

pipe = StableDiffusionPipeline.from_pretrained(

model_id,

torch_dtype=torch.float16,

use_safetensors=True,

)

Use a fast scheduler

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

pipe = pipe.to("cuda")

Generate image

prompt = "a photorealistic landscape of mountains at sunset, 8k, highly detailed"

negative_prompt = "blurry, low quality, distorted"

image = pipe(

prompt,

negative_prompt=negative_prompt,

num_inference_steps=25, # 1000 DDPM steps → 25 steps

guidance_scale=7.5, # CFG scale

height=512,

width=512,

generator=torch.Generator("cuda").manual_seed(42)

).images[0]

image.save("generated_image.png")

Image-to-image translation

from diffusers import StableDiffusionImg2ImgPipeline

from PIL import Image

img2img_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(

model_id,

torch_dtype=torch.float16

).to("cuda")

init_image = Image.open("input.jpg").resize((512, 512))

result = img2img_pipe(

prompt="a painting in Van Gogh style",

image=init_image,

strength=0.75, # How much to transform (0-1)

guidance_scale=7.5,

num_inference_steps=50

).images[0]

8.4 Inpainting

from diffusers import StableDiffusionInpaintPipeline

inpaint_pipe = StableDiffusionInpaintPipeline.from_pretrained(

"runwayml/stable-diffusion-inpainting",

torch_dtype=torch.float16

).to("cuda")

image = Image.open("photo.jpg").resize((512, 512))

mask = Image.open("mask.jpg").resize((512, 512)) # White = area to inpaint

result = inpaint_pipe(

prompt="a beautiful garden with flowers",

image=image,

mask_image=mask,

num_inference_steps=50

).images[0]

9. ControlNet - Fine-grained Image Control

9.1 ControlNet Architecture

ControlNet (Zhang et al., 2023) allows Stable Diffusion to be conditioned on additional control signals (Canny edges, depth maps, pose, etc.).

The original U-Net weights are frozen, and a separate control network is added. It trains approximately 360M additional parameters as a copy of the SD encoder.

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

from diffusers.utils import load_image

Load Canny ControlNet

controlnet = ControlNetModel.from_pretrained(

"lllyasviel/sd-controlnet-canny",

torch_dtype=torch.float16

)

pipe = StableDiffusionControlNetPipeline.from_pretrained(

"runwayml/stable-diffusion-v1-5",

controlnet=controlnet,

torch_dtype=torch.float16

).to("cuda")

Extract Canny edges

image = load_image("input.jpg")

image_array = np.array(image)

canny = cv2.Canny(image_array, 100, 200)

canny_image = Image.fromarray(canny)

Generate with ControlNet

result = pipe(

prompt="a beautiful oil painting",

image=canny_image, # Canny edges as control signal

controlnet_conditioning_scale=1.0,

num_inference_steps=50,

guidance_scale=7.5

).images[0]

10. Current Trends in Generative Models

10.1 DiT (Diffusion Transformers)

Proposed by Peebles and Xie (2022), DiT uses Transformer architecture as the backbone for diffusion models instead of U-Net. Modern models like Sora, Flux, and SD3 are DiT-based.

class DiTBlock(nn.Module):

"""Diffusion Transformer Block"""

def __init__(self, hidden_dim, num_heads, mlp_ratio=4.0):

super().__init__()

self.norm1 = nn.LayerNorm(hidden_dim)

self.attn = nn.MultiheadAttention(hidden_dim, num_heads, batch_first=True)

self.norm2 = nn.LayerNorm(hidden_dim)

mlp_dim = int(hidden_dim * mlp_ratio)

self.mlp = nn.Sequential(

nn.Linear(hidden_dim, mlp_dim),

nn.GELU(),

nn.Linear(mlp_dim, hidden_dim)

)

AdaLN (Adaptive Layer Normalization): timestep conditioning

self.adaLN_modulation = nn.Sequential(

nn.SiLU(),

nn.Linear(hidden_dim, 6 * hidden_dim)

)

def forward(self, x, c):

Compute modulation parameters from timestep/condition embedding

shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (

self.adaLN_modulation(c).chunk(6, dim=-1)

)

Modulated normalization

x_norm = (1 + scale_msa.unsqueeze(1)) * self.norm1(x) + shift_msa.unsqueeze(1)

attn_out, _ = self.attn(x_norm, x_norm, x_norm)

x = x + gate_msa.unsqueeze(1) * attn_out

x_norm = (1 + scale_mlp.unsqueeze(1)) * self.norm2(x) + shift_mlp.unsqueeze(1)

x = x + gate_mlp.unsqueeze(1) * self.mlp(x_norm)

return x

10.2 SDXL (Stable Diffusion XL)

An upgraded version of Stable Diffusion generating higher-resolution (1024x1024) images.

from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(

"stabilityai/stable-diffusion-xl-base-1.0",

torch_dtype=torch.float16,

use_safetensors=True,

variant="fp16"

).to("cuda")

image = pipe(

prompt="a majestic lion in photorealistic style, 4k",

negative_prompt="cartoon, blurry",

height=1024,

width=1024,

num_inference_steps=50,

guidance_scale=5.0

).images[0]

10.3 Generative Model Comparison

| ----- | ---- | ------------------------- | ---------------------------------- | -------------------------------- |

Conclusion

We have fully explored the journey of generative AI — from the latent space theory of VAEs, through the adversarial game of GANs, to the iterative denoising of DDPM, and finally to the latent space diffusion of Stable Diffusion.

This field evolves extremely rapidly. Today's cutting-edge becomes tomorrow's baseline. Understanding the fundamentals deeply, while continuously following new research papers, is essential.

Recommended resources for continued learning:

- Hugging Face Diffusers: https://huggingface.co/docs/diffusers/

- VAE paper: arXiv:1312.6114

- GAN paper: arXiv:1406.2661

- DDPM paper: arXiv:2006.11239

- Latent Diffusion / Stable Diffusion: arXiv:2112.10752

References

- Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv:1312.6114

- Goodfellow, I., et al. (2014). Generative Adversarial Networks. arXiv:1406.2661

- Ho, J., et al. (2020). Denoising Diffusion Probabilistic Models. arXiv:2006.11239

- Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752

- Arjovsky, M., et al. (2017). Wasserstein GAN. arXiv:1701.07875

- Peebles, W., & Xie, S. (2022). Scalable Diffusion Models with Transformers. arXiv:2212.09748

- Zhang, L., et al. (2023). Adding Conditional Control to Text-to-Image Diffusion Models. arXiv:2302.05543

- Hugging Face Diffusers: https://huggingface.co/docs/diffusers/