Split View: DDPM 논문 완벽 분석: 노이즈에서 이미지를 만들어내는 확산 모델의 수학과 원리

DDPM 논문 완벽 분석: 노이즈에서 이미지를 만들어내는 확산 모델의 수학과 원리

1. 논문 개요
2. 배경: 열역학에서 생성 모델로
3. Forward Process: 체계적으로 노이즈를 추가하다
- 3.1 Markov Chain으로서의 Forward Process
- 3.2 전체 Forward Process
4. 핵심 수학: Reparameterization Trick
5. Reverse Process: 노이즈에서 이미지를 복원하다
6. 학습 목표의 유도: ELBO에서 Simplified Loss로
7. 노이즈 스케줄링: $\beta_t$의 설계
8. 샘플링 알고리즘
9. 아키텍처: Time-conditioned U-Net
10. 실험 결과
11. 후속 연구 총정리: Diffusion의 진화
12. PyTorch 코드 예제: 간단한 DDPM 구현
13. Diffusion Model vs GAN vs VAE: 비교 분석
- 13.1 종합 비교표
- 13.2 언제 어떤 모델을 선택할 것인가?
14. Diffusion Model의 현재와 미래
- 14.1 2024~2025년의 주요 흐름
- 14.2 DDPM의 역사적 의의
15. References

1. 논문 개요

"Denoising Diffusion Probabilistic Models"(DDPM)는 2020년 NeurIPS에서 발표된 논문으로, UC Berkeley의 Jonathan Ho, Ajay Jain, Pieter Abbeel이 공동 저술했다. 이 논문은 확산 확률 모델(Diffusion Probabilistic Model)을 통해 고품질 이미지 합성이 가능함을 실증적으로 보여준 기념비적인 연구다.

핵심 아이디어는 놀랍도록 단순하다. 데이터에 점진적으로 가우시안 노이즈를 추가하는 Forward Process와, 이 노이즈를 단계적으로 제거하여 원본 데이터를 복원하는 Reverse Process를 학습하는 것이다. 최종 학습 목표는 "모델이 예측한 노이즈"와 "실제로 추가된 노이즈" 간의 단순 MSE 손실로 귀결된다.

DDPM은 CIFAR-10에서 FID 3.17, Inception Score 9.46을 달성하며 당시 GAN 기반 모델들과 대등하거나 능가하는 성능을 보여줬다. 더 중요한 것은, 이 논문이 이후 DALL-E 2, Imagen, Stable Diffusion, Midjourney 등 현대 이미지 생성 AI의 토대가 되었다는 사실이다.

논문 정보

제목: Denoising Diffusion Probabilistic Models

저자: Jonathan Ho, Ajay Jain, Pieter Abbeel

학회: NeurIPS 2020

arXiv: 2006.11239

공식 코드: hojonathanho/diffusion

2. 배경: 열역학에서 생성 모델로

2.1 비평형 열역학에서의 영감

Diffusion Model의 지적 기원은 비평형 통계 역학(Non-equilibrium Thermodynamics)에 있다. 물리학에서 확산(Diffusion)은 입자가 농도가 높은 곳에서 낮은 곳으로 무작위하게 이동하며 결국 열적 평형 상태(최대 엔트로피)에 도달하는 과정을 말한다. 이 과정의 핵심 통찰은 다음과 같다.

Forward: 복잡한 구조를 가진 상태 $\rightarrow$ 무질서한 평형 상태 (정보 파괴)
Reverse: 평형 상태 $\rightarrow$ 구조를 가진 상태로의 복원 (정보 생성)

Sohl-Dickstein et al.(2015)이 이 아이디어를 처음으로 기계학습에 적용하여 "Deep Unsupervised Learning using Nonequilibrium Thermodynamics"를 발표했다. 복잡한 데이터 분포를 단순한 알려진 분포(가우시안)로 변환하는 확산 과정을 정의하고, 그 역과정을 학습하면 생성 모델이 된다는 것이다.

2.2 Score Matching과의 연결

Diffusion Model의 또 다른 이론적 축은 Score Matching이다. Score function은 로그 확률 밀도의 그래디언트로 정의된다.

\nabla_x \log p(x)

이 score function을 추정할 수 있다면, Langevin Dynamics를 통해 샘플을 생성할 수 있다.

x_{t+1} = x_t + \frac{\epsilon}{2} \nabla_x \log p(x_t) + \sqrt{\epsilon} \, z, \quad z \sim \mathcal{N}(0, I)

Yang Song과 Stefano Ermon(2019)은 "Generative Modeling by Estimating Gradients of the Data Distribution"에서 Noise Conditional Score Networks(NCSN)를 제안하며, 다양한 노이즈 레벨에서의 score function을 추정하는 방법을 제시했다. Ho et al.의 DDPM은 이 Score Matching 관점과 깊이 연결되어 있으며, 논문에서도 "denoising score matching with Langevin dynamics와의 새로운 연결"을 핵심 기여로 언급한다.

2.3 SDE 관점: 통합 프레임워크

Song et al.(2021)은 "Score-Based Generative Modeling through Stochastic Differential Equations"에서 DDPM과 Score Matching을 확률 미분 방정식(SDE)이라는 통합 프레임워크로 묶었다. Forward Process를 연속 시간 SDE로 기술하면 다음과 같다.

dx = f(x, t) \, dt + g(t) \, dw

여기서 $f$ 는 drift coefficient, $g$ 는 diffusion coefficient, $w$ 는 표준 Wiener process다. 이 SDE에 대응하는 Reverse-time SDE가 존재한다.

dx = \left[ f(x, t) - g(t)^2 \nabla_x \log p_t(x) \right] dt + g(t) \, d\bar{w}

핵심은, 역방향 SDE를 풀기 위해 필요한 것이 오직 시간에 따른 score function $\nabla_x \log p_t(x)$ 뿐이라는 점이다. DDPM의 노이즈 예측 네트워크 $\epsilon_\theta$ 는 사실상 이 score function을 추정하는 것과 동등하다.

\epsilon_\theta(x_t, t) \approx -\sqrt{1 - \bar{\alpha}_t} \, \nabla_{x_t} \log p(x_t)

이 관계가 DDPM과 Score Matching을 이론적으로 통합하는 핵심 연결 고리다.

3. Forward Process: 체계적으로 노이즈를 추가하다

3.1 Markov Chain으로서의 Forward Process

Forward Process(또는 Diffusion Process)는 원본 데이터 $x_0$ 에 점진적으로 가우시안 노이즈를 추가하는 고정된 Markov Chain이다. 학습 가능한 파라미터가 없으며, 사전에 정의된 Variance Schedule $\{\beta_1, \beta_2, ..., \beta_T\}$ 에 의해 완전히 결정된다.

각 시간 단계 $t$ 에서의 전이 확률은 다음과 같이 정의된다.

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \, \beta_t I)

이를 풀어쓰면, 각 단계에서 이전 시점의 데이터를 $\sqrt{1 - \beta_t}$ 만큼 축소하고, 분산 $\beta_t$ 인 가우시안 노이즈를 추가한다는 뜻이다.

x_t = \sqrt{1 - \beta_t} \, x_{t-1} + \sqrt{\beta_t} \, \epsilon_{t-1}, \quad \epsilon_{t-1} \sim \mathcal{N}(0, I)

왜 $\sqrt{1 - \beta_t}$ 로 스케일링하는가? 각 단계에서 전체 분산이 보존되도록 하기 위함이다. $x_{t-1}$ 의 분산이 1이라면, $\sqrt{1-\beta_t} \cdot x_{t-1}$ 의 분산은 $1-\beta_t$ 이고, 여기에 분산 $\beta_t$ 인 노이즈를 더하면 전체 분산은 $(1-\beta_t) + \beta_t = 1$ 이 된다.

$T$ 가 충분히 크고 $\beta_t$ 가 적절히 설정되면, $x_T$ 는 거의 순수한 등방성 가우시안 노이즈 $\mathcal{N}(0, I)$ 에 수렴한다.

3.2 전체 Forward Process

$T$ 스텝에 걸친 전체 Forward Process의 결합 분포는 다음과 같다.

q(x_{1:T} | x_0) = \prod_{t=1}^{T} q(x_t | x_{t-1})

이는 Markov 성질에 의한 것으로, 각 단계가 오직 직전 단계에만 의존한다. DDPM에서는 $T = 1000$ 을 사용하고, $\beta_1 = 10^{-4}$ 에서 $\beta_T = 0.02$ 까지 선형적으로 증가시킨다.

4. 핵심 수학: Reparameterization Trick

4.1 임의의 시간 $t$ 로 한 번에 점프

Forward Process의 가장 강력한 수학적 성질은, $x_0$ 에서 임의의 시간 $t$ 에서의 $x_t$ 를 중간 단계를 거치지 않고 직접 계산할 수 있다는 것이다. 이것이 학습을 효율적으로 만드는 핵심이다.

먼저 표기법을 정의한다.

\alpha_t = 1 - \beta_t, \qquad \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s

$\bar{\alpha}_t$ 는 $\alpha_s$ 의 누적곱으로, 시간 $t$ 까지 원본 신호가 얼마나 보존되는지를 나타낸다.

4.2 유도 과정

$x_1$ 부터 시작하여 귀납적으로 유도해보자.

x_1 = \sqrt{\alpha_1} \, x_0 + \sqrt{1 - \alpha_1} \, \epsilon_0

x_2 = \sqrt{\alpha_2} \, x_1 + \sqrt{1 - \alpha_2} \, \epsilon_1

$x_1$ 을 $x_2$ 에 대입하면,

x_2 = \sqrt{\alpha_2} \left( \sqrt{\alpha_1} \, x_0 + \sqrt{1 - \alpha_1} \, \epsilon_0 \right) + \sqrt{1 - \alpha_2} \, \epsilon_1

= \sqrt{\alpha_1 \alpha_2} \, x_0 + \sqrt{\alpha_2(1-\alpha_1)} \, \epsilon_0 + \sqrt{1-\alpha_2} \, \epsilon_1

여기서 독립 가우시안의 합 법칙을 적용한다. 두 독립 가우시안 $\mathcal{N}(0, \sigma_1^2 I)$ 와 $\mathcal{N}(0, \sigma_2^2 I)$ 의 합은 $\mathcal{N}(0, (\sigma_1^2 + \sigma_2^2)I)$ 를 따른다.

노이즈 항의 분산을 합산하면,

\alpha_2(1-\alpha_1) + (1-\alpha_2) = \alpha_2 - \alpha_1\alpha_2 + 1 - \alpha_2 = 1 - \alpha_1\alpha_2 = 1 - \bar{\alpha}_2

따라서,

x_2 = \sqrt{\bar{\alpha}_2} \, x_0 + \sqrt{1 - \bar{\alpha}_2} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

이를 일반화하면 다음과 같다.

4.3 최종 결과: Closed-form Expression

\boxed{q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, \, (1 - \bar{\alpha}_t) I)}

즉, 임의의 시간 $t$ 에서의 $x_t$ 를 다음과 같이 한 번에 샘플링할 수 있다.

x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

이 식의 의미를 직관적으로 해석하면 다음과 같다.

항	의미	시간에 따른 변화
$\sqrt{\bar{\alpha}_t} \, x_0$	원본 신호(signal)	$t \uparrow$ 이면 $\bar{\alpha}_t \downarrow$ , 신호 감소
$\sqrt{1 - \bar{\alpha}_t} \, \epsilon$	추가된 노이즈	$t \uparrow$ 이면 $1-\bar{\alpha}_t \uparrow$ , 노이즈 증가

$t = 0$ 일 때 $\bar{\alpha}_0 = 1$ 이므로 $x_0$ 그대로이고, $t = T$ 일 때 $\bar{\alpha}_T \approx 0$ 이므로 거의 순수 노이즈가 된다. 이 **Signal-to-Noise Ratio(SNR)**의 점진적 감소가 Forward Process의 본질이다.

\text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}

5. Reverse Process: 노이즈에서 이미지를 복원하다

5.1 Reverse Process의 정의

Reverse Process는 순수 노이즈 $x_T \sim \mathcal{N}(0, I)$ 에서 시작하여 점진적으로 노이즈를 제거하며 데이터 $x_0$ 를 생성하는 과정이다. Forward Process의 각 단계가 작은 가우시안 섭동이라면, 그 역과정도 가우시안으로 근사할 수 있다는 것이 핵심 가정이다. ( $\beta_t$ 가 충분히 작을 때)

p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1} | x_t)

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

여기서 $\mu_\theta$ 와 $\Sigma_\theta$ 는 신경망이 학습해야 할 평균과 분산이다. DDPM에서는 분산 $\Sigma_\theta$ 를 학습하지 않고 $\sigma_t^2 I$ 로 고정하며, $\sigma_t^2 = \beta_t$ 또는 $\sigma_t^2 = \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t$ 를 사용한다.

5.2 Posterior $q(x_{t-1}|x_t, x_0)$ 의 유도

학습의 핵심은, $x_0$ 가 주어졌을 때의 역방향 조건부 분포(posterior)가 닫힌 형태로 계산 가능하다는 점이다. 베이즈 정리를 적용하면,

q(x_{t-1} | x_t, x_0) = \frac{q(x_t | x_{t-1}, x_0) \, q(x_{t-1} | x_0)}{q(x_t | x_0)}

Markov 성질에 의해 $q(x_t|x_{t-1}, x_0) = q(x_t|x_{t-1})$ 이므로, 세 항 모두 가우시안이다. 가우시안의 곱도 가우시안이므로, 지수 부분을 전개하여 $x_{t-1}$ 에 대한 이차식으로 정리하면 다음을 얻는다.

q(x_{t-1} | x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I)

여기서 posterior 평균은,

\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t

posterior 분산은,

\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t

5.3 $x_0$ 를 $\epsilon$ 으로 대체

모델이 $x_0$ 를 직접 알 수는 없으므로, Reparameterization 공식 $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ 을 역으로 풀어 $x_0$ 를 표현한다.

x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1-\bar{\alpha}_t} \, \epsilon \right)

이를 posterior 평균 $\tilde{\mu}_t$ 에 대입하면,

\tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \, \epsilon \right)

모델이 노이즈 $\epsilon$ 을 예측하는 네트워크 $\epsilon_\theta(x_t, t)$ 를 학습하면, Reverse Process의 평균은 다음과 같이 계산된다.

\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \, \epsilon_\theta(x_t, t) \right)

이것이 DDPM의 Reverse Process에서 노이즈 예측이 곧 평균 예측이 되는 이유다.

6. 학습 목표의 유도: ELBO에서 Simplified Loss로

6.1 최대 우도와 ELBO

생성 모델의 궁극적 목표는 데이터의 로그 우도 $\log p_\theta(x_0)$ 를 최대화하는 것이다. 그러나 이를 직접 계산하기 어려우므로, 변분 하한(Evidence Lower Bound, ELBO)을 최적화한다.

Jensen's inequality를 적용하면,

\log p_\theta(x_0) \geq \mathbb{E}_{q(x_{1:T}|x_0)} \left[ \log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)} \right] = \text{ELBO}

6.2 ELBO의 분해

ELBO를 KL divergence 항으로 분해하면 다음과 같다.

\text{ELBO} = \underbrace{\mathbb{E}_q[\log p_\theta(x_0 | x_1)]}_{L_0: \text{Reconstruction term}} - \underbrace{D_{\text{KL}}(q(x_T | x_0) \| p(x_T))}_{L_T: \text{Prior matching term}} - \sum_{t=2}^{T} \underbrace{\mathbb{E}_q \left[ D_{\text{KL}}(q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t)) \right]}_{L_{t-1}: \text{Denoising matching term}}

각 항의 의미를 분석하면 다음과 같다.

$L_T$ (Prior Matching): $q(x_T|x_0)$ 가 사전 분포 $p(x_T) = \mathcal{N}(0, I)$ 와 얼마나 일치하는지 측정한다. $T$ 가 충분히 크면 이 항은 0에 수렴하며, 학습 가능한 파라미터가 없으므로 상수로 무시한다.

$L_0$ (Reconstruction): $x_1$ 에서 $x_0$ 를 복원하는 능력을 측정한다. $x_0$ 와 $x_1$ 이 매우 유사하므로 전체 학습에 미치는 영향이 작다.

$L_{t-1}$ (Denoising Matching): 모델의 Reverse 전이 $p_\theta(x_{t-1}|x_t)$ 가 실제 posterior $q(x_{t-1}|x_t, x_0)$ 와 얼마나 일치하는지를 측정하는 핵심 학습 신호다.

6.3 KL Divergence 계산

두 가우시안 사이의 KL divergence는 닫힌 형태로 계산 가능하다. $q(x_{t-1}|x_t, x_0) = \mathcal{N}(\tilde{\mu}_t, \tilde{\beta}_t I)$ 이고 $p_\theta(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta, \sigma_t^2 I)$ 이므로,

D_{\text{KL}}(q \| p_\theta) = \frac{1}{2\sigma_t^2} \|\tilde{\mu}_t(x_t, x_0) - \mu_\theta(x_t, t)\|^2 + C

여기서 $C$ 는 분산 관련 상수항이다. 분산을 고정하면 평균의 차이만이 학습 목표가 된다.

6.4 노이즈 예측으로의 재매개변수화

앞서 유도한 $\tilde{\mu}_t$ 와 $\mu_\theta$ 의 표현을 대입하면,

\|\tilde{\mu}_t - \mu_\theta\|^2 = \frac{\beta_t^2}{(1-\bar{\alpha}_t)\alpha_t} \|\epsilon - \epsilon_\theta(x_t, t)\|^2

가중치 계수를 제거한 Simplified Loss는 다음과 같다.

\boxed{L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \right]}

여기서 $t \sim \text{Uniform}(\{1, ..., T\})$ , $x_0 \sim q(x_0)$ , $\epsilon \sim \mathcal{N}(0, I)$ 이다.

이것이 DDPM의 가장 중요한 기여다. 복잡한 ELBO에서 출발하여, 결국 **"실제 노이즈 $\epsilon$ 과 예측 노이즈 $\epsilon_\theta$ 의 MSE"**라는 머신러닝에서 가장 단순한 손실 함수에 도달한 것이다. 실험적으로도 이 simplified loss가 가중치가 포함된 원래의 variational bound보다 더 좋은 샘플 품질을 생성한다.

6.5 학습 알고리즘 요약

Algorithm 1: Training
─────────────────────────────────
repeat
    x_0 ~ q(x_0)                    # 데이터셋에서 샘플
    t ~ Uniform({1, ..., T})         # 랜덤 시간 단계 선택
    ε ~ N(0, I)                      # 표준 가우시안 노이즈 샘플
    x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε   # Noisy 이미지 생성
    ∇_θ ||ε - ε_θ(x_t, t)||²        # 그래디언트 계산 및 업데이트
until converged

7. 노이즈 스케줄링: $\beta_t$ 의 설계

7.1 Linear Schedule (DDPM 원본)

Ho et al.은 $\beta_t$ 를 $\beta_1 = 10^{-4}$ 에서 $\beta_T = 0.02$ 까지 선형적으로 증가시키는 스케줄을 사용했다.

\beta_t = \beta_1 + \frac{t-1}{T-1}(\beta_T - \beta_1)

이 스케줄의 직관은, 초기에는 작은 노이즈를 추가하여 데이터 구조를 서서히 파괴하고, 후반에는 더 큰 노이즈를 추가하여 빠르게 가우시안으로 수렴시키는 것이다.

7.2 Linear Schedule의 문제점

Nichol & Dhariwal(2021, "Improved Denoising Diffusion Probabilistic Models")은 Linear Schedule의 두 가지 문제를 지적했다.

첫째, 초반에 정보가 너무 빨리 파괴된다. $\bar{\alpha}_t$ 가 초반에 급격히 감소하여, $t$ 의 낮은 값에서도 이미 상당한 노이즈가 추가된다. 이는 특히 고해상도 이미지에서 문제가 된다.

둘째, 후반 시간 단계가 낭비된다. $t$ 가 큰 값에서는 $\bar{\alpha}_t \approx 0$ 으로, $x_t$ 가 이미 순수 노이즈에 가까워 학습에 유의미한 기여를 하지 못한다.

7.3 Cosine Schedule

Nichol & Dhariwal이 제안한 Cosine Schedule은 $\bar{\alpha}_t$ 를 직접 정의한다.

\bar{\alpha}_t = \frac{f(t)}{f(0)}, \qquad f(t) = \cos\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2

여기서 $s = 0.008$ 은 작은 오프셋으로, $t=0$ 근처에서 $\beta_t$ 가 너무 작아지는 것을 방지한다.

Cosine Schedule의 핵심 특성은 다음과 같다.

$\bar{\alpha}_t$ 가 중반부에서 거의 선형적으로 감소하여, 모든 시간 단계에서 균등하게 유용한 학습 신호를 제공한다
초반에 너무 많은 노이즈가 추가되는 것을 방지하여, 세밀한 디테일이 보존된다
후반부에서도 완전한 노이즈로의 전환이 부드럽게 이루어진다

import torch
import math

def cosine_beta_schedule(timesteps, s=0.008):
    """Cosine schedule as proposed in Nichol & Dhariwal (2021)."""
    steps = timesteps + 1
    t = torch.linspace(0, timesteps, steps) / timesteps
    alphas_cumprod = torch.cos((t + s) / (1 + s) * math.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clip(betas, 0.0001, 0.9999)

def linear_beta_schedule(timesteps, beta_start=1e-4, beta_end=0.02):
    """Linear schedule as proposed in Ho et al. (2020)."""
    return torch.linspace(beta_start, beta_end, timesteps)

7.4 스케줄 비교

속성	Linear Schedule	Cosine Schedule
$\bar{\alpha}_t$ 감소 패턴	초반 급격, 후반 완만	중반에 거의 선형
초반 정보 보존	낮음	높음
후반 시간 단계 활용	비효율적 (이미 순수 노이즈)	효율적
고해상도 이미지 적합성	낮음	높음
원래 DDPM 사용 여부	O	X
Improved DDPM 사용 여부	X	O

8. 샘플링 알고리즘

8.1 DDPM Sampling

학습이 완료된 후, 새로운 이미지를 생성하는 DDPM 샘플링 알고리즘은 다음과 같다.

Algorithm 2: Sampling
─────────────────────────────────
x_T ~ N(0, I)                          # 순수 노이즈에서 시작
for t = T, T-1, ..., 1:
    z ~ N(0, I)  if t > 1, else z = 0  # 마지막 단계에서는 노이즈 추가 안함
    x_{t-1} = 1/√α_t · (x_t - β_t/√(1-ᾱ_t) · ε_θ(x_t, t)) + σ_t · z
return x_0

8.2 단계별 해석

Step 1: 초기화. $x_T \sim \mathcal{N}(0, I)$ 에서 순수 가우시안 노이즈를 샘플링한다. 이것이 생성 과정의 시작점이다.

Step 2: 노이즈 예측. 현재 noisy 이미지 $x_t$ 와 시간 단계 $t$ 를 네트워크 $\epsilon_\theta$ 에 입력하여, $x_t$ 에 포함된 노이즈를 예측한다.

Step 3: 평균 계산. 예측된 노이즈를 사용하여 Reverse 전이의 평균을 계산한다.

\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right)

Step 4: 확률적 전이. 계산된 평균에 스케일링된 가우시안 노이즈 $\sigma_t z$ 를 추가하여 $x_{t-1}$ 을 생성한다. 마지막 단계( $t=1$ )에서는 노이즈를 추가하지 않는다.

Step 5: 반복. $t = T$ 에서 $t = 1$ 까지 위 과정을 반복한다.

8.3 샘플링의 한계

DDPM 샘플링의 가장 큰 단점은 속도다. $T = 1000$ 스텝의 순차적 디노이징이 필요하므로, 단일 이미지 생성에도 1000번의 신경망 순전파가 필요하다. 이는 GAN의 단일 순전파에 비해 극도로 느리며, 이후 DDIM, DPM-Solver 등의 가속 샘플러 연구를 촉발했다.

9. 아키텍처: Time-conditioned U-Net

9.1 U-Net 기반 설계

DDPM의 노이즈 예측 네트워크 $\epsilon_\theta(x_t, t)$ 는 U-Net 아키텍처를 기반으로 한다. U-Net은 원래 의료 영상 분할을 위해 Ronneberger et al.(2015)이 제안한 구조로, Encoder-Decoder 구조에 Skip Connection을 추가하여 다양한 해상도의 특징을 결합하는 것이 특징이다.

DDPM의 U-Net은 PixelCNN++의 구조를 기반으로 하며, 다음과 같은 수정을 가했다.

9.2 핵심 구성 요소

Time Embedding: 시간 단계 $t$ 를 네트워크에 주입하기 위해 Transformer의 Sinusoidal Positional Encoding을 사용한다.

\text{TE}(t)_{2i} = \sin\left(\frac{t}{10000^{2i/d}}\right), \quad \text{TE}(t)_{2i+1} = \cos\left(\frac{t}{10000^{2i/d}}\right)

이 embedding은 MLP를 거쳐 각 ResNet Block에 주입된다. 구체적으로, time embedding을 선형 변환한 후 ResNet Block의 중간 feature map에 더하거나(additive) 스케일링하는(FiLM conditioning) 방식으로 적용한다.

ResNet Block: 각 블록은 다음 순서로 구성된다.

Group Normalization
SiLU (Swish) Activation
Convolution
Time Embedding 주입
Group Normalization
SiLU Activation
Dropout
Convolution
Residual Connection

Self-Attention: $16 \times 16$ 해상도의 feature map에서 Multi-Head Self-Attention을 적용한다. 공간 차원 $(h, w)$ 을 시퀀스 길이 $h \times w$ 로 펼쳐 표준 Scaled Dot-Product Attention을 수행한다.

Group Normalization: Batch Normalization 대신 Group Normalization을 사용한다. 이는 배치 크기에 독립적이며, 생성 모델에서 더 안정적인 학습을 제공한다.

9.3 구체적 아키텍처 사양

입력: x_t ∈ R^(C×H×W), t ∈ {1,...,T}

Encoder:
  [128] → [128] → ↓2 →
  [256] → [256] → ↓2 →
  [256] → [256] → ↓2 →      (+ Self-Attention at 16×16)
  [512] → [512] → ↓2

Bottleneck:
  [512] → Self-Attention → [512]

Decoder (with skip connections):
  [512] → [512] → ↑2 →
  [256] → [256] → ↑2 →      (+ Self-Attention at 16×16)
  [256] → [256] → ↑2 →
  [128] → [128] → ↑2

출력: ε_θ ∈ R^(C×H×W)       (입력과 동일 차원의 예측 노이즈)

DDPM은 $256 \times 256$ 해상도에서 약 114M 파라미터를 사용했다.

10. 실험 결과

10.1 정량적 평가

DDPM은 다음과 같은 벤치마크에서 평가되었다.

CIFAR-10 (Unconditional, $32 \times 32$ ):

모델	FID ( $\downarrow$ )	IS ( $\uparrow$ )
DDPM	3.17	9.46
StyleGAN2 + ADA	2.92	9.83
NCSN	25.32	8.87
ProgressiveGAN	15.52	8.80
NVAE	23.5	-

DDPM은 당시 unconditional 생성 모델 중 SOTA FID를 달성했으며, GAN 기반의 StyleGAN2와 비교 가능한 수준의 품질을 보여줬다.

LSUN ( $256 \times 256$ ):

데이터셋	FID
LSUN Bedroom	4.90
LSUN Cat	-
LSUN Church	7.89

10.2 정성적 분석

DDPM의 샘플들은 GAN에 비해 몇 가지 뚜렷한 특성을 보였다.

높은 다양성: GAN은 mode collapse 문제로 인해 생성 다양성이 제한되는 반면, DDPM은 데이터 분포의 다양한 모드를 균형 있게 커버한다.

점진적 생성: 노이즈에서 이미지로의 점진적 변환 과정을 시각화할 수 있어, 모델이 전역 구조를 먼저 형성하고 이후 세부 디테일을 추가하는 coarse-to-fine 생성 패턴을 확인할 수 있다.

안정적 학습: GAN의 고질적 문제인 학습 불안정성(mode collapse, training oscillation)이 없으며, 단순한 MSE 손실로 안정적으로 수렴한다.

10.3 Progressive Lossy Compression 해석

Ho et al.은 DDPM이 자연스럽게 점진적 손실 압축 스킴(Progressive Lossy Decompression)을 구현한다고 해석했다. 각 Reverse 단계에서 점진적으로 정보가 추가되며, 이는 Autoregressive Decoding의 일반화로 볼 수 있다. Rate-Distortion 곡선 분석에서, 대부분의 비트가 인지적으로 무의미한 세부 사항보다는 전체 구조에 할당됨을 확인했다.

11. 후속 연구 총정리: Diffusion의 진화

11.1 DDIM (Denoising Diffusion Implicit Models)

Song et al., 2021 | arXiv: 2010.02502

DDPM의 가장 큰 한계인 느린 샘플링 속도를 해결한 연구다. 핵심 아이디어는 Forward Process를 Non-Markovian으로 일반화하는 것이다.

DDIM은 동일한 학습된 모델 $\epsilon_\theta$ 를 사용하면서, 샘플링 과정만 변경한다.

x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\left( \frac{x_t - \sqrt{1-\bar{\alpha}_t} \, \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}} \right)}_{\text{predicted } x_0} + \sqrt{1-\bar{\alpha}_{t-1} - \sigma_t^2} \cdot \epsilon_\theta(x_t, t) + \sigma_t \epsilon_t

$\sigma_t = 0$ 으로 설정하면 **완전히 결정론적(deterministic)**인 샘플링이 되며, 이를 통해 다음을 얻는다.

가속 샘플링: $T=1000$ 스텝 대신 $50 \sim 100$ 스텝만으로 유사한 품질의 이미지 생성 (10~20배 가속)
의미론적 보간: 결정론적 매핑 덕분에 latent space에서의 보간이 의미 있는 이미지 변환으로 이어진다
일관성: 동일한 초기 노이즈에서 항상 동일한 이미지를 생성하여, 재현 가능한 결과를 보장한다

11.2 Improved DDPM

Nichol & Dhariwal, 2021 | arXiv: 2102.09672

원래 DDPM의 두 가지를 개선한 연구다.

학습 가능한 분산: DDPM은 $\sigma_t^2$ 를 $\beta_t$ 또는 $\tilde{\beta}_t$ 로 고정했지만, Improved DDPM은 이를 학습 가능하게 만들었다. 구체적으로, $\sigma_t^2$ 를 $\beta_t$ 와 $\tilde{\beta}_t$ 의 보간으로 매개변수화한다.

\Sigma_\theta(x_t, t) = \exp(v \log \beta_t + (1-v) \log \tilde{\beta}_t)

여기서 $v$ 는 네트워크가 출력하는 값이다.

Cosine Schedule: 앞서 설명한 Cosine Variance Schedule을 도입하여, 특히 고해상도 이미지에서 학습 효율성을 크게 개선했다.

Hybrid Loss: $L_\text{simple}$ 에 variational lower bound $L_\text{vlb}$ 를 소량 추가하여 log-likelihood도 개선했다.

L_\text{hybrid} = L_\text{simple} + \lambda L_\text{vlb}

11.3 Classifier Guidance

Dhariwal & Nichol, 2021 | arXiv: 2105.05233

"Diffusion Models Beat GANs on Image Synthesis"에서 제안된 기법으로, 사전 학습된 분류기의 그래디언트를 Reverse Process에 주입하여 조건부 생성을 수행한다.

\hat{\epsilon}_\theta(x_t, t, y) = \epsilon_\theta(x_t, t) - s \cdot \sqrt{1-\bar{\alpha}_t} \cdot \nabla_{x_t} \log p_\phi(y|x_t)

여기서 $s$ 는 guidance scale이고, $p_\phi$ 는 noisy 이미지에 대해 학습된 분류기다. $s$ 를 증가시키면 다양성은 줄어들지만 특정 클래스에 대한 충실도(fidelity)는 높아진다. 이 논문에서 Diffusion Model이 처음으로 FID에서 GAN을 능가했다 (CIFAR-10 FID 2.97, ImageNet 256x256 FID 4.59).

한계: 별도의 분류기를 noisy 데이터에 대해 학습해야 하며, 이는 학습 파이프라인을 복잡하게 만든다.

11.4 Classifier-Free Guidance (CFG)

Ho & Salimans, 2022 | arXiv: 2207.12598

별도 분류기 없이 guidance 효과를 얻는 혁신적 기법으로, 현대 Diffusion Model의 사실상 표준이 되었다.

핵심 아이디어는 하나의 네트워크가 조건부와 무조건부 생성을 모두 학습하는 것이다. 학습 시 일정 확률(보통 10~20%)로 조건 정보 $c$ 를 null token $\varnothing$ 로 대체한다.

추론 시, 조건부 예측과 무조건부 예측을 선형 결합한다.

\hat{\epsilon}_\theta(x_t, t, c) = (1 + w) \cdot \epsilon_\theta(x_t, t, c) - w \cdot \epsilon_\theta(x_t, t, \varnothing)

여기서 $w$ 는 guidance weight다. $w = 0$ 이면 표준 조건부 생성, $w > 0$ 이면 조건에 대한 충실도가 증가한다.

이를 재배열하면 다음과 같이 해석할 수 있다.

\hat{\epsilon}_\theta = \epsilon_\theta(x_t, t, \varnothing) + (1 + w) \cdot \underbrace{(\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \varnothing))}_{\text{조건 방향으로의 이동}}

무조건부 예측에서 조건부 방향으로 밀어내는 것으로 해석할 수 있으며, $w$ 가 클수록 이 밀어내는 힘이 강해진다. DALL-E 2, Stable Diffusion, Imagen 등 거의 모든 최신 Text-to-Image 모델이 CFG를 사용한다.

11.5 Latent Diffusion Models (LDM) / Stable Diffusion

Rombach et al., 2022 | arXiv: 2112.10752

LDM은 Diffusion Process를 **픽셀 공간이 아닌 잠재 공간(Latent Space)**에서 수행하여 계산 효율성을 극적으로 개선한 연구다.

핵심 구조:

Perceptual Compression: 사전 학습된 Autoencoder(VQ-VAE 또는 KL-regularized VAE)의 Encoder $\mathcal{E}$ 로 이미지 $x$ 를 저차원 latent $z = \mathcal{E}(x)$ 로 압축한다. 일반적으로 $256 \times 256 \times 3$ 이미지가 $32 \times 32 \times 4$ 의 latent으로 압축된다 (약 48배 차원 축소).
Latent Diffusion: 이 latent space에서 DDPM의 Forward/Reverse Process를 수행한다. 계산량이 픽셀 공간 대비 크게 절감된다.
Cross-Attention Conditioning: 텍스트, 세그멘테이션 맵 등의 조건 정보를 Cross-Attention을 통해 U-Net에 주입한다. 텍스트의 경우 CLIP 또는 BERT의 임베딩을 사용한다.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V

여기서 $Q = W_Q \cdot \varphi(z_t)$ , $K = W_K \cdot \tau_\theta(y)$ , $V = W_V \cdot \tau_\theta(y)$ 이고, $\tau_\theta(y)$ 는 조건 정보의 인코딩이다.

Stable Diffusion은 이 LDM 아키텍처에 CLIP text encoder와 대규모 데이터셋(LAION-5B)을 결합하여 학습한 것으로, 오픈소스 Text-to-Image 모델의 사실상 표준이 되었다.

11.6 Score SDE

Song et al., 2021 | arXiv: 2011.13456

ICLR 2021 Oral 발표된 이 연구는 DDPM과 Score Matching을 확률 미분 방정식(SDE)이라는 통합 프레임워크로 연결했다.

핵심 기여는 다음과 같다.

Variance Exploding (VE) SDE: NCSN/SMLD 계열에 대응
Variance Preserving (VP) SDE: DDPM에 대응
Sub-VP SDE: 더 나은 likelihood를 제공하는 변형

\text{VP-SDE}: \quad dx = -\frac{1}{2}\beta(t) x \, dt + \sqrt{\beta(t)} \, dw

연속 시간으로의 확장을 통해, 정확한 log-likelihood 계산(ODE를 통해), 더 유연한 샘플러 설계, 그리고 Inpainting, Colorization 등의 조건부 생성이 가능해졌다.

11.7 Consistency Models

Song et al., 2023 | arXiv: 2303.01469

OpenAI의 Yang Song이 제안한 Consistency Models는 Diffusion Model의 다단계 샘플링 문제를 근본적으로 해결하려는 시도다.

핵심 아이디어는 ODE trajectory 위의 모든 점을 동일한 시작점(원본 데이터)으로 매핑하는 함수 $f_\theta$ 를 학습하는 것이다.

f_\theta(x_t, t) = x_0, \quad \forall t \in [0, T]

이 self-consistency 속성에 의해, 어떤 시간 $t$ 의 noisy 샘플이든 한 번의 네트워크 평가로 데이터를 복원할 수 있다. 즉, 1-step 생성이 가능하다.

두 가지 학습 방식이 있다.

Consistency Distillation (CD): 사전 학습된 Diffusion Model로부터 증류
Consistency Training (CT): 사전 학습 없이 독립적으로 학습

2024년에는 **Easy Consistency Models (ECM)**이 등장하여, iCT 대비 33%의 학습 비용으로 더 나은 2-step 생성 성능을 달성했다.

11.8 Flow Matching / Rectified Flow

Lipman et al., 2023; Liu et al., 2023 | arXiv: 2210.02747, arXiv: 2209.03003

Flow Matching은 Diffusion Model의 대안적 접근으로, 데이터 분포와 노이즈 분포를 연결하는 **확률 흐름(Probability Flow)**을 직접 학습한다.

핵심 아이디어: 노이즈 $x_1 \sim \mathcal{N}(0, I)$ 에서 데이터 $x_0$ 로의 직선 경로(straight path)를 정의한다.

x_t = (1-t) x_0 + t \, \epsilon, \quad t \in [0, 1]

이 경로를 따르는 속도장(velocity field) $v_\theta(x_t, t)$ 를 학습한다.

L_{\text{FM}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| v_\theta(x_t, t) - (x_0 - \epsilon) \|^2 \right]

Rectified Flow는 이 직선 경로를 "곧게 만드는" 과정을 반복(reflow)하여, 적은 스텝에서도 높은 품질의 샘플을 생성한다.

Stable Diffusion 3가 Rectified Flow를 채택하여, U-Net에서 Transformer로의 전환과 함께 Diffusion Model의 새로운 패러다임을 제시했다.

11.9 DiT (Diffusion Transformer)

Peebles & Xie, 2023 | arXiv: 2212.09748

DiT는 Diffusion Model의 backbone을 U-Net에서 **Vision Transformer (ViT)**로 교체한 연구다.

핵심 설계 선택:

이미지를 패치로 분할하여 토큰으로 처리
시간 단계 $t$ 와 클래스 레이블 $y$ 를 **Adaptive Layer Normalization (adaLN-Zero)**로 주입
$L$ 계층의 Transformer Block으로 구성

DiT는 Latent Diffusion과 결합하여, ImageNet $256 \times 256$ class-conditional 생성에서 FID 2.27을 달성하며 이전의 모든 Diffusion Model을 능가했다.

DiT의 의의: Transformer의 scaling law를 Diffusion Model에 적용할 수 있음을 실증했다. 모델 크기와 학습 컴퓨팅을 증가시키면 일관되게 성능이 향상된다. 이 발견은 Sora(OpenAI, Video 생성), Stable Diffusion 3 등 최신 대규모 생성 모델의 아키텍처 선택에 직접적인 영향을 미쳤다.

12. PyTorch 코드 예제: 간단한 DDPM 구현

아래는 DDPM의 핵심 구성 요소를 PyTorch로 구현한 간소화된 예제다. 실제 학습에는 더 정교한 U-Net과 하이퍼파라미터 튜닝이 필요하다.

12.1 Noise Schedule과 Forward Process

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class DDPMScheduler:
    """DDPM의 Forward Process를 관리하는 스케줄러."""

    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02, schedule='linear'):
        self.num_timesteps = num_timesteps

        if schedule == 'linear':
            self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        elif schedule == 'cosine':
            self.betas = self._cosine_schedule(num_timesteps)
        else:
            raise ValueError(f"Unknown schedule: {schedule}")

        # 핵심 변수들 사전 계산
        self.alphas = 1.0 - self.betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)          # ᾱ_t
        self.alphas_cumprod_prev = F.pad(self.alphas_cumprod[:-1], (1, 0), value=1.0)

        # Forward process 계수
        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)        # √ᾱ_t
        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)  # √(1-ᾱ_t)

        # Reverse process 계수
        self.sqrt_recip_alphas = torch.sqrt(1.0 / self.alphas)           # 1/√α_t
        self.posterior_variance = (
            self.betas * (1.0 - self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
        )  # β̃_t

    def _cosine_schedule(self, timesteps, s=0.008):
        steps = timesteps + 1
        t = torch.linspace(0, timesteps, steps) / timesteps
        alphas_cumprod = torch.cos((t + s) / (1 + s) * math.pi * 0.5) ** 2
        alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
        betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
        return torch.clip(betas, 0.0001, 0.9999)

    def add_noise(self, x_0, t, noise=None):
        """Forward process: q(x_t | x_0)을 한 번에 계산."""
        if noise is None:
            noise = torch.randn_like(x_0)

        sqrt_alpha_cumprod = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
        sqrt_one_minus_alpha_cumprod = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)

        # x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε
        x_t = sqrt_alpha_cumprod * x_0 + sqrt_one_minus_alpha_cumprod * noise
        return x_t

12.2 간소화된 U-Net

class SinusoidalPositionEmbedding(nn.Module):
    """Transformer 스타일의 Sinusoidal Time Embedding."""

    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, t):
        device = t.device
        half_dim = self.dim // 2
        emb = math.log(10000) / (half_dim - 1)
        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
        emb = t[:, None].float() * emb[None, :]
        emb = torch.cat([emb.sin(), emb.cos()], dim=-1)
        return emb


class ResBlock(nn.Module):
    """Time-conditioned Residual Block."""

    def __init__(self, in_ch, out_ch, time_emb_dim):
        super().__init__()
        self.norm1 = nn.GroupNorm(8, in_ch)
        self.conv1 = nn.Conv2d(in_ch, out_ch, 3, padding=1)
        self.time_mlp = nn.Sequential(
            nn.SiLU(),
            nn.Linear(time_emb_dim, out_ch),
        )
        self.norm2 = nn.GroupNorm(8, out_ch)
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1)
        self.skip = nn.Conv2d(in_ch, out_ch, 1) if in_ch != out_ch else nn.Identity()

    def forward(self, x, t_emb):
        h = self.conv1(F.silu(self.norm1(x)))
        h = h + self.time_mlp(t_emb)[:, :, None, None]  # Time embedding 주입
        h = self.conv2(F.silu(self.norm2(h)))
        return h + self.skip(x)                           # Residual connection


class SimpleUNet(nn.Module):
    """DDPM 학습을 위한 간소화된 U-Net."""

    def __init__(self, in_channels=3, base_channels=64, time_emb_dim=256):
        super().__init__()

        # Time embedding
        self.time_mlp = nn.Sequential(
            SinusoidalPositionEmbedding(base_channels),
            nn.Linear(base_channels, time_emb_dim),
            nn.SiLU(),
            nn.Linear(time_emb_dim, time_emb_dim),
        )

        # Encoder
        self.enc1 = ResBlock(in_channels, base_channels, time_emb_dim)
        self.enc2 = ResBlock(base_channels, base_channels * 2, time_emb_dim)
        self.enc3 = ResBlock(base_channels * 2, base_channels * 4, time_emb_dim)
        self.pool = nn.MaxPool2d(2)

        # Bottleneck
        self.bot = ResBlock(base_channels * 4, base_channels * 4, time_emb_dim)

        # Decoder (with skip connections)
        self.dec3 = ResBlock(base_channels * 8, base_channels * 2, time_emb_dim)
        self.dec2 = ResBlock(base_channels * 4, base_channels, time_emb_dim)
        self.dec1 = ResBlock(base_channels * 2, base_channels, time_emb_dim)
        self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)

        # Output
        self.out = nn.Conv2d(base_channels, in_channels, 1)

    def forward(self, x, t):
        t_emb = self.time_mlp(t)

        # Encoder
        e1 = self.enc1(x, t_emb)
        e2 = self.enc2(self.pool(e1), t_emb)
        e3 = self.enc3(self.pool(e2), t_emb)

        # Bottleneck
        b = self.bot(self.pool(e3), t_emb)

        # Decoder with skip connections
        d3 = self.dec3(torch.cat([self.up(b), e3], dim=1), t_emb)
        d2 = self.dec2(torch.cat([self.up(d3), e2], dim=1), t_emb)
        d1 = self.dec1(torch.cat([self.up(d2), e1], dim=1), t_emb)

        return self.out(d1)  # 예측된 노이즈 ε_θ

12.3 학습 루프

def train_ddpm(model, dataloader, scheduler, epochs=100, lr=2e-4, device='cuda'):
    """DDPM 학습 루프 (Algorithm 1 구현)."""
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    model.train()

    for epoch in range(epochs):
        total_loss = 0
        for batch_idx, (x_0, _) in enumerate(dataloader):
            x_0 = x_0.to(device)

            # 1. 랜덤 시간 단계 선택: t ~ Uniform({1, ..., T})
            t = torch.randint(0, scheduler.num_timesteps, (x_0.shape[0],), device=device)

            # 2. 노이즈 샘플링: ε ~ N(0, I)
            noise = torch.randn_like(x_0)

            # 3. Forward process: x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε
            x_t = scheduler.add_noise(x_0, t, noise)

            # 4. 노이즈 예측: ε_θ(x_t, t)
            noise_pred = model(x_t, t)

            # 5. Simplified loss: L = ||ε - ε_θ(x_t, t)||²
            loss = F.mse_loss(noise_pred, noise)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)
        print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

12.4 샘플링

@torch.no_grad()
def sample_ddpm(model, scheduler, image_shape, device='cuda'):
    """DDPM 샘플링 (Algorithm 2 구현)."""
    model.eval()

    # x_T ~ N(0, I)
    x = torch.randn(image_shape, device=device)

    for t in reversed(range(scheduler.num_timesteps)):
        t_batch = torch.full((image_shape[0],), t, device=device, dtype=torch.long)

        # 노이즈 예측
        predicted_noise = model(x, t_batch)

        # Reverse process 계수
        alpha_t = scheduler.alphas[t]
        alpha_cumprod_t = scheduler.alphas_cumprod[t]
        beta_t = scheduler.betas[t]

        # 평균 계산: μ_θ = 1/√α_t · (x_t - β_t/√(1-ᾱ_t) · ε_θ)
        mean = (1.0 / torch.sqrt(alpha_t)) * (
            x - (beta_t / torch.sqrt(1.0 - alpha_cumprod_t)) * predicted_noise
        )

        if t > 0:
            # 확률적 노이즈 추가 (마지막 단계 제외)
            noise = torch.randn_like(x)
            sigma_t = torch.sqrt(scheduler.posterior_variance[t])
            x = mean + sigma_t * noise
        else:
            x = mean

    return x

12.5 사용 예시

# 하이퍼파라미터
device = 'cuda' if torch.cuda.is_available() else 'cpu'
image_size = 32
batch_size = 128
num_timesteps = 1000

# 스케줄러 및 모델 초기화
scheduler = DDPMScheduler(num_timesteps=num_timesteps, schedule='cosine')
model = SimpleUNet(in_channels=3, base_channels=64).to(device)

# 데이터셋 (예: CIFAR-10)
from torchvision import datasets, transforms
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),  # [-1, 1] 정규화
])
dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

# 학습
train_ddpm(model, dataloader, scheduler, epochs=100, device=device)

# 샘플링
samples = sample_ddpm(model, scheduler, (16, 3, image_size, image_size), device=device)
# samples: [-1, 1] 범위의 생성된 이미지 16장

13. Diffusion Model vs GAN vs VAE: 비교 분석

13.1 종합 비교표

특성	Diffusion Model (DDPM)	GAN	VAE
학습 방식	노이즈 예측 (MSE)	적대적 학습 (Min-Max)	변분 추론 (ELBO)
학습 안정성	매우 안정적	불안정 (mode collapse, oscillation)	안정적
생성 품질	매우 높음	매우 높음	보통 (blurry)
다양성	높음 (전체 분포 커버)	낮음 (mode collapse 위험)	높음
생성 속도	느림 (1000 스텝)	매우 빠름 (1 스텝)	빠름 (1 스텝)
Log-likelihood	계산 가능 (ELBO)	계산 불가	계산 가능 (ELBO)
Latent Space	암묵적	없음 (또는 제한적)	명시적, 연속적
모드 커버리지	높음	낮음	높음
조건부 생성	CFG로 매우 효과적	cGAN으로 가능	조건부 VAE 가능
해상도 확장	LDM으로 효율적	점진적 학습 필요	계층적 VAE 필요
이론적 기반	열역학, Score Matching	게임 이론	변분 베이즈
대표 모델	Stable Diffusion, DALL-E 2	StyleGAN, BigGAN	VQ-VAE-2, NVAE
CIFAR-10 FID	~2.0 (최신)	~2.9 (StyleGAN2)	~23.5 (NVAE)

13.2 언제 어떤 모델을 선택할 것인가?

Diffusion Model을 선택할 때:

생성 품질과 다양성이 모두 중요할 때
텍스트-이미지 생성 등 복잡한 조건부 생성이 필요할 때
학습 안정성이 중요할 때
생성 속도가 최우선이 아닐 때

GAN을 선택할 때:

실시간 생성이 필요할 때
특정 도메인의 고품질 이미지가 필요할 때 (얼굴, 풍경 등)
데이터셋이 상대적으로 작고 균일할 때

VAE를 선택할 때:

의미 있는 Latent Space 조작이 필요할 때
Likelihood 기반 이상치 탐지가 필요할 때
빠른 인코딩/디코딩이 필요할 때
준지도 학습 또는 표현 학습이 주 목적일 때

14. Diffusion Model의 현재와 미래

14.1 2024~2025년의 주요 흐름

아키텍처 전환: U-Net에서 Transformer로. Stable Diffusion 3, FLUX, Sora 등 최신 모델들은 DiT 기반 아키텍처를 채택하고 있다. Transformer의 scaling law가 Diffusion Model에도 적용됨이 확인되었으며, 모델 규모 확장(8B+ 파라미터)이 활발히 진행 중이다.

샘플링 효율화. Consistency Models, Flow Matching, DPM-Solver 등의 발전으로 1~4 스텝 생성이 가능해졌다. Rectified Flow는 직선 경로를 학습하여 적은 스텝에서도 높은 품질을 달성한다.

멀티모달 확장. Diffusion Model은 이미지를 넘어 비디오(Sora, Runway Gen-3), 오디오(AudioLDM), 3D(DreamFusion, Zero-1-to-3), 로보틱스(Diffusion Policy) 등 다양한 도메인으로 확장되고 있다.

가속과 최적화. Distillation, Quantization, Caching 등의 기법으로 추론 속도가 크게 향상되었으며, 실시간 이미지 생성이 가능한 수준에 근접하고 있다.

14.2 DDPM의 역사적 의의

DDPM은 다음과 같은 점에서 생성 모델 역사의 전환점이다.

GAN이 지배하던 이미지 생성 분야에서 Likelihood 기반 모델의 경쟁력을 실증했다
극도로 단순한 학습 목표( $L_\text{simple}$ )로 고품질 생성이 가능함을 보여줬다
열역학과 Score Matching을 연결하는 이론적 프레임워크를 확립했다
Stable Diffusion, DALL-E 2, Midjourney 등 현대 AI 혁명의 직접적 토대가 되었다

15. References

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020. arXiv:2006.11239
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML 2015. arXiv:1503.03585
Song, J., Meng, C., & Ermon, S. (2021). Denoising Diffusion Implicit Models (DDIM). ICLR 2021. arXiv:2010.02502
Nichol, A. & Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. ICML 2021. arXiv:2102.09672
Dhariwal, P. & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. NeurIPS 2021. arXiv:2105.05233
Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance. NeurIPS Workshop 2021. arXiv:2207.12598
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022. arXiv:2112.10752
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021. arXiv:2011.13456
Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). Consistency Models. ICML 2023. arXiv:2303.01469
Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling. ICLR 2023. arXiv:2210.02747
Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers (DiT). ICCV 2023. arXiv:2212.09748
Song, Y. & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019. arXiv:1907.05600
Weng, L. (2021). What are Diffusion Models? lilianweng.github.io
Hugging Face. The Annotated Diffusion Model. huggingface.co/blog/annotated-diffusion

Complete Analysis of the DDPM Paper: The Mathematics and Principles of Diffusion Models that Create Images from Noise

1. Paper Overview
2. Background: From Thermodynamics to Generative Models
3. Forward Process: Systematically Adding Noise
- 3.1 Forward Process as a Markov Chain
- 3.2 Complete Forward Process
4. Core Mathematics: Reparameterization Trick
5. Reverse Process: Recovering Images from Noise
6. Deriving the Training Objective: From ELBO to Simplified Loss
7. Noise Scheduling: Design of $\beta_t$
8. Sampling Algorithm
9. Architecture: Time-conditioned U-Net
10. Experimental Results
11. Comprehensive Overview of Subsequent Research: The Evolution of Diffusion
12. PyTorch Code Examples: Simple DDPM Implementation
13. Diffusion Model vs GAN vs VAE: Comparative Analysis
- 13.1 Comprehensive Comparison Table
- 13.2 When to Choose Which Model?
14. Present and Future of Diffusion Models
- 14.1 Major Trends in 2024-2025
- 14.2 Historical Significance of DDPM
15. References

1. Paper Overview

"Denoising Diffusion Probabilistic Models" (DDPM) was published at NeurIPS 2020, co-authored by Jonathan Ho, Ajay Jain, and Pieter Abbeel from UC Berkeley. This paper is a landmark study that empirically demonstrated that high-quality image synthesis is achievable through diffusion probabilistic models.

The core idea is surprisingly simple. Define a Forward Process that gradually adds Gaussian noise to data, and learn a Reverse Process that step-by-step removes this noise to recover the original data. The final training objective reduces to a simple MSE loss between "model-predicted noise" and "actually added noise."

DDPM achieved FID 3.17 and Inception Score 9.46 on CIFAR-10, showing performance comparable to or surpassing GAN-based models of the time. More importantly, this paper became the foundation of modern image generation AI including DALL-E 2, Imagen, Stable Diffusion, and Midjourney.

Paper Information

Title: Denoising Diffusion Probabilistic Models

Authors: Jonathan Ho, Ajay Jain, Pieter Abbeel

Venue: NeurIPS 2020

arXiv: 2006.11239

Official Code: hojonathanho/diffusion

2. Background: From Thermodynamics to Generative Models

2.1 Inspiration from Non-equilibrium Thermodynamics

The intellectual origin of Diffusion Models lies in non-equilibrium statistical mechanics. In physics, diffusion refers to the process where particles randomly move from high-concentration regions to low-concentration regions, eventually reaching a state of thermal equilibrium (maximum entropy). The key insight of this process is:

Forward: A state with complex structure $\rightarrow$ disordered equilibrium state (information destruction)
Reverse: Equilibrium state $\rightarrow$ restoration to a structured state (information creation)

Sohl-Dickstein et al. (2015) first applied this idea to machine learning, publishing "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." By defining a diffusion process that transforms a complex data distribution into a simple known distribution (Gaussian), and learning its reverse process, one obtains a generative model.

2.2 Connection with Score Matching

Another theoretical pillar of Diffusion Models is Score Matching. The score function is defined as the gradient of the log probability density.

\nabla_x \log p(x)

If this score function can be estimated, samples can be generated through Langevin Dynamics.

x_{t+1} = x_t + \frac{\epsilon}{2} \nabla_x \log p(x_t) + \sqrt{\epsilon} \, z, \quad z \sim \mathcal{N}(0, I)

Yang Song and Stefano Ermon (2019) proposed Noise Conditional Score Networks (NCSN) in "Generative Modeling by Estimating Gradients of the Data Distribution," presenting a method for estimating the score function at various noise levels. Ho et al.'s DDPM is deeply connected to this Score Matching perspective, and the paper explicitly cites "a new connection with denoising score matching with Langevin dynamics" as a core contribution.

2.3 SDE Perspective: A Unified Framework

Song et al. (2021) unified DDPM and Score Matching under the framework of Stochastic Differential Equations (SDE) in "Score-Based Generative Modeling through Stochastic Differential Equations." The Forward Process described as a continuous-time SDE takes the form:

dx = f(x, t) \, dt + g(t) \, dw

where $f$ is the drift coefficient, $g$ is the diffusion coefficient, and $w$ is a standard Wiener process. A corresponding Reverse-time SDE exists:

dx = \left[ f(x, t) - g(t)^2 \nabla_x \log p_t(x) \right] dt + g(t) \, d\bar{w}

The key insight is that solving the reverse SDE requires only the time-dependent score function $\nabla_x \log p_t(x)$ . DDPM's noise prediction network $\epsilon_\theta$ is essentially equivalent to estimating this score function.

\epsilon_\theta(x_t, t) \approx -\sqrt{1 - \bar{\alpha}_t} \, \nabla_{x_t} \log p(x_t)

This relationship is the key link that theoretically unifies DDPM and Score Matching.

3. Forward Process: Systematically Adding Noise

3.1 Forward Process as a Markov Chain

The Forward Process (or Diffusion Process) is a fixed Markov Chain that gradually adds Gaussian noise to original data $x_0$ . It has no learnable parameters and is entirely determined by a predefined Variance Schedule $\{\beta_1, \beta_2, ..., \beta_T\}$ .

The transition probability at each time step $t$ is defined as:

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \, \beta_t I)

In plain terms, at each step the data from the previous time step is scaled down by $\sqrt{1 - \beta_t}$ and Gaussian noise with variance $\beta_t$ is added.

x_t = \sqrt{1 - \beta_t} \, x_{t-1} + \sqrt{\beta_t} \, \epsilon_{t-1}, \quad \epsilon_{t-1} \sim \mathcal{N}(0, I)

Why scale by $\sqrt{1 - \beta_t}$ ? To preserve the total variance at each step. If the variance of $x_{t-1}$ is 1, then the variance of $\sqrt{1-\beta_t} \cdot x_{t-1}$ is $1-\beta_t$ , and adding noise with variance $\beta_t$ gives a total variance of $(1-\beta_t) + \beta_t = 1$ .

When $T$ is sufficiently large and $\beta_t$ is appropriately set, $x_T$ converges to nearly pure isotropic Gaussian noise $\mathcal{N}(0, I)$ .

3.2 Complete Forward Process

The joint distribution of the complete Forward Process over $T$ steps is:

q(x_{1:T} | x_0) = \prod_{t=1}^{T} q(x_t | x_{t-1})

This follows from the Markov property, where each step depends only on the immediately preceding step. In DDPM, $T = 1000$ is used, with $\beta_t$ increasing linearly from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$ .

4. Core Mathematics: Reparameterization Trick

4.1 Jumping to Arbitrary Time $t$ in One Step

The most powerful mathematical property of the Forward Process is that $x_t$ at any arbitrary time $t$ can be computed directly from $x_0$ without going through intermediate steps. This is what makes training efficient.

First, define the notation:

\alpha_t = 1 - \beta_t, \qquad \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s

$\bar{\alpha}_t$ is the cumulative product of $\alpha_s$ , representing how much of the original signal is preserved up to time $t$ .

4.2 Derivation

Starting from $x_1$ and deriving inductively:

x_1 = \sqrt{\alpha_1} \, x_0 + \sqrt{1 - \alpha_1} \, \epsilon_0

x_2 = \sqrt{\alpha_2} \, x_1 + \sqrt{1 - \alpha_2} \, \epsilon_1

Substituting $x_1$ into $x_2$ :

x_2 = \sqrt{\alpha_2} \left( \sqrt{\alpha_1} \, x_0 + \sqrt{1 - \alpha_1} \, \epsilon_0 \right) + \sqrt{1 - \alpha_2} \, \epsilon_1

= \sqrt{\alpha_1 \alpha_2} \, x_0 + \sqrt{\alpha_2(1-\alpha_1)} \, \epsilon_0 + \sqrt{1-\alpha_2} \, \epsilon_1

Applying the sum of independent Gaussians rule: the sum of two independent Gaussians $\mathcal{N}(0, \sigma_1^2 I)$ and $\mathcal{N}(0, \sigma_2^2 I)$ follows $\mathcal{N}(0, (\sigma_1^2 + \sigma_2^2)I)$ .

Summing the noise variances:

\alpha_2(1-\alpha_1) + (1-\alpha_2) = \alpha_2 - \alpha_1\alpha_2 + 1 - \alpha_2 = 1 - \alpha_1\alpha_2 = 1 - \bar{\alpha}_2

Therefore:

x_2 = \sqrt{\bar{\alpha}_2} \, x_0 + \sqrt{1 - \bar{\alpha}_2} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Generalizing this yields the following.

4.3 Final Result: Closed-form Expression

\boxed{q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, \, (1 - \bar{\alpha}_t) I)}

That is, $x_t$ at any time $t$ can be sampled in one step:

x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

The intuitive interpretation of this formula is:

Term	Meaning	Change Over Time
$\sqrt{\bar{\alpha}_t} \, x_0$	Original signal	As $t \uparrow$ , $\bar{\alpha}_t \downarrow$ , signal decreases
$\sqrt{1 - \bar{\alpha}_t} \, \epsilon$	Added noise	As $t \uparrow$ , $1-\bar{\alpha}_t \uparrow$ , noise increases

At $t = 0$ , $\bar{\alpha}_0 = 1$ so we get $x_0$ as-is, and at $t = T$ , $\bar{\alpha}_T \approx 0$ so it becomes nearly pure noise. This gradual decrease in Signal-to-Noise Ratio (SNR) is the essence of the Forward Process.

\text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}

5. Reverse Process: Recovering Images from Noise

5.1 Definition of the Reverse Process

The Reverse Process starts from pure noise $x_T \sim \mathcal{N}(0, I)$ and progressively removes noise to generate data $x_0$ . If each step of the Forward Process is a small Gaussian perturbation, the key assumption is that its reverse can also be approximated as Gaussian (when $\beta_t$ is sufficiently small).

p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1} | x_t)

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

Here, $\mu_\theta$ and $\Sigma_\theta$ are the mean and variance that the neural network must learn. In DDPM, the variance $\Sigma_\theta$ is not learned but fixed as $\sigma_t^2 I$ , where either $\sigma_t^2 = \beta_t$ or $\sigma_t^2 = \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t$ is used.

5.2 Derivation of the Posterior $q(x_{t-1}|x_t, x_0)$

The key to training is that the reverse conditional distribution (posterior) given $x_0$ is computable in closed form. Applying Bayes' theorem:

q(x_{t-1} | x_t, x_0) = \frac{q(x_t | x_{t-1}, x_0) \, q(x_{t-1} | x_0)}{q(x_t | x_0)}

By the Markov property, $q(x_t|x_{t-1}, x_0) = q(x_t|x_{t-1})$ , so all three terms are Gaussian. Since the product of Gaussians is also Gaussian, expanding the exponents and rearranging as a quadratic in $x_{t-1}$ yields:

q(x_{t-1} | x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I)

where the posterior mean is:

\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t

and the posterior variance is:

\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t

5.3 Replacing $x_0$ with $\epsilon$

Since the model cannot directly know $x_0$ , we solve the Reparameterization formula $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ in reverse to express $x_0$ :

x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1-\bar{\alpha}_t} \, \epsilon \right)

Substituting this into the posterior mean $\tilde{\mu}_t$ :

\tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \, \epsilon \right)

If the model learns a network $\epsilon_\theta(x_t, t)$ that predicts the noise $\epsilon$ , the Reverse Process mean is computed as:

\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \, \epsilon_\theta(x_t, t) \right)

This is why noise prediction is equivalent to mean prediction in DDPM's Reverse Process.

6. Deriving the Training Objective: From ELBO to Simplified Loss

6.1 Maximum Likelihood and ELBO

The ultimate goal of a generative model is to maximize the data log-likelihood $\log p_\theta(x_0)$ . However, since this is intractable to compute directly, we optimize the Evidence Lower Bound (ELBO).

Applying Jensen's inequality:

\log p_\theta(x_0) \geq \mathbb{E}_{q(x_{1:T}|x_0)} \left[ \log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)} \right] = \text{ELBO}

6.2 Decomposition of the ELBO

Decomposing the ELBO into KL divergence terms:

\text{ELBO} = \underbrace{\mathbb{E}_q[\log p_\theta(x_0 | x_1)]}_{L_0: \text{Reconstruction term}} - \underbrace{D_{\text{KL}}(q(x_T | x_0) \| p(x_T))}_{L_T: \text{Prior matching term}} - \sum_{t=2}^{T} \underbrace{\mathbb{E}_q \left[ D_{\text{KL}}(q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t)) \right]}_{L_{t-1}: \text{Denoising matching term}}

Analyzing the meaning of each term:

$L_T$ (Prior Matching): Measures how well $q(x_T|x_0)$ matches the prior distribution $p(x_T) = \mathcal{N}(0, I)$ . When $T$ is sufficiently large, this term converges to 0, and since it has no learnable parameters, it is ignored as a constant.

$L_0$ (Reconstruction): Measures the ability to reconstruct $x_0$ from $x_1$ . Since $x_0$ and $x_1$ are very similar, its impact on overall training is small.

$L_{t-1}$ (Denoising Matching): The core training signal that measures how well the model's Reverse transition $p_\theta(x_{t-1}|x_t)$ matches the true posterior $q(x_{t-1}|x_t, x_0)$ .

6.3 KL Divergence Computation

The KL divergence between two Gaussians is computable in closed form. Since $q(x_{t-1}|x_t, x_0) = \mathcal{N}(\tilde{\mu}_t, \tilde{\beta}_t I)$ and $p_\theta(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta, \sigma_t^2 I)$ :

D_{\text{KL}}(q \| p_\theta) = \frac{1}{2\sigma_t^2} \|\tilde{\mu}_t(x_t, x_0) - \mu_\theta(x_t, t)\|^2 + C

where $C$ is a constant related to the variances. With fixed variance, only the difference in means becomes the training objective.

6.4 Reparameterization to Noise Prediction

Substituting the expressions for $\tilde{\mu}_t$ and $\mu_\theta$ derived earlier:

\|\tilde{\mu}_t - \mu_\theta\|^2 = \frac{\beta_t^2}{(1-\bar{\alpha}_t)\alpha_t} \|\epsilon - \epsilon_\theta(x_t, t)\|^2

The Simplified Loss with the weighting coefficient removed is:

\boxed{L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \right]}

where $t \sim \text{Uniform}(\{1, ..., T\})$ , $x_0 \sim q(x_0)$ , and $\epsilon \sim \mathcal{N}(0, I)$ .

This is DDPM's most important contribution. Starting from the complex ELBO, it ultimately arrives at the "MSE between actual noise $\epsilon$ and predicted noise $\epsilon_\theta$ " — the simplest possible loss function in machine learning. Experimentally, this simplified loss also produces better sample quality than the weighted variational bound.

6.5 Training Algorithm Summary

Algorithm 1: Training
─────────────────────────────────
repeat
    x_0 ~ q(x_0)                    # Sample from dataset
    t ~ Uniform({1, ..., T})         # Select random time step
    ε ~ N(0, I)                      # Sample standard Gaussian noise
    x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε   # Generate noisy image
    ∇_θ ||ε - ε_θ(x_t, t)||²        # Compute gradient and update
until converged

7. Noise Scheduling: Design of $\beta_t$

7.1 Linear Schedule (Original DDPM)

Ho et al. used a schedule where $\beta_t$ increases linearly from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$ .

\beta_t = \beta_1 + \frac{t-1}{T-1}(\beta_T - \beta_1)

The intuition behind this schedule is to add small noise initially to gradually destroy data structure, and larger noise in later stages to rapidly converge to a Gaussian.

7.2 Problems with the Linear Schedule

Nichol & Dhariwal (2021, "Improved Denoising Diffusion Probabilistic Models") identified two issues with the Linear Schedule.

First, information is destroyed too quickly in the early stages. $\bar{\alpha}_t$ decreases rapidly in the beginning, so significant noise is added even at low values of $t$ . This is particularly problematic for high-resolution images.

Second, late time steps are wasted. At large values of $t$ , $\bar{\alpha}_t \approx 0$ , meaning $x_t$ is already close to pure noise and contributes little to meaningful training.

7.3 Cosine Schedule

The Cosine Schedule proposed by Nichol & Dhariwal defines $\bar{\alpha}_t$ directly.

\bar{\alpha}_t = \frac{f(t)}{f(0)}, \qquad f(t) = \cos\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2

where $s = 0.008$ is a small offset to prevent $\beta_t$ from becoming too small near $t=0$ .

The key characteristics of the Cosine Schedule are:

$\bar{\alpha}_t$ decreases nearly linearly in the middle range, providing uniformly useful training signals across all time steps
Prevents excessive noise addition in the early stages, preserving fine details
Ensures smooth transition to complete noise in the later stages

import torch
import math

def cosine_beta_schedule(timesteps, s=0.008):
    """Cosine schedule as proposed in Nichol & Dhariwal (2021)."""
    steps = timesteps + 1
    t = torch.linspace(0, timesteps, steps) / timesteps
    alphas_cumprod = torch.cos((t + s) / (1 + s) * math.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clip(betas, 0.0001, 0.9999)

def linear_beta_schedule(timesteps, beta_start=1e-4, beta_end=0.02):
    """Linear schedule as proposed in Ho et al. (2020)."""
    return torch.linspace(beta_start, beta_end, timesteps)

7.4 Schedule Comparison

Property	Linear Schedule	Cosine Schedule
$\bar{\alpha}_t$ decay pattern	Rapid early, gradual late	Nearly linear in middle
Early information preservation	Low	High
Late time step utilization	Inefficient (already pure noise)	Efficient
High-resolution suitability	Low	High
Used in original DDPM	Yes	No
Used in Improved DDPM	No	Yes

8. Sampling Algorithm

8.1 DDPM Sampling

After training is complete, the DDPM sampling algorithm for generating new images is:

Algorithm 2: Sampling
─────────────────────────────────
x_T ~ N(0, I)                          # Start from pure noise
for t = T, T-1, ..., 1:
    z ~ N(0, I)  if t > 1, else z = 0  # No noise added at the last step
    x_{t-1} = 1/√α_t · (x_t - β_t/√(1-ᾱ_t) · ε_θ(x_t, t)) + σ_t · z
return x_0

8.2 Step-by-Step Interpretation

Step 1: Initialization. Sample pure Gaussian noise from $x_T \sim \mathcal{N}(0, I)$ . This is the starting point of the generation process.

Step 2: Noise Prediction. Feed the current noisy image $x_t$ and time step $t$ into the network $\epsilon_\theta$ to predict the noise contained in $x_t$ .

Step 3: Mean Computation. Compute the mean of the Reverse transition using the predicted noise.

\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right)

Step 4: Stochastic Transition. Generate $x_{t-1}$ by adding scaled Gaussian noise $\sigma_t z$ to the computed mean. No noise is added at the final step ( $t=1$ ).

Step 5: Repeat. Repeat the above process from $t = T$ to $t = 1$ .

8.3 Limitations of Sampling

The biggest drawback of DDPM sampling is speed. Sequential denoising over $T = 1000$ steps requires 1000 neural network forward passes for a single image. This is extremely slow compared to GAN's single forward pass, spurring subsequent research on accelerated samplers such as DDIM and DPM-Solver.

9. Architecture: Time-conditioned U-Net

9.1 U-Net Based Design

DDPM's noise prediction network $\epsilon_\theta(x_t, t)$ is based on the U-Net architecture. U-Net was originally proposed by Ronneberger et al. (2015) for medical image segmentation, featuring an Encoder-Decoder structure with Skip Connections that combine features at various resolutions.

DDPM's U-Net is based on the PixelCNN++ structure with the following modifications.

9.2 Key Components

Time Embedding: To inject the time step $t$ into the network, Transformer-style Sinusoidal Positional Encoding is used.

\text{TE}(t)_{2i} = \sin\left(\frac{t}{10000^{2i/d}}\right), \quad \text{TE}(t)_{2i+1} = \cos\left(\frac{t}{10000^{2i/d}}\right)

This embedding passes through an MLP and is injected into each ResNet Block. Specifically, the time embedding is linearly transformed and then either added (additive) or scaled (FiLM conditioning) onto the intermediate feature maps of the ResNet Block.

ResNet Block: Each block consists of the following sequence:

Group Normalization
SiLU (Swish) Activation
Convolution
Time Embedding injection
Group Normalization
SiLU Activation
Dropout
Convolution
Residual Connection

Self-Attention: Multi-Head Self-Attention is applied at feature maps of $16 \times 16$ resolution. The spatial dimensions $(h, w)$ are flattened to sequence length $h \times w$ to perform standard Scaled Dot-Product Attention.

Group Normalization: Group Normalization is used instead of Batch Normalization. It is independent of batch size and provides more stable training for generative models.

9.3 Specific Architecture Specifications

Input: x_t ∈ R^(C×H×W), t ∈ {1,...,T}

Encoder:
  [128] → [128] → ↓2 →
  [256] → [256] → ↓2 →
  [256] → [256] → ↓2 →      (+ Self-Attention at 16×16)
  [512] → [512] → ↓2

Bottleneck:
  [512] → Self-Attention → [512]

Decoder (with skip connections):
  [512] → [512] → ↑2 →
  [256] → [256] → ↑2 →      (+ Self-Attention at 16×16)
  [256] → [256] → ↑2 →
  [128] → [128] → ↑2

Output: ε_θ ∈ R^(C×H×W)       (predicted noise with same dimensions as input)

DDPM used approximately 114M parameters at $256 \times 256$ resolution.

10. Experimental Results

10.1 Quantitative Evaluation

DDPM was evaluated on the following benchmarks.

CIFAR-10 (Unconditional, $32 \times 32$ ):

Model	FID ( $\downarrow$ )	IS ( $\uparrow$ )
DDPM	3.17	9.46
StyleGAN2 + ADA	2.92	9.83
NCSN	25.32	8.87
ProgressiveGAN	15.52	8.80
NVAE	23.5	-

DDPM achieved SOTA FID among unconditional generative models at the time, showing quality comparable to GAN-based StyleGAN2.

LSUN ( $256 \times 256$ ):

Dataset	FID
LSUN Bedroom	4.90
LSUN Cat	-
LSUN Church	7.89

10.2 Qualitative Analysis

DDPM samples exhibited several distinct characteristics compared to GANs.

High diversity: While GANs suffer from limited generation diversity due to mode collapse, DDPM covers diverse modes of the data distribution in a balanced manner.

Gradual generation: The progressive transformation from noise to image can be visualized, confirming a coarse-to-fine generation pattern where the model first forms global structure and then adds fine details.

Stable training: Free from GAN's chronic problems of training instability (mode collapse, training oscillation), converging stably with a simple MSE loss.

10.3 Progressive Lossy Compression Interpretation

Ho et al. interpreted DDPM as naturally implementing a Progressive Lossy Decompression scheme. Information is progressively added at each Reverse step, which can be viewed as a generalization of Autoregressive Decoding. Rate-Distortion curve analysis confirmed that most bits are allocated to overall structure rather than perceptually insignificant details.

11. Comprehensive Overview of Subsequent Research: The Evolution of Diffusion

11.1 DDIM (Denoising Diffusion Implicit Models)

Song et al., 2021 | arXiv: 2010.02502

Research that addressed DDPM's biggest limitation: slow sampling speed. The core idea is to generalize the Forward Process to be Non-Markovian.

DDIM uses the same trained model $\epsilon_\theta$ while modifying only the sampling process.

x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\left( \frac{x_t - \sqrt{1-\bar{\alpha}_t} \, \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}} \right)}_{\text{predicted } x_0} + \sqrt{1-\bar{\alpha}_{t-1} - \sigma_t^2} \cdot \epsilon_\theta(x_t, t) + \sigma_t \epsilon_t

Setting $\sigma_t = 0$ makes sampling completely deterministic, which provides:

Accelerated sampling: Similar quality images with only 50-100 steps instead of $T=1000$ (10-20x speedup)
Semantic interpolation: Thanks to the deterministic mapping, interpolation in latent space leads to meaningful image transformations
Consistency: Always generates the same image from the same initial noise, ensuring reproducible results

11.2 Improved DDPM

Nichol & Dhariwal, 2021 | arXiv: 2102.09672

Research that improved two aspects of the original DDPM.

Learnable variance: While DDPM fixed $\sigma_t^2$ as either $\beta_t$ or $\tilde{\beta}_t$ , Improved DDPM makes it learnable. Specifically, $\sigma_t^2$ is parameterized as an interpolation between $\beta_t$ and $\tilde{\beta}_t$ .

\Sigma_\theta(x_t, t) = \exp(v \log \beta_t + (1-v) \log \tilde{\beta}_t)

where $v$ is a value output by the network.

Cosine Schedule: Introduced the Cosine Variance Schedule described earlier, greatly improving training efficiency especially for high-resolution images.

Hybrid Loss: Adding a small amount of the variational lower bound $L_\text{vlb}$ to $L_\text{simple}$ also improved log-likelihood.

L_\text{hybrid} = L_\text{simple} + \lambda L_\text{vlb}

11.3 Classifier Guidance

Dhariwal & Nichol, 2021 | arXiv: 2105.05233

A technique proposed in "Diffusion Models Beat GANs on Image Synthesis" that injects the gradient of a pre-trained classifier into the Reverse Process for conditional generation.

\hat{\epsilon}_\theta(x_t, t, y) = \epsilon_\theta(x_t, t) - s \cdot \sqrt{1-\bar{\alpha}_t} \cdot \nabla_{x_t} \log p_\phi(y|x_t)

where $s$ is the guidance scale and $p_\phi$ is a classifier trained on noisy images. Increasing $s$ reduces diversity but increases fidelity to a specific class. In this paper, Diffusion Models first surpassed GANs in FID (CIFAR-10 FID 2.97, ImageNet 256x256 FID 4.59).

Limitation: A separate classifier must be trained on noisy data, complicating the training pipeline.

11.4 Classifier-Free Guidance (CFG)

Ho & Salimans, 2022 | arXiv: 2207.12598

An innovative technique that achieves guidance effects without a separate classifier, and has become the de facto standard in modern Diffusion Models.

The core idea is for a single network to learn both conditional and unconditional generation. During training, condition information $c$ is replaced with a null token $\varnothing$ with a certain probability (typically 10-20%).

At inference, conditional and unconditional predictions are linearly combined.

\hat{\epsilon}_\theta(x_t, t, c) = (1 + w) \cdot \epsilon_\theta(x_t, t, c) - w \cdot \epsilon_\theta(x_t, t, \varnothing)

where $w$ is the guidance weight. When $w = 0$ , standard conditional generation occurs; when $w > 0$ , fidelity to the condition increases.

Rearranging gives the following interpretation:

\hat{\epsilon}_\theta = \epsilon_\theta(x_t, t, \varnothing) + (1 + w) \cdot \underbrace{(\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \varnothing))}_{\text{shift toward condition}}

This can be interpreted as pushing away from the unconditional prediction toward the conditional direction, with larger $w$ increasing the pushing force. Nearly all state-of-the-art Text-to-Image models including DALL-E 2, Stable Diffusion, and Imagen use CFG.

11.5 Latent Diffusion Models (LDM) / Stable Diffusion

Rombach et al., 2022 | arXiv: 2112.10752

LDM dramatically improved computational efficiency by performing the Diffusion Process in latent space rather than pixel space.

Key Architecture:

Perceptual Compression: A pre-trained Autoencoder (VQ-VAE or KL-regularized VAE) Encoder $\mathcal{E}$ compresses image $x$ into low-dimensional latent $z = \mathcal{E}(x)$ . Typically, a $256 \times 256 \times 3$ image is compressed to $32 \times 32 \times 4$ latent (approximately 48x dimensionality reduction).
Latent Diffusion: DDPM's Forward/Reverse Process is performed in this latent space. Computation is significantly reduced compared to pixel space.
Cross-Attention Conditioning: Condition information such as text and segmentation maps is injected into the U-Net via Cross-Attention. For text, CLIP or BERT embeddings are used.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V

where $Q = W_Q \cdot \varphi(z_t)$ , $K = W_K \cdot \tau_\theta(y)$ , $V = W_V \cdot \tau_\theta(y)$ , and $\tau_\theta(y)$ is the encoding of the condition information.

Stable Diffusion is trained by combining this LDM architecture with a CLIP text encoder and a large-scale dataset (LAION-5B), becoming the de facto standard for open-source Text-to-Image models.

11.6 Score SDE

Song et al., 2021 | arXiv: 2011.13456

This ICLR 2021 Oral presentation connected DDPM and Score Matching under the unified framework of Stochastic Differential Equations (SDE).

Key contributions:

Variance Exploding (VE) SDE: Corresponds to the NCSN/SMLD family
Variance Preserving (VP) SDE: Corresponds to DDPM
Sub-VP SDE: A variant providing better likelihood

\text{VP-SDE}: \quad dx = -\frac{1}{2}\beta(t) x \, dt + \sqrt{\beta(t)} \, dw

The extension to continuous time enables exact log-likelihood computation (via ODE), more flexible sampler design, and conditional generation tasks such as Inpainting and Colorization.

11.7 Consistency Models

Song et al., 2023 | arXiv: 2303.01469

Consistency Models, proposed by Yang Song at OpenAI, represent an attempt to fundamentally solve the multi-step sampling problem of Diffusion Models.

The core idea is to learn a function $f_\theta$ that maps all points on an ODE trajectory to the same starting point (original data).

f_\theta(x_t, t) = x_0, \quad \forall t \in [0, T]

By this self-consistency property, data can be recovered from a noisy sample at any time $t$ with a single network evaluation. That is, 1-step generation is possible.

Two training approaches exist:

Consistency Distillation (CD): Distilling from a pre-trained Diffusion Model
Consistency Training (CT): Training independently without pre-training

In 2024, Easy Consistency Models (ECM) emerged, achieving better 2-step generation performance at 33% of the training cost compared to iCT.

11.8 Flow Matching / Rectified Flow

Lipman et al., 2023; Liu et al., 2023 | arXiv: 2210.02747, arXiv: 2209.03003

Flow Matching is an alternative approach to Diffusion Models that directly learns the probability flow connecting data and noise distributions.

Core Idea: Define straight paths from noise $x_1 \sim \mathcal{N}(0, I)$ to data $x_0$ .

x_t = (1-t) x_0 + t \, \epsilon, \quad t \in [0, 1]

Learn a velocity field $v_\theta(x_t, t)$ along this path.

L_{\text{FM}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| v_\theta(x_t, t) - (x_0 - \epsilon) \|^2 \right]

Rectified Flow repeatedly "straightens" these paths (reflow), producing high-quality samples even with few steps.

Stable Diffusion 3 adopted Rectified Flow, presenting a new paradigm for Diffusion Models alongside the transition from U-Net to Transformer.

11.9 DiT (Diffusion Transformer)

Peebles & Xie, 2023 | arXiv: 2212.09748

DiT replaced the Diffusion Model backbone from U-Net to Vision Transformer (ViT).

Key design choices:

Images are divided into patches and processed as tokens
Time step $t$ and class label $y$ are injected via Adaptive Layer Normalization (adaLN-Zero)
Composed of $L$ Transformer Blocks

DiT, combined with Latent Diffusion, achieved FID 2.27 on ImageNet $256 \times 256$ class-conditional generation, surpassing all previous Diffusion Models.

Significance of DiT: It empirically demonstrated that Transformer scaling laws can be applied to Diffusion Models. Performance consistently improves with increased model size and training compute. This finding directly influenced the architectural choices of the latest large-scale generative models such as Sora (OpenAI, Video generation) and Stable Diffusion 3.

12. PyTorch Code Examples: Simple DDPM Implementation

Below is a simplified PyTorch implementation of DDPM's core components. A more sophisticated U-Net and hyperparameter tuning would be needed for actual training.

12.1 Noise Schedule and Forward Process

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class DDPMScheduler:
    """Scheduler managing DDPM's Forward Process."""

    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02, schedule='linear'):
        self.num_timesteps = num_timesteps

        if schedule == 'linear':
            self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        elif schedule == 'cosine':
            self.betas = self._cosine_schedule(num_timesteps)
        else:
            raise ValueError(f"Unknown schedule: {schedule}")

        # Pre-compute key variables
        self.alphas = 1.0 - self.betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)          # ᾱ_t
        self.alphas_cumprod_prev = F.pad(self.alphas_cumprod[:-1], (1, 0), value=1.0)

        # Forward process coefficients
        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)        # √ᾱ_t
        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)  # √(1-ᾱ_t)

        # Reverse process coefficients
        self.sqrt_recip_alphas = torch.sqrt(1.0 / self.alphas)           # 1/√α_t
        self.posterior_variance = (
            self.betas * (1.0 - self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
        )  # β̃_t

    def _cosine_schedule(self, timesteps, s=0.008):
        steps = timesteps + 1
        t = torch.linspace(0, timesteps, steps) / timesteps
        alphas_cumprod = torch.cos((t + s) / (1 + s) * math.pi * 0.5) ** 2
        alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
        betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
        return torch.clip(betas, 0.0001, 0.9999)

    def add_noise(self, x_0, t, noise=None):
        """Forward process: compute q(x_t | x_0) in one step."""
        if noise is None:
            noise = torch.randn_like(x_0)

        sqrt_alpha_cumprod = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
        sqrt_one_minus_alpha_cumprod = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)

        # x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε
        x_t = sqrt_alpha_cumprod * x_0 + sqrt_one_minus_alpha_cumprod * noise
        return x_t

12.2 Simplified U-Net

class SinusoidalPositionEmbedding(nn.Module):
    """Transformer-style Sinusoidal Time Embedding."""

    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, t):
        device = t.device
        half_dim = self.dim // 2
        emb = math.log(10000) / (half_dim - 1)
        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
        emb = t[:, None].float() * emb[None, :]
        emb = torch.cat([emb.sin(), emb.cos()], dim=-1)
        return emb


class ResBlock(nn.Module):
    """Time-conditioned Residual Block."""

    def __init__(self, in_ch, out_ch, time_emb_dim):
        super().__init__()
        self.norm1 = nn.GroupNorm(8, in_ch)
        self.conv1 = nn.Conv2d(in_ch, out_ch, 3, padding=1)
        self.time_mlp = nn.Sequential(
            nn.SiLU(),
            nn.Linear(time_emb_dim, out_ch),
        )
        self.norm2 = nn.GroupNorm(8, out_ch)
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1)
        self.skip = nn.Conv2d(in_ch, out_ch, 1) if in_ch != out_ch else nn.Identity()

    def forward(self, x, t_emb):
        h = self.conv1(F.silu(self.norm1(x)))
        h = h + self.time_mlp(t_emb)[:, :, None, None]  # Inject time embedding
        h = self.conv2(F.silu(self.norm2(h)))
        return h + self.skip(x)                           # Residual connection


class SimpleUNet(nn.Module):
    """Simplified U-Net for DDPM training."""

    def __init__(self, in_channels=3, base_channels=64, time_emb_dim=256):
        super().__init__()

        # Time embedding
        self.time_mlp = nn.Sequential(
            SinusoidalPositionEmbedding(base_channels),
            nn.Linear(base_channels, time_emb_dim),
            nn.SiLU(),
            nn.Linear(time_emb_dim, time_emb_dim),
        )

        # Encoder
        self.enc1 = ResBlock(in_channels, base_channels, time_emb_dim)
        self.enc2 = ResBlock(base_channels, base_channels * 2, time_emb_dim)
        self.enc3 = ResBlock(base_channels * 2, base_channels * 4, time_emb_dim)
        self.pool = nn.MaxPool2d(2)

        # Bottleneck
        self.bot = ResBlock(base_channels * 4, base_channels * 4, time_emb_dim)

        # Decoder (with skip connections)
        self.dec3 = ResBlock(base_channels * 8, base_channels * 2, time_emb_dim)
        self.dec2 = ResBlock(base_channels * 4, base_channels, time_emb_dim)
        self.dec1 = ResBlock(base_channels * 2, base_channels, time_emb_dim)
        self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)

        # Output
        self.out = nn.Conv2d(base_channels, in_channels, 1)

    def forward(self, x, t):
        t_emb = self.time_mlp(t)

        # Encoder
        e1 = self.enc1(x, t_emb)
        e2 = self.enc2(self.pool(e1), t_emb)
        e3 = self.enc3(self.pool(e2), t_emb)

        # Bottleneck
        b = self.bot(self.pool(e3), t_emb)

        # Decoder with skip connections
        d3 = self.dec3(torch.cat([self.up(b), e3], dim=1), t_emb)
        d2 = self.dec2(torch.cat([self.up(d3), e2], dim=1), t_emb)
        d1 = self.dec1(torch.cat([self.up(d2), e1], dim=1), t_emb)

        return self.out(d1)  # Predicted noise ε_θ

12.3 Training Loop

def train_ddpm(model, dataloader, scheduler, epochs=100, lr=2e-4, device='cuda'):
    """DDPM training loop (Algorithm 1 implementation)."""
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    model.train()

    for epoch in range(epochs):
        total_loss = 0
        for batch_idx, (x_0, _) in enumerate(dataloader):
            x_0 = x_0.to(device)

            # 1. Select random time step: t ~ Uniform({1, ..., T})
            t = torch.randint(0, scheduler.num_timesteps, (x_0.shape[0],), device=device)

            # 2. Sample noise: ε ~ N(0, I)
            noise = torch.randn_like(x_0)

            # 3. Forward process: x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε
            x_t = scheduler.add_noise(x_0, t, noise)

            # 4. Predict noise: ε_θ(x_t, t)
            noise_pred = model(x_t, t)

            # 5. Simplified loss: L = ||ε - ε_θ(x_t, t)||²
            loss = F.mse_loss(noise_pred, noise)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)
        print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

12.4 Sampling

@torch.no_grad()
def sample_ddpm(model, scheduler, image_shape, device='cuda'):
    """DDPM sampling (Algorithm 2 implementation)."""
    model.eval()

    # x_T ~ N(0, I)
    x = torch.randn(image_shape, device=device)

    for t in reversed(range(scheduler.num_timesteps)):
        t_batch = torch.full((image_shape[0],), t, device=device, dtype=torch.long)

        # Predict noise
        predicted_noise = model(x, t_batch)

        # Reverse process coefficients
        alpha_t = scheduler.alphas[t]
        alpha_cumprod_t = scheduler.alphas_cumprod[t]
        beta_t = scheduler.betas[t]

        # Compute mean: μ_θ = 1/√α_t · (x_t - β_t/√(1-ᾱ_t) · ε_θ)
        mean = (1.0 / torch.sqrt(alpha_t)) * (
            x - (beta_t / torch.sqrt(1.0 - alpha_cumprod_t)) * predicted_noise
        )

        if t > 0:
            # Add stochastic noise (except at the last step)
            noise = torch.randn_like(x)
            sigma_t = torch.sqrt(scheduler.posterior_variance[t])
            x = mean + sigma_t * noise
        else:
            x = mean

    return x

12.5 Usage Example

# Hyperparameters
device = 'cuda' if torch.cuda.is_available() else 'cpu'
image_size = 32
batch_size = 128
num_timesteps = 1000

# Initialize scheduler and model
scheduler = DDPMScheduler(num_timesteps=num_timesteps, schedule='cosine')
model = SimpleUNet(in_channels=3, base_channels=64).to(device)

# Dataset (e.g., CIFAR-10)
from torchvision import datasets, transforms
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),  # Normalize to [-1, 1]
])
dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Train
train_ddpm(model, dataloader, scheduler, epochs=100, device=device)

# Sample
samples = sample_ddpm(model, scheduler, (16, 3, image_size, image_size), device=device)
# samples: 16 generated images in [-1, 1] range

13. Diffusion Model vs GAN vs VAE: Comparative Analysis

13.1 Comprehensive Comparison Table

Property	Diffusion Model (DDPM)	GAN	VAE
Training Method	Noise prediction (MSE)	Adversarial training (Min-Max)	Variational inference (ELBO)
Training Stability	Very stable	Unstable (mode collapse, oscillation)	Stable
Generation Quality	Very high	Very high	Moderate (blurry)
Diversity	High (full distribution coverage)	Low (mode collapse risk)	High
Generation Speed	Slow (1000 steps)	Very fast (1 step)	Fast (1 step)
Log-likelihood	Computable (ELBO)	Not computable	Computable (ELBO)
Latent Space	Implicit	None (or limited)	Explicit, continuous
Mode Coverage	High	Low	High
Conditional Generation	Very effective via CFG	Possible via cGAN	Conditional VAE
Resolution Scaling	Efficient via LDM	Progressive training needed	Hierarchical VAE needed
Theoretical Basis	Thermodynamics, Score Matching	Game theory	Variational Bayes
Representative Models	Stable Diffusion, DALL-E 2	StyleGAN, BigGAN	VQ-VAE-2, NVAE
CIFAR-10 FID	~2.0 (latest)	~2.9 (StyleGAN2)	~23.5 (NVAE)

13.2 When to Choose Which Model?

Choose Diffusion Models when:

Both generation quality and diversity are important
Complex conditional generation like text-to-image is needed
Training stability is critical
Generation speed is not the top priority

Choose GANs when:

Real-time generation is needed
High-quality images for a specific domain are needed (faces, landscapes, etc.)
The dataset is relatively small and uniform

Choose VAEs when:

Meaningful Latent Space manipulation is needed
Likelihood-based anomaly detection is needed
Fast encoding/decoding is required
Semi-supervised learning or representation learning is the main purpose

14. Present and Future of Diffusion Models

14.1 Major Trends in 2024-2025

Architecture Transition: From U-Net to Transformer. The latest models such as Stable Diffusion 3, FLUX, and Sora adopt DiT-based architectures. Transformer scaling laws have been confirmed to apply to Diffusion Models, and model scale expansion (8B+ parameters) is actively underway.

Sampling Efficiency. With advances in Consistency Models, Flow Matching, and DPM-Solver, 1-4 step generation has become possible. Rectified Flow learns straight paths, achieving high quality even with few steps.

Multimodal Expansion. Diffusion Models are expanding beyond images to video (Sora, Runway Gen-3), audio (AudioLDM), 3D (DreamFusion, Zero-1-to-3), robotics (Diffusion Policy), and other domains.

Acceleration and Optimization. Techniques such as Distillation, Quantization, and Caching have greatly improved inference speed, approaching real-time image generation.

14.2 Historical Significance of DDPM

DDPM represents a turning point in generative model history in the following ways:

Demonstrated the competitiveness of Likelihood-based models in the image generation space dominated by GANs
Showed that high-quality generation is possible with an extremely simple training objective ( $L_\text{simple}$ )
Established a theoretical framework connecting thermodynamics and Score Matching
Became the direct foundation of the modern AI revolution including Stable Diffusion, DALL-E 2, and Midjourney

15. References

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020. arXiv:2006.11239
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML 2015. arXiv:1503.03585
Song, J., Meng, C., & Ermon, S. (2021). Denoising Diffusion Implicit Models (DDIM). ICLR 2021. arXiv:2010.02502
Nichol, A. & Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. ICML 2021. arXiv:2102.09672
Dhariwal, P. & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. NeurIPS 2021. arXiv:2105.05233
Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance. NeurIPS Workshop 2021. arXiv:2207.12598
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022. arXiv:2112.10752
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021. arXiv:2011.13456
Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). Consistency Models. ICML 2023. arXiv:2303.01469
Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling. ICLR 2023. arXiv:2210.02747
Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers (DiT). ICCV 2023. arXiv:2212.09748
Song, Y. & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019. arXiv:1907.05600
Weng, L. (2021). What are Diffusion Models? lilianweng.github.io
Hugging Face. The Annotated Diffusion Model. huggingface.co/blog/annotated-diffusion

DDPM 논문 완벽 분석: 노이즈에서 이미지를 만들어내는 확산 모델의 수학과 원리

1. 논문 개요

2. 배경: 열역학에서 생성 모델로

2.1 비평형 열역학에서의 영감

2.2 Score Matching과의 연결

2.3 SDE 관점: 통합 프레임워크

3. Forward Process: 체계적으로 노이즈를 추가하다

3.1 Markov Chain으로서의 Forward Process

3.2 전체 Forward Process

4. 핵심 수학: Reparameterization Trick

4.1 임의의 시간 ttt로 한 번에 점프

4.2 유도 과정

4.3 최종 결과: Closed-form Expression

5. Reverse Process: 노이즈에서 이미지를 복원하다

5.1 Reverse Process의 정의

5.2 Posterior q(xt−1∣xt,x0)q(x_{t-1}|x_t, x_0)q(xt−1​∣xt​,x0​)의 유도

5.3 x0x_0x0​를 ϵ\epsilonϵ으로 대체

6. 학습 목표의 유도: ELBO에서 Simplified Loss로

6.1 최대 우도와 ELBO

6.2 ELBO의 분해

6.3 KL Divergence 계산

6.4 노이즈 예측으로의 재매개변수화

6.5 학습 알고리즘 요약

7. 노이즈 스케줄링: βt\beta_tβt​의 설계

7.1 Linear Schedule (DDPM 원본)

7.2 Linear Schedule의 문제점

7.3 Cosine Schedule

7.4 스케줄 비교

8. 샘플링 알고리즘

8.1 DDPM Sampling

8.2 단계별 해석

8.3 샘플링의 한계

9. 아키텍처: Time-conditioned U-Net

9.1 U-Net 기반 설계

9.2 핵심 구성 요소

9.3 구체적 아키텍처 사양

10. 실험 결과

10.1 정량적 평가

10.2 정성적 분석

10.3 Progressive Lossy Compression 해석

11. 후속 연구 총정리: Diffusion의 진화

11.1 DDIM (Denoising Diffusion Implicit Models)

11.2 Improved DDPM

11.3 Classifier Guidance

11.4 Classifier-Free Guidance (CFG)

11.5 Latent Diffusion Models (LDM) / Stable Diffusion

11.6 Score SDE

11.7 Consistency Models

11.8 Flow Matching / Rectified Flow

11.9 DiT (Diffusion Transformer)

12. PyTorch 코드 예제: 간단한 DDPM 구현

12.1 Noise Schedule과 Forward Process

12.2 간소화된 U-Net

12.3 학습 루프

12.4 샘플링

12.5 사용 예시

13. Diffusion Model vs GAN vs VAE: 비교 분석

13.1 종합 비교표

13.2 언제 어떤 모델을 선택할 것인가?

14. Diffusion Model의 현재와 미래

14.1 2024~2025년의 주요 흐름

14.2 DDPM의 역사적 의의

15. References

Complete Analysis of the DDPM Paper: The Mathematics and Principles of Diffusion Models that Create Images from Noise

1. Paper Overview

2. Background: From Thermodynamics to Generative Models

2.1 Inspiration from Non-equilibrium Thermodynamics

2.2 Connection with Score Matching

2.3 SDE Perspective: A Unified Framework

3. Forward Process: Systematically Adding Noise

3.1 Forward Process as a Markov Chain

3.2 Complete Forward Process

4. Core Mathematics: Reparameterization Trick

4.1 Jumping to Arbitrary Time ttt in One Step

4.2 Derivation

4.3 Final Result: Closed-form Expression

5. Reverse Process: Recovering Images from Noise

5.1 Definition of the Reverse Process

5.2 Derivation of the Posterior q(xt−1∣xt,x0)q(x_{t-1}|x_t, x_0)q(xt−1​∣xt​,x0​)

5.3 Replacing x0x_0x0​ with ϵ\epsilonϵ

4.1 임의의 시간 $t$ 로 한 번에 점프

5.2 Posterior $q(x_{t-1}|x_t, x_0)$ 의 유도

5.3 $x_0$ 를 $\epsilon$ 으로 대체

7. 노이즈 스케줄링: $\beta_t$ 의 설계

4.1 Jumping to Arbitrary Time $t$ in One Step

5.2 Derivation of the Posterior $q(x_{t-1}|x_t, x_0)$

5.3 Replacing $x_0$ with $\epsilon$

7. Noise Scheduling: Design of $\beta_t$