Split View: GAN 논문 완벽 분석: 생성적 적대 신경망이 열어젖힌 AI 생성 모델의 시대

GAN 논문 완벽 분석: 생성적 적대 신경망이 열어젖힌 AI 생성 모델의 시대

1. 논문 개요 및 역사적 의의
2. GAN의 핵심 아이디어
3. 수학적 기반
4. 학습 알고리즘
5. GAN의 핵심 문제점
6. GAN 계보 총정리
7. PyTorch로 구현하는 GAN
8. GAN vs Diffusion Models 비교
9. GAN의 현재와 미래
10. 결론
References

1. 논문 개요 및 역사적 의의

1.1 논문 기본 정보

"Generative Adversarial Nets" 는 2014년 NeurIPS(당시 NIPS)에서 발표된 논문으로, Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio가 공동 저술했다. 전설적인 일화에 따르면, Goodfellow는 몬트리올의 한 술집에서 동료들과 생성 모델에 대해 토론하던 중 이 아이디어를 떠올렸고, 그날 밤 집에 돌아가 코딩하여 첫 번째 프로토타입이 바로 작동했다고 한다.

이 논문의 핵심 아이디어는 놀랍도록 직관적이다: 위조지폐범(Generator)과 경찰(Discriminator)이 서로 경쟁하면서, 위조지폐범은 점점 더 정교한 위조지폐를 만들고, 경찰은 점점 더 뛰어난 감별 능력을 갖추게 된다. 이 적대적 과정이 수렴하면, 위조지폐범은 진짜와 구별할 수 없는 지폐를 만들어내게 된다.

1.2 역사적 맥락: 2014년의 생성 모델 지형

GAN이 등장하기 전, 생성 모델의 주류는 다음과 같았다.

Variational Autoencoder (VAE, 2013): Kingma와 Welling이 제안한 VAE는 Encoder-Decoder 구조에 확률적 잠재 변수를 도입하여 데이터 분포를 학습했다. 하지만 ELBO(Evidence Lower Bound)를 최적화하는 과정에서 생성된 이미지가 흐릿해지는 문제가 있었다.

Boltzmann Machine 계열: Deep Boltzmann Machine, Restricted Boltzmann Machine 등은 에너지 기반 모델로서 이론적으로 우아했지만, MCMC(Markov Chain Monte Carlo) 샘플링에 의존하여 학습이 느리고 확장성이 제한적이었다.

Autoregressive 모델: PixelRNN(2016)과 같은 모델은 픽셀을 하나씩 순차적으로 생성하는 방식으로, 고품질 샘플을 만들 수 있었지만 생성 속도가 극도로 느렸다.

GAN은 이러한 한계들을 한 번에 돌파했다. 명시적인 확률 분포를 정의하지 않고도 고품질 샘플을 생성할 수 있었고, Markov chain이나 순차적 생성 과정 없이 단일 Forward pass로 샘플을 즉시 생성할 수 있었다. 이것은 생성 모델 분야에서 패러다임 전환에 해당했다.

1.3 영향력

GAN 논문은 2024년 기준 약 65,000회 이상 인용되었으며, 이후 10년간 수백 가지의 GAN 변형이 제안되었다. Yann LeCun은 GAN을 "지난 20년간 머신러닝에서 가장 흥미로운 아이디어"라고 극찬했다. GAN은 이미지 생성, 초해상도, 스타일 전환, 데이터 증강, 약물 발견 등 무수히 많은 분야에 적용되었으며, 이후 Diffusion Model이 등장하기 전까지 생성 모델의 절대적 주류로 군림했다.

2. GAN의 핵심 아이디어

2.1 Two-Player Game: Generator vs Discriminator

GAN의 프레임워크는 두 개의 신경망이 서로 경쟁하는 구조로 이루어진다.

Generator (G): 랜덤 노이즈 벡터 $z$ 를 입력받아 가짜 데이터 $G(z)$ 를 생성한다. Generator의 목표는 Discriminator를 속일 만큼 실제 데이터와 유사한 샘플을 만들어내는 것이다.

G: z \sim p_z(z) \rightarrow G(z) \in \mathbb{R}^d

Discriminator (D): 입력 데이터가 실제 데이터 분포에서 온 것인지( $x \sim p_{data}$ ), 아니면 Generator가 만든 가짜인지( $G(z)$ ) 판별한다. 출력은 0에서 1 사이의 확률값으로, 1에 가까울수록 진짜라고 판단하는 것이다.

D: x \rightarrow [0, 1]

이 두 네트워크는 서로 상반된 목표를 가진다:

Generator: $D(G(z))$ 를 최대화하려 한다 (가짜를 진짜로 판별하게 만듦)
Discriminator: 진짜 데이터에는 높은 확률을, 가짜 데이터에는 낮은 확률을 할당하려 한다

2.2 직관적 비유

GAN의 학습 과정을 미술 시장에 비유하면 이해가 쉽다.

구성 요소	비유	역할
Generator	위조 화가	진품과 구별할 수 없는 모조품을 만드는 것이 목표
Discriminator	미술 감정사	진품과 위작을 구별하는 것이 목표
Training Data	진품 미술 작품	진짜 데이터 분포의 샘플
Noise Vector $z$	화가의 영감/재료	랜덤한 잠재 공간의 점

초기에는 위조 화가의 실력이 형편없어서 감정사가 쉽게 위작을 가려낸다. 하지만 위조 화가는 감정사의 피드백(gradient)을 통해 점점 실력을 개선하고, 감정사 역시 더 정교한 위작에 대응하기 위해 감별 능력을 높인다. 이 경쟁이 충분히 진행되면, 위조 화가는 진품과 구별할 수 없는 수준의 작품을 만들어내게 된다.

2.3 Minimax Game Formulation

GAN의 학습 목표는 다음과 같은 minimax game으로 공식화된다:

\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]

이 Value function $V(D, G)$ 의 각 항을 분석해보자.

첫 번째 항: $\mathbb{E}_{x \sim p_{data}(x)}[\log D(x)]$

실제 데이터 $x$ 에 대한 Discriminator의 판단이다. Discriminator는 이 값을 최대화하려 하므로, $D(x) \rightarrow 1$ (진짜를 진짜라고 판단)을 목표로 한다. Generator는 이 항에 영향을 미치지 않는다.

두 번째 항: $\mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]$

Generator가 만든 가짜 데이터에 대한 Discriminator의 판단이다.

Discriminator는 이 값을 최대화하려 한다: $D(G(z)) \rightarrow 0$ (가짜를 가짜라고 판단) 이면 $\log(1 - 0) = 0$ 으로 최대
Generator는 이 값을 최소화하려 한다: $D(G(z)) \rightarrow 1$ (가짜를 진짜라고 판단) 이면 $\log(1 - 1) = -\infty$ 로 최소

이것이 바로 적대적(Adversarial) 이라는 이름의 유래다. 두 플레이어가 동일한 Value function을 놓고 반대 방향으로 최적화한다.

3. 수학적 기반

3.1 최적 판별자 (Optimal Discriminator)

고정된 Generator $G$ 에 대해, 최적의 Discriminator $D^*_G$ 를 유도해보자. Value function을 기댓값의 정의에 따라 적분 형태로 변환하면:

V(D, G) = \int_x p_{data}(x) \log D(x) \, dx + \int_x p_g(x) \log(1 - D(x)) \, dx

여기서 $p_g$ 는 Generator가 생성하는 데이터의 분포다. 이를 하나의 적분으로 합치면:

V(D, G) = \int_x \left[ p_{data}(x) \log D(x) + p_g(x) \log(1 - D(x)) \right] dx

피적분함수를 $D(x)$ 에 대해 미분하여 0으로 놓으면:

\frac{p_{data}(x)}{D(x)} - \frac{p_g(x)}{1 - D(x)} = 0

이를 $D(x)$ 에 대해 풀면 최적 판별자는:

D^*_G(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}

이 결과는 직관적으로도 타당하다. 어떤 데이터 포인트 $x$ 에서 진짜 데이터일 확률이 $p_{data}(x)$ 이고 가짜일 확률이 $p_g(x)$ 라면, 최적의 판별은 Bayes' rule에 따른 사후 확률과 정확히 일치한다.

핵심 관찰: $p_g = p_{data}$ , 즉 Generator가 실제 데이터 분포를 완벽히 학습했을 때, $D^*_G(x) = \frac{1}{2}$ for all $x$ . Discriminator는 진짜와 가짜를 전혀 구별하지 못하게 된다.

3.2 Jensen-Shannon Divergence와의 관계

최적 판별자 $D^*_G$ 를 Value function에 대입하면:

V(D^*_G, G) = \mathbb{E}_{x \sim p_{data}} \left[ \log \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} \right] + \mathbb{E}_{x \sim p_g} \left[ \log \frac{p_g(x)}{p_{data}(x) + p_g(x)} \right]

이를 정리하면:

V(D^*_G, G) = -\log 4 + 2 \cdot JSD(p_{data} \| p_g)

여기서 $JSD$ 는 Jensen-Shannon Divergence로 정의된다:

JSD(p \| q) = \frac{1}{2} KL\left(p \left\| \frac{p+q}{2}\right.\right) + \frac{1}{2} KL\left(q \left\| \frac{p+q}{2}\right.\right)

JSD는 KL Divergence의 대칭화된 버전이며, 항상 $0 \leq JSD(p \| q) \leq \log 2$ 범위에 있다. $JSD = 0$ 이 되는 것은 $p = q$ 일 때, 즉 두 분포가 완전히 동일할 때 뿐이다.

3.3 Global Optimality의 증명

Theorem (Goodfellow et al., 2014): $C(G) = \max_D V(D, G)$ 의 global minimum은 $p_g = p_{data}$ 일 때, 그리고 오직 그때에만 달성되며, 이때 $C(G) = -\log 4$ 이다.

증명:

(1) $C(G) = V(D^*_G, G) = -\log 4 + 2 \cdot JSD(p_{data} \| p_g)$

(2) $JSD(p_{data} \| p_g) \geq 0$ (JSD의 비음수성)

(3) $JSD(p_{data} \| p_g) = 0 \iff p_{data} = p_g$

(4) 따라서 $C(G) \geq -\log 4$ 이고, 등호 조건은 $p_g = p_{data}$

이것은 GAN 학습의 이론적 보장을 제공한다. 충분한 용량을 가진 Generator와 Discriminator가 주어지면, minimax game의 Nash equilibrium에서 Generator는 실제 데이터 분포를 완벽히 복원한다.

3.4 Nash Equilibrium

게임 이론적 관점에서, GAN의 학습은 두 플레이어 간의 Nash equilibrium을 찾는 문제다. Nash equilibrium은 각 플레이어가 상대방의 전략이 고정된 상태에서 자신의 전략을 변경해도 더 이상 이득이 없는 상태를 의미한다.

GAN에서의 Nash equilibrium은:

$G^*$ : $p_g = p_{data}$ 를 달성하는 Generator
$D^*$ : $D(x) = \frac{1}{2}$ for all $x$ 를 출력하는 Discriminator

이론적으로 이 균형점이 존재하고 유일하지만, 실제 학습에서 이를 찾는 것은 매우 어렵다. 두 네트워크를 동시에 최적화해야 하는 non-convex game이기 때문이다. 이는 GAN 학습의 핵심적인 어려움이며, 이후 수많은 후속 연구의 출발점이 되었다.

3.5 KL Divergence vs JS Divergence

왜 하필 JSD일까? KL Divergence와 비교해보자.

KL Divergence의 문제점:

KL(p_{data} \| p_g) = \int p_{data}(x) \log \frac{p_{data}(x)}{p_g(x)} dx

KL Divergence는 비대칭적이며, $p_g(x) = 0$ 이지만 $p_{data}(x) > 0$ 인 영역에서 무한대로 발산한다. 이는 학습 초반에 Generator의 분포가 실제 분포를 충분히 커버하지 못할 때 문제가 된다.

JS Divergence의 장점:

대칭적: $JSD(p \| q) = JSD(q \| p)$
항상 유한: $0 \leq JSD \leq \log 2$
두 분포의 혼합 분포 $\frac{p+q}{2}$ 를 기준으로 KL을 계산하므로, 한쪽 분포가 0이더라도 발산하지 않는다

그러나 JSD도 완벽하지 않다. 두 분포의 support가 겹치지 않으면 JSD는 상수 $\log 2$ 가 되어 gradient가 0이 된다. 이것이 바로 GAN 학습에서 vanishing gradient 문제의 근본 원인이며, 이후 WGAN이 Wasserstein distance를 도입한 핵심 동기다.

4. 학습 알고리즘

4.1 Training Procedure

원논문에서 제안한 학습 알고리즘은 다음과 같다:

Algorithm 1: GAN Training (Goodfellow et al., 2014)

for number of training iterations do
    # --- Step 1: Discriminator 업데이트 (k steps) ---
    for k steps do
        - 노이즈 prior p_z(z)에서 m개의 노이즈 샘플 {z^(1), ..., z^(m)} 추출
        - 데이터 분포 p_data(x)에서 m개의 실제 샘플 {x^(1), ..., x^(m)} 추출
        - Discriminator의 파라미터를 stochastic gradient ascending으로 업데이트:

          nabla_{theta_d} (1/m) sum_{i=1}^{m} [log D(x^(i)) + log(1 - D(G(z^(i))))]

    end for

    # --- Step 2: Generator 업데이트 (1 step) ---
    - 노이즈 prior p_z(z)에서 m개의 노이즈 샘플 {z^(1), ..., z^(m)} 추출
    - Generator의 파라미터를 stochastic gradient descending으로 업데이트:

          nabla_{theta_g} (1/m) sum_{i=1}^{m} log(1 - D(G(z^(i))))

end for

4.2 Alternating Optimization

핵심은 교대 최적화(Alternating Optimization) 이다. Discriminator와 Generator를 번갈아가며 업데이트한다.

Discriminator를 k번 업데이트한 후 Generator를 1번 업데이트하는 이유:

이론적으로, 최적 판별자 $D^*_G$ 를 구한 후에 Generator를 업데이트해야 올바른 gradient direction을 얻을 수 있다. 실제로 $D$ 를 완전히 최적화하는 것은 불가능하므로, $k$ 번의 gradient step으로 근사한다. 원논문에서는 $k = 1$ 을 기본값으로 사용했다.

균형 유지의 중요성:

Discriminator가 너무 강해지면: Generator의 gradient가 vanish하여 학습이 멈춘다
Discriminator가 너무 약해지면: Generator에게 유용한 학습 신호를 제공하지 못한다
이상적으로는 Discriminator와 Generator가 비슷한 수준으로 함께 발전해야 한다

4.3 Non-Saturating Loss (실전 수정)

이론적 minimax 목적함수에서 Generator의 목표는 $\log(1 - D(G(z)))$ 를 최소화하는 것이다. 그러나 학습 초반에 Generator가 매우 열등할 때, $D(G(z)) \approx 0$ 이므로 $\log(1 - D(G(z))) \approx \log 1 = 0$ 이 되어 gradient가 거의 0이 된다.

Goodfellow는 이 문제를 해결하기 위해, Generator의 목적함수를 다음과 같이 수정했다:

원래 (Minimax):

\min_G \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]

수정 (Non-Saturating):

\max_G \mathbb{E}_{z \sim p_z(z)}[\log D(G(z))]

두 목적함수는 동일한 고정점(Nash equilibrium)을 가지지만, 학습 초반의 gradient 크기가 크게 다르다. Non-saturating loss는 $D(G(z))$ 가 작을 때도 강한 gradient를 제공하여 Generator가 빠르게 학습할 수 있게 한다.

\text{Minimax gradient}: \frac{\partial}{\partial G} \log(1 - D(G(z))) = \frac{-D'(G(z))}{1 - D(G(z))} \approx 0 \text{ when } D(G(z)) \approx 0

\text{Non-Saturating gradient}: \frac{\partial}{\partial G} \log D(G(z)) = \frac{D'(G(z))}{D(G(z))} \rightarrow \text{large when } D(G(z)) \approx 0

4.4 원논문의 실험 결과

원논문에서는 MNIST, Toronto Face Database (TFD), CIFAR-10 데이터셋에서 실험을 수행했다. Parzen window 기반 log-likelihood 추정을 사용하여 평가했으며, GAN이 기존의 Deep Boltzmann Machine이나 Stacked Denoising Autoencoder와 비교하여 경쟁력 있는 성능을 보였다.

하지만 원논문의 결과는 오늘날 기준으로 보면 매우 단순한 수준이었다. Generator와 Discriminator 모두 단순한 MLP(Multi-Layer Perceptron)를 사용했으며, 생성된 이미지의 해상도와 품질은 제한적이었다. 진정한 혁신은 이후 아키텍처 개선과 학습 기법의 발전을 통해 이루어졌다.

5. GAN의 핵심 문제점

5.1 Mode Collapse

GAN의 가장 악명 높은 문제는 Mode Collapse (모드 붕괴)다. 이는 Generator가 데이터 분포의 전체 모드(다양한 패턴)를 학습하지 못하고, 특정 소수의 모드에만 집중하여 유사한 출력만 반복적으로 생성하는 현상이다.

발생 메커니즘:

Generator가 Discriminator를 속이는 데 특히 효과적인 소수의 패턴을 발견하면, 다른 다양한 패턴을 탐색하는 대신 그 패턴만 반복적으로 생성하게 된다. 예를 들어, MNIST에서 학습할 때 Generator가 숫자 '1'만 완벽하게 생성하고 나머지 숫자는 전혀 생성하지 못하는 상황이다.

수학적 해석:

Mode collapse는 minimax 대신 maximin 게임으로의 전환과 관련이 있다:

\max_D \min_G V(D, G) \neq \min_G \max_D V(D, G)

이론적 minimax에서는 Generator가 모든 가능한 Discriminator에 대비해야 하므로 전체 분포를 커버해야 한다. 하지만 실제 학습에서 Generator는 현재의 Discriminator만 속이면 되므로, 특정 모드에 집중하는 것이 "합리적"인 전략이 될 수 있다.

5.2 Training Instability

GAN 학습은 본질적으로 비협조적 게임(non-cooperative game) 의 Nash equilibrium을 찾는 문제다. 이는 단순한 최적화 문제보다 훨씬 어렵다.

Oscillation 문제: Generator와 Discriminator가 수렴하지 않고 서로를 중심으로 진동(oscillate)하는 현상이 빈번하다. 일반적인 loss landscape에서 gradient descent는 local minimum을 찾아가지만, minimax game에서의 gradient descent는 안장점(saddle point) 주위를 맴돌 수 있다.

학습 균형의 어려움: Discriminator가 너무 빨리 수렴하면 Generator가 학습할 수 없고, 반대로 Discriminator가 너무 약하면 Generator에게 유의미한 학습 신호를 전달하지 못한다. 이 미묘한 균형을 유지하는 것이 실무에서 GAN 학습의 가장 큰 도전 과제였다.

5.3 Vanishing Gradients

앞서 3.5절에서 설명한 것처럼, JS Divergence는 두 분포의 support가 겹치지 않을 때 상수 $\log 2$ 가 되어 gradient가 0이 된다.

고차원 데이터(예: 이미지)에서 실제 데이터 분포와 Generator의 분포는 고차원 공간의 저차원 manifold 위에 존재한다. 이 두 manifold가 겹칠 확률은 매우 낮기 때문에, 학습 초반에 두 분포의 support가 거의 겹치지 않는 것이 일반적이다. 이 상황에서 JSD 기반의 GAN은 유용한 gradient를 전혀 제공하지 못한다.

\text{When } \text{supp}(p_{data}) \cap \text{supp}(p_g) = \emptyset: \quad JSD(p_{data} \| p_g) = \log 2 \quad (\text{constant})

5.4 Evaluation의 어려움

GAN의 성능을 객관적으로 평가하는 것 자체도 매우 어려운 문제다. 주요 평가 지표는 다음과 같다:

Inception Score (IS): 생성된 이미지의 품질(sharpness)과 다양성을 측정한다. 사전 학습된 Inception 네트워크를 사용하여, 개별 이미지의 클래스 예측이 확실하면서도(품질) 전체적으로 다양한 클래스에 분포(다양성)하면 높은 점수를 받는다.

Frechet Inception Distance (FID): 실제 데이터와 생성된 데이터의 Inception 특징 분포 간의 Frechet distance를 측정한다. 낮을수록 좋다. IS보다 더 신뢰할 수 있는 지표로 널리 사용된다.

FID = \|\mu_r - \mu_g\|^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})

여기서 $(\mu_r, \Sigma_r)$ 과 $(\mu_g, \Sigma_g)$ 는 각각 실제 이미지와 생성 이미지의 Inception feature에 대한 평균과 공분산이다.

6. GAN 계보 총정리

6.1 DCGAN (2015): CNN 기반 안정적 학습의 시작

Radford, Metz, Chintala. "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks" (2015)

원래 GAN은 MLP(Fully Connected Layer)만으로 구성되었기 때문에, 이미지 생성에서 CNN의 강력한 공간적 특징 추출 능력을 활용하지 못했다. DCGAN(Deep Convolutional GAN) 은 CNN을 GAN에 성공적으로 통합한 최초의 아키텍처로, 안정적인 학습을 위한 여러 가지 아키텍처 가이드라인을 제시했다.

DCGAN의 핵심 아키텍처 규칙:

Pooling 레이어 제거: Max pooling 대신 strided convolution(Discriminator)과 fractional-strided convolution / transposed convolution(Generator)을 사용
Batch Normalization 적용: Generator와 Discriminator 모두에 Batch Normalization을 적용하되, Generator의 출력 레이어와 Discriminator의 입력 레이어에는 적용하지 않음
Fully Connected 레이어 제거: Global average pooling이나 직접적인 convolution 연결을 사용
활성화 함수: Generator는 출력 레이어에 Tanh, 나머지는 ReLU 사용. Discriminator는 모든 레이어에 LeakyReLU 사용

DCGAN의 Generator 구조 (개념적):

z (100-dim) -> FC -> Reshape (4x4x1024) -> ConvT -> BN -> ReLU (8x8x512)
-> ConvT -> BN -> ReLU (16x16x256) -> ConvT -> BN -> ReLU (32x32x128)
-> ConvT -> Tanh (64x64x3)

DCGAN은 단순히 좋은 이미지를 생성하는 것을 넘어, 학습된 잠재 공간(latent space)이 의미 있는 구조를 가진다는 것을 보여주었다. 유명한 예시로, latent space에서의 벡터 연산이 의미론적 변환에 대응된다는 것을 시연했다:

\text{vec}(\text{"안경 쓴 남자"}) - \text{vec}(\text{"남자"}) + \text{vec}(\text{"여자"}) = \text{vec}(\text{"안경 쓴 여자"})

6.2 WGAN (2017): Wasserstein Distance의 도입

Arjovsky, Chintala, Bottou. "Wasserstein GAN" (2017)

WGAN은 GAN 이론의 가장 중요한 진전 중 하나로, JS Divergence의 근본적 한계를 해결하기 위해 Wasserstein distance (Earth Mover's distance) 를 도입했다.

Wasserstein Distance (EM Distance):

W(p_{data}, p_g) = \inf_{\gamma \in \Pi(p_{data}, p_g)} \mathbb{E}_{(x, y) \sim \gamma} [\|x - y\|]

여기서 $\Pi(p_{data}, p_g)$ 는 $p_{data}$ 와 $p_g$ 를 marginal로 가지는 모든 결합 분포의 집합이다. 직관적으로, 한 분포를 다른 분포로 변환하기 위해 "흙더미를 옮기는" 최소 비용이다.

Wasserstein Distance의 핵심 장점:

JSD와 달리, 두 분포의 support가 겹치지 않더라도 연속적이고 미분 가능한 거리를 제공한다. 예를 들어, 두 점 분포 $\delta_0$ 와 $\delta_\theta$ ( $\theta > 0$ )를 고려하면:

JSD(\delta_0 \| \delta_\theta) = \log 2 \quad \text{(상수, gradient = 0)}

W(\delta_0, \delta_\theta) = |\theta| \quad \text{(연속, gradient} = \text{sign}(\theta)\text{)}

Kantorovich-Rubinstein Duality:

Wasserstein distance를 직접 계산하는 것은 intractable하므로, Kantorovich-Rubinstein duality를 활용한다:

W(p_{data}, p_g) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_{data}}[f(x)] - \mathbb{E}_{x \sim p_g}[f(x)]

여기서 supremum은 1-Lipschitz 함수 전체에 대해 취한다. WGAN은 Discriminator(이제 Critic이라 부름)가 이 1-Lipschitz 함수를 근사하도록 학습한다.

Weight Clipping: 원래 WGAN은 Lipschitz constraint를 강제하기 위해 critic의 가중치를 $[-c, c]$ 범위로 clipping했다. 그러나 이는 critic의 표현력을 심각하게 제한하고, 학습 불안정성을 야기할 수 있었다.

6.3 WGAN-GP (2017): Gradient Penalty

Gulrajani, Ahmed, Arjovsky, Dumoulin, Courville. "Improved Training of Wasserstein GANs" (2017)

Weight clipping의 문제를 해결하기 위해, Gradient Penalty (GP) 방식이 제안되었다. Lipschitz constraint를 직접 강제하는 대신, critic의 gradient norm이 1에 가까워지도록 정규화한다.

L_{WGAN-GP} = \underbrace{\mathbb{E}_{x \sim p_g}[D(x)] - \mathbb{E}_{x \sim p_{data}}[D(x)]}_{\text{Original Critic Loss}} + \underbrace{\lambda \mathbb{E}_{\hat{x} \sim p_{\hat{x}}} \left[ (\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2 \right]}_{\text{Gradient Penalty}}

여기서 $\hat{x}$ 는 실제 데이터와 생성 데이터 사이의 랜덤 보간(interpolation)이다:

\hat{x} = \epsilon x + (1 - \epsilon) G(z), \quad \epsilon \sim \text{Uniform}[0, 1]

WGAN-GP는 $\lambda = 10$ , critic 업데이트 횟수 $n_{critic} = 5$ 를 기본값으로 사용하며, 거의 하이퍼파라미터 튜닝 없이도 다양한 아키텍처에서 안정적으로 학습된다.

6.4 Progressive GAN (2017): 점진적 해상도 증가

Karras, Aila, Laine, Lehtinen. "Progressive Growing of GANs for Improved Quality, Stability, and Variation" (2017)

Progressive GAN (ProGAN) 은 NVIDIA 연구팀이 제안한 아키텍처로, 고해상도 이미지 생성의 새로운 지평을 열었다. 핵심 아이디어는 Generator와 Discriminator를 저해상도에서 시작하여 점진적으로 레이어를 추가해가며 해상도를 높이는 것이다.

학습 과정:

Phase 1: 4x4 해상도에서 G와 D 학습
Phase 2: 8x8 레이어 추가, fade-in을 통해 점진적 전환
Phase 3: 16x16 레이어 추가
...
Phase N: 1024x1024 최종 해상도 도달

Fade-in 메커니즘: 새로운 레이어를 추가할 때, 기존 레이어의 출력과 새 레이어의 출력을 가중 평균으로 결합한다. 가중치 $\alpha$ 가 0에서 1로 서서히 증가하면서 새 레이어가 점진적으로 활성화된다.

\text{output} = (1 - \alpha) \cdot \text{upsampled\_old} + \alpha \cdot \text{new\_layer\_output}

핵심 기여:

학습 안정성 대폭 향상: 저해상도에서 대략적인 구조를 먼저 학습하고, 점차 세밀한 디테일을 추가하므로 학습이 훨씬 안정적
1024x1024 해상도 달성: CelebA-HQ 데이터셋에서 1024x1024 해상도의 사실적인 얼굴 이미지 생성에 최초로 성공
Minibatch standard deviation: 다양성을 높이기 위해 minibatch 내 통계를 활용하는 기법 도입

6.5 StyleGAN 시리즈 (2019-2021): 스타일 기반 생성의 정수

StyleGAN (2019)

Karras, Laine, Aila. "A Style-Based Generator Architecture for Generative Adversarial Networks" (2019)

StyleGAN은 Progressive GAN의 점진적 학습과 Neural Style Transfer의 스타일 분리 개념을 결합한 혁명적 아키텍처다.

핵심 구조 변경:

Mapping Network: 입력 잠재 벡터 $z \in \mathcal{Z}$ 를 비선형 매핑 네트워크 $f: \mathcal{Z} \rightarrow \mathcal{W}$ 를 통해 중간 잠재 공간(intermediate latent space) $\mathcal{W}$ 로 변환한다. 8개의 FC 레이어로 구성된다.
Adaptive Instance Normalization (AdaIN): $\mathcal{W}$ 공간의 스타일 벡터 $w$ 를 각 convolution 레이어에 주입한다.

\text{AdaIN}(x_i, y) = y_{s,i} \frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i}

여기서 $y_s$ 와 $y_b$ 는 스타일 벡터 $w$ 로부터 학습된 affine transformation으로 얻는 scale과 bias다.

Constant Input: Generator의 입력으로 학습 가능한 상수 텐서(4x4x512)를 사용한다. 스타일은 오직 AdaIN을 통해서만 주입된다.
Noise Injection: 각 convolution 레이어 이후에 per-pixel 노이즈를 추가하여 확률적 변동(stochastic variation, 예: 머리카락 위치, 모공 등)을 제어한다.

스타일 계층 구조:

해상도 레이어	제어하는 특성
$4^2 - 8^2$ (Coarse)	포즈, 얼굴 형태, 안경 유무
$16^2 - 32^2$ (Middle)	얼굴 특징, 헤어스타일, 눈 개폐
$64^2 - 1024^2$ (Fine)	색상, 미세 구조, 배경 디테일

StyleGAN2 (2020)

Karras, Laine, Aittala, Hellsten, Lehtinen, Aila. "Analyzing and Improving the Image Quality of StyleGAN" (2020)

StyleGAN2는 StyleGAN의 여러 아티팩트를 해결하고 이미지 품질을 크게 개선했다.

주요 개선 사항:

Weight Demodulation: AdaIN을 대체하여 물방울 형태의 아티팩트(blob artifact) 제거. AdaIN의 instance normalization이 feature map 내 상대적 크기 정보를 파괴하는 문제를 해결
Progressive Growing 제거: Skip connection과 residual connection을 사용하여 progressive growing 없이도 안정적인 고해상도 학습 달성
Path Length Regularization: 잠재 공간의 부드러움(smoothness)을 개선하여, 잠재 벡터의 작은 변화가 이미지에서 비례적인 변화를 만들도록 유도
Lazy Regularization: 정규화를 매 step이 아닌 16 step마다 적용하여 효율성 향상

StyleGAN2-ADA: 제한된 데이터에서도 과적합 없이 학습할 수 있도록 적응적 판별자 증강(Adaptive Discriminator Augmentation) 을 도입. 수천 장 수준의 소규모 데이터셋에서도 고품질 생성이 가능해졌다.

StyleGAN3 (2021)

Karras, Aittala, Laine, et al. "Alias-Free Generative Adversarial Networks" (2021)

StyleGAN3는 근본적인 신호 처리 문제를 해결했다.

문제: StyleGAN2에서 생성된 이미지의 세밀한 디테일이 이미지 좌표에 "붙어있는" 듯한 현상. 카메라가 움직여야 할 때 텍스처가 객체와 함께 움직이지 않고 제자리에 남는 aliasing 문제.

해결: 네트워크 내 모든 신호를 연속 도메인에서 처리하도록 재설계하여, 이산 샘플링에서 발생하는 aliasing을 근본적으로 차단.

핵심 변경:

Fourier feature 기반의 입력 대체
연속적 등변(equivariant) 연산 보장
이동(translation)과 회전(rotation)에 대한 완전한 등변성 달성
FID는 StyleGAN2와 동등하면서, 내부 표현이 근본적으로 다름

StyleGAN3는 비디오 생성과 애니메이션에 더 적합한 기반을 마련했다.

6.6 Conditional GAN, Pix2Pix, CycleGAN

Conditional GAN (cGAN, 2014)

Mirza, Osindero. "Conditional Generative Adversarial Nets" (2014)

원래 GAN은 생성되는 데이터를 제어할 수 없다. Conditional GAN은 Generator와 Discriminator 모두에 추가 조건 정보 $y$ (예: 클래스 레이블)를 입력으로 제공하여, 원하는 특성의 데이터를 조건부로 생성할 수 있게 한다.

\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x|y)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z|y)|y))]

Pix2Pix (2017)

Isola, Zhu, Zhou, Efros. "Image-to-Image Translation with Conditional Adversarial Networks" (2017)

Pix2Pix는 paired 이미지 데이터를 사용한 Image-to-Image Translation 프레임워크다. 흑백 사진을 컬러로, 위성 이미지를 지도로, 스케치를 사진으로 변환하는 등 다양한 태스크를 통일된 프레임워크로 해결했다.

핵심 구성:

U-Net Generator: Encoder-Decoder 구조에 skip connection 추가
PatchGAN Discriminator: 전체 이미지가 아닌 $N \times N$ 패치 단위로 진위를 판별
L1 Reconstruction Loss + Adversarial Loss: 구조적 유사성과 사실감을 동시에 추구

\mathcal{L} = \mathcal{L}_{cGAN}(G, D) + \lambda \mathcal{L}_{L1}(G)

CycleGAN (2017)

Zhu, Park, Isola, Efros. "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks" (2017)

Pix2Pix는 paired 데이터가 필요하다는 큰 제약이 있었다. CycleGAN은 unpaired 데이터만으로 두 도메인 간의 변환을 학습한다.

핵심 아이디어: Cycle Consistency Loss

두 개의 Generator $G: X \rightarrow Y$ 와 $F: Y \rightarrow X$ , 두 개의 Discriminator $D_X$ , $D_Y$ 를 학습한다.

\mathcal{L}_{cyc}(G, F) = \mathbb{E}_{x \sim p_{data}(x)}[\|F(G(x)) - x\|_1] + \mathbb{E}_{y \sim p_{data}(y)}[\|G(F(y)) - y\|_1]

도메인 $X$ 의 이미지를 $Y$ 로 변환한 후 다시 $X$ 로 변환하면 원래 이미지로 복원되어야 한다는 제약이다. 이를 통해 paired 데이터 없이도 의미 있는 매핑을 학습할 수 있다.

응용 분야: 말을 얼룩말로 변환, 여름 풍경을 겨울로 변환, 사진을 모네 스타일로 변환 등.

6.7 BigGAN (2018): 규모의 힘

Brock, Donahue, Simonyan. "Large Scale GAN Training for High Fidelity Natural Image Synthesis" (2018)

BigGAN은 "GAN 학습에서 규모가 중요하다"는 것을 극적으로 증명한 연구다. 기존 대비 2~4배의 파라미터와 8배의 batch size를 사용하여 학습했다.

핵심 기법:

Class-Conditional Batch Normalization: 클래스 임베딩을 공유하여 각 Batch Normalization 레이어의 scale과 bias를 조절
Truncation Trick: 추론 시 잠재 벡터 $z$ 의 분포를 truncation하여 품질과 다양성의 트레이드오프를 제어

z \sim \mathcal{N}(0, I) \rightarrow z' = \text{truncate}(z, \text{threshold})

Orthogonal Regularization: Generator의 가중치에 직교 정규화를 적용하여 학습 안정성 확보

성과: ImageNet 128x128에서 IS 166.5, FID 7.4를 달성. 이전 최고 기록(IS 52.52, FID 18.6)을 크게 뛰어넘었다.

6.8 GigaGAN (2023): GAN의 부활?

Kang, Zhu, et al. "Scaling up GANs for Text-to-Image Synthesis" (2023)

Diffusion Model이 이미지 생성을 지배하던 시점에, GigaGAN은 1B 파라미터 규모의 Text-to-Image GAN으로 GAN의 잠재력을 다시 증명했다.

핵심 혁신:

Adaptive Kernel Selection: 각 이미지마다 다른 convolution 필터를 생성. 필터 뱅크에서 스타일 벡터에 의한 convex combination으로 결정
Stable Attention: L2 distance 기반 attention score 계산으로 Lipschitz 연속성을 보장하고, attention weight matrix를 unit variance로 정규화
Query-Key Tying: Query와 Key 행렬을 공유하여 안정성 확보
CLIP Text Encoder: 사전 학습된 CLIP 모델로 텍스트 임베딩 추출

성과 및 의의:

FID에서 Stable Diffusion v1.5, DALL-E 2, Parti-750M을 능가
512px 이미지 생성에 0.13초: Diffusion model 대비 수십~수백 배 빠른 추론 속도
GAN이 대규모 text-to-image 합성에서도 경쟁력이 있음을 증명

7. PyTorch로 구현하는 GAN

7.1 기본 GAN 구현 (MNIST)

아래는 PyTorch로 MNIST 데이터셋에 대한 가장 기본적인 GAN을 구현한 코드다.

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# ============================================================
# 하이퍼파라미터 설정
# ============================================================
LATENT_DIM = 100        # 잠재 벡터 차원 (z의 차원)
IMG_DIM = 28 * 28       # MNIST 이미지를 flatten한 차원
HIDDEN_DIM = 256        # Hidden layer 차원
BATCH_SIZE = 64
EPOCHS = 200
LR = 0.0002
BETAS = (0.5, 0.999)   # Adam optimizer의 beta 파라미터
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# ============================================================
# Generator 정의
# ============================================================
class Generator(nn.Module):
    """
    잠재 벡터 z를 입력받아 가짜 이미지를 생성한다.
    구조: z(100) -> 256 -> 512 -> 1024 -> 784(28x28)
    """
    def __init__(self, latent_dim: int, img_dim: int, hidden_dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim, hidden_dim * 2),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim * 2, hidden_dim * 4),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim * 4, img_dim),
            nn.Tanh(),  # 출력을 [-1, 1] 범위로 정규화
        )

    def forward(self, z: torch.Tensor) -> torch.Tensor:
        return self.net(z)


# ============================================================
# Discriminator 정의
# ============================================================
class Discriminator(nn.Module):
    """
    이미지를 입력받아 진짜/가짜 확률을 출력한다.
    구조: 784(28x28) -> 1024 -> 512 -> 256 -> 1
    """
    def __init__(self, img_dim: int, hidden_dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(img_dim, hidden_dim * 4),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim * 4, hidden_dim * 2),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, 1),
            nn.Sigmoid(),  # 출력을 [0, 1] 확률로 변환
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)


# ============================================================
# 데이터 로더 설정
# ============================================================
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,)),  # [0,1] -> [-1,1]
])

dataset = datasets.MNIST(root="./data", train=True, transform=transform, download=True)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)


# ============================================================
# 모델, 옵티마이저, 손실 함수 초기화
# ============================================================
G = Generator(LATENT_DIM, IMG_DIM, HIDDEN_DIM).to(DEVICE)
D = Discriminator(IMG_DIM, HIDDEN_DIM).to(DEVICE)

opt_G = optim.Adam(G.parameters(), lr=LR, betas=BETAS)
opt_D = optim.Adam(D.parameters(), lr=LR, betas=BETAS)

criterion = nn.BCELoss()  # Binary Cross Entropy


# ============================================================
# 학습 루프
# ============================================================
for epoch in range(EPOCHS):
    d_loss_total, g_loss_total = 0.0, 0.0

    for batch_idx, (real_imgs, _) in enumerate(dataloader):
        real_imgs = real_imgs.view(-1, IMG_DIM).to(DEVICE)
        batch_size = real_imgs.size(0)

        # 진짜/가짜 레이블
        real_labels = torch.ones(batch_size, 1, device=DEVICE)
        fake_labels = torch.zeros(batch_size, 1, device=DEVICE)

        # -----------------------------------------
        # Step 1: Discriminator 학습
        # -----------------------------------------
        # 진짜 이미지에 대한 판별
        d_real = D(real_imgs)
        d_loss_real = criterion(d_real, real_labels)

        # 가짜 이미지 생성 및 판별
        z = torch.randn(batch_size, LATENT_DIM, device=DEVICE)
        fake_imgs = G(z).detach()  # Generator의 gradient 차단
        d_fake = D(fake_imgs)
        d_loss_fake = criterion(d_fake, fake_labels)

        # Discriminator 총 손실 및 업데이트
        d_loss = d_loss_real + d_loss_fake
        opt_D.zero_grad()
        d_loss.backward()
        opt_D.step()

        # -----------------------------------------
        # Step 2: Generator 학습
        # -----------------------------------------
        z = torch.randn(batch_size, LATENT_DIM, device=DEVICE)
        fake_imgs = G(z)
        d_fake = D(fake_imgs)

        # Non-saturating loss: Generator는 D(G(z))를 높이려 한다
        g_loss = criterion(d_fake, real_labels)
        opt_G.zero_grad()
        g_loss.backward()
        opt_G.step()

        d_loss_total += d_loss.item()
        g_loss_total += g_loss.item()

    # 에폭별 로그 출력
    num_batches = len(dataloader)
    print(
        f"Epoch [{epoch+1}/{EPOCHS}] "
        f"D Loss: {d_loss_total/num_batches:.4f} | "
        f"G Loss: {g_loss_total/num_batches:.4f}"
    )

7.2 DCGAN 구현 (핵심 부분)

DCGAN의 Generator와 Discriminator를 Convolutional 구조로 변경한 버전이다.

class DCGANGenerator(nn.Module):
    """
    DCGAN Generator: Transposed Convolution으로 이미지를 생성한다.
    z(100) -> 4x4x512 -> 8x8x256 -> 16x16x128 -> 32x32x64 -> 64x64x3
    """
    def __init__(self, latent_dim: int = 100, feature_map_size: int = 64, channels: int = 3):
        super().__init__()
        self.net = nn.Sequential(
            # 입력: z (latent_dim x 1 x 1) -> (feature_map_size*8 x 4 x 4)
            nn.ConvTranspose2d(latent_dim, feature_map_size * 8, 4, 1, 0, bias=False),
            nn.BatchNorm2d(feature_map_size * 8),
            nn.ReLU(inplace=True),

            # (feature_map_size*8 x 4 x 4) -> (feature_map_size*4 x 8 x 8)
            nn.ConvTranspose2d(feature_map_size * 8, feature_map_size * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 4),
            nn.ReLU(inplace=True),

            # (feature_map_size*4 x 8 x 8) -> (feature_map_size*2 x 16 x 16)
            nn.ConvTranspose2d(feature_map_size * 4, feature_map_size * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 2),
            nn.ReLU(inplace=True),

            # (feature_map_size*2 x 16 x 16) -> (feature_map_size x 32 x 32)
            nn.ConvTranspose2d(feature_map_size * 2, feature_map_size, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size),
            nn.ReLU(inplace=True),

            # (feature_map_size x 32 x 32) -> (channels x 64 x 64)
            nn.ConvTranspose2d(feature_map_size, channels, 4, 2, 1, bias=False),
            nn.Tanh(),
        )

    def forward(self, z: torch.Tensor) -> torch.Tensor:
        return self.net(z)


class DCGANDiscriminator(nn.Module):
    """
    DCGAN Discriminator: Strided Convolution으로 진위를 판별한다.
    (3 x 64 x 64) -> (64 x 32 x 32) -> (128 x 16 x 16) ->
    (256 x 8 x 8) -> (512 x 4 x 4) -> 1
    """
    def __init__(self, feature_map_size: int = 64, channels: int = 3):
        super().__init__()
        self.net = nn.Sequential(
            # (channels x 64 x 64) -> (feature_map_size x 32 x 32)
            nn.Conv2d(channels, feature_map_size, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size x 32 x 32) -> (feature_map_size*2 x 16 x 16)
            nn.Conv2d(feature_map_size, feature_map_size * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 2),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size*2 x 16 x 16) -> (feature_map_size*4 x 8 x 8)
            nn.Conv2d(feature_map_size * 2, feature_map_size * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 4),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size*4 x 8 x 8) -> (feature_map_size*8 x 4 x 4)
            nn.Conv2d(feature_map_size * 4, feature_map_size * 8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 8),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size*8 x 4 x 4) -> (1 x 1 x 1)
            nn.Conv2d(feature_map_size * 8, 1, 4, 1, 0, bias=False),
            nn.Sigmoid(),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x).view(-1, 1)

7.3 WGAN-GP 핵심 Loss 구현

def compute_gradient_penalty(
    discriminator: nn.Module,
    real_samples: torch.Tensor,
    fake_samples: torch.Tensor,
    device: torch.device,
    lambda_gp: float = 10.0,
) -> torch.Tensor:
    """
    WGAN-GP의 Gradient Penalty를 계산한다.

    실제 데이터와 생성 데이터 사이의 랜덤 보간점에서
    Discriminator(Critic) gradient의 L2 norm이 1이 되도록 패널티를 부과한다.
    """
    batch_size = real_samples.size(0)

    # 랜덤 보간 계수
    epsilon = torch.rand(batch_size, 1, 1, 1, device=device)

    # 실제와 가짜 사이의 보간점
    interpolated = (epsilon * real_samples + (1 - epsilon) * fake_samples).requires_grad_(True)

    # Critic의 출력
    d_interpolated = discriminator(interpolated)

    # Gradient 계산
    gradients = torch.autograd.grad(
        outputs=d_interpolated,
        inputs=interpolated,
        grad_outputs=torch.ones_like(d_interpolated),
        create_graph=True,
        retain_graph=True,
    )[0]

    # Gradient의 L2 norm
    gradients = gradients.view(batch_size, -1)
    gradient_norm = gradients.norm(2, dim=1)

    # Gradient Penalty: (||grad|| - 1)^2 의 기댓값
    gradient_penalty = lambda_gp * ((gradient_norm - 1) ** 2).mean()

    return gradient_penalty


# WGAN-GP 학습 루프 (핵심 부분)
def train_wgan_gp_step(
    G: nn.Module,
    D: nn.Module,
    opt_G: optim.Optimizer,
    opt_D: optim.Optimizer,
    real_imgs: torch.Tensor,
    latent_dim: int,
    device: torch.device,
    n_critic: int = 5,
):
    """WGAN-GP의 한 iteration 학습."""
    batch_size = real_imgs.size(0)

    # --- Critic (Discriminator) 학습: n_critic 번 ---
    for _ in range(n_critic):
        z = torch.randn(batch_size, latent_dim, 1, 1, device=device)
        fake_imgs = G(z).detach()

        # Wasserstein Loss: E[D(real)] - E[D(fake)]를 최대화
        d_real = D(real_imgs).mean()
        d_fake = D(fake_imgs).mean()
        gp = compute_gradient_penalty(D, real_imgs, fake_imgs, device)

        d_loss = d_fake - d_real + gp  # Critic은 이것을 최소화

        opt_D.zero_grad()
        d_loss.backward()
        opt_D.step()

    # --- Generator 학습: 1번 ---
    z = torch.randn(batch_size, latent_dim, 1, 1, device=device)
    fake_imgs = G(z)
    g_loss = -D(fake_imgs).mean()  # Generator는 D(G(z))를 최대화

    opt_G.zero_grad()
    g_loss.backward()
    opt_G.step()

    return d_loss.item(), g_loss.item()

8. GAN vs Diffusion Models 비교

2020년대에 접어들면서, Diffusion Models (DDPM, Score-based models)이 이미지 생성의 새로운 패러다임으로 부상했다. 2021년 Dhariwal과 Nichol의 논문 "Diffusion Models Beat GANs on Image Synthesis"가 발표된 이후, Diffusion Model은 DALL-E 2, Stable Diffusion, Midjourney 등을 통해 생성 모델의 주류로 자리 잡았다. GAN과 Diffusion Model을 체계적으로 비교해보자.

8.1 기본 원리 비교

측면	GAN	Diffusion Model
학습 방식	Adversarial Training (minimax game)	Denoising Score Matching
생성 과정	단일 Forward pass	반복적 Denoising (수십~수백 step)
확률 모델	암시적 (implicit)	명시적 (explicit)
손실 함수	Adversarial loss (+ 보조 loss)	단순한 MSE/L1 (noise prediction)
이론적 분포	$p_g \approx p_{data}$ via JSD/Wasserstein	$p_\theta(x_0) \approx p_{data}$ via ELBO

8.2 강점 및 약점 비교

GAN의 강점:

추론 속도: 단일 Forward pass로 이미지 생성. 실시간 애플리케이션에 적합
샘플 선명도: 적대적 학습을 통해 선명하고 사실적인 이미지를 생성하는 경향
Latent Space 제어: 잘 구조화된 잠재 공간을 통한 의미론적 조작 가능
경량성: 상대적으로 적은 파라미터로도 고품질 생성 가능

GAN의 약점:

학습 불안정성: Mode collapse, training oscillation 등
다양성 부족: Mode collapse로 인해 생성 다양성이 제한될 수 있음
확장성 한계: 텍스트 조건부 생성 등에서 Diffusion model만큼 자연스럽게 확장되지 않음
평가의 어려움: 학습 진행 상황을 신뢰할 수 있는 지표로 모니터링하기 어려움

Diffusion Model의 강점:

학습 안정성: 단순한 MSE loss로 안정적 학습
샘플 다양성: Mode collapse 문제가 거의 없음
텍스트 조건부 생성: Classifier-free guidance 등을 통한 자연스러운 조건부 생성
이론적 견고성: 명시적 확률 모델로서 likelihood 계산 가능

Diffusion Model의 약점:

추론 속도: 수십~수백 번의 반복적 denoising이 필요하여 느림 (최근 distillation 등으로 개선 중)
계산 비용: 학습과 추론 모두에서 높은 컴퓨팅 자원 필요
메모리 사용량: 고해상도 생성 시 U-Net의 대규모 파라미터 요구

8.3 수렴 특성 비교

특성	GAN	Diffusion Model
수렴 보장	Nash equilibrium 이론적으로만 보장	ELBO 최적화로 안정적 수렴
Mode Coverage	Mode collapse 위험	우수한 mode coverage
학습 곡선	불안정, 해석 어려움	안정적, loss 직접 해석 가능
하이퍼파라미터 민감도	높음	상대적으로 낮음

8.4 2025년 현재의 지형

2025년 현재, Diffusion Model이 이미지 생성의 주류로 자리 잡았다. Stable Diffusion, DALL-E 3, Midjourney 등 상업적으로 가장 성공한 이미지 생성 모델들은 모두 Diffusion 기반이다.

그러나 GAN은 완전히 대체되지 않았다. 특히 다음 영역에서 GAN은 여전히 강세를 보이고 있다:

실시간 생성이 필요한 영역: 비디오 게임, VR/AR 등
이미지 편집/조작: StyleGAN 기반의 정밀한 얼굴 편집, 속성 조작
Super-Resolution: 실시간 초해상도 처리
GAN-Diffusion 하이브리드: Diffusion process에 GAN loss를 결합하거나, GAN의 빠른 추론을 Diffusion model의 distillation에 활용

GigaGAN(2023)의 등장은 GAN이 대규모 text-to-image 합성에서도 경쟁력을 가질 수 있음을 보여주었으며, 두 패러다임의 장점을 결합하는 연구가 활발히 진행 중이다.

9. GAN의 현재와 미래

9.1 GAN의 현재 위상

GAN은 2014년 발표 이후 약 10년간 생성 모델의 중심에 있었지만, 2021년 이후 Diffusion Model에 주류 자리를 내주었다. 그러나 GAN의 유산과 현재 역할은 여전히 중요하다.

현재 GAN이 활발히 사용되는 분야:

의료 영상: 환자 프라이버시를 보호하면서 학습 데이터를 증강하는 데 GAN이 널리 사용된다
데이터 증강: 소규모 데이터셋의 학습 데이터를 확장하여 모델 성능을 개선
영상 편집 및 복원: 얼굴 복원, 노이즈 제거, 초해상도 등
패션 및 디자인: 가상 피팅(virtual try-on), 디자인 프로토타이핑
게임 및 시뮬레이션: 실시간 콘텐츠 생성, 텍스처 합성

9.2 GAN이 남긴 이론적 유산

GAN의 가장 큰 기여는 단순히 이미지 생성 기술에 그치지 않는다.

Adversarial Training 패러다임: GAN이 도입한 적대적 학습은 생성 모델을 넘어 다양한 분야에 영향을 미쳤다.

Adversarial Examples: 딥러닝 모델의 robustness 연구
Domain Adaptation: Adversarial training을 활용한 도메인 간 지식 전이
Self-supervised Learning: Adversarial 신호를 활용한 자기지도 학습
Inverse Reinforcement Learning: 보상 함수를 adversarial하게 학습

Implicit Generative Models: 명시적인 확률 분포를 정의하지 않고도 복잡한 분포를 학습할 수 있다는 GAN의 핵심 통찰은, 이후 Energy-based Models, Score-based Models 등의 발전에도 영향을 미쳤다.

9.3 미래 전망

GAN-Diffusion 융합: 가장 유망한 방향 중 하나는 GAN과 Diffusion Model의 장점을 결합하는 것이다. Diffusion 과정의 denoising step을 GAN으로 대체하여 추론 속도를 높이는 연구가 진행 중이다.

3D 생성: GAN을 3D 표현(Neural Radiance Fields, 3D Gaussian Splatting 등)과 결합하여 3D 콘텐츠를 생성하는 연구가 활발하다. EG3D, GET3D 등이 대표적이다.

비디오 생성: StyleGAN3의 등변 특성은 비디오 생성에 자연스럽게 적용될 수 있으며, 시간적 일관성을 가진 비디오 생성 연구가 진행 중이다.

효율적 학습: Few-shot GAN, GAN의 전이 학습(Transfer Learning) 등을 통해 적은 데이터로도 고품질 생성 모델을 학습하는 연구가 계속되고 있다.

9.4 GAN 논문 계보 타임라인 요약

연도	모델	핵심 기여	해상도
2014	GAN	Adversarial training framework	Low
2014	cGAN	조건부 생성	Low
2015	DCGAN	CNN 기반 아키텍처 가이드라인	64x64
2017	WGAN	Wasserstein distance	64x64
2017	WGAN-GP	Gradient penalty	64x64
2017	Pix2Pix	Paired image-to-image translation	256x256
2017	CycleGAN	Unpaired image-to-image translation	256x256
2017	ProGAN	Progressive growing	1024x1024
2018	BigGAN	대규모 학습, truncation trick	512x512
2019	StyleGAN	Mapping network, AdaIN, 스타일 분리	1024x1024
2020	StyleGAN2	Weight demodulation, path regularization	1024x1024
2021	StyleGAN3	Alias-free, 등변 생성	1024x1024
2023	GigaGAN	1B-param text-to-image GAN	512x512+

10. 결론

Ian Goodfellow가 2014년에 제안한 GAN은 단순하면서도 강력한 아이디어 --- "두 네트워크의 경쟁이 더 나은 생성 모델을 만든다" --- 로 AI 분야에 혁명을 일으켰다. minimax game이라는 수학적 프레임워크는 우아하면서도 실용적이었고, 이후 10년간 수백 가지 변형을 낳으며 이미지 생성의 품질을 비약적으로 향상시켰다.

DCGAN은 CNN과의 결합으로 실용적 발전의 발판을 마련했고, WGAN은 Wasserstein distance라는 이론적 혁신으로 학습 안정성 문제를 해결했다. Progressive GAN과 StyleGAN 시리즈는 1024x1024 해상도의 포토리얼리스틱한 이미지 생성을 가능케 했고, CycleGAN과 Pix2Pix는 이미지 변환이라는 새로운 응용 영역을 개척했다.

비록 2021년 이후 Diffusion Model이 생성 모델의 주류로 부상했지만, GAN이 남긴 유산은 막대하다. Adversarial training이라는 패러다임은 여전히 다양한 분야에서 활용되고 있으며, GAN과 Diffusion의 장점을 결합하는 하이브리드 연구가 활발히 진행 중이다. GigaGAN의 등장이 보여주듯, GAN의 이야기는 아직 끝나지 않았다.

생성 모델의 역사에서 GAN은 "인공지능이 진정으로 창작할 수 있다"는 가능성을 처음으로 보여준 이정표로 기억될 것이다.

References

Goodfellow, I. J. et al. (2014). "Generative Adversarial Nets." NeurIPS 2014. arXiv:1406.2661
Mirza, M. & Osindero, S. (2014). "Conditional Generative Adversarial Nets." arXiv:1411.1784
Radford, A., Metz, L. & Chintala, S. (2015). "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks." arXiv:1511.06434
Arjovsky, M., Chintala, S. & Bottou, L. (2017). "Wasserstein GAN." arXiv:1701.07875
Gulrajani, I. et al. (2017). "Improved Training of Wasserstein GANs." arXiv:1704.00028
Isola, P. et al. (2017). "Image-to-Image Translation with Conditional Adversarial Networks." CVPR 2017. arXiv:1611.07004
Zhu, J.-Y. et al. (2017). "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks." ICCV 2017. arXiv:1703.10593
Karras, T. et al. (2017). "Progressive Growing of GANs for Improved Quality, Stability, and Variation." ICLR 2018. arXiv:1710.10196
Brock, A., Donahue, J. & Simonyan, K. (2018). "Large Scale GAN Training for High Fidelity Natural Image Synthesis." ICLR 2019. arXiv:1809.11096
Karras, T., Laine, S. & Aila, T. (2019). "A Style-Based Generator Architecture for Generative Adversarial Networks." CVPR 2019. arXiv:1812.04948
Karras, T. et al. (2020). "Analyzing and Improving the Image Quality of StyleGAN." CVPR 2020. arXiv:1912.04958
Karras, T. et al. (2021). "Alias-Free Generative Adversarial Networks." NeurIPS 2021. arXiv:2106.12423
Kang, M. et al. (2023). "Scaling up GANs for Text-to-Image Synthesis." CVPR 2023. arXiv:2303.05511
Dhariwal, P. & Nichol, A. (2021). "Diffusion Models Beat GANs on Image Synthesis." NeurIPS 2021. arXiv:2105.05233

GAN Paper Deep Dive: How Generative Adversarial Networks Ushered in the Era of AI-Generated Content

1. Paper Overview and Historical Significance
2. The Core Idea of GAN
3. Mathematical Foundations
4. Training Algorithm
5. Core Problems of GAN
6. The Complete GAN Lineage
7. Implementing GAN in PyTorch
8. GAN vs Diffusion Models Comparison
9. The Present and Future of GAN
10. Conclusion
References

1. Paper Overview and Historical Significance

1.1 Paper Information

"Generative Adversarial Nets" was published at NeurIPS 2014 (then known as NIPS), co-authored by Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. According to the now-legendary anecdote, Goodfellow conceived the idea while discussing generative models with colleagues at a bar in Montreal. He went home that night, coded it up, and the first prototype worked right away.

The core idea of this paper is remarkably intuitive: A counterfeiter (Generator) and a police officer (Discriminator) compete against each other. The counterfeiter produces increasingly sophisticated forgeries, while the police officer develops ever sharper detection skills. When this adversarial process converges, the counterfeiter produces bills indistinguishable from genuine ones.

1.2 Historical Context: The Generative Model Landscape of 2014

Before GAN appeared, the dominant approaches in generative modeling were as follows.

Variational Autoencoder (VAE, 2013): Proposed by Kingma and Welling, VAE introduced probabilistic latent variables into an Encoder-Decoder architecture to learn data distributions. However, optimizing the ELBO (Evidence Lower Bound) resulted in blurry generated images.

Boltzmann Machine Family: Deep Boltzmann Machines, Restricted Boltzmann Machines, and similar energy-based models were theoretically elegant but relied on MCMC (Markov Chain Monte Carlo) sampling, making training slow and scalability limited.

Autoregressive Models: Models like PixelRNN (2016) generated pixels one at a time sequentially. They could produce high-quality samples but generation speed was extremely slow.

GAN broke through all these limitations at once. It could generate high-quality samples without defining an explicit probability distribution, and could generate samples instantly in a single forward pass without Markov chains or sequential generation processes. This represented a paradigm shift in the field of generative models.

1.3 Impact

The GAN paper has been cited approximately 65,000 times as of 2024, and hundreds of GAN variants have been proposed over the following decade. Yann LeCun praised GAN as "the most interesting idea in the last 20 years in machine learning." GAN has been applied to countless domains including image generation, super-resolution, style transfer, data augmentation, and drug discovery. It reigned as the dominant paradigm in generative modeling until the emergence of Diffusion Models.

2. The Core Idea of GAN

2.1 Two-Player Game: Generator vs Discriminator

The GAN framework consists of two neural networks competing against each other.

Generator (G): Takes a random noise vector $z$ as input and generates fake data $G(z)$ . The Generator's goal is to produce samples similar enough to real data to fool the Discriminator.

G: z \sim p_z(z) \rightarrow G(z) \in \mathbb{R}^d

Discriminator (D): Determines whether the input data comes from the real data distribution ( $x \sim p_{data}$ ) or is fake, produced by the Generator ( $G(z)$ ). The output is a probability value between 0 and 1, where closer to 1 means the input is judged as real.

D: x \rightarrow [0, 1]

These two networks have opposing objectives:

Generator: Tries to maximize $D(G(z))$ (making the Discriminator classify fakes as real)
Discriminator: Tries to assign high probability to real data and low probability to fake data

2.2 Intuitive Analogy

The GAN training process can be understood through an art market analogy.

Component	Analogy	Role
Generator	Art forger	Goal is to create forgeries indistinguishable from originals
Discriminator	Art appraiser	Goal is to distinguish originals from forgeries
Training Data	Authentic artworks	Samples from the real data distribution
Noise Vector $z$	Artist's inspiration	A random point in the latent space

Initially, the forger's skills are poor, so the appraiser easily identifies forgeries. But the forger improves through the appraiser's feedback (gradients), and the appraiser also enhances detection capabilities to counter increasingly sophisticated forgeries. When this competition progresses sufficiently, the forger produces works indistinguishable from originals.

2.3 Minimax Game Formulation

The GAN training objective is formalized as the following minimax game:

\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]

Let us analyze each term of this value function $V(D, G)$ .

First term: $\mathbb{E}_{x \sim p_{data}(x)}[\log D(x)]$

This is the Discriminator's judgment on real data $x$ . The Discriminator tries to maximize this value, aiming for $D(x) \rightarrow 1$ (judging real as real). The Generator has no influence on this term.

Second term: $\mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]$

This is the Discriminator's judgment on fake data produced by the Generator.

The Discriminator tries to maximize this: $D(G(z)) \rightarrow 0$ (judging fake as fake) gives $\log(1 - 0) = 0$ , the maximum
The Generator tries to minimize this: $D(G(z)) \rightarrow 1$ (judging fake as real) gives $\log(1 - 1) = -\infty$ , the minimum

This is precisely where the name Adversarial comes from. Two players optimize the same value function in opposite directions.

3. Mathematical Foundations

3.1 Optimal Discriminator

Let us derive the optimal Discriminator $D^*_G$ for a fixed Generator $G$ . Converting the value function to integral form using the definition of expectation:

V(D, G) = \int_x p_{data}(x) \log D(x) \, dx + \int_x p_g(x) \log(1 - D(x)) \, dx

where $p_g$ is the distribution of data generated by the Generator. Combining into a single integral:

V(D, G) = \int_x \left[ p_{data}(x) \log D(x) + p_g(x) \log(1 - D(x)) \right] dx

Differentiating the integrand with respect to $D(x)$ and setting it to zero:

\frac{p_{data}(x)}{D(x)} - \frac{p_g(x)}{1 - D(x)} = 0

Solving for $D(x)$ , the optimal discriminator is:

D^*_G(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}

This result is intuitively sound. If the probability of a data point $x$ being real is $p_{data}(x)$ and being fake is $p_g(x)$ , the optimal discrimination exactly matches the posterior probability from Bayes' rule.

Key observation: When $p_g = p_{data}$ , i.e., when the Generator has perfectly learned the real data distribution, $D^*_G(x) = \frac{1}{2}$ for all $x$ . The Discriminator can no longer distinguish real from fake at all.

3.2 Relationship with Jensen-Shannon Divergence

Substituting the optimal discriminator $D^*_G$ into the value function:

V(D^*_G, G) = \mathbb{E}_{x \sim p_{data}} \left[ \log \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} \right] + \mathbb{E}_{x \sim p_g} \left[ \log \frac{p_g(x)}{p_{data}(x) + p_g(x)} \right]

Simplifying:

V(D^*_G, G) = -\log 4 + 2 \cdot JSD(p_{data} \| p_g)

where $JSD$ is the Jensen-Shannon Divergence, defined as:

JSD(p \| q) = \frac{1}{2} KL\left(p \left\| \frac{p+q}{2}\right.\right) + \frac{1}{2} KL\left(q \left\| \frac{p+q}{2}\right.\right)

JSD is a symmetrized version of KL Divergence and is always bounded: $0 \leq JSD(p \| q) \leq \log 2$ . $JSD = 0$ occurs if and only if $p = q$ , i.e., when the two distributions are completely identical.

3.3 Proof of Global Optimality

Theorem (Goodfellow et al., 2014): The global minimum of $C(G) = \max_D V(D, G)$ is achieved if and only if $p_g = p_{data}$ , at which point $C(G) = -\log 4$ .

Proof:

(1) $C(G) = V(D^*_G, G) = -\log 4 + 2 \cdot JSD(p_{data} \| p_g)$

(2) $JSD(p_{data} \| p_g) \geq 0$ (non-negativity of JSD)

(3) $JSD(p_{data} \| p_g) = 0 \iff p_{data} = p_g$

(4) Therefore $C(G) \geq -\log 4$ , with equality if and only if $p_g = p_{data}$

This provides the theoretical guarantee for GAN training. Given a Generator and Discriminator with sufficient capacity, at the Nash equilibrium of the minimax game, the Generator perfectly recovers the real data distribution.

3.4 Nash Equilibrium

From a game-theoretic perspective, GAN training is the problem of finding a Nash equilibrium between two players. A Nash equilibrium is a state where neither player can benefit by unilaterally changing their strategy while the other player's strategy remains fixed.

The Nash equilibrium in GAN is:

$G^*$ : A Generator that achieves $p_g = p_{data}$
$D^*$ : A Discriminator that outputs $D(x) = \frac{1}{2}$ for all $x$

Theoretically, this equilibrium point exists and is unique, but finding it in practice is very difficult. It is a non-convex game where two networks must be optimized simultaneously. This is the fundamental difficulty of GAN training and became the starting point for numerous subsequent studies.

3.5 KL Divergence vs JS Divergence

Why JSD specifically? Let us compare with KL Divergence.

Problems with KL Divergence:

KL(p_{data} \| p_g) = \int p_{data}(x) \log \frac{p_{data}(x)}{p_g(x)} dx

KL Divergence is asymmetric and diverges to infinity in regions where $p_g(x) = 0$ but $p_{data}(x) > 0$ . This becomes problematic when the Generator's distribution fails to sufficiently cover the real distribution early in training.

Advantages of JS Divergence:

Symmetric: $JSD(p \| q) = JSD(q \| p)$
Always finite: $0 \leq JSD \leq \log 2$
Computes KL with respect to the mixture distribution $\frac{p+q}{2}$ , so it does not diverge even when one distribution is zero

However, JSD is not perfect either. When the supports of the two distributions do not overlap, JSD becomes the constant $\log 2$ , making the gradient zero. This is the root cause of the vanishing gradient problem in GAN training, and the key motivation for WGAN's introduction of Wasserstein distance.

4. Training Algorithm

4.1 Training Procedure

The training algorithm proposed in the original paper is as follows:

Algorithm 1: GAN Training (Goodfellow et al., 2014)

for number of training iterations do
    # --- Step 1: Discriminator update (k steps) ---
    for k steps do
        - Sample m noise samples {z^(1), ..., z^(m)} from noise prior p_z(z)
        - Sample m real samples {x^(1), ..., x^(m)} from data distribution p_data(x)
        - Update Discriminator parameters by stochastic gradient ascending:

          nabla_{theta_d} (1/m) sum_{i=1}^{m} [log D(x^(i)) + log(1 - D(G(z^(i))))]

    end for

    # --- Step 2: Generator update (1 step) ---
    - Sample m noise samples {z^(1), ..., z^(m)} from noise prior p_z(z)
    - Update Generator parameters by stochastic gradient descending:

          nabla_{theta_g} (1/m) sum_{i=1}^{m} log(1 - D(G(z^(i))))

end for

4.2 Alternating Optimization

The key is alternating optimization. The Discriminator and Generator are updated in turn.

Why update the Discriminator k times before updating the Generator once:

Theoretically, the optimal discriminator $D^*_G$ must be found before updating the Generator to obtain the correct gradient direction. Since fully optimizing $D$ is infeasible in practice, it is approximated with $k$ gradient steps. The original paper used $k = 1$ as the default.

Importance of maintaining balance:

If the Discriminator becomes too strong: The Generator's gradients vanish and training stalls
If the Discriminator becomes too weak: It fails to provide useful learning signals to the Generator
Ideally, the Discriminator and Generator should advance at comparable levels

4.3 Non-Saturating Loss (Practical Modification)

In the theoretical minimax objective, the Generator's goal is to minimize $\log(1 - D(G(z)))$ . However, early in training when the Generator is very poor, $D(G(z)) \approx 0$ , so $\log(1 - D(G(z))) \approx \log 1 = 0$ , resulting in near-zero gradients.

Goodfellow addressed this by modifying the Generator's objective:

Original (Minimax):

\min_G \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]

Modified (Non-Saturating):

\max_G \mathbb{E}_{z \sim p_z(z)}[\log D(G(z))]

Both objectives share the same fixed point (Nash equilibrium), but the gradient magnitudes differ significantly early in training. The non-saturating loss provides strong gradients even when $D(G(z))$ is small, enabling the Generator to learn quickly.

\text{Minimax gradient}: \frac{\partial}{\partial G} \log(1 - D(G(z))) = \frac{-D'(G(z))}{1 - D(G(z))} \approx 0 \text{ when } D(G(z)) \approx 0

\text{Non-Saturating gradient}: \frac{\partial}{\partial G} \log D(G(z)) = \frac{D'(G(z))}{D(G(z))} \rightarrow \text{large when } D(G(z)) \approx 0

4.4 Experimental Results in the Original Paper

The original paper conducted experiments on MNIST, Toronto Face Database (TFD), and CIFAR-10 datasets. Evaluation used Parzen window-based log-likelihood estimation, and GAN showed competitive performance compared to Deep Boltzmann Machines and Stacked Denoising Autoencoders.

However, by today's standards, the results were quite rudimentary. Both the Generator and Discriminator used simple MLPs (Multi-Layer Perceptrons), and the resolution and quality of generated images were limited. The true breakthroughs came through subsequent architectural improvements and training technique advancements.

5. Core Problems of GAN

5.1 Mode Collapse

The most notorious problem of GAN is Mode Collapse. This occurs when the Generator fails to learn all the modes (diverse patterns) of the data distribution and instead focuses on a small subset, repeatedly generating similar outputs.

Mechanism:

When the Generator discovers a few patterns that are particularly effective at fooling the Discriminator, it repeatedly generates those patterns instead of exploring diverse alternatives. For example, when training on MNIST, the Generator might perfectly generate only the digit '1' while failing to generate any other digits.

Mathematical interpretation:

Mode collapse is related to the transformation from minimax to maximin game:

\max_D \min_G V(D, G) \neq \min_G \max_D V(D, G)

In the theoretical minimax, the Generator must defend against all possible Discriminators, requiring it to cover the entire distribution. However, in actual training, the Generator only needs to fool the current Discriminator, making it a "rational" strategy to focus on specific modes.

5.2 Training Instability

GAN training is inherently the problem of finding a Nash equilibrium in a non-cooperative game. This is far more difficult than a simple optimization problem.

Oscillation problem: The Generator and Discriminator frequently oscillate around each other without converging. In a typical loss landscape, gradient descent finds local minima, but gradient descent in a minimax game can circle around saddle points.

Difficulty of training balance: If the Discriminator converges too quickly, the Generator cannot learn; conversely, if the Discriminator is too weak, it fails to convey meaningful learning signals to the Generator. Maintaining this delicate balance was the greatest practical challenge in GAN training.

5.3 Vanishing Gradients

As explained in Section 3.5, JS Divergence becomes the constant $\log 2$ when the supports of the two distributions do not overlap, resulting in zero gradients.

In high-dimensional data (e.g., images), both the real data distribution and the Generator's distribution exist on low-dimensional manifolds within the high-dimensional space. The probability of these two manifolds overlapping is very low, so it is typical for the supports of the two distributions to barely overlap early in training. In this situation, JSD-based GAN provides no useful gradients at all.

\text{When } \text{supp}(p_{data}) \cap \text{supp}(p_g) = \emptyset: \quad JSD(p_{data} \| p_g) = \log 2 \quad (\text{constant})

5.4 Evaluation Challenges

Objectively evaluating GAN performance is itself a very challenging problem. The main evaluation metrics are:

Inception Score (IS): Measures the quality (sharpness) and diversity of generated images. Uses a pre-trained Inception network -- high scores indicate that individual images have confident class predictions (quality) while the overall distribution covers diverse classes (diversity).

Frechet Inception Distance (FID): Measures the Frechet distance between the Inception feature distributions of real and generated data. Lower is better. Widely used as a more reliable metric than IS.

FID = \|\mu_r - \mu_g\|^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})

where $(\mu_r, \Sigma_r)$ and $(\mu_g, \Sigma_g)$ are the mean and covariance of the Inception features for real and generated images, respectively.

6. The Complete GAN Lineage

6.1 DCGAN (2015): The Beginning of Stable CNN-based Training

Radford, Metz, Chintala. "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks" (2015)

The original GAN used only MLPs (Fully Connected Layers), failing to leverage CNN's powerful spatial feature extraction capabilities for image generation. DCGAN (Deep Convolutional GAN) was the first architecture to successfully integrate CNNs into GANs, establishing several architectural guidelines for stable training.

DCGAN's key architectural rules:

Remove pooling layers: Use strided convolutions (Discriminator) and fractional-strided / transposed convolutions (Generator) instead of max pooling
Apply Batch Normalization: Apply to both Generator and Discriminator, except for the Generator's output layer and the Discriminator's input layer
Remove fully connected layers: Use global average pooling or direct convolutional connections
Activation functions: Generator uses Tanh for the output layer and ReLU elsewhere. Discriminator uses LeakyReLU for all layers

DCGAN Generator Architecture (Conceptual):

z (100-dim) -> FC -> Reshape (4x4x1024) -> ConvT -> BN -> ReLU (8x8x512)
-> ConvT -> BN -> ReLU (16x16x256) -> ConvT -> BN -> ReLU (32x32x128)
-> ConvT -> Tanh (64x64x3)

Beyond simply generating good images, DCGAN demonstrated that the learned latent space possesses meaningful structure. The famous demonstration showed that vector arithmetic in latent space corresponds to semantic transformations:

\text{vec}(\text{"man with glasses"}) - \text{vec}(\text{"man"}) + \text{vec}(\text{"woman"}) = \text{vec}(\text{"woman with glasses"})

6.2 WGAN (2017): Introduction of Wasserstein Distance

Arjovsky, Chintala, Bottou. "Wasserstein GAN" (2017)

WGAN is one of the most important theoretical advances in GAN, introducing Wasserstein distance (Earth Mover's distance) to address the fundamental limitations of JS Divergence.

Wasserstein Distance (EM Distance):

W(p_{data}, p_g) = \inf_{\gamma \in \Pi(p_{data}, p_g)} \mathbb{E}_{(x, y) \sim \gamma} [\|x - y\|]

where $\Pi(p_{data}, p_g)$ is the set of all joint distributions with marginals $p_{data}$ and $p_g$ . Intuitively, it is the minimum cost of "moving dirt" to transform one distribution into another.

Key advantages of Wasserstein Distance:

Unlike JSD, it provides a continuous and differentiable distance even when the supports of the two distributions do not overlap. For example, considering two point distributions $\delta_0$ and $\delta_\theta$ ( $\theta > 0$ ):

JSD(\delta_0 \| \delta_\theta) = \log 2 \quad \text{(constant, gradient = 0)}

W(\delta_0, \delta_\theta) = |\theta| \quad \text{(continuous, gradient} = \text{sign}(\theta)\text{)}

Kantorovich-Rubinstein Duality:

Since directly computing the Wasserstein distance is intractable, the Kantorovich-Rubinstein duality is leveraged:

W(p_{data}, p_g) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_{data}}[f(x)] - \mathbb{E}_{x \sim p_g}[f(x)]

where the supremum is taken over all 1-Lipschitz functions. WGAN trains the Discriminator (now called the Critic) to approximate this 1-Lipschitz function.

Weight Clipping: The original WGAN enforced the Lipschitz constraint by clipping critic weights to the range $[-c, c]$ . However, this severely limited the critic's representational power and could cause training instability.

6.3 WGAN-GP (2017): Gradient Penalty

Gulrajani, Ahmed, Arjovsky, Dumoulin, Courville. "Improved Training of Wasserstein GANs" (2017)

To address weight clipping's problems, Gradient Penalty (GP) was proposed. Instead of directly enforcing the Lipschitz constraint, it regularizes the critic's gradient norm to stay close to 1.

L_{WGAN-GP} = \underbrace{\mathbb{E}_{x \sim p_g}[D(x)] - \mathbb{E}_{x \sim p_{data}}[D(x)]}_{\text{Original Critic Loss}} + \underbrace{\lambda \mathbb{E}_{\hat{x} \sim p_{\hat{x}}} \left[ (\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2 \right]}_{\text{Gradient Penalty}}

where $\hat{x}$ is a random interpolation between real and generated data:

\hat{x} = \epsilon x + (1 - \epsilon) G(z), \quad \epsilon \sim \text{Uniform}[0, 1]

WGAN-GP uses $\lambda = 10$ and $n_{critic} = 5$ critic updates as defaults, training stably across diverse architectures with minimal hyperparameter tuning.

6.4 Progressive GAN (2017): Gradual Resolution Increase

Karras, Aila, Laine, Lehtinen. "Progressive Growing of GANs for Improved Quality, Stability, and Variation" (2017)

Progressive GAN (ProGAN), proposed by the NVIDIA research team, opened new horizons in high-resolution image generation. The core idea is to start training the Generator and Discriminator at low resolution and progressively add layers to increase resolution.

Training process:

Phase 1: Train G and D at 4x4 resolution
Phase 2: Add 8x8 layers with gradual fade-in transition
Phase 3: Add 16x16 layers
...
Phase N: Reach final 1024x1024 resolution

Fade-in mechanism: When adding new layers, the outputs of existing and new layers are combined via weighted averaging. The weight $\alpha$ gradually increases from 0 to 1, progressively activating the new layer.

\text{output} = (1 - \alpha) \cdot \text{upsampled\_old} + \alpha \cdot \text{new\_layer\_output}

Key contributions:

Dramatically improved training stability: Learning coarse structure at low resolution first, then gradually adding fine details makes training much more stable
Achieved 1024x1024 resolution: First successful generation of photorealistic face images at 1024x1024 resolution on the CelebA-HQ dataset
Minibatch standard deviation: Introduced a technique using within-minibatch statistics to increase diversity

6.5 StyleGAN Series (2019-2021): The Pinnacle of Style-based Generation

StyleGAN (2019)

Karras, Laine, Aila. "A Style-Based Generator Architecture for Generative Adversarial Networks" (2019)

StyleGAN is a revolutionary architecture that combines Progressive GAN's progressive training with the style separation concepts from Neural Style Transfer.

Key structural changes:

Mapping Network: Transforms the input latent vector $z \in \mathcal{Z}$ through a nonlinear mapping network $f: \mathcal{Z} \rightarrow \mathcal{W}$ to an intermediate latent space $\mathcal{W}$ . Consists of 8 FC layers.
Adaptive Instance Normalization (AdaIN): Injects style vectors $w$ from $\mathcal{W}$ space into each convolution layer.

\text{AdaIN}(x_i, y) = y_{s,i} \frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i}

where $y_s$ and $y_b$ are scale and bias obtained via learned affine transformation from the style vector $w$ .

Constant Input: Uses a learnable constant tensor (4x4x512) as the Generator's input. Style is injected solely through AdaIN.
Noise Injection: Adds per-pixel noise after each convolution layer to control stochastic variation (e.g., hair position, pores, etc.).

Style hierarchy:

Resolution Layer	Controlled Attributes
$4^2 - 8^2$ (Coarse)	Pose, face shape, presence of glasses
$16^2 - 32^2$ (Middle)	Facial features, hairstyle, eye openness
$64^2 - 1024^2$ (Fine)	Color, fine structure, background details

StyleGAN2 (2020)

Karras, Laine, Aittala, Hellsten, Lehtinen, Aila. "Analyzing and Improving the Image Quality of StyleGAN" (2020)

StyleGAN2 resolved several artifacts in StyleGAN and significantly improved image quality.

Key improvements:

Weight Demodulation: Replaces AdaIN to eliminate blob artifacts. Solves the problem where AdaIN's instance normalization destroys relative magnitude information within feature maps
Removal of Progressive Growing: Achieves stable high-resolution training without progressive growing by using skip connections and residual connections
Path Length Regularization: Improves smoothness of the latent space so that small changes in latent vectors produce proportional changes in images
Lazy Regularization: Applies regularization every 16 steps instead of every step for improved efficiency

StyleGAN2-ADA: Introduced Adaptive Discriminator Augmentation to train without overfitting even with limited data. Enabled high-quality generation from datasets as small as a few thousand images.

StyleGAN3 (2021)

Karras, Aittala, Laine, et al. "Alias-Free Generative Adversarial Networks" (2021)

StyleGAN3 addressed a fundamental signal processing issue.

Problem: In StyleGAN2, fine details in generated images appeared "stuck" to image coordinates. When the camera should move, textures did not move with objects but remained fixed -- an aliasing problem.

Solution: Redesigned all signals within the network to be processed in the continuous domain, fundamentally eliminating aliasing from discrete sampling.

Key changes:

Fourier feature-based input replacement
Guaranteed continuous equivariant operations
Achieved full equivariance to translation and rotation
FID comparable to StyleGAN2 while having fundamentally different internal representations

StyleGAN3 laid the foundation for better suitability in video generation and animation.

6.6 Conditional GAN, Pix2Pix, CycleGAN

Conditional GAN (cGAN, 2014)

Mirza, Osindero. "Conditional Generative Adversarial Nets" (2014)

The original GAN cannot control what data is generated. Conditional GAN provides additional conditioning information $y$ (e.g., class labels) to both the Generator and Discriminator, enabling conditional generation of data with desired attributes.

\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x|y)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z|y)|y))]

Pix2Pix (2017)

Isola, Zhu, Zhou, Efros. "Image-to-Image Translation with Conditional Adversarial Networks" (2017)

Pix2Pix is an image-to-image translation framework using paired image data. It solved diverse tasks -- colorizing grayscale photos, converting satellite images to maps, transforming sketches to photos -- within a unified framework.

Key components:

U-Net Generator: Encoder-Decoder architecture with skip connections
PatchGAN Discriminator: Judges authenticity at the $N \times N$ patch level rather than the whole image
L1 Reconstruction Loss + Adversarial Loss: Simultaneously pursues structural similarity and realism

\mathcal{L} = \mathcal{L}_{cGAN}(G, D) + \lambda \mathcal{L}_{L1}(G)

CycleGAN (2017)

Zhu, Park, Isola, Efros. "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks" (2017)

Pix2Pix had the major constraint of requiring paired data. CycleGAN learns translation between two domains using only unpaired data.

Core idea: Cycle Consistency Loss

Two Generators $G: X \rightarrow Y$ and $F: Y \rightarrow X$ , and two Discriminators $D_X$ , $D_Y$ are trained.

\mathcal{L}_{cyc}(G, F) = \mathbb{E}_{x \sim p_{data}(x)}[\|F(G(x)) - x\|_1] + \mathbb{E}_{y \sim p_{data}(y)}[\|G(F(y)) - y\|_1]

The constraint that translating an image from domain $X$ to $Y$ and back to $X$ should recover the original image. This enables learning meaningful mappings without paired data.

Applications: Converting horses to zebras, summer landscapes to winter, photographs to Monet-style paintings, etc.

6.7 BigGAN (2018): The Power of Scale

Brock, Donahue, Simonyan. "Large Scale GAN Training for High Fidelity Natural Image Synthesis" (2018)

BigGAN dramatically demonstrated that "scale matters in GAN training." It trained with 2-4x the parameters and 8x the batch size compared to prior work.

Key techniques:

Class-Conditional Batch Normalization: Shares class embeddings to adjust the scale and bias of each Batch Normalization layer
Truncation Trick: Truncates the distribution of latent vectors $z$ at inference time to control the quality-diversity tradeoff

z \sim \mathcal{N}(0, I) \rightarrow z' = \text{truncate}(z, \text{threshold})

Orthogonal Regularization: Applies orthogonal regularization to Generator weights for training stability

Results: Achieved IS 166.5 and FID 7.4 on ImageNet 128x128, vastly surpassing the previous best (IS 52.52, FID 18.6).

6.8 GigaGAN (2023): The Return of GAN?

Kang, Zhu, et al. "Scaling up GANs for Text-to-Image Synthesis" (2023)

At a time when Diffusion Models dominated image generation, GigaGAN demonstrated the potential of GAN once again as a 1B-parameter text-to-image GAN.

Key innovations:

Adaptive Kernel Selection: Generates different convolution filters for each image. Determined by convex combination from a filter bank using the style vector
Stable Attention: Computes attention scores based on L2 distance to guarantee Lipschitz continuity, and normalizes the attention weight matrix to unit variance
Query-Key Tying: Shares Query and Key matrices for stability
CLIP Text Encoder: Extracts text embeddings using a pre-trained CLIP model

Results and significance:

Surpassed Stable Diffusion v1.5, DALL-E 2, and Parti-750M in FID
0.13 seconds for 512px image generation: Inference speed tens to hundreds of times faster than Diffusion models
Proved that GAN remains competitive in large-scale text-to-image synthesis

7. Implementing GAN in PyTorch

7.1 Basic GAN Implementation (MNIST)

Below is the most basic GAN implementation for the MNIST dataset in PyTorch.

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# ============================================================
# Hyperparameter Settings
# ============================================================
LATENT_DIM = 100        # Latent vector dimension (dimension of z)
IMG_DIM = 28 * 28       # Flattened MNIST image dimension
HIDDEN_DIM = 256        # Hidden layer dimension
BATCH_SIZE = 64
EPOCHS = 200
LR = 0.0002
BETAS = (0.5, 0.999)   # Adam optimizer beta parameters
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# ============================================================
# Generator Definition
# ============================================================
class Generator(nn.Module):
    """
    Takes a latent vector z as input and generates a fake image.
    Architecture: z(100) -> 256 -> 512 -> 1024 -> 784(28x28)
    """
    def __init__(self, latent_dim: int, img_dim: int, hidden_dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim, hidden_dim * 2),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim * 2, hidden_dim * 4),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim * 4, img_dim),
            nn.Tanh(),  # Normalize output to [-1, 1] range
        )

    def forward(self, z: torch.Tensor) -> torch.Tensor:
        return self.net(z)


# ============================================================
# Discriminator Definition
# ============================================================
class Discriminator(nn.Module):
    """
    Takes an image as input and outputs the probability of being real/fake.
    Architecture: 784(28x28) -> 1024 -> 512 -> 256 -> 1
    """
    def __init__(self, img_dim: int, hidden_dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(img_dim, hidden_dim * 4),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim * 4, hidden_dim * 2),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, 1),
            nn.Sigmoid(),  # Convert output to [0, 1] probability
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)


# ============================================================
# Data Loader Setup
# ============================================================
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,)),  # [0,1] -> [-1,1]
])

dataset = datasets.MNIST(root="./data", train=True, transform=transform, download=True)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)


# ============================================================
# Model, Optimizer, and Loss Function Initialization
# ============================================================
G = Generator(LATENT_DIM, IMG_DIM, HIDDEN_DIM).to(DEVICE)
D = Discriminator(IMG_DIM, HIDDEN_DIM).to(DEVICE)

opt_G = optim.Adam(G.parameters(), lr=LR, betas=BETAS)
opt_D = optim.Adam(D.parameters(), lr=LR, betas=BETAS)

criterion = nn.BCELoss()  # Binary Cross Entropy


# ============================================================
# Training Loop
# ============================================================
for epoch in range(EPOCHS):
    d_loss_total, g_loss_total = 0.0, 0.0

    for batch_idx, (real_imgs, _) in enumerate(dataloader):
        real_imgs = real_imgs.view(-1, IMG_DIM).to(DEVICE)
        batch_size = real_imgs.size(0)

        # Real/fake labels
        real_labels = torch.ones(batch_size, 1, device=DEVICE)
        fake_labels = torch.zeros(batch_size, 1, device=DEVICE)

        # -----------------------------------------
        # Step 1: Train Discriminator
        # -----------------------------------------
        # Discriminate real images
        d_real = D(real_imgs)
        d_loss_real = criterion(d_real, real_labels)

        # Generate and discriminate fake images
        z = torch.randn(batch_size, LATENT_DIM, device=DEVICE)
        fake_imgs = G(z).detach()  # Block Generator's gradients
        d_fake = D(fake_imgs)
        d_loss_fake = criterion(d_fake, fake_labels)

        # Total Discriminator loss and update
        d_loss = d_loss_real + d_loss_fake
        opt_D.zero_grad()
        d_loss.backward()
        opt_D.step()

        # -----------------------------------------
        # Step 2: Train Generator
        # -----------------------------------------
        z = torch.randn(batch_size, LATENT_DIM, device=DEVICE)
        fake_imgs = G(z)
        d_fake = D(fake_imgs)

        # Non-saturating loss: Generator tries to maximize D(G(z))
        g_loss = criterion(d_fake, real_labels)
        opt_G.zero_grad()
        g_loss.backward()
        opt_G.step()

        d_loss_total += d_loss.item()
        g_loss_total += g_loss.item()

    # Per-epoch log output
    num_batches = len(dataloader)
    print(
        f"Epoch [{epoch+1}/{EPOCHS}] "
        f"D Loss: {d_loss_total/num_batches:.4f} | "
        f"G Loss: {g_loss_total/num_batches:.4f}"
    )

7.2 DCGAN Implementation (Key Parts)

A version with the Generator and Discriminator changed to convolutional architectures.

class DCGANGenerator(nn.Module):
    """
    DCGAN Generator: Generates images using Transposed Convolutions.
    z(100) -> 4x4x512 -> 8x8x256 -> 16x16x128 -> 32x32x64 -> 64x64x3
    """
    def __init__(self, latent_dim: int = 100, feature_map_size: int = 64, channels: int = 3):
        super().__init__()
        self.net = nn.Sequential(
            # Input: z (latent_dim x 1 x 1) -> (feature_map_size*8 x 4 x 4)
            nn.ConvTranspose2d(latent_dim, feature_map_size * 8, 4, 1, 0, bias=False),
            nn.BatchNorm2d(feature_map_size * 8),
            nn.ReLU(inplace=True),

            # (feature_map_size*8 x 4 x 4) -> (feature_map_size*4 x 8 x 8)
            nn.ConvTranspose2d(feature_map_size * 8, feature_map_size * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 4),
            nn.ReLU(inplace=True),

            # (feature_map_size*4 x 8 x 8) -> (feature_map_size*2 x 16 x 16)
            nn.ConvTranspose2d(feature_map_size * 4, feature_map_size * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 2),
            nn.ReLU(inplace=True),

            # (feature_map_size*2 x 16 x 16) -> (feature_map_size x 32 x 32)
            nn.ConvTranspose2d(feature_map_size * 2, feature_map_size, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size),
            nn.ReLU(inplace=True),

            # (feature_map_size x 32 x 32) -> (channels x 64 x 64)
            nn.ConvTranspose2d(feature_map_size, channels, 4, 2, 1, bias=False),
            nn.Tanh(),
        )

    def forward(self, z: torch.Tensor) -> torch.Tensor:
        return self.net(z)


class DCGANDiscriminator(nn.Module):
    """
    DCGAN Discriminator: Judges authenticity using Strided Convolutions.
    (3 x 64 x 64) -> (64 x 32 x 32) -> (128 x 16 x 16) ->
    (256 x 8 x 8) -> (512 x 4 x 4) -> 1
    """
    def __init__(self, feature_map_size: int = 64, channels: int = 3):
        super().__init__()
        self.net = nn.Sequential(
            # (channels x 64 x 64) -> (feature_map_size x 32 x 32)
            nn.Conv2d(channels, feature_map_size, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size x 32 x 32) -> (feature_map_size*2 x 16 x 16)
            nn.Conv2d(feature_map_size, feature_map_size * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 2),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size*2 x 16 x 16) -> (feature_map_size*4 x 8 x 8)
            nn.Conv2d(feature_map_size * 2, feature_map_size * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 4),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size*4 x 8 x 8) -> (feature_map_size*8 x 4 x 4)
            nn.Conv2d(feature_map_size * 4, feature_map_size * 8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 8),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size*8 x 4 x 4) -> (1 x 1 x 1)
            nn.Conv2d(feature_map_size * 8, 1, 4, 1, 0, bias=False),
            nn.Sigmoid(),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x).view(-1, 1)

7.3 WGAN-GP Core Loss Implementation

def compute_gradient_penalty(
    discriminator: nn.Module,
    real_samples: torch.Tensor,
    fake_samples: torch.Tensor,
    device: torch.device,
    lambda_gp: float = 10.0,
) -> torch.Tensor:
    """
    Computes the Gradient Penalty for WGAN-GP.

    Penalizes the Discriminator (Critic) so that the L2 norm of its gradient
    equals 1 at random interpolation points between real and generated data.
    """
    batch_size = real_samples.size(0)

    # Random interpolation coefficient
    epsilon = torch.rand(batch_size, 1, 1, 1, device=device)

    # Interpolation between real and fake
    interpolated = (epsilon * real_samples + (1 - epsilon) * fake_samples).requires_grad_(True)

    # Critic output
    d_interpolated = discriminator(interpolated)

    # Gradient computation
    gradients = torch.autograd.grad(
        outputs=d_interpolated,
        inputs=interpolated,
        grad_outputs=torch.ones_like(d_interpolated),
        create_graph=True,
        retain_graph=True,
    )[0]

    # L2 norm of gradients
    gradients = gradients.view(batch_size, -1)
    gradient_norm = gradients.norm(2, dim=1)

    # Gradient Penalty: expectation of (||grad|| - 1)^2
    gradient_penalty = lambda_gp * ((gradient_norm - 1) ** 2).mean()

    return gradient_penalty


# WGAN-GP Training Loop (Key Parts)
def train_wgan_gp_step(
    G: nn.Module,
    D: nn.Module,
    opt_G: optim.Optimizer,
    opt_D: optim.Optimizer,
    real_imgs: torch.Tensor,
    latent_dim: int,
    device: torch.device,
    n_critic: int = 5,
):
    """One iteration of WGAN-GP training."""
    batch_size = real_imgs.size(0)

    # --- Critic (Discriminator) training: n_critic times ---
    for _ in range(n_critic):
        z = torch.randn(batch_size, latent_dim, 1, 1, device=device)
        fake_imgs = G(z).detach()

        # Wasserstein Loss: maximize E[D(real)] - E[D(fake)]
        d_real = D(real_imgs).mean()
        d_fake = D(fake_imgs).mean()
        gp = compute_gradient_penalty(D, real_imgs, fake_imgs, device)

        d_loss = d_fake - d_real + gp  # Critic minimizes this

        opt_D.zero_grad()
        d_loss.backward()
        opt_D.step()

    # --- Generator training: 1 time ---
    z = torch.randn(batch_size, latent_dim, 1, 1, device=device)
    fake_imgs = G(z)
    g_loss = -D(fake_imgs).mean()  # Generator maximizes D(G(z))

    opt_G.zero_grad()
    g_loss.backward()
    opt_G.step()

    return d_loss.item(), g_loss.item()

8. GAN vs Diffusion Models Comparison

Entering the 2020s, Diffusion Models (DDPM, Score-based models) emerged as a new paradigm in image generation. After Dhariwal and Nichol's 2021 paper "Diffusion Models Beat GANs on Image Synthesis," Diffusion Models became the mainstream of generative modeling through DALL-E 2, Stable Diffusion, Midjourney, and others. Let us systematically compare GAN and Diffusion Models.

8.1 Fundamental Comparison

Aspect	GAN	Diffusion Model
Training Method	Adversarial Training (minimax game)	Denoising Score Matching
Generation	Single forward pass	Iterative denoising (tens to hundreds of steps)
Probabilistic	Implicit	Explicit
Loss Function	Adversarial loss (+ auxiliary losses)	Simple MSE/L1 (noise prediction)
Distribution	$p_g \approx p_{data}$ via JSD/Wasserstein	$p_\theta(x_0) \approx p_{data}$ via ELBO

8.2 Strengths and Weaknesses

GAN Strengths:

Inference speed: Generates images in a single forward pass. Suitable for real-time applications
Sample sharpness: Tends to produce sharp, realistic images through adversarial training
Latent space control: Semantic manipulation through a well-structured latent space
Lightweight: Can achieve high-quality generation with relatively few parameters

GAN Weaknesses:

Training instability: Mode collapse, training oscillation, etc.
Limited diversity: Mode collapse can restrict generation diversity
Scalability limitations: Does not scale as naturally to text-conditioned generation as Diffusion Models
Evaluation difficulty: Hard to monitor training progress with reliable metrics

Diffusion Model Strengths:

Training stability: Stable training with simple MSE loss
Sample diversity: Mode collapse is virtually nonexistent
Text-conditioned generation: Natural conditional generation through classifier-free guidance, etc.
Theoretical robustness: Explicit probabilistic model enabling likelihood computation

Diffusion Model Weaknesses:

Inference speed: Requires tens to hundreds of iterative denoising steps (being improved through distillation, etc.)
Computational cost: High compute requirements for both training and inference
Memory usage: Large U-Net parameters required for high-resolution generation

8.3 Convergence Characteristics

Property	GAN	Diffusion Model
Convergence guarantee	Nash equilibrium guaranteed only theoretically	Stable convergence via ELBO optimization
Mode Coverage	Risk of mode collapse	Excellent mode coverage
Training curve	Unstable, hard to interpret	Stable, loss directly interpretable
Hyperparameter sensitivity	High	Relatively low

8.4 The 2025 Landscape

As of 2025, Diffusion Models dominate image generation. The most commercially successful image generation models -- Stable Diffusion, DALL-E 3, Midjourney -- are all Diffusion-based.

However, GAN has not been fully replaced. GAN still shows strength in the following areas:

Real-time generation: Video games, VR/AR, etc.
Image editing/manipulation: Precise face editing and attribute manipulation based on StyleGAN
Super-Resolution: Real-time super-resolution processing
GAN-Diffusion Hybrids: Combining GAN loss with Diffusion processes, or leveraging GAN's fast inference for Diffusion model distillation

The emergence of GigaGAN (2023) demonstrated that GAN can be competitive in large-scale text-to-image synthesis, and research combining the strengths of both paradigms is actively underway.

9. The Present and Future of GAN

9.1 GAN's Current Status

GAN has been at the center of generative modeling for about a decade since its 2014 publication, but ceded its mainstream position to Diffusion Models after 2021. However, GAN's legacy and current role remain significant.

Fields where GAN is actively used today:

Medical imaging: Widely used for augmenting training data while preserving patient privacy
Data augmentation: Expanding small datasets to improve model performance
Image editing and restoration: Face restoration, denoising, super-resolution, etc.
Fashion and design: Virtual try-on, design prototyping
Gaming and simulation: Real-time content generation, texture synthesis

9.2 GAN's Theoretical Legacy

GAN's greatest contribution extends beyond image generation technology.

Adversarial Training Paradigm: The adversarial training introduced by GAN has influenced diverse fields beyond generative models.

Adversarial Examples: Robustness research on deep learning models
Domain Adaptation: Knowledge transfer across domains using adversarial training
Self-supervised Learning: Self-supervised learning leveraging adversarial signals
Inverse Reinforcement Learning: Learning reward functions adversarially

Implicit Generative Models: GAN's core insight that complex distributions can be learned without defining explicit probability distributions has influenced the development of Energy-based Models, Score-based Models, and more.

9.3 Future Outlook

GAN-Diffusion Fusion: One of the most promising directions is combining the strengths of GAN and Diffusion Models. Research is underway to replace denoising steps in the Diffusion process with GANs to accelerate inference.

3D Generation: Research combining GAN with 3D representations (Neural Radiance Fields, 3D Gaussian Splatting, etc.) for 3D content generation is active. EG3D and GET3D are representative examples.

Video Generation: StyleGAN3's equivariant properties can naturally apply to video generation, with ongoing research in temporally consistent video generation.

Efficient Training: Research continues on high-quality generation model training with limited data through Few-shot GAN, transfer learning for GANs, and related approaches.

9.4 GAN Timeline Summary

Year	Model	Key Contribution	Resolution
2014	GAN	Adversarial training framework	Low
2014	cGAN	Conditional generation	Low
2015	DCGAN	CNN-based architecture guidelines	64x64
2017	WGAN	Wasserstein distance	64x64
2017	WGAN-GP	Gradient penalty	64x64
2017	Pix2Pix	Paired image-to-image translation	256x256
2017	CycleGAN	Unpaired image-to-image translation	256x256
2017	ProGAN	Progressive growing	1024x1024
2018	BigGAN	Large-scale training, truncation trick	512x512
2019	StyleGAN	Mapping network, AdaIN, style separation	1024x1024
2020	StyleGAN2	Weight demodulation, path regularization	1024x1024
2021	StyleGAN3	Alias-free, equivariant generation	1024x1024
2023	GigaGAN	1B-param text-to-image GAN	512x512+

10. Conclusion

The GAN proposed by Ian Goodfellow in 2014 revolutionized the AI field with a simple yet powerful idea --- "competition between two networks produces better generative models." The mathematical framework of the minimax game was both elegant and practical, spawning hundreds of variants over the following decade and dramatically advancing image generation quality.

DCGAN laid the practical foundation through its combination with CNNs, while WGAN solved training stability issues with the theoretical innovation of Wasserstein distance. The Progressive GAN and StyleGAN series enabled photorealistic image generation at 1024x1024 resolution, and CycleGAN and Pix2Pix pioneered the new application domain of image translation.

Although Diffusion Models have risen to prominence in generative modeling since 2021, GAN's legacy is immense. The adversarial training paradigm continues to be utilized across diverse fields, and hybrid research combining the strengths of GAN and Diffusion Models is actively progressing. As the emergence of GigaGAN demonstrates, the GAN story is far from over.

In the history of generative models, GAN will be remembered as the milestone that first demonstrated the possibility that "artificial intelligence can truly create."

References

Goodfellow, I. J. et al. (2014). "Generative Adversarial Nets." NeurIPS 2014. arXiv:1406.2661
Mirza, M. & Osindero, S. (2014). "Conditional Generative Adversarial Nets." arXiv:1411.1784
Radford, A., Metz, L. & Chintala, S. (2015). "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks." arXiv:1511.06434
Arjovsky, M., Chintala, S. & Bottou, L. (2017). "Wasserstein GAN." arXiv:1701.07875
Gulrajani, I. et al. (2017). "Improved Training of Wasserstein GANs." arXiv:1704.00028
Isola, P. et al. (2017). "Image-to-Image Translation with Conditional Adversarial Networks." CVPR 2017. arXiv:1611.07004
Zhu, J.-Y. et al. (2017). "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks." ICCV 2017. arXiv:1703.10593
Karras, T. et al. (2017). "Progressive Growing of GANs for Improved Quality, Stability, and Variation." ICLR 2018. arXiv:1710.10196
Brock, A., Donahue, J. & Simonyan, K. (2018). "Large Scale GAN Training for High Fidelity Natural Image Synthesis." ICLR 2019. arXiv:1809.11096
Karras, T., Laine, S. & Aila, T. (2019). "A Style-Based Generator Architecture for Generative Adversarial Networks." CVPR 2019. arXiv:1812.04948
Karras, T. et al. (2020). "Analyzing and Improving the Image Quality of StyleGAN." CVPR 2020. arXiv:1912.04958
Karras, T. et al. (2021). "Alias-Free Generative Adversarial Networks." NeurIPS 2021. arXiv:2106.12423
Kang, M. et al. (2023). "Scaling up GANs for Text-to-Image Synthesis." CVPR 2023. arXiv:2303.05511
Dhariwal, P. & Nichol, A. (2021). "Diffusion Models Beat GANs on Image Synthesis." NeurIPS 2021. arXiv:2105.05233