Skip to content
Published on

GAN Paper Deep Dive: How Generative Adversarial Networks Ushered in the Era of AI-Generated Content

Authors
  • Name
    Twitter

1. Paper Overview and Historical Significance

1.1 Paper Information

"Generative Adversarial Nets" was published at NeurIPS 2014 (then known as NIPS), co-authored by Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. According to the now-legendary anecdote, Goodfellow conceived the idea while discussing generative models with colleagues at a bar in Montreal. He went home that night, coded it up, and the first prototype worked right away.

The core idea of this paper is remarkably intuitive: A counterfeiter (Generator) and a police officer (Discriminator) compete against each other. The counterfeiter produces increasingly sophisticated forgeries, while the police officer develops ever sharper detection skills. When this adversarial process converges, the counterfeiter produces bills indistinguishable from genuine ones.

1.2 Historical Context: The Generative Model Landscape of 2014

Before GAN appeared, the dominant approaches in generative modeling were as follows.

Variational Autoencoder (VAE, 2013): Proposed by Kingma and Welling, VAE introduced probabilistic latent variables into an Encoder-Decoder architecture to learn data distributions. However, optimizing the ELBO (Evidence Lower Bound) resulted in blurry generated images.

Boltzmann Machine Family: Deep Boltzmann Machines, Restricted Boltzmann Machines, and similar energy-based models were theoretically elegant but relied on MCMC (Markov Chain Monte Carlo) sampling, making training slow and scalability limited.

Autoregressive Models: Models like PixelRNN (2016) generated pixels one at a time sequentially. They could produce high-quality samples but generation speed was extremely slow.

GAN broke through all these limitations at once. It could generate high-quality samples without defining an explicit probability distribution, and could generate samples instantly in a single forward pass without Markov chains or sequential generation processes. This represented a paradigm shift in the field of generative models.

1.3 Impact

The GAN paper has been cited approximately 65,000 times as of 2024, and hundreds of GAN variants have been proposed over the following decade. Yann LeCun praised GAN as "the most interesting idea in the last 20 years in machine learning." GAN has been applied to countless domains including image generation, super-resolution, style transfer, data augmentation, and drug discovery. It reigned as the dominant paradigm in generative modeling until the emergence of Diffusion Models.


2. The Core Idea of GAN

2.1 Two-Player Game: Generator vs Discriminator

The GAN framework consists of two neural networks competing against each other.

Generator (G): Takes a random noise vector zz as input and generates fake data G(z)G(z). The Generator's goal is to produce samples similar enough to real data to fool the Discriminator.

G:zpz(z)G(z)RdG: z \sim p_z(z) \rightarrow G(z) \in \mathbb{R}^d

Discriminator (D): Determines whether the input data comes from the real data distribution (xpdatax \sim p_{data}) or is fake, produced by the Generator (G(z)G(z)). The output is a probability value between 0 and 1, where closer to 1 means the input is judged as real.

D:x[0,1]D: x \rightarrow [0, 1]

These two networks have opposing objectives:

  • Generator: Tries to maximize D(G(z))D(G(z)) (making the Discriminator classify fakes as real)
  • Discriminator: Tries to assign high probability to real data and low probability to fake data

2.2 Intuitive Analogy

The GAN training process can be understood through an art market analogy.

ComponentAnalogyRole
GeneratorArt forgerGoal is to create forgeries indistinguishable from originals
DiscriminatorArt appraiserGoal is to distinguish originals from forgeries
Training DataAuthentic artworksSamples from the real data distribution
Noise Vector zzArtist's inspirationA random point in the latent space

Initially, the forger's skills are poor, so the appraiser easily identifies forgeries. But the forger improves through the appraiser's feedback (gradients), and the appraiser also enhances detection capabilities to counter increasingly sophisticated forgeries. When this competition progresses sufficiently, the forger produces works indistinguishable from originals.

2.3 Minimax Game Formulation

The GAN training objective is formalized as the following minimax game:

minGmaxDV(D,G)=Expdata(x)[logD(x)]+Ezpz(z)[log(1D(G(z)))]\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]

Let us analyze each term of this value function V(D,G)V(D, G).

First term: Expdata(x)[logD(x)]\mathbb{E}_{x \sim p_{data}(x)}[\log D(x)]

This is the Discriminator's judgment on real data xx. The Discriminator tries to maximize this value, aiming for D(x)1D(x) \rightarrow 1 (judging real as real). The Generator has no influence on this term.

Second term: Ezpz(z)[log(1D(G(z)))]\mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]

This is the Discriminator's judgment on fake data produced by the Generator.

  • The Discriminator tries to maximize this: D(G(z))0D(G(z)) \rightarrow 0 (judging fake as fake) gives log(10)=0\log(1 - 0) = 0, the maximum
  • The Generator tries to minimize this: D(G(z))1D(G(z)) \rightarrow 1 (judging fake as real) gives log(11)=\log(1 - 1) = -\infty, the minimum

This is precisely where the name Adversarial comes from. Two players optimize the same value function in opposite directions.


3. Mathematical Foundations

3.1 Optimal Discriminator

Let us derive the optimal Discriminator DGD^*_G for a fixed Generator GG. Converting the value function to integral form using the definition of expectation:

V(D,G)=xpdata(x)logD(x)dx+xpg(x)log(1D(x))dxV(D, G) = \int_x p_{data}(x) \log D(x) \, dx + \int_x p_g(x) \log(1 - D(x)) \, dx

where pgp_g is the distribution of data generated by the Generator. Combining into a single integral:

V(D,G)=x[pdata(x)logD(x)+pg(x)log(1D(x))]dxV(D, G) = \int_x \left[ p_{data}(x) \log D(x) + p_g(x) \log(1 - D(x)) \right] dx

Differentiating the integrand with respect to D(x)D(x) and setting it to zero:

pdata(x)D(x)pg(x)1D(x)=0\frac{p_{data}(x)}{D(x)} - \frac{p_g(x)}{1 - D(x)} = 0

Solving for D(x)D(x), the optimal discriminator is:

DG(x)=pdata(x)pdata(x)+pg(x)D^*_G(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}

This result is intuitively sound. If the probability of a data point xx being real is pdata(x)p_{data}(x) and being fake is pg(x)p_g(x), the optimal discrimination exactly matches the posterior probability from Bayes' rule.

Key observation: When pg=pdatap_g = p_{data}, i.e., when the Generator has perfectly learned the real data distribution, DG(x)=12D^*_G(x) = \frac{1}{2} for all xx. The Discriminator can no longer distinguish real from fake at all.

3.2 Relationship with Jensen-Shannon Divergence

Substituting the optimal discriminator DGD^*_G into the value function:

V(DG,G)=Expdata[logpdata(x)pdata(x)+pg(x)]+Expg[logpg(x)pdata(x)+pg(x)]V(D^*_G, G) = \mathbb{E}_{x \sim p_{data}} \left[ \log \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} \right] + \mathbb{E}_{x \sim p_g} \left[ \log \frac{p_g(x)}{p_{data}(x) + p_g(x)} \right]

Simplifying:

V(DG,G)=log4+2JSD(pdatapg)V(D^*_G, G) = -\log 4 + 2 \cdot JSD(p_{data} \| p_g)

where JSDJSD is the Jensen-Shannon Divergence, defined as:

JSD(pq)=12KL(pp+q2)+12KL(qp+q2)JSD(p \| q) = \frac{1}{2} KL\left(p \left\| \frac{p+q}{2}\right.\right) + \frac{1}{2} KL\left(q \left\| \frac{p+q}{2}\right.\right)

JSD is a symmetrized version of KL Divergence and is always bounded: 0JSD(pq)log20 \leq JSD(p \| q) \leq \log 2. JSD=0JSD = 0 occurs if and only if p=qp = q, i.e., when the two distributions are completely identical.

3.3 Proof of Global Optimality

Theorem (Goodfellow et al., 2014): The global minimum of C(G)=maxDV(D,G)C(G) = \max_D V(D, G) is achieved if and only if pg=pdatap_g = p_{data}, at which point C(G)=log4C(G) = -\log 4.

Proof:

(1) C(G)=V(DG,G)=log4+2JSD(pdatapg)C(G) = V(D^*_G, G) = -\log 4 + 2 \cdot JSD(p_{data} \| p_g)

(2) JSD(pdatapg)0JSD(p_{data} \| p_g) \geq 0 (non-negativity of JSD)

(3) JSD(pdatapg)=0    pdata=pgJSD(p_{data} \| p_g) = 0 \iff p_{data} = p_g

(4) Therefore C(G)log4C(G) \geq -\log 4, with equality if and only if pg=pdatap_g = p_{data}

This provides the theoretical guarantee for GAN training. Given a Generator and Discriminator with sufficient capacity, at the Nash equilibrium of the minimax game, the Generator perfectly recovers the real data distribution.

3.4 Nash Equilibrium

From a game-theoretic perspective, GAN training is the problem of finding a Nash equilibrium between two players. A Nash equilibrium is a state where neither player can benefit by unilaterally changing their strategy while the other player's strategy remains fixed.

The Nash equilibrium in GAN is:

  • GG^*: A Generator that achieves pg=pdatap_g = p_{data}
  • DD^*: A Discriminator that outputs D(x)=12D(x) = \frac{1}{2} for all xx

Theoretically, this equilibrium point exists and is unique, but finding it in practice is very difficult. It is a non-convex game where two networks must be optimized simultaneously. This is the fundamental difficulty of GAN training and became the starting point for numerous subsequent studies.

3.5 KL Divergence vs JS Divergence

Why JSD specifically? Let us compare with KL Divergence.

Problems with KL Divergence:

KL(pdatapg)=pdata(x)logpdata(x)pg(x)dxKL(p_{data} \| p_g) = \int p_{data}(x) \log \frac{p_{data}(x)}{p_g(x)} dx

KL Divergence is asymmetric and diverges to infinity in regions where pg(x)=0p_g(x) = 0 but pdata(x)>0p_{data}(x) > 0. This becomes problematic when the Generator's distribution fails to sufficiently cover the real distribution early in training.

Advantages of JS Divergence:

  • Symmetric: JSD(pq)=JSD(qp)JSD(p \| q) = JSD(q \| p)
  • Always finite: 0JSDlog20 \leq JSD \leq \log 2
  • Computes KL with respect to the mixture distribution p+q2\frac{p+q}{2}, so it does not diverge even when one distribution is zero

However, JSD is not perfect either. When the supports of the two distributions do not overlap, JSD becomes the constant log2\log 2, making the gradient zero. This is the root cause of the vanishing gradient problem in GAN training, and the key motivation for WGAN's introduction of Wasserstein distance.


4. Training Algorithm

4.1 Training Procedure

The training algorithm proposed in the original paper is as follows:

Algorithm 1: GAN Training (Goodfellow et al., 2014)

for number of training iterations do
    # --- Step 1: Discriminator update (k steps) ---
    for k steps do
        - Sample m noise samples {z^(1), ..., z^(m)} from noise prior p_z(z)
        - Sample m real samples {x^(1), ..., x^(m)} from data distribution p_data(x)
        - Update Discriminator parameters by stochastic gradient ascending:

          nabla_{theta_d} (1/m) sum_{i=1}^{m} [log D(x^(i)) + log(1 - D(G(z^(i))))]

    end for

    # --- Step 2: Generator update (1 step) ---
    - Sample m noise samples {z^(1), ..., z^(m)} from noise prior p_z(z)
    - Update Generator parameters by stochastic gradient descending:

          nabla_{theta_g} (1/m) sum_{i=1}^{m} log(1 - D(G(z^(i))))

end for

4.2 Alternating Optimization

The key is alternating optimization. The Discriminator and Generator are updated in turn.

Why update the Discriminator k times before updating the Generator once:

Theoretically, the optimal discriminator DGD^*_G must be found before updating the Generator to obtain the correct gradient direction. Since fully optimizing DD is infeasible in practice, it is approximated with kk gradient steps. The original paper used k=1k = 1 as the default.

Importance of maintaining balance:

  • If the Discriminator becomes too strong: The Generator's gradients vanish and training stalls
  • If the Discriminator becomes too weak: It fails to provide useful learning signals to the Generator
  • Ideally, the Discriminator and Generator should advance at comparable levels

4.3 Non-Saturating Loss (Practical Modification)

In the theoretical minimax objective, the Generator's goal is to minimize log(1D(G(z)))\log(1 - D(G(z))). However, early in training when the Generator is very poor, D(G(z))0D(G(z)) \approx 0, so log(1D(G(z)))log1=0\log(1 - D(G(z))) \approx \log 1 = 0, resulting in near-zero gradients.

Goodfellow addressed this by modifying the Generator's objective:

Original (Minimax):

minGEzpz(z)[log(1D(G(z)))]\min_G \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]

Modified (Non-Saturating):

maxGEzpz(z)[logD(G(z))]\max_G \mathbb{E}_{z \sim p_z(z)}[\log D(G(z))]

Both objectives share the same fixed point (Nash equilibrium), but the gradient magnitudes differ significantly early in training. The non-saturating loss provides strong gradients even when D(G(z))D(G(z)) is small, enabling the Generator to learn quickly.

Minimax gradient:Glog(1D(G(z)))=D(G(z))1D(G(z))0 when D(G(z))0\text{Minimax gradient}: \frac{\partial}{\partial G} \log(1 - D(G(z))) = \frac{-D'(G(z))}{1 - D(G(z))} \approx 0 \text{ when } D(G(z)) \approx 0 Non-Saturating gradient:GlogD(G(z))=D(G(z))D(G(z))large when D(G(z))0\text{Non-Saturating gradient}: \frac{\partial}{\partial G} \log D(G(z)) = \frac{D'(G(z))}{D(G(z))} \rightarrow \text{large when } D(G(z)) \approx 0

4.4 Experimental Results in the Original Paper

The original paper conducted experiments on MNIST, Toronto Face Database (TFD), and CIFAR-10 datasets. Evaluation used Parzen window-based log-likelihood estimation, and GAN showed competitive performance compared to Deep Boltzmann Machines and Stacked Denoising Autoencoders.

However, by today's standards, the results were quite rudimentary. Both the Generator and Discriminator used simple MLPs (Multi-Layer Perceptrons), and the resolution and quality of generated images were limited. The true breakthroughs came through subsequent architectural improvements and training technique advancements.


5. Core Problems of GAN

5.1 Mode Collapse

The most notorious problem of GAN is Mode Collapse. This occurs when the Generator fails to learn all the modes (diverse patterns) of the data distribution and instead focuses on a small subset, repeatedly generating similar outputs.

Mechanism:

When the Generator discovers a few patterns that are particularly effective at fooling the Discriminator, it repeatedly generates those patterns instead of exploring diverse alternatives. For example, when training on MNIST, the Generator might perfectly generate only the digit '1' while failing to generate any other digits.

Mathematical interpretation:

Mode collapse is related to the transformation from minimax to maximin game:

maxDminGV(D,G)minGmaxDV(D,G)\max_D \min_G V(D, G) \neq \min_G \max_D V(D, G)

In the theoretical minimax, the Generator must defend against all possible Discriminators, requiring it to cover the entire distribution. However, in actual training, the Generator only needs to fool the current Discriminator, making it a "rational" strategy to focus on specific modes.

5.2 Training Instability

GAN training is inherently the problem of finding a Nash equilibrium in a non-cooperative game. This is far more difficult than a simple optimization problem.

Oscillation problem: The Generator and Discriminator frequently oscillate around each other without converging. In a typical loss landscape, gradient descent finds local minima, but gradient descent in a minimax game can circle around saddle points.

Difficulty of training balance: If the Discriminator converges too quickly, the Generator cannot learn; conversely, if the Discriminator is too weak, it fails to convey meaningful learning signals to the Generator. Maintaining this delicate balance was the greatest practical challenge in GAN training.

5.3 Vanishing Gradients

As explained in Section 3.5, JS Divergence becomes the constant log2\log 2 when the supports of the two distributions do not overlap, resulting in zero gradients.

In high-dimensional data (e.g., images), both the real data distribution and the Generator's distribution exist on low-dimensional manifolds within the high-dimensional space. The probability of these two manifolds overlapping is very low, so it is typical for the supports of the two distributions to barely overlap early in training. In this situation, JSD-based GAN provides no useful gradients at all.

When supp(pdata)supp(pg)=:JSD(pdatapg)=log2(constant)\text{When } \text{supp}(p_{data}) \cap \text{supp}(p_g) = \emptyset: \quad JSD(p_{data} \| p_g) = \log 2 \quad (\text{constant})

5.4 Evaluation Challenges

Objectively evaluating GAN performance is itself a very challenging problem. The main evaluation metrics are:

Inception Score (IS): Measures the quality (sharpness) and diversity of generated images. Uses a pre-trained Inception network -- high scores indicate that individual images have confident class predictions (quality) while the overall distribution covers diverse classes (diversity).

Frechet Inception Distance (FID): Measures the Frechet distance between the Inception feature distributions of real and generated data. Lower is better. Widely used as a more reliable metric than IS.

FID=μrμg2+Tr(Σr+Σg2(ΣrΣg)1/2)FID = \|\mu_r - \mu_g\|^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})

where (μr,Σr)(\mu_r, \Sigma_r) and (μg,Σg)(\mu_g, \Sigma_g) are the mean and covariance of the Inception features for real and generated images, respectively.


6. The Complete GAN Lineage

6.1 DCGAN (2015): The Beginning of Stable CNN-based Training

Radford, Metz, Chintala. "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks" (2015)

The original GAN used only MLPs (Fully Connected Layers), failing to leverage CNN's powerful spatial feature extraction capabilities for image generation. DCGAN (Deep Convolutional GAN) was the first architecture to successfully integrate CNNs into GANs, establishing several architectural guidelines for stable training.

DCGAN's key architectural rules:

  1. Remove pooling layers: Use strided convolutions (Discriminator) and fractional-strided / transposed convolutions (Generator) instead of max pooling
  2. Apply Batch Normalization: Apply to both Generator and Discriminator, except for the Generator's output layer and the Discriminator's input layer
  3. Remove fully connected layers: Use global average pooling or direct convolutional connections
  4. Activation functions: Generator uses Tanh for the output layer and ReLU elsewhere. Discriminator uses LeakyReLU for all layers

DCGAN Generator Architecture (Conceptual):

z (100-dim) -> FC -> Reshape (4x4x1024) -> ConvT -> BN -> ReLU (8x8x512)
-> ConvT -> BN -> ReLU (16x16x256) -> ConvT -> BN -> ReLU (32x32x128)
-> ConvT -> Tanh (64x64x3)

Beyond simply generating good images, DCGAN demonstrated that the learned latent space possesses meaningful structure. The famous demonstration showed that vector arithmetic in latent space corresponds to semantic transformations:

vec("man with glasses")vec("man")+vec("woman")=vec("woman with glasses")\text{vec}(\text{"man with glasses"}) - \text{vec}(\text{"man"}) + \text{vec}(\text{"woman"}) = \text{vec}(\text{"woman with glasses"})

6.2 WGAN (2017): Introduction of Wasserstein Distance

Arjovsky, Chintala, Bottou. "Wasserstein GAN" (2017)

WGAN is one of the most important theoretical advances in GAN, introducing Wasserstein distance (Earth Mover's distance) to address the fundamental limitations of JS Divergence.

Wasserstein Distance (EM Distance):

W(pdata,pg)=infγΠ(pdata,pg)E(x,y)γ[xy]W(p_{data}, p_g) = \inf_{\gamma \in \Pi(p_{data}, p_g)} \mathbb{E}_{(x, y) \sim \gamma} [\|x - y\|]

where Π(pdata,pg)\Pi(p_{data}, p_g) is the set of all joint distributions with marginals pdatap_{data} and pgp_g. Intuitively, it is the minimum cost of "moving dirt" to transform one distribution into another.

Key advantages of Wasserstein Distance:

Unlike JSD, it provides a continuous and differentiable distance even when the supports of the two distributions do not overlap. For example, considering two point distributions δ0\delta_0 and δθ\delta_\theta (θ>0\theta > 0):

JSD(δ0δθ)=log2(constant, gradient = 0)JSD(\delta_0 \| \delta_\theta) = \log 2 \quad \text{(constant, gradient = 0)} W(δ0,δθ)=θ(continuous, gradient=sign(θ))W(\delta_0, \delta_\theta) = |\theta| \quad \text{(continuous, gradient} = \text{sign}(\theta)\text{)}

Kantorovich-Rubinstein Duality:

Since directly computing the Wasserstein distance is intractable, the Kantorovich-Rubinstein duality is leveraged:

W(pdata,pg)=supfL1Expdata[f(x)]Expg[f(x)]W(p_{data}, p_g) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_{data}}[f(x)] - \mathbb{E}_{x \sim p_g}[f(x)]

where the supremum is taken over all 1-Lipschitz functions. WGAN trains the Discriminator (now called the Critic) to approximate this 1-Lipschitz function.

Weight Clipping: The original WGAN enforced the Lipschitz constraint by clipping critic weights to the range [c,c][-c, c]. However, this severely limited the critic's representational power and could cause training instability.

6.3 WGAN-GP (2017): Gradient Penalty

Gulrajani, Ahmed, Arjovsky, Dumoulin, Courville. "Improved Training of Wasserstein GANs" (2017)

To address weight clipping's problems, Gradient Penalty (GP) was proposed. Instead of directly enforcing the Lipschitz constraint, it regularizes the critic's gradient norm to stay close to 1.

LWGANGP=Expg[D(x)]Expdata[D(x)]Original Critic Loss+λEx^px^[(x^D(x^)21)2]Gradient PenaltyL_{WGAN-GP} = \underbrace{\mathbb{E}_{x \sim p_g}[D(x)] - \mathbb{E}_{x \sim p_{data}}[D(x)]}_{\text{Original Critic Loss}} + \underbrace{\lambda \mathbb{E}_{\hat{x} \sim p_{\hat{x}}} \left[ (\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2 \right]}_{\text{Gradient Penalty}}

where x^\hat{x} is a random interpolation between real and generated data:

x^=ϵx+(1ϵ)G(z),ϵUniform[0,1]\hat{x} = \epsilon x + (1 - \epsilon) G(z), \quad \epsilon \sim \text{Uniform}[0, 1]

WGAN-GP uses λ=10\lambda = 10 and ncritic=5n_{critic} = 5 critic updates as defaults, training stably across diverse architectures with minimal hyperparameter tuning.

6.4 Progressive GAN (2017): Gradual Resolution Increase

Karras, Aila, Laine, Lehtinen. "Progressive Growing of GANs for Improved Quality, Stability, and Variation" (2017)

Progressive GAN (ProGAN), proposed by the NVIDIA research team, opened new horizons in high-resolution image generation. The core idea is to start training the Generator and Discriminator at low resolution and progressively add layers to increase resolution.

Training process:

Phase 1: Train G and D at 4x4 resolution
Phase 2: Add 8x8 layers with gradual fade-in transition
Phase 3: Add 16x16 layers
...
Phase N: Reach final 1024x1024 resolution

Fade-in mechanism: When adding new layers, the outputs of existing and new layers are combined via weighted averaging. The weight α\alpha gradually increases from 0 to 1, progressively activating the new layer.

output=(1α)upsampled_old+αnew_layer_output\text{output} = (1 - \alpha) \cdot \text{upsampled\_old} + \alpha \cdot \text{new\_layer\_output}

Key contributions:

  • Dramatically improved training stability: Learning coarse structure at low resolution first, then gradually adding fine details makes training much more stable
  • Achieved 1024x1024 resolution: First successful generation of photorealistic face images at 1024x1024 resolution on the CelebA-HQ dataset
  • Minibatch standard deviation: Introduced a technique using within-minibatch statistics to increase diversity

6.5 StyleGAN Series (2019-2021): The Pinnacle of Style-based Generation

StyleGAN (2019)

Karras, Laine, Aila. "A Style-Based Generator Architecture for Generative Adversarial Networks" (2019)

StyleGAN is a revolutionary architecture that combines Progressive GAN's progressive training with the style separation concepts from Neural Style Transfer.

Key structural changes:

  1. Mapping Network: Transforms the input latent vector zZz \in \mathcal{Z} through a nonlinear mapping network f:ZWf: \mathcal{Z} \rightarrow \mathcal{W} to an intermediate latent space W\mathcal{W}. Consists of 8 FC layers.

  2. Adaptive Instance Normalization (AdaIN): Injects style vectors ww from W\mathcal{W} space into each convolution layer.

AdaIN(xi,y)=ys,ixiμ(xi)σ(xi)+yb,i\text{AdaIN}(x_i, y) = y_{s,i} \frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i}

where ysy_s and yby_b are scale and bias obtained via learned affine transformation from the style vector ww.

  1. Constant Input: Uses a learnable constant tensor (4x4x512) as the Generator's input. Style is injected solely through AdaIN.

  2. Noise Injection: Adds per-pixel noise after each convolution layer to control stochastic variation (e.g., hair position, pores, etc.).

Style hierarchy:

Resolution LayerControlled Attributes
42824^2 - 8^2 (Coarse)Pose, face shape, presence of glasses
16232216^2 - 32^2 (Middle)Facial features, hairstyle, eye openness
6421024264^2 - 1024^2 (Fine)Color, fine structure, background details

StyleGAN2 (2020)

Karras, Laine, Aittala, Hellsten, Lehtinen, Aila. "Analyzing and Improving the Image Quality of StyleGAN" (2020)

StyleGAN2 resolved several artifacts in StyleGAN and significantly improved image quality.

Key improvements:

  1. Weight Demodulation: Replaces AdaIN to eliminate blob artifacts. Solves the problem where AdaIN's instance normalization destroys relative magnitude information within feature maps
  2. Removal of Progressive Growing: Achieves stable high-resolution training without progressive growing by using skip connections and residual connections
  3. Path Length Regularization: Improves smoothness of the latent space so that small changes in latent vectors produce proportional changes in images
  4. Lazy Regularization: Applies regularization every 16 steps instead of every step for improved efficiency

StyleGAN2-ADA: Introduced Adaptive Discriminator Augmentation to train without overfitting even with limited data. Enabled high-quality generation from datasets as small as a few thousand images.

StyleGAN3 (2021)

Karras, Aittala, Laine, et al. "Alias-Free Generative Adversarial Networks" (2021)

StyleGAN3 addressed a fundamental signal processing issue.

Problem: In StyleGAN2, fine details in generated images appeared "stuck" to image coordinates. When the camera should move, textures did not move with objects but remained fixed -- an aliasing problem.

Solution: Redesigned all signals within the network to be processed in the continuous domain, fundamentally eliminating aliasing from discrete sampling.

Key changes:

  • Fourier feature-based input replacement
  • Guaranteed continuous equivariant operations
  • Achieved full equivariance to translation and rotation
  • FID comparable to StyleGAN2 while having fundamentally different internal representations

StyleGAN3 laid the foundation for better suitability in video generation and animation.

6.6 Conditional GAN, Pix2Pix, CycleGAN

Conditional GAN (cGAN, 2014)

Mirza, Osindero. "Conditional Generative Adversarial Nets" (2014)

The original GAN cannot control what data is generated. Conditional GAN provides additional conditioning information yy (e.g., class labels) to both the Generator and Discriminator, enabling conditional generation of data with desired attributes.

minGmaxDV(D,G)=Expdata(x)[logD(xy)]+Ezpz(z)[log(1D(G(zy)y))]\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x|y)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z|y)|y))]

Pix2Pix (2017)

Isola, Zhu, Zhou, Efros. "Image-to-Image Translation with Conditional Adversarial Networks" (2017)

Pix2Pix is an image-to-image translation framework using paired image data. It solved diverse tasks -- colorizing grayscale photos, converting satellite images to maps, transforming sketches to photos -- within a unified framework.

Key components:

  • U-Net Generator: Encoder-Decoder architecture with skip connections
  • PatchGAN Discriminator: Judges authenticity at the N×NN \times N patch level rather than the whole image
  • L1 Reconstruction Loss + Adversarial Loss: Simultaneously pursues structural similarity and realism
L=LcGAN(G,D)+λLL1(G)\mathcal{L} = \mathcal{L}_{cGAN}(G, D) + \lambda \mathcal{L}_{L1}(G)

CycleGAN (2017)

Zhu, Park, Isola, Efros. "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks" (2017)

Pix2Pix had the major constraint of requiring paired data. CycleGAN learns translation between two domains using only unpaired data.

Core idea: Cycle Consistency Loss

Two Generators G:XYG: X \rightarrow Y and F:YXF: Y \rightarrow X, and two Discriminators DXD_X, DYD_Y are trained.

Lcyc(G,F)=Expdata(x)[F(G(x))x1]+Eypdata(y)[G(F(y))y1]\mathcal{L}_{cyc}(G, F) = \mathbb{E}_{x \sim p_{data}(x)}[\|F(G(x)) - x\|_1] + \mathbb{E}_{y \sim p_{data}(y)}[\|G(F(y)) - y\|_1]

The constraint that translating an image from domain XX to YY and back to XX should recover the original image. This enables learning meaningful mappings without paired data.

Applications: Converting horses to zebras, summer landscapes to winter, photographs to Monet-style paintings, etc.

6.7 BigGAN (2018): The Power of Scale

Brock, Donahue, Simonyan. "Large Scale GAN Training for High Fidelity Natural Image Synthesis" (2018)

BigGAN dramatically demonstrated that "scale matters in GAN training." It trained with 2-4x the parameters and 8x the batch size compared to prior work.

Key techniques:

  1. Class-Conditional Batch Normalization: Shares class embeddings to adjust the scale and bias of each Batch Normalization layer
  2. Truncation Trick: Truncates the distribution of latent vectors zz at inference time to control the quality-diversity tradeoff
zN(0,I)z=truncate(z,threshold)z \sim \mathcal{N}(0, I) \rightarrow z' = \text{truncate}(z, \text{threshold})
  1. Orthogonal Regularization: Applies orthogonal regularization to Generator weights for training stability

Results: Achieved IS 166.5 and FID 7.4 on ImageNet 128x128, vastly surpassing the previous best (IS 52.52, FID 18.6).

6.8 GigaGAN (2023): The Return of GAN?

Kang, Zhu, et al. "Scaling up GANs for Text-to-Image Synthesis" (2023)

At a time when Diffusion Models dominated image generation, GigaGAN demonstrated the potential of GAN once again as a 1B-parameter text-to-image GAN.

Key innovations:

  1. Adaptive Kernel Selection: Generates different convolution filters for each image. Determined by convex combination from a filter bank using the style vector
  2. Stable Attention: Computes attention scores based on L2 distance to guarantee Lipschitz continuity, and normalizes the attention weight matrix to unit variance
  3. Query-Key Tying: Shares Query and Key matrices for stability
  4. CLIP Text Encoder: Extracts text embeddings using a pre-trained CLIP model

Results and significance:

  • Surpassed Stable Diffusion v1.5, DALL-E 2, and Parti-750M in FID
  • 0.13 seconds for 512px image generation: Inference speed tens to hundreds of times faster than Diffusion models
  • Proved that GAN remains competitive in large-scale text-to-image synthesis

7. Implementing GAN in PyTorch

7.1 Basic GAN Implementation (MNIST)

Below is the most basic GAN implementation for the MNIST dataset in PyTorch.

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# ============================================================
# Hyperparameter Settings
# ============================================================
LATENT_DIM = 100        # Latent vector dimension (dimension of z)
IMG_DIM = 28 * 28       # Flattened MNIST image dimension
HIDDEN_DIM = 256        # Hidden layer dimension
BATCH_SIZE = 64
EPOCHS = 200
LR = 0.0002
BETAS = (0.5, 0.999)   # Adam optimizer beta parameters
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# ============================================================
# Generator Definition
# ============================================================
class Generator(nn.Module):
    """
    Takes a latent vector z as input and generates a fake image.
    Architecture: z(100) -> 256 -> 512 -> 1024 -> 784(28x28)
    """
    def __init__(self, latent_dim: int, img_dim: int, hidden_dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim, hidden_dim * 2),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim * 2, hidden_dim * 4),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim * 4, img_dim),
            nn.Tanh(),  # Normalize output to [-1, 1] range
        )

    def forward(self, z: torch.Tensor) -> torch.Tensor:
        return self.net(z)


# ============================================================
# Discriminator Definition
# ============================================================
class Discriminator(nn.Module):
    """
    Takes an image as input and outputs the probability of being real/fake.
    Architecture: 784(28x28) -> 1024 -> 512 -> 256 -> 1
    """
    def __init__(self, img_dim: int, hidden_dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(img_dim, hidden_dim * 4),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim * 4, hidden_dim * 2),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, 1),
            nn.Sigmoid(),  # Convert output to [0, 1] probability
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)


# ============================================================
# Data Loader Setup
# ============================================================
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,)),  # [0,1] -> [-1,1]
])

dataset = datasets.MNIST(root="./data", train=True, transform=transform, download=True)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)


# ============================================================
# Model, Optimizer, and Loss Function Initialization
# ============================================================
G = Generator(LATENT_DIM, IMG_DIM, HIDDEN_DIM).to(DEVICE)
D = Discriminator(IMG_DIM, HIDDEN_DIM).to(DEVICE)

opt_G = optim.Adam(G.parameters(), lr=LR, betas=BETAS)
opt_D = optim.Adam(D.parameters(), lr=LR, betas=BETAS)

criterion = nn.BCELoss()  # Binary Cross Entropy


# ============================================================
# Training Loop
# ============================================================
for epoch in range(EPOCHS):
    d_loss_total, g_loss_total = 0.0, 0.0

    for batch_idx, (real_imgs, _) in enumerate(dataloader):
        real_imgs = real_imgs.view(-1, IMG_DIM).to(DEVICE)
        batch_size = real_imgs.size(0)

        # Real/fake labels
        real_labels = torch.ones(batch_size, 1, device=DEVICE)
        fake_labels = torch.zeros(batch_size, 1, device=DEVICE)

        # -----------------------------------------
        # Step 1: Train Discriminator
        # -----------------------------------------
        # Discriminate real images
        d_real = D(real_imgs)
        d_loss_real = criterion(d_real, real_labels)

        # Generate and discriminate fake images
        z = torch.randn(batch_size, LATENT_DIM, device=DEVICE)
        fake_imgs = G(z).detach()  # Block Generator's gradients
        d_fake = D(fake_imgs)
        d_loss_fake = criterion(d_fake, fake_labels)

        # Total Discriminator loss and update
        d_loss = d_loss_real + d_loss_fake
        opt_D.zero_grad()
        d_loss.backward()
        opt_D.step()

        # -----------------------------------------
        # Step 2: Train Generator
        # -----------------------------------------
        z = torch.randn(batch_size, LATENT_DIM, device=DEVICE)
        fake_imgs = G(z)
        d_fake = D(fake_imgs)

        # Non-saturating loss: Generator tries to maximize D(G(z))
        g_loss = criterion(d_fake, real_labels)
        opt_G.zero_grad()
        g_loss.backward()
        opt_G.step()

        d_loss_total += d_loss.item()
        g_loss_total += g_loss.item()

    # Per-epoch log output
    num_batches = len(dataloader)
    print(
        f"Epoch [{epoch+1}/{EPOCHS}] "
        f"D Loss: {d_loss_total/num_batches:.4f} | "
        f"G Loss: {g_loss_total/num_batches:.4f}"
    )

7.2 DCGAN Implementation (Key Parts)

A version with the Generator and Discriminator changed to convolutional architectures.

class DCGANGenerator(nn.Module):
    """
    DCGAN Generator: Generates images using Transposed Convolutions.
    z(100) -> 4x4x512 -> 8x8x256 -> 16x16x128 -> 32x32x64 -> 64x64x3
    """
    def __init__(self, latent_dim: int = 100, feature_map_size: int = 64, channels: int = 3):
        super().__init__()
        self.net = nn.Sequential(
            # Input: z (latent_dim x 1 x 1) -> (feature_map_size*8 x 4 x 4)
            nn.ConvTranspose2d(latent_dim, feature_map_size * 8, 4, 1, 0, bias=False),
            nn.BatchNorm2d(feature_map_size * 8),
            nn.ReLU(inplace=True),

            # (feature_map_size*8 x 4 x 4) -> (feature_map_size*4 x 8 x 8)
            nn.ConvTranspose2d(feature_map_size * 8, feature_map_size * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 4),
            nn.ReLU(inplace=True),

            # (feature_map_size*4 x 8 x 8) -> (feature_map_size*2 x 16 x 16)
            nn.ConvTranspose2d(feature_map_size * 4, feature_map_size * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 2),
            nn.ReLU(inplace=True),

            # (feature_map_size*2 x 16 x 16) -> (feature_map_size x 32 x 32)
            nn.ConvTranspose2d(feature_map_size * 2, feature_map_size, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size),
            nn.ReLU(inplace=True),

            # (feature_map_size x 32 x 32) -> (channels x 64 x 64)
            nn.ConvTranspose2d(feature_map_size, channels, 4, 2, 1, bias=False),
            nn.Tanh(),
        )

    def forward(self, z: torch.Tensor) -> torch.Tensor:
        return self.net(z)


class DCGANDiscriminator(nn.Module):
    """
    DCGAN Discriminator: Judges authenticity using Strided Convolutions.
    (3 x 64 x 64) -> (64 x 32 x 32) -> (128 x 16 x 16) ->
    (256 x 8 x 8) -> (512 x 4 x 4) -> 1
    """
    def __init__(self, feature_map_size: int = 64, channels: int = 3):
        super().__init__()
        self.net = nn.Sequential(
            # (channels x 64 x 64) -> (feature_map_size x 32 x 32)
            nn.Conv2d(channels, feature_map_size, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size x 32 x 32) -> (feature_map_size*2 x 16 x 16)
            nn.Conv2d(feature_map_size, feature_map_size * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 2),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size*2 x 16 x 16) -> (feature_map_size*4 x 8 x 8)
            nn.Conv2d(feature_map_size * 2, feature_map_size * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 4),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size*4 x 8 x 8) -> (feature_map_size*8 x 4 x 4)
            nn.Conv2d(feature_map_size * 4, feature_map_size * 8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 8),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size*8 x 4 x 4) -> (1 x 1 x 1)
            nn.Conv2d(feature_map_size * 8, 1, 4, 1, 0, bias=False),
            nn.Sigmoid(),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x).view(-1, 1)

7.3 WGAN-GP Core Loss Implementation

def compute_gradient_penalty(
    discriminator: nn.Module,
    real_samples: torch.Tensor,
    fake_samples: torch.Tensor,
    device: torch.device,
    lambda_gp: float = 10.0,
) -> torch.Tensor:
    """
    Computes the Gradient Penalty for WGAN-GP.

    Penalizes the Discriminator (Critic) so that the L2 norm of its gradient
    equals 1 at random interpolation points between real and generated data.
    """
    batch_size = real_samples.size(0)

    # Random interpolation coefficient
    epsilon = torch.rand(batch_size, 1, 1, 1, device=device)

    # Interpolation between real and fake
    interpolated = (epsilon * real_samples + (1 - epsilon) * fake_samples).requires_grad_(True)

    # Critic output
    d_interpolated = discriminator(interpolated)

    # Gradient computation
    gradients = torch.autograd.grad(
        outputs=d_interpolated,
        inputs=interpolated,
        grad_outputs=torch.ones_like(d_interpolated),
        create_graph=True,
        retain_graph=True,
    )[0]

    # L2 norm of gradients
    gradients = gradients.view(batch_size, -1)
    gradient_norm = gradients.norm(2, dim=1)

    # Gradient Penalty: expectation of (||grad|| - 1)^2
    gradient_penalty = lambda_gp * ((gradient_norm - 1) ** 2).mean()

    return gradient_penalty


# WGAN-GP Training Loop (Key Parts)
def train_wgan_gp_step(
    G: nn.Module,
    D: nn.Module,
    opt_G: optim.Optimizer,
    opt_D: optim.Optimizer,
    real_imgs: torch.Tensor,
    latent_dim: int,
    device: torch.device,
    n_critic: int = 5,
):
    """One iteration of WGAN-GP training."""
    batch_size = real_imgs.size(0)

    # --- Critic (Discriminator) training: n_critic times ---
    for _ in range(n_critic):
        z = torch.randn(batch_size, latent_dim, 1, 1, device=device)
        fake_imgs = G(z).detach()

        # Wasserstein Loss: maximize E[D(real)] - E[D(fake)]
        d_real = D(real_imgs).mean()
        d_fake = D(fake_imgs).mean()
        gp = compute_gradient_penalty(D, real_imgs, fake_imgs, device)

        d_loss = d_fake - d_real + gp  # Critic minimizes this

        opt_D.zero_grad()
        d_loss.backward()
        opt_D.step()

    # --- Generator training: 1 time ---
    z = torch.randn(batch_size, latent_dim, 1, 1, device=device)
    fake_imgs = G(z)
    g_loss = -D(fake_imgs).mean()  # Generator maximizes D(G(z))

    opt_G.zero_grad()
    g_loss.backward()
    opt_G.step()

    return d_loss.item(), g_loss.item()

8. GAN vs Diffusion Models Comparison

Entering the 2020s, Diffusion Models (DDPM, Score-based models) emerged as a new paradigm in image generation. After Dhariwal and Nichol's 2021 paper "Diffusion Models Beat GANs on Image Synthesis," Diffusion Models became the mainstream of generative modeling through DALL-E 2, Stable Diffusion, Midjourney, and others. Let us systematically compare GAN and Diffusion Models.

8.1 Fundamental Comparison

AspectGANDiffusion Model
Training MethodAdversarial Training (minimax game)Denoising Score Matching
GenerationSingle forward passIterative denoising (tens to hundreds of steps)
ProbabilisticImplicitExplicit
Loss FunctionAdversarial loss (+ auxiliary losses)Simple MSE/L1 (noise prediction)
Distributionpgpdatap_g \approx p_{data} via JSD/Wassersteinpθ(x0)pdatap_\theta(x_0) \approx p_{data} via ELBO

8.2 Strengths and Weaknesses

GAN Strengths:

  • Inference speed: Generates images in a single forward pass. Suitable for real-time applications
  • Sample sharpness: Tends to produce sharp, realistic images through adversarial training
  • Latent space control: Semantic manipulation through a well-structured latent space
  • Lightweight: Can achieve high-quality generation with relatively few parameters

GAN Weaknesses:

  • Training instability: Mode collapse, training oscillation, etc.
  • Limited diversity: Mode collapse can restrict generation diversity
  • Scalability limitations: Does not scale as naturally to text-conditioned generation as Diffusion Models
  • Evaluation difficulty: Hard to monitor training progress with reliable metrics

Diffusion Model Strengths:

  • Training stability: Stable training with simple MSE loss
  • Sample diversity: Mode collapse is virtually nonexistent
  • Text-conditioned generation: Natural conditional generation through classifier-free guidance, etc.
  • Theoretical robustness: Explicit probabilistic model enabling likelihood computation

Diffusion Model Weaknesses:

  • Inference speed: Requires tens to hundreds of iterative denoising steps (being improved through distillation, etc.)
  • Computational cost: High compute requirements for both training and inference
  • Memory usage: Large U-Net parameters required for high-resolution generation

8.3 Convergence Characteristics

PropertyGANDiffusion Model
Convergence guaranteeNash equilibrium guaranteed only theoreticallyStable convergence via ELBO optimization
Mode CoverageRisk of mode collapseExcellent mode coverage
Training curveUnstable, hard to interpretStable, loss directly interpretable
Hyperparameter sensitivityHighRelatively low

8.4 The 2025 Landscape

As of 2025, Diffusion Models dominate image generation. The most commercially successful image generation models -- Stable Diffusion, DALL-E 3, Midjourney -- are all Diffusion-based.

However, GAN has not been fully replaced. GAN still shows strength in the following areas:

  • Real-time generation: Video games, VR/AR, etc.
  • Image editing/manipulation: Precise face editing and attribute manipulation based on StyleGAN
  • Super-Resolution: Real-time super-resolution processing
  • GAN-Diffusion Hybrids: Combining GAN loss with Diffusion processes, or leveraging GAN's fast inference for Diffusion model distillation

The emergence of GigaGAN (2023) demonstrated that GAN can be competitive in large-scale text-to-image synthesis, and research combining the strengths of both paradigms is actively underway.


9. The Present and Future of GAN

9.1 GAN's Current Status

GAN has been at the center of generative modeling for about a decade since its 2014 publication, but ceded its mainstream position to Diffusion Models after 2021. However, GAN's legacy and current role remain significant.

Fields where GAN is actively used today:

  1. Medical imaging: Widely used for augmenting training data while preserving patient privacy
  2. Data augmentation: Expanding small datasets to improve model performance
  3. Image editing and restoration: Face restoration, denoising, super-resolution, etc.
  4. Fashion and design: Virtual try-on, design prototyping
  5. Gaming and simulation: Real-time content generation, texture synthesis

9.2 GAN's Theoretical Legacy

GAN's greatest contribution extends beyond image generation technology.

Adversarial Training Paradigm: The adversarial training introduced by GAN has influenced diverse fields beyond generative models.

  • Adversarial Examples: Robustness research on deep learning models
  • Domain Adaptation: Knowledge transfer across domains using adversarial training
  • Self-supervised Learning: Self-supervised learning leveraging adversarial signals
  • Inverse Reinforcement Learning: Learning reward functions adversarially

Implicit Generative Models: GAN's core insight that complex distributions can be learned without defining explicit probability distributions has influenced the development of Energy-based Models, Score-based Models, and more.

9.3 Future Outlook

GAN-Diffusion Fusion: One of the most promising directions is combining the strengths of GAN and Diffusion Models. Research is underway to replace denoising steps in the Diffusion process with GANs to accelerate inference.

3D Generation: Research combining GAN with 3D representations (Neural Radiance Fields, 3D Gaussian Splatting, etc.) for 3D content generation is active. EG3D and GET3D are representative examples.

Video Generation: StyleGAN3's equivariant properties can naturally apply to video generation, with ongoing research in temporally consistent video generation.

Efficient Training: Research continues on high-quality generation model training with limited data through Few-shot GAN, transfer learning for GANs, and related approaches.

9.4 GAN Timeline Summary

YearModelKey ContributionResolution
2014GANAdversarial training frameworkLow
2014cGANConditional generationLow
2015DCGANCNN-based architecture guidelines64x64
2017WGANWasserstein distance64x64
2017WGAN-GPGradient penalty64x64
2017Pix2PixPaired image-to-image translation256x256
2017CycleGANUnpaired image-to-image translation256x256
2017ProGANProgressive growing1024x1024
2018BigGANLarge-scale training, truncation trick512x512
2019StyleGANMapping network, AdaIN, style separation1024x1024
2020StyleGAN2Weight demodulation, path regularization1024x1024
2021StyleGAN3Alias-free, equivariant generation1024x1024
2023GigaGAN1B-param text-to-image GAN512x512+

10. Conclusion

The GAN proposed by Ian Goodfellow in 2014 revolutionized the AI field with a simple yet powerful idea --- "competition between two networks produces better generative models." The mathematical framework of the minimax game was both elegant and practical, spawning hundreds of variants over the following decade and dramatically advancing image generation quality.

DCGAN laid the practical foundation through its combination with CNNs, while WGAN solved training stability issues with the theoretical innovation of Wasserstein distance. The Progressive GAN and StyleGAN series enabled photorealistic image generation at 1024x1024 resolution, and CycleGAN and Pix2Pix pioneered the new application domain of image translation.

Although Diffusion Models have risen to prominence in generative modeling since 2021, GAN's legacy is immense. The adversarial training paradigm continues to be utilized across diverse fields, and hybrid research combining the strengths of GAN and Diffusion Models is actively progressing. As the emergence of GigaGAN demonstrates, the GAN story is far from over.

In the history of generative models, GAN will be remembered as the milestone that first demonstrated the possibility that "artificial intelligence can truly create."


References

  1. Goodfellow, I. J. et al. (2014). "Generative Adversarial Nets." NeurIPS 2014. arXiv:1406.2661

  2. Mirza, M. & Osindero, S. (2014). "Conditional Generative Adversarial Nets." arXiv:1411.1784

  3. Radford, A., Metz, L. & Chintala, S. (2015). "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks." arXiv:1511.06434

  4. Arjovsky, M., Chintala, S. & Bottou, L. (2017). "Wasserstein GAN." arXiv:1701.07875

  5. Gulrajani, I. et al. (2017). "Improved Training of Wasserstein GANs." arXiv:1704.00028

  6. Isola, P. et al. (2017). "Image-to-Image Translation with Conditional Adversarial Networks." CVPR 2017. arXiv:1611.07004

  7. Zhu, J.-Y. et al. (2017). "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks." ICCV 2017. arXiv:1703.10593

  8. Karras, T. et al. (2017). "Progressive Growing of GANs for Improved Quality, Stability, and Variation." ICLR 2018. arXiv:1710.10196

  9. Brock, A., Donahue, J. & Simonyan, K. (2018). "Large Scale GAN Training for High Fidelity Natural Image Synthesis." ICLR 2019. arXiv:1809.11096

  10. Karras, T., Laine, S. & Aila, T. (2019). "A Style-Based Generator Architecture for Generative Adversarial Networks." CVPR 2019. arXiv:1812.04948

  11. Karras, T. et al. (2020). "Analyzing and Improving the Image Quality of StyleGAN." CVPR 2020. arXiv:1912.04958

  12. Karras, T. et al. (2021). "Alias-Free Generative Adversarial Networks." NeurIPS 2021. arXiv:2106.12423

  13. Kang, M. et al. (2023). "Scaling up GANs for Text-to-Image Synthesis." CVPR 2023. arXiv:2303.05511

  14. Dhariwal, P. & Nichol, A. (2021). "Diffusion Models Beat GANs on Image Synthesis." NeurIPS 2021. arXiv:2105.05233