GAN Paper Deep Dive: How Generative Adversarial Networks Ushered in the Era of AI-Generated Content

1. Paper Overview and Historical Significance
2. The Core Idea of GAN
3. Mathematical Foundations
4. Training Algorithm
5. Core Problems of GAN
6. The Complete GAN Lineage
7. Implementing GAN in PyTorch
8. GAN vs Diffusion Models Comparison
9. The Present and Future of GAN
10. Conclusion
References

1. Paper Overview and Historical Significance

1.1 Paper Information

"Generative Adversarial Nets" was published at NeurIPS 2014 (then known as NIPS), co-authored by Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. According to the now-legendary anecdote, Goodfellow conceived the idea while discussing generative models with colleagues at a bar in Montreal. He went home that night, coded it up, and the first prototype worked right away.

The core idea of this paper is remarkably intuitive: A counterfeiter (Generator) and a police officer (Discriminator) compete against each other. The counterfeiter produces increasingly sophisticated forgeries, while the police officer develops ever sharper detection skills. When this adversarial process converges, the counterfeiter produces bills indistinguishable from genuine ones.

1.2 Historical Context: The Generative Model Landscape of 2014

Before GAN appeared, the dominant approaches in generative modeling were as follows.

Variational Autoencoder (VAE, 2013): Proposed by Kingma and Welling, VAE introduced probabilistic latent variables into an Encoder-Decoder architecture to learn data distributions. However, optimizing the ELBO (Evidence Lower Bound) resulted in blurry generated images.

Boltzmann Machine Family: Deep Boltzmann Machines, Restricted Boltzmann Machines, and similar energy-based models were theoretically elegant but relied on MCMC (Markov Chain Monte Carlo) sampling, making training slow and scalability limited.

Autoregressive Models: Models like PixelRNN (2016) generated pixels one at a time sequentially. They could produce high-quality samples but generation speed was extremely slow.

GAN broke through all these limitations at once. It could generate high-quality samples without defining an explicit probability distribution, and could generate samples instantly in a single forward pass without Markov chains or sequential generation processes. This represented a paradigm shift in the field of generative models.

1.3 Impact

The GAN paper has been cited approximately 65,000 times as of 2024, and hundreds of GAN variants have been proposed over the following decade. Yann LeCun praised GAN as "the most interesting idea in the last 20 years in machine learning." GAN has been applied to countless domains including image generation, super-resolution, style transfer, data augmentation, and drug discovery. It reigned as the dominant paradigm in generative modeling until the emergence of Diffusion Models.

2. The Core Idea of GAN

2.1 Two-Player Game: Generator vs Discriminator

The GAN framework consists of two neural networks competing against each other.

Generator (G): Takes a random noise vector $z$ as input and generates fake data $G(z)$ . The Generator's goal is to produce samples similar enough to real data to fool the Discriminator.

G: z \sim p_z(z) \rightarrow G(z) \in \mathbb{R}^d

Discriminator (D): Determines whether the input data comes from the real data distribution ( $x \sim p_{data}$ ) or is fake, produced by the Generator ( $G(z)$ ). The output is a probability value between 0 and 1, where closer to 1 means the input is judged as real.

D: x \rightarrow [0, 1]

These two networks have opposing objectives:

Generator: Tries to maximize $D(G(z))$ (making the Discriminator classify fakes as real)
Discriminator: Tries to assign high probability to real data and low probability to fake data

2.2 Intuitive Analogy

The GAN training process can be understood through an art market analogy.

Component	Analogy	Role
Generator	Art forger	Goal is to create forgeries indistinguishable from originals
Discriminator	Art appraiser	Goal is to distinguish originals from forgeries
Training Data	Authentic artworks	Samples from the real data distribution
Noise Vector $z$	Artist's inspiration	A random point in the latent space

Initially, the forger's skills are poor, so the appraiser easily identifies forgeries. But the forger improves through the appraiser's feedback (gradients), and the appraiser also enhances detection capabilities to counter increasingly sophisticated forgeries. When this competition progresses sufficiently, the forger produces works indistinguishable from originals.

2.3 Minimax Game Formulation

The GAN training objective is formalized as the following minimax game:

\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]

Let us analyze each term of this value function $V(D, G)$ .

First term: $\mathbb{E}_{x \sim p_{data}(x)}[\log D(x)]$

This is the Discriminator's judgment on real data $x$ . The Discriminator tries to maximize this value, aiming for $D(x) \rightarrow 1$ (judging real as real). The Generator has no influence on this term.

Second term: $\mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]$

This is the Discriminator's judgment on fake data produced by the Generator.

The Discriminator tries to maximize this: $D(G(z)) \rightarrow 0$ (judging fake as fake) gives $\log(1 - 0) = 0$ , the maximum
The Generator tries to minimize this: $D(G(z)) \rightarrow 1$ (judging fake as real) gives $\log(1 - 1) = -\infty$ , the minimum

This is precisely where the name Adversarial comes from. Two players optimize the same value function in opposite directions.

3. Mathematical Foundations

3.1 Optimal Discriminator

Let us derive the optimal Discriminator $D^*_G$ for a fixed Generator $G$ . Converting the value function to integral form using the definition of expectation:

V(D, G) = \int_x p_{data}(x) \log D(x) \, dx + \int_x p_g(x) \log(1 - D(x)) \, dx

where $p_g$ is the distribution of data generated by the Generator. Combining into a single integral:

V(D, G) = \int_x \left[ p_{data}(x) \log D(x) + p_g(x) \log(1 - D(x)) \right] dx

Differentiating the integrand with respect to $D(x)$ and setting it to zero:

\frac{p_{data}(x)}{D(x)} - \frac{p_g(x)}{1 - D(x)} = 0

Solving for $D(x)$ , the optimal discriminator is:

D^*_G(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}

This result is intuitively sound. If the probability of a data point $x$ being real is $p_{data}(x)$ and being fake is $p_g(x)$ , the optimal discrimination exactly matches the posterior probability from Bayes' rule.

Key observation: When $p_g = p_{data}$ , i.e., when the Generator has perfectly learned the real data distribution, $D^*_G(x) = \frac{1}{2}$ for all $x$ . The Discriminator can no longer distinguish real from fake at all.

3.2 Relationship with Jensen-Shannon Divergence

Substituting the optimal discriminator $D^*_G$ into the value function:

V(D^*_G, G) = \mathbb{E}_{x \sim p_{data}} \left[ \log \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} \right] + \mathbb{E}_{x \sim p_g} \left[ \log \frac{p_g(x)}{p_{data}(x) + p_g(x)} \right]

Simplifying:

V(D^*_G, G) = -\log 4 + 2 \cdot JSD(p_{data} \| p_g)

where $JSD$ is the Jensen-Shannon Divergence, defined as:

JSD(p \| q) = \frac{1}{2} KL\left(p \left\| \frac{p+q}{2}\right.\right) + \frac{1}{2} KL\left(q \left\| \frac{p+q}{2}\right.\right)

JSD is a symmetrized version of KL Divergence and is always bounded: $0 \leq JSD(p \| q) \leq \log 2$ . $JSD = 0$ occurs if and only if $p = q$ , i.e., when the two distributions are completely identical.

3.3 Proof of Global Optimality

Theorem (Goodfellow et al., 2014): The global minimum of $C(G) = \max_D V(D, G)$ is achieved if and only if $p_g = p_{data}$ , at which point $C(G) = -\log 4$ .

Proof:

(1) $C(G) = V(D^*_G, G) = -\log 4 + 2 \cdot JSD(p_{data} \| p_g)$

(2) $JSD(p_{data} \| p_g) \geq 0$ (non-negativity of JSD)

(3) $JSD(p_{data} \| p_g) = 0 \iff p_{data} = p_g$

(4) Therefore $C(G) \geq -\log 4$ , with equality if and only if $p_g = p_{data}$

This provides the theoretical guarantee for GAN training. Given a Generator and Discriminator with sufficient capacity, at the Nash equilibrium of the minimax game, the Generator perfectly recovers the real data distribution.

3.4 Nash Equilibrium

From a game-theoretic perspective, GAN training is the problem of finding a Nash equilibrium between two players. A Nash equilibrium is a state where neither player can benefit by unilaterally changing their strategy while the other player's strategy remains fixed.

The Nash equilibrium in GAN is:

$G^*$ : A Generator that achieves $p_g = p_{data}$
$D^*$ : A Discriminator that outputs $D(x) = \frac{1}{2}$ for all $x$

Theoretically, this equilibrium point exists and is unique, but finding it in practice is very difficult. It is a non-convex game where two networks must be optimized simultaneously. This is the fundamental difficulty of GAN training and became the starting point for numerous subsequent studies.

3.5 KL Divergence vs JS Divergence

Why JSD specifically? Let us compare with KL Divergence.

Problems with KL Divergence:

KL(p_{data} \| p_g) = \int p_{data}(x) \log \frac{p_{data}(x)}{p_g(x)} dx

KL Divergence is asymmetric and diverges to infinity in regions where $p_g(x) = 0$ but $p_{data}(x) > 0$ . This becomes problematic when the Generator's distribution fails to sufficiently cover the real distribution early in training.

Advantages of JS Divergence:

Symmetric: $JSD(p \| q) = JSD(q \| p)$
Always finite: $0 \leq JSD \leq \log 2$
Computes KL with respect to the mixture distribution $\frac{p+q}{2}$ , so it does not diverge even when one distribution is zero

However, JSD is not perfect either. When the supports of the two distributions do not overlap, JSD becomes the constant $\log 2$ , making the gradient zero. This is the root cause of the vanishing gradient problem in GAN training, and the key motivation for WGAN's introduction of Wasserstein distance.

4. Training Algorithm

4.1 Training Procedure

The training algorithm proposed in the original paper is as follows:

Algorithm 1: GAN Training (Goodfellow et al., 2014)

for number of training iterations do
    # --- Step 1: Discriminator update (k steps) ---
    for k steps do
        - Sample m noise samples {z^(1), ..., z^(m)} from noise prior p_z(z)
        - Sample m real samples {x^(1), ..., x^(m)} from data distribution p_data(x)
        - Update Discriminator parameters by stochastic gradient ascending:

          nabla_{theta_d} (1/m) sum_{i=1}^{m} [log D(x^(i)) + log(1 - D(G(z^(i))))]

    end for

    # --- Step 2: Generator update (1 step) ---
    - Sample m noise samples {z^(1), ..., z^(m)} from noise prior p_z(z)
    - Update Generator parameters by stochastic gradient descending:

          nabla_{theta_g} (1/m) sum_{i=1}^{m} log(1 - D(G(z^(i))))

end for

4.2 Alternating Optimization

The key is alternating optimization. The Discriminator and Generator are updated in turn.

Why update the Discriminator k times before updating the Generator once:

Theoretically, the optimal discriminator $D^*_G$ must be found before updating the Generator to obtain the correct gradient direction. Since fully optimizing $D$ is infeasible in practice, it is approximated with $k$ gradient steps. The original paper used $k = 1$ as the default.

Importance of maintaining balance:

If the Discriminator becomes too strong: The Generator's gradients vanish and training stalls
If the Discriminator becomes too weak: It fails to provide useful learning signals to the Generator
Ideally, the Discriminator and Generator should advance at comparable levels

4.3 Non-Saturating Loss (Practical Modification)

In the theoretical minimax objective, the Generator's goal is to minimize $\log(1 - D(G(z)))$ . However, early in training when the Generator is very poor, $D(G(z)) \approx 0$ , so $\log(1 - D(G(z))) \approx \log 1 = 0$ , resulting in near-zero gradients.

Goodfellow addressed this by modifying the Generator's objective:

Original (Minimax):

\min_G \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]

Modified (Non-Saturating):

\max_G \mathbb{E}_{z \sim p_z(z)}[\log D(G(z))]

Both objectives share the same fixed point (Nash equilibrium), but the gradient magnitudes differ significantly early in training. The non-saturating loss provides strong gradients even when $D(G(z))$ is small, enabling the Generator to learn quickly.

\text{Minimax gradient}: \frac{\partial}{\partial G} \log(1 - D(G(z))) = \frac{-D'(G(z))}{1 - D(G(z))} \approx 0 \text{ when } D(G(z)) \approx 0

\text{Non-Saturating gradient}: \frac{\partial}{\partial G} \log D(G(z)) = \frac{D'(G(z))}{D(G(z))} \rightarrow \text{large when } D(G(z)) \approx 0

4.4 Experimental Results in the Original Paper

The original paper conducted experiments on MNIST, Toronto Face Database (TFD), and CIFAR-10 datasets. Evaluation used Parzen window-based log-likelihood estimation, and GAN showed competitive performance compared to Deep Boltzmann Machines and Stacked Denoising Autoencoders.

However, by today's standards, the results were quite rudimentary. Both the Generator and Discriminator used simple MLPs (Multi-Layer Perceptrons), and the resolution and quality of generated images were limited. The true breakthroughs came through subsequent architectural improvements and training technique advancements.

5. Core Problems of GAN

5.1 Mode Collapse

The most notorious problem of GAN is Mode Collapse. This occurs when the Generator fails to learn all the modes (diverse patterns) of the data distribution and instead focuses on a small subset, repeatedly generating similar outputs.

Mechanism:

When the Generator discovers a few patterns that are particularly effective at fooling the Discriminator, it repeatedly generates those patterns instead of exploring diverse alternatives. For example, when training on MNIST, the Generator might perfectly generate only the digit '1' while failing to generate any other digits.

Mathematical interpretation:

Mode collapse is related to the transformation from minimax to maximin game:

\max_D \min_G V(D, G) \neq \min_G \max_D V(D, G)

In the theoretical minimax, the Generator must defend against all possible Discriminators, requiring it to cover the entire distribution. However, in actual training, the Generator only needs to fool the current Discriminator, making it a "rational" strategy to focus on specific modes.

5.2 Training Instability

GAN training is inherently the problem of finding a Nash equilibrium in a non-cooperative game. This is far more difficult than a simple optimization problem.

Oscillation problem: The Generator and Discriminator frequently oscillate around each other without converging. In a typical loss landscape, gradient descent finds local minima, but gradient descent in a minimax game can circle around saddle points.

Difficulty of training balance: If the Discriminator converges too quickly, the Generator cannot learn; conversely, if the Discriminator is too weak, it fails to convey meaningful learning signals to the Generator. Maintaining this delicate balance was the greatest practical challenge in GAN training.

5.3 Vanishing Gradients

As explained in Section 3.5, JS Divergence becomes the constant $\log 2$ when the supports of the two distributions do not overlap, resulting in zero gradients.

In high-dimensional data (e.g., images), both the real data distribution and the Generator's distribution exist on low-dimensional manifolds within the high-dimensional space. The probability of these two manifolds overlapping is very low, so it is typical for the supports of the two distributions to barely overlap early in training. In this situation, JSD-based GAN provides no useful gradients at all.

\text{When } \text{supp}(p_{data}) \cap \text{supp}(p_g) = \emptyset: \quad JSD(p_{data} \| p_g) = \log 2 \quad (\text{constant})

5.4 Evaluation Challenges

Objectively evaluating GAN performance is itself a very challenging problem. The main evaluation metrics are:

Inception Score (IS): Measures the quality (sharpness) and diversity of generated images. Uses a pre-trained Inception network -- high scores indicate that individual images have confident class predictions (quality) while the overall distribution covers diverse classes (diversity).

Frechet Inception Distance (FID): Measures the Frechet distance between the Inception feature distributions of real and generated data. Lower is better. Widely used as a more reliable metric than IS.

FID = \|\mu_r - \mu_g\|^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})

where $(\mu_r, \Sigma_r)$ and $(\mu_g, \Sigma_g)$ are the mean and covariance of the Inception features for real and generated images, respectively.

6. The Complete GAN Lineage

6.1 DCGAN (2015): The Beginning of Stable CNN-based Training

Radford, Metz, Chintala. "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks" (2015)

The original GAN used only MLPs (Fully Connected Layers), failing to leverage CNN's powerful spatial feature extraction capabilities for image generation. DCGAN (Deep Convolutional GAN) was the first architecture to successfully integrate CNNs into GANs, establishing several architectural guidelines for stable training.

DCGAN's key architectural rules:

Remove pooling layers: Use strided convolutions (Discriminator) and fractional-strided / transposed convolutions (Generator) instead of max pooling
Apply Batch Normalization: Apply to both Generator and Discriminator, except for the Generator's output layer and the Discriminator's input layer
Remove fully connected layers: Use global average pooling or direct convolutional connections
Activation functions: Generator uses Tanh for the output layer and ReLU elsewhere. Discriminator uses LeakyReLU for all layers

DCGAN Generator Architecture (Conceptual):

z (100-dim) -> FC -> Reshape (4x4x1024) -> ConvT -> BN -> ReLU (8x8x512)
-> ConvT -> BN -> ReLU (16x16x256) -> ConvT -> BN -> ReLU (32x32x128)
-> ConvT -> Tanh (64x64x3)

Beyond simply generating good images, DCGAN demonstrated that the learned latent space possesses meaningful structure. The famous demonstration showed that vector arithmetic in latent space corresponds to semantic transformations:

\text{vec}(\text{"man with glasses"}) - \text{vec}(\text{"man"}) + \text{vec}(\text{"woman"}) = \text{vec}(\text{"woman with glasses"})

6.2 WGAN (2017): Introduction of Wasserstein Distance

Arjovsky, Chintala, Bottou. "Wasserstein GAN" (2017)

WGAN is one of the most important theoretical advances in GAN, introducing Wasserstein distance (Earth Mover's distance) to address the fundamental limitations of JS Divergence.

Wasserstein Distance (EM Distance):

W(p_{data}, p_g) = \inf_{\gamma \in \Pi(p_{data}, p_g)} \mathbb{E}_{(x, y) \sim \gamma} [\|x - y\|]

where $\Pi(p_{data}, p_g)$ is the set of all joint distributions with marginals $p_{data}$ and $p_g$ . Intuitively, it is the minimum cost of "moving dirt" to transform one distribution into another.

Key advantages of Wasserstein Distance:

Unlike JSD, it provides a continuous and differentiable distance even when the supports of the two distributions do not overlap. For example, considering two point distributions $\delta_0$ and $\delta_\theta$ ( $\theta > 0$ ):

JSD(\delta_0 \| \delta_\theta) = \log 2 \quad \text{(constant, gradient = 0)}

W(\delta_0, \delta_\theta) = |\theta| \quad \text{(continuous, gradient} = \text{sign}(\theta)\text{)}

Kantorovich-Rubinstein Duality:

Since directly computing the Wasserstein distance is intractable, the Kantorovich-Rubinstein duality is leveraged:

W(p_{data}, p_g) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_{data}}[f(x)] - \mathbb{E}_{x \sim p_g}[f(x)]

where the supremum is taken over all 1-Lipschitz functions. WGAN trains the Discriminator (now called the Critic) to approximate this 1-Lipschitz function.

Weight Clipping: The original WGAN enforced the Lipschitz constraint by clipping critic weights to the range $[-c, c]$ . However, this severely limited the critic's representational power and could cause training instability.

6.3 WGAN-GP (2017): Gradient Penalty

Gulrajani, Ahmed, Arjovsky, Dumoulin, Courville. "Improved Training of Wasserstein GANs" (2017)

To address weight clipping's problems, Gradient Penalty (GP) was proposed. Instead of directly enforcing the Lipschitz constraint, it regularizes the critic's gradient norm to stay close to 1.

L_{WGAN-GP} = \underbrace{\mathbb{E}_{x \sim p_g}[D(x)] - \mathbb{E}_{x \sim p_{data}}[D(x)]}_{\text{Original Critic Loss}} + \underbrace{\lambda \mathbb{E}_{\hat{x} \sim p_{\hat{x}}} \left[ (\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2 \right]}_{\text{Gradient Penalty}}

where $\hat{x}$ is a random interpolation between real and generated data:

\hat{x} = \epsilon x + (1 - \epsilon) G(z), \quad \epsilon \sim \text{Uniform}[0, 1]

WGAN-GP uses $\lambda = 10$ and $n_{critic} = 5$ critic updates as defaults, training stably across diverse architectures with minimal hyperparameter tuning.

6.4 Progressive GAN (2017): Gradual Resolution Increase

Karras, Aila, Laine, Lehtinen. "Progressive Growing of GANs for Improved Quality, Stability, and Variation" (2017)

Progressive GAN (ProGAN), proposed by the NVIDIA research team, opened new horizons in high-resolution image generation. The core idea is to start training the Generator and Discriminator at low resolution and progressively add layers to increase resolution.

Training process:

Phase 1: Train G and D at 4x4 resolution
Phase 2: Add 8x8 layers with gradual fade-in transition
Phase 3: Add 16x16 layers
...
Phase N: Reach final 1024x1024 resolution

Fade-in mechanism: When adding new layers, the outputs of existing and new layers are combined via weighted averaging. The weight $\alpha$ gradually increases from 0 to 1, progressively activating the new layer.

\text{output} = (1 - \alpha) \cdot \text{upsampled\_old} + \alpha \cdot \text{new\_layer\_output}

Key contributions:

Dramatically improved training stability: Learning coarse structure at low resolution first, then gradually adding fine details makes training much more stable
Achieved 1024x1024 resolution: First successful generation of photorealistic face images at 1024x1024 resolution on the CelebA-HQ dataset
Minibatch standard deviation: Introduced a technique using within-minibatch statistics to increase diversity

6.5 StyleGAN Series (2019-2021): The Pinnacle of Style-based Generation

StyleGAN (2019)

Karras, Laine, Aila. "A Style-Based Generator Architecture for Generative Adversarial Networks" (2019)

StyleGAN is a revolutionary architecture that combines Progressive GAN's progressive training with the style separation concepts from Neural Style Transfer.

Key structural changes:

Mapping Network: Transforms the input latent vector $z \in \mathcal{Z}$ through a nonlinear mapping network $f: \mathcal{Z} \rightarrow \mathcal{W}$ to an intermediate latent space $\mathcal{W}$ . Consists of 8 FC layers.
Adaptive Instance Normalization (AdaIN): Injects style vectors $w$ from $\mathcal{W}$ space into each convolution layer.

\text{AdaIN}(x_i, y) = y_{s,i} \frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i}

where $y_s$ and $y_b$ are scale and bias obtained via learned affine transformation from the style vector $w$ .

Constant Input: Uses a learnable constant tensor (4x4x512) as the Generator's input. Style is injected solely through AdaIN.
Noise Injection: Adds per-pixel noise after each convolution layer to control stochastic variation (e.g., hair position, pores, etc.).

Style hierarchy:

Resolution Layer	Controlled Attributes
$4^2 - 8^2$ (Coarse)	Pose, face shape, presence of glasses
$16^2 - 32^2$ (Middle)	Facial features, hairstyle, eye openness
$64^2 - 1024^2$ (Fine)	Color, fine structure, background details

StyleGAN2 (2020)

Karras, Laine, Aittala, Hellsten, Lehtinen, Aila. "Analyzing and Improving the Image Quality of StyleGAN" (2020)

StyleGAN2 resolved several artifacts in StyleGAN and significantly improved image quality.

Key improvements:

Weight Demodulation: Replaces AdaIN to eliminate blob artifacts. Solves the problem where AdaIN's instance normalization destroys relative magnitude information within feature maps
Removal of Progressive Growing: Achieves stable high-resolution training without progressive growing by using skip connections and residual connections
Path Length Regularization: Improves smoothness of the latent space so that small changes in latent vectors produce proportional changes in images
Lazy Regularization: Applies regularization every 16 steps instead of every step for improved efficiency

StyleGAN2-ADA: Introduced Adaptive Discriminator Augmentation to train without overfitting even with limited data. Enabled high-quality generation from datasets as small as a few thousand images.

StyleGAN3 (2021)

Karras, Aittala, Laine, et al. "Alias-Free Generative Adversarial Networks" (2021)

StyleGAN3 addressed a fundamental signal processing issue.

Problem: In StyleGAN2, fine details in generated images appeared "stuck" to image coordinates. When the camera should move, textures did not move with objects but remained fixed -- an aliasing problem.

Solution: Redesigned all signals within the network to be processed in the continuous domain, fundamentally eliminating aliasing from discrete sampling.

Key changes:

Fourier feature-based input replacement
Guaranteed continuous equivariant operations
Achieved full equivariance to translation and rotation
FID comparable to StyleGAN2 while having fundamentally different internal representations

StyleGAN3 laid the foundation for better suitability in video generation and animation.

6.6 Conditional GAN, Pix2Pix, CycleGAN

Conditional GAN (cGAN, 2014)

Mirza, Osindero. "Conditional Generative Adversarial Nets" (2014)

The original GAN cannot control what data is generated. Conditional GAN provides additional conditioning information $y$ (e.g., class labels) to both the Generator and Discriminator, enabling conditional generation of data with desired attributes.

\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x|y)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z|y)|y))]

Pix2Pix (2017)

Isola, Zhu, Zhou, Efros. "Image-to-Image Translation with Conditional Adversarial Networks" (2017)

Pix2Pix is an image-to-image translation framework using paired image data. It solved diverse tasks -- colorizing grayscale photos, converting satellite images to maps, transforming sketches to photos -- within a unified framework.

Key components:

U-Net Generator: Encoder-Decoder architecture with skip connections
PatchGAN Discriminator: Judges authenticity at the $N \times N$ patch level rather than the whole image
L1 Reconstruction Loss + Adversarial Loss: Simultaneously pursues structural similarity and realism

\mathcal{L} = \mathcal{L}_{cGAN}(G, D) + \lambda \mathcal{L}_{L1}(G)

CycleGAN (2017)

Zhu, Park, Isola, Efros. "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks" (2017)

Pix2Pix had the major constraint of requiring paired data. CycleGAN learns translation between two domains using only unpaired data.

Core idea: Cycle Consistency Loss

Two Generators $G: X \rightarrow Y$ and $F: Y \rightarrow X$ , and two Discriminators $D_X$ , $D_Y$ are trained.

\mathcal{L}_{cyc}(G, F) = \mathbb{E}_{x \sim p_{data}(x)}[\|F(G(x)) - x\|_1] + \mathbb{E}_{y \sim p_{data}(y)}[\|G(F(y)) - y\|_1]

The constraint that translating an image from domain $X$ to $Y$ and back to $X$ should recover the original image. This enables learning meaningful mappings without paired data.

Applications: Converting horses to zebras, summer landscapes to winter, photographs to Monet-style paintings, etc.

6.7 BigGAN (2018): The Power of Scale

Brock, Donahue, Simonyan. "Large Scale GAN Training for High Fidelity Natural Image Synthesis" (2018)

BigGAN dramatically demonstrated that "scale matters in GAN training." It trained with 2-4x the parameters and 8x the batch size compared to prior work.

Key techniques:

Class-Conditional Batch Normalization: Shares class embeddings to adjust the scale and bias of each Batch Normalization layer
Truncation Trick: Truncates the distribution of latent vectors $z$ at inference time to control the quality-diversity tradeoff

z \sim \mathcal{N}(0, I) \rightarrow z' = \text{truncate}(z, \text{threshold})

Orthogonal Regularization: Applies orthogonal regularization to Generator weights for training stability

Results: Achieved IS 166.5 and FID 7.4 on ImageNet 128x128, vastly surpassing the previous best (IS 52.52, FID 18.6).

6.8 GigaGAN (2023): The Return of GAN?

Kang, Zhu, et al. "Scaling up GANs for Text-to-Image Synthesis" (2023)

At a time when Diffusion Models dominated image generation, GigaGAN demonstrated the potential of GAN once again as a 1B-parameter text-to-image GAN.

Key innovations:

Adaptive Kernel Selection: Generates different convolution filters for each image. Determined by convex combination from a filter bank using the style vector
Stable Attention: Computes attention scores based on L2 distance to guarantee Lipschitz continuity, and normalizes the attention weight matrix to unit variance
Query-Key Tying: Shares Query and Key matrices for stability
CLIP Text Encoder: Extracts text embeddings using a pre-trained CLIP model

Results and significance:

Surpassed Stable Diffusion v1.5, DALL-E 2, and Parti-750M in FID
0.13 seconds for 512px image generation: Inference speed tens to hundreds of times faster than Diffusion models
Proved that GAN remains competitive in large-scale text-to-image synthesis

7. Implementing GAN in PyTorch

7.1 Basic GAN Implementation (MNIST)

Below is the most basic GAN implementation for the MNIST dataset in PyTorch.

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# ============================================================
# Hyperparameter Settings
# ============================================================
LATENT_DIM = 100        # Latent vector dimension (dimension of z)
IMG_DIM = 28 * 28       # Flattened MNIST image dimension
HIDDEN_DIM = 256        # Hidden layer dimension
BATCH_SIZE = 64
EPOCHS = 200
LR = 0.0002
BETAS = (0.5, 0.999)   # Adam optimizer beta parameters
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# ============================================================
# Generator Definition
# ============================================================
class Generator(nn.Module):
    """
    Takes a latent vector z as input and generates a fake image.
    Architecture: z(100) -> 256 -> 512 -> 1024 -> 784(28x28)
    """
    def __init__(self, latent_dim: int, img_dim: int, hidden_dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim, hidden_dim * 2),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim * 2, hidden_dim * 4),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim * 4, img_dim),
            nn.Tanh(),  # Normalize output to [-1, 1] range
        )

    def forward(self, z: torch.Tensor) -> torch.Tensor:
        return self.net(z)


# ============================================================
# Discriminator Definition
# ============================================================
class Discriminator(nn.Module):
    """
    Takes an image as input and outputs the probability of being real/fake.
    Architecture: 784(28x28) -> 1024 -> 512 -> 256 -> 1
    """
    def __init__(self, img_dim: int, hidden_dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(img_dim, hidden_dim * 4),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim * 4, hidden_dim * 2),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, 1),
            nn.Sigmoid(),  # Convert output to [0, 1] probability
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)


# ============================================================
# Data Loader Setup
# ============================================================
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,)),  # [0,1] -> [-1,1]
])

dataset = datasets.MNIST(root="./data", train=True, transform=transform, download=True)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)


# ============================================================
# Model, Optimizer, and Loss Function Initialization
# ============================================================
G = Generator(LATENT_DIM, IMG_DIM, HIDDEN_DIM).to(DEVICE)
D = Discriminator(IMG_DIM, HIDDEN_DIM).to(DEVICE)

opt_G = optim.Adam(G.parameters(), lr=LR, betas=BETAS)
opt_D = optim.Adam(D.parameters(), lr=LR, betas=BETAS)

criterion = nn.BCELoss()  # Binary Cross Entropy


# ============================================================
# Training Loop
# ============================================================
for epoch in range(EPOCHS):
    d_loss_total, g_loss_total = 0.0, 0.0

    for batch_idx, (real_imgs, _) in enumerate(dataloader):
        real_imgs = real_imgs.view(-1, IMG_DIM).to(DEVICE)
        batch_size = real_imgs.size(0)

        # Real/fake labels
        real_labels = torch.ones(batch_size, 1, device=DEVICE)
        fake_labels = torch.zeros(batch_size, 1, device=DEVICE)

        # -----------------------------------------
        # Step 1: Train Discriminator
        # -----------------------------------------
        # Discriminate real images
        d_real = D(real_imgs)
        d_loss_real = criterion(d_real, real_labels)

        # Generate and discriminate fake images
        z = torch.randn(batch_size, LATENT_DIM, device=DEVICE)
        fake_imgs = G(z).detach()  # Block Generator's gradients
        d_fake = D(fake_imgs)
        d_loss_fake = criterion(d_fake, fake_labels)

        # Total Discriminator loss and update
        d_loss = d_loss_real + d_loss_fake
        opt_D.zero_grad()
        d_loss.backward()
        opt_D.step()

        # -----------------------------------------
        # Step 2: Train Generator
        # -----------------------------------------
        z = torch.randn(batch_size, LATENT_DIM, device=DEVICE)
        fake_imgs = G(z)
        d_fake = D(fake_imgs)

        # Non-saturating loss: Generator tries to maximize D(G(z))
        g_loss = criterion(d_fake, real_labels)
        opt_G.zero_grad()
        g_loss.backward()
        opt_G.step()

        d_loss_total += d_loss.item()
        g_loss_total += g_loss.item()

    # Per-epoch log output
    num_batches = len(dataloader)
    print(
        f"Epoch [{epoch+1}/{EPOCHS}] "
        f"D Loss: {d_loss_total/num_batches:.4f} | "
        f"G Loss: {g_loss_total/num_batches:.4f}"
    )

7.2 DCGAN Implementation (Key Parts)

A version with the Generator and Discriminator changed to convolutional architectures.

class DCGANGenerator(nn.Module):
    """
    DCGAN Generator: Generates images using Transposed Convolutions.
    z(100) -> 4x4x512 -> 8x8x256 -> 16x16x128 -> 32x32x64 -> 64x64x3
    """
    def __init__(self, latent_dim: int = 100, feature_map_size: int = 64, channels: int = 3):
        super().__init__()
        self.net = nn.Sequential(
            # Input: z (latent_dim x 1 x 1) -> (feature_map_size*8 x 4 x 4)
            nn.ConvTranspose2d(latent_dim, feature_map_size * 8, 4, 1, 0, bias=False),
            nn.BatchNorm2d(feature_map_size * 8),
            nn.ReLU(inplace=True),

            # (feature_map_size*8 x 4 x 4) -> (feature_map_size*4 x 8 x 8)
            nn.ConvTranspose2d(feature_map_size * 8, feature_map_size * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 4),
            nn.ReLU(inplace=True),

            # (feature_map_size*4 x 8 x 8) -> (feature_map_size*2 x 16 x 16)
            nn.ConvTranspose2d(feature_map_size * 4, feature_map_size * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 2),
            nn.ReLU(inplace=True),

            # (feature_map_size*2 x 16 x 16) -> (feature_map_size x 32 x 32)
            nn.ConvTranspose2d(feature_map_size * 2, feature_map_size, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size),
            nn.ReLU(inplace=True),

            # (feature_map_size x 32 x 32) -> (channels x 64 x 64)
            nn.ConvTranspose2d(feature_map_size, channels, 4, 2, 1, bias=False),
            nn.Tanh(),
        )

    def forward(self, z: torch.Tensor) -> torch.Tensor:
        return self.net(z)


class DCGANDiscriminator(nn.Module):
    """
    DCGAN Discriminator: Judges authenticity using Strided Convolutions.
    (3 x 64 x 64) -> (64 x 32 x 32) -> (128 x 16 x 16) ->
    (256 x 8 x 8) -> (512 x 4 x 4) -> 1
    """
    def __init__(self, feature_map_size: int = 64, channels: int = 3):
        super().__init__()
        self.net = nn.Sequential(
            # (channels x 64 x 64) -> (feature_map_size x 32 x 32)
            nn.Conv2d(channels, feature_map_size, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size x 32 x 32) -> (feature_map_size*2 x 16 x 16)
            nn.Conv2d(feature_map_size, feature_map_size * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 2),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size*2 x 16 x 16) -> (feature_map_size*4 x 8 x 8)
            nn.Conv2d(feature_map_size * 2, feature_map_size * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 4),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size*4 x 8 x 8) -> (feature_map_size*8 x 4 x 4)
            nn.Conv2d(feature_map_size * 4, feature_map_size * 8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_map_size * 8),
            nn.LeakyReLU(0.2, inplace=True),

            # (feature_map_size*8 x 4 x 4) -> (1 x 1 x 1)
            nn.Conv2d(feature_map_size * 8, 1, 4, 1, 0, bias=False),
            nn.Sigmoid(),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x).view(-1, 1)

7.3 WGAN-GP Core Loss Implementation

def compute_gradient_penalty(
    discriminator: nn.Module,
    real_samples: torch.Tensor,
    fake_samples: torch.Tensor,
    device: torch.device,
    lambda_gp: float = 10.0,
) -> torch.Tensor:
    """
    Computes the Gradient Penalty for WGAN-GP.

    Penalizes the Discriminator (Critic) so that the L2 norm of its gradient
    equals 1 at random interpolation points between real and generated data.
    """
    batch_size = real_samples.size(0)

    # Random interpolation coefficient
    epsilon = torch.rand(batch_size, 1, 1, 1, device=device)

    # Interpolation between real and fake
    interpolated = (epsilon * real_samples + (1 - epsilon) * fake_samples).requires_grad_(True)

    # Critic output
    d_interpolated = discriminator(interpolated)

    # Gradient computation
    gradients = torch.autograd.grad(
        outputs=d_interpolated,
        inputs=interpolated,
        grad_outputs=torch.ones_like(d_interpolated),
        create_graph=True,
        retain_graph=True,
    )[0]

    # L2 norm of gradients
    gradients = gradients.view(batch_size, -1)
    gradient_norm = gradients.norm(2, dim=1)

    # Gradient Penalty: expectation of (||grad|| - 1)^2
    gradient_penalty = lambda_gp * ((gradient_norm - 1) ** 2).mean()

    return gradient_penalty


# WGAN-GP Training Loop (Key Parts)
def train_wgan_gp_step(
    G: nn.Module,
    D: nn.Module,
    opt_G: optim.Optimizer,
    opt_D: optim.Optimizer,
    real_imgs: torch.Tensor,
    latent_dim: int,
    device: torch.device,
    n_critic: int = 5,
):
    """One iteration of WGAN-GP training."""
    batch_size = real_imgs.size(0)

    # --- Critic (Discriminator) training: n_critic times ---
    for _ in range(n_critic):
        z = torch.randn(batch_size, latent_dim, 1, 1, device=device)
        fake_imgs = G(z).detach()

        # Wasserstein Loss: maximize E[D(real)] - E[D(fake)]
        d_real = D(real_imgs).mean()
        d_fake = D(fake_imgs).mean()
        gp = compute_gradient_penalty(D, real_imgs, fake_imgs, device)

        d_loss = d_fake - d_real + gp  # Critic minimizes this

        opt_D.zero_grad()
        d_loss.backward()
        opt_D.step()

    # --- Generator training: 1 time ---
    z = torch.randn(batch_size, latent_dim, 1, 1, device=device)
    fake_imgs = G(z)
    g_loss = -D(fake_imgs).mean()  # Generator maximizes D(G(z))

    opt_G.zero_grad()
    g_loss.backward()
    opt_G.step()

    return d_loss.item(), g_loss.item()

8. GAN vs Diffusion Models Comparison

Entering the 2020s, Diffusion Models (DDPM, Score-based models) emerged as a new paradigm in image generation. After Dhariwal and Nichol's 2021 paper "Diffusion Models Beat GANs on Image Synthesis," Diffusion Models became the mainstream of generative modeling through DALL-E 2, Stable Diffusion, Midjourney, and others. Let us systematically compare GAN and Diffusion Models.

8.1 Fundamental Comparison

Aspect	GAN	Diffusion Model
Training Method	Adversarial Training (minimax game)	Denoising Score Matching
Generation	Single forward pass	Iterative denoising (tens to hundreds of steps)
Probabilistic	Implicit	Explicit
Loss Function	Adversarial loss (+ auxiliary losses)	Simple MSE/L1 (noise prediction)
Distribution	$p_g \approx p_{data}$ via JSD/Wasserstein	$p_\theta(x_0) \approx p_{data}$ via ELBO

8.2 Strengths and Weaknesses

GAN Strengths:

Inference speed: Generates images in a single forward pass. Suitable for real-time applications
Sample sharpness: Tends to produce sharp, realistic images through adversarial training
Latent space control: Semantic manipulation through a well-structured latent space
Lightweight: Can achieve high-quality generation with relatively few parameters

GAN Weaknesses:

Training instability: Mode collapse, training oscillation, etc.
Limited diversity: Mode collapse can restrict generation diversity
Scalability limitations: Does not scale as naturally to text-conditioned generation as Diffusion Models
Evaluation difficulty: Hard to monitor training progress with reliable metrics

Diffusion Model Strengths:

Training stability: Stable training with simple MSE loss
Sample diversity: Mode collapse is virtually nonexistent
Text-conditioned generation: Natural conditional generation through classifier-free guidance, etc.
Theoretical robustness: Explicit probabilistic model enabling likelihood computation

Diffusion Model Weaknesses:

Inference speed: Requires tens to hundreds of iterative denoising steps (being improved through distillation, etc.)
Computational cost: High compute requirements for both training and inference
Memory usage: Large U-Net parameters required for high-resolution generation

8.3 Convergence Characteristics

Property	GAN	Diffusion Model
Convergence guarantee	Nash equilibrium guaranteed only theoretically	Stable convergence via ELBO optimization
Mode Coverage	Risk of mode collapse	Excellent mode coverage
Training curve	Unstable, hard to interpret	Stable, loss directly interpretable
Hyperparameter sensitivity	High	Relatively low

8.4 The 2025 Landscape

As of 2025, Diffusion Models dominate image generation. The most commercially successful image generation models -- Stable Diffusion, DALL-E 3, Midjourney -- are all Diffusion-based.

However, GAN has not been fully replaced. GAN still shows strength in the following areas:

Real-time generation: Video games, VR/AR, etc.
Image editing/manipulation: Precise face editing and attribute manipulation based on StyleGAN
Super-Resolution: Real-time super-resolution processing
GAN-Diffusion Hybrids: Combining GAN loss with Diffusion processes, or leveraging GAN's fast inference for Diffusion model distillation

The emergence of GigaGAN (2023) demonstrated that GAN can be competitive in large-scale text-to-image synthesis, and research combining the strengths of both paradigms is actively underway.

9. The Present and Future of GAN

9.1 GAN's Current Status

GAN has been at the center of generative modeling for about a decade since its 2014 publication, but ceded its mainstream position to Diffusion Models after 2021. However, GAN's legacy and current role remain significant.

Fields where GAN is actively used today:

Medical imaging: Widely used for augmenting training data while preserving patient privacy
Data augmentation: Expanding small datasets to improve model performance
Image editing and restoration: Face restoration, denoising, super-resolution, etc.
Fashion and design: Virtual try-on, design prototyping
Gaming and simulation: Real-time content generation, texture synthesis

9.2 GAN's Theoretical Legacy

GAN's greatest contribution extends beyond image generation technology.

Adversarial Training Paradigm: The adversarial training introduced by GAN has influenced diverse fields beyond generative models.

Adversarial Examples: Robustness research on deep learning models
Domain Adaptation: Knowledge transfer across domains using adversarial training
Self-supervised Learning: Self-supervised learning leveraging adversarial signals
Inverse Reinforcement Learning: Learning reward functions adversarially

Implicit Generative Models: GAN's core insight that complex distributions can be learned without defining explicit probability distributions has influenced the development of Energy-based Models, Score-based Models, and more.

9.3 Future Outlook

GAN-Diffusion Fusion: One of the most promising directions is combining the strengths of GAN and Diffusion Models. Research is underway to replace denoising steps in the Diffusion process with GANs to accelerate inference.

3D Generation: Research combining GAN with 3D representations (Neural Radiance Fields, 3D Gaussian Splatting, etc.) for 3D content generation is active. EG3D and GET3D are representative examples.

Video Generation: StyleGAN3's equivariant properties can naturally apply to video generation, with ongoing research in temporally consistent video generation.

Efficient Training: Research continues on high-quality generation model training with limited data through Few-shot GAN, transfer learning for GANs, and related approaches.

9.4 GAN Timeline Summary

Year	Model	Key Contribution	Resolution
2014	GAN	Adversarial training framework	Low
2014	cGAN	Conditional generation	Low
2015	DCGAN	CNN-based architecture guidelines	64x64
2017	WGAN	Wasserstein distance	64x64
2017	WGAN-GP	Gradient penalty	64x64
2017	Pix2Pix	Paired image-to-image translation	256x256
2017	CycleGAN	Unpaired image-to-image translation	256x256
2017	ProGAN	Progressive growing	1024x1024
2018	BigGAN	Large-scale training, truncation trick	512x512
2019	StyleGAN	Mapping network, AdaIN, style separation	1024x1024
2020	StyleGAN2	Weight demodulation, path regularization	1024x1024
2021	StyleGAN3	Alias-free, equivariant generation	1024x1024
2023	GigaGAN	1B-param text-to-image GAN	512x512+

10. Conclusion

The GAN proposed by Ian Goodfellow in 2014 revolutionized the AI field with a simple yet powerful idea --- "competition between two networks produces better generative models." The mathematical framework of the minimax game was both elegant and practical, spawning hundreds of variants over the following decade and dramatically advancing image generation quality.

DCGAN laid the practical foundation through its combination with CNNs, while WGAN solved training stability issues with the theoretical innovation of Wasserstein distance. The Progressive GAN and StyleGAN series enabled photorealistic image generation at 1024x1024 resolution, and CycleGAN and Pix2Pix pioneered the new application domain of image translation.

Although Diffusion Models have risen to prominence in generative modeling since 2021, GAN's legacy is immense. The adversarial training paradigm continues to be utilized across diverse fields, and hybrid research combining the strengths of GAN and Diffusion Models is actively progressing. As the emergence of GigaGAN demonstrates, the GAN story is far from over.

In the history of generative models, GAN will be remembered as the milestone that first demonstrated the possibility that "artificial intelligence can truly create."

References

Goodfellow, I. J. et al. (2014). "Generative Adversarial Nets." NeurIPS 2014. arXiv:1406.2661
Mirza, M. & Osindero, S. (2014). "Conditional Generative Adversarial Nets." arXiv:1411.1784
Radford, A., Metz, L. & Chintala, S. (2015). "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks." arXiv:1511.06434
Arjovsky, M., Chintala, S. & Bottou, L. (2017). "Wasserstein GAN." arXiv:1701.07875
Gulrajani, I. et al. (2017). "Improved Training of Wasserstein GANs." arXiv:1704.00028
Isola, P. et al. (2017). "Image-to-Image Translation with Conditional Adversarial Networks." CVPR 2017. arXiv:1611.07004
Zhu, J.-Y. et al. (2017). "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks." ICCV 2017. arXiv:1703.10593
Karras, T. et al. (2017). "Progressive Growing of GANs for Improved Quality, Stability, and Variation." ICLR 2018. arXiv:1710.10196
Brock, A., Donahue, J. & Simonyan, K. (2018). "Large Scale GAN Training for High Fidelity Natural Image Synthesis." ICLR 2019. arXiv:1809.11096
Karras, T., Laine, S. & Aila, T. (2019). "A Style-Based Generator Architecture for Generative Adversarial Networks." CVPR 2019. arXiv:1812.04948
Karras, T. et al. (2020). "Analyzing and Improving the Image Quality of StyleGAN." CVPR 2020. arXiv:1912.04958
Karras, T. et al. (2021). "Alias-Free Generative Adversarial Networks." NeurIPS 2021. arXiv:2106.12423
Kang, M. et al. (2023). "Scaling up GANs for Text-to-Image Synthesis." CVPR 2023. arXiv:2303.05511
Dhariwal, P. & Nichol, A. (2021). "Diffusion Models Beat GANs on Image Synthesis." NeurIPS 2021. arXiv:2105.05233