Week 7

Published

Wednesday, April 2, 2025

Generative Adversarial Networks (GANs)

Introduction

Origin and Context
- GANs were introduced by Goodfellow et al. in 2014.
- They became widely adopted and studied around 2016.
Implicit Generative Models
- GANs are implicit generative models: they learn to represent a data distribution \(p_{data}(x)\) through a generative process, rather than by explicitly modeling the probability density function.
- The model does not require an explicit likelihood function.

Limitations of Likelihood-based Methods

Challenges in Likelihood Estimation
- Maximizing likelihood does not guarantee high-quality or representative samples.
  - A model may assign high likelihood to noisy or outlier regions, skewing the overall likelihood score, even if it fails to capture the main modes of the data.
Paradoxical Outcomes
- Models can achieve high likelihood scores while generating poor-quality samples.
- Conversely, models that generate visually realistic samples may have poor log-likelihood scores.
Memorization vs. Generalization
- Likelihood-based models may overfit and memorize the training data, achieving high likelihood on the training set but failing to generalize and generate novel samples.

GANs as Implicit Generative Models

Key Characteristics
- Likelihood-free: GANs do not require explicit definition or optimization of a likelihood function \(p(x)\).
- Architectural Flexibility: The generator and discriminator can be implemented using various neural network architectures (e.g., MLP, CNN, RNN).
- Transformation: GANs map low-dimensional random noise vectors \(z\) to high-dimensional structured data samples \(G(z)\).
- Sample Quality: GANs often produce sharper and more realistic samples than Variational Autoencoders (VAEs), especially for images.
- Composability: GANs can be integrated with other generative models (e.g., VAEs, Normalizing Flows, Diffusion Models) to potentially improve sample quality or training stability.

Architecture Components

Generator Network (\(G\))
- Input: Latent vector \(z\) sampled from a prior distribution \(p_z(z)\) (e.g., standard Gaussian \(\mathcal{N}(0, I)\) or Uniform \(U[-1, 1]\)).
- Output: Synthetic data sample \(x_{fake} = G(z)\), in the same space as the real data.
- Objective: Generate samples \(G(z)\) that are indistinguishable from real data samples \(x \sim p_{data}(x)\).
Discriminator Network (\(D\))
- Input: Data sample \(x\), which can be either a real sample (\(x \sim p_{data}(x)\)) or a fake sample (\(x = G(z)\)).
- Output: Scalar probability \(D(x) \in [0, 1]\), representing the estimated probability that \(x\) is a real sample from \(p_{data}(x)\).
- Objective: Accurately classify inputs as real (\(D(x) \to 1\)) or fake (\(D(G(z)) \to 0\)).

Training Dynamics

Adversarial Framework
- The process is a zero-sum, two-player minimax game between \(G\) and \(D\).
  - \(G\) tries to fool \(D\) by generating increasingly realistic samples.
  - \(D\) tries to improve its ability to distinguish real samples from fake ones generated by \(G\).
- Training involves backpropagating gradients: \(D\)’s gradients guide its own learning, and also flow back through \(D\) (when frozen) to train \(G\).
Mathematical Formulation
- The core value function \(V(G, D)\) defines the minimax objective: \[ \min_G \max_D V(G, D) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))] \]
Optimization
- \(D\) is trained to maximize \(V(G, D)\) (maximize classification accuracy).
- \(G\) is trained to minimize \(V(G, D)\) (minimize the log-probability of \(D\) being correct about fake samples).
Theoretical Equilibrium (Nash Equilibrium)
- If \(G\) and \(D\) have sufficient capacity and training converges optimally:
  - The generator’s distribution \(p_g\) matches the real data distribution \(p_{data}\).
  - The discriminator outputs \(D(x) = 1/2\) everywhere, unable to distinguish real from fake.
- Assumptions: This requires strong assumptions like infinite model capacity and the ability to find the Nash equilibrium, which are often not met in practice.

Practical Training Algorithm

Alternating Optimization
- Simultaneous gradient descent for minimax games is unstable, so training typically alternates:
  - Train Discriminator: Freeze \(G\), update \(D\) for one or more (\(k\)) steps by ascending its gradient: \(\nabla_D V(G, D)\).
  - Train Generator: Freeze \(D\), update \(G\) for one step by descending its gradient: \(\nabla_G V(G, D)\).
  - Repeat.
- Often \(k > 1\) (e.g., \(k=5\)) to ensure \(D\) remains relatively well-optimized.
Training Considerations
- Balance: Crucial to maintain balance between \(G\) and \(D\). If \(D\) becomes too strong too quickly, gradients for \(G\) can vanish. If \(D\) is too weak, \(G\) receives poor guidance.
- Computational Cost: Training is computationally intensive due to the need to train two networks and potentially perform multiple discriminator updates per generator update.

Proposition 1: Optimal Discriminator

For a fixed generator \(G\), the optimal discriminator \(D^*(x)\) that maximizes \(V(G, D)\) is:

\[ D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} \]

where \(p_g(x)\) is the probability density of the samples generated by \(G\).
Derivation:
- The discriminator’s objective is: \[ \max_D \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))] \]
- By the law of the unconscious statistician, this can be rewritten as: \[ \max_D \int_x p_{data}(x) \log D(x) dx + \int_x p_g(x) \log(1 - D(x)) dx \]
- For each \(x\), maximize \(a \log y + b \log(1-y)\) with \(a = p_{data}(x)\), \(b = p_g(x)\), \(y = D(x)\).
- The maximum is at \(y^* = \frac{a}{a+b}\), so: \[ D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} \]

Proposition 2: GANs Minimize Jensen-Shannon Divergence

When \(D\) is optimal (\(D^*\)), the training objective for \(G\) corresponds to minimizing the Jensen-Shannon (JS) divergence between \(p_{data}\) and \(p_g\).
Kullback-Leibler (KL) Divergence: \[ D_{KL}(P || Q) = \int P(x) \log \frac{P(x)}{Q(x)} dx \]
Jensen-Shannon (JS) Divergence: \[ D_{JS}(P || Q) = \frac{1}{2} D_{KL}(P || M) + \frac{1}{2} D_{KL}(Q || M) \] where \(M = \frac{1}{2}(P + Q)\). \(0 \le D_{JS}(P || Q) \le \log 2\).
Proof Sketch:
- Substitute \(D^*(x)\) into \(V(G, D)\): \[ \begin{aligned} C(G) &= \max_D V(G, D) = V(G, D^*) \\ &= \mathbb{E}_{x \sim p_{data}}[\log D^*(x)] + \mathbb{E}_{x \sim p_g}[\log(1 - D^*(x))] \\ &= \mathbb{E}_{x \sim p_{data}}\left[\log \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}\right] + \mathbb{E}_{x \sim p_g}\left[\log \frac{p_g(x)}{p_{data}(x) + p_g(x)}\right] \\ &= \int p_{data}(x) \log \frac{p_{data}(x)}{2 \cdot \frac{1}{2}(p_{data}(x) + p_g(x))} dx + \int p_g(x) \log \frac{p_g(x)}{2 \cdot \frac{1}{2}(p_{data}(x) + p_g(x))} dx \\ &= \int p_{data}(x) \left( \log \frac{p_{data}(x)}{m(x)} - \log 2 \right) dx + \int p_g(x) \left( \log \frac{p_g(x)}{m(x)} - \log 2 \right) dx \\ &= (D_{KL}(p_{data} || m) - \log 2) + (D_{KL}(p_g || m) - \log 2) \\ &= D_{KL}(p_{data} || m) + D_{KL}(p_g || m) - 2 \log 2 \\ &= 2 D_{JS}(p_{data}, p_g) - 2 \log 2 \end{aligned} \]
- Minimizing \(C(G)\) with respect to \(G\) is equivalent to minimizing \(D_{JS}(p_{data}, p_g)\), as \(-2 \log 2\) is a constant.
- The minimum value of \(D_{JS}\) is \(0\), achieved if and only if \(p_{data} = p_g\).

Practical Training Algorithm (continued)

The theoretical analysis assumes \(D\) can always reach its optimum \(D^*\) for any \(G\), but in practice, \(D\) is only optimized for a finite number of steps (\(k\)), and this inner-loop optimization is computationally intensive and can lead to overfitting on finite datasets.
Alternating optimization (typically \(k \in \{1, ..., 5\}\) steps for \(D\) per \(G\) update) is used to keep \(D\) near optimal while allowing \(G\) to change slowly.

Issues with Training

Vanishing Gradients
- Problem: When \(D\) is very confident (outputs close to 0 or 1) and accurate, the gradient \(\log(1 - D(G(z)))\) for \(G\) can become very small (saturate), especially early in training when \(G\) produces poor samples easily distinguishable by \(D\). This stalls the learning of \(G\).
- Solution (Non-Saturating Loss): Instead of minimizing \(\mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]\), the generator maximizes \(\mathbb{E}_{z \sim p_z(z)}[\log D(G(z))]\). This provides stronger gradients early in training. The generator’s objective becomes: \[ \mathcal{L}_G = - \mathbb{E}_{z \sim p_z(z)}[\log D(G(z))] \]
Mode Collapse
- Problem: The generator \(G\) learns to produce only a limited variety of samples, collapsing many modes of the true data distribution \(p_{data}\) into one or a few output modes. \(G\) finds a few samples that can easily fool the current \(D\), and stops exploring.
- Potential Causes: \(G\) finds a local minimum in the adversarial game.
- Mitigation Strategies:
  - Architectural changes
  - Alternative loss functions (e.g., Wasserstein GAN, which uses the Wasserstein distance as a similarity measure between distributions, providing smoother gradients and better mode coverage)
  - Unrolled GANs: The generator is optimized with respect to the state of \(D\) after the next \(k\) steps, discouraging \(G\) from exploiting local minima.
  - Mini-batch discrimination and gradient accumulation can also help.
Training Instability
- Problem: The adversarial training process involves finding a Nash equilibrium, which is inherently more unstable than standard minimization problems. Updates improving one player (\(G\) or \(D\)) might worsen the objective for the other, leading to oscillations or divergence.
- Mitigation Strategies:
  - Careful hyperparameter tuning
  - Architectural choices (e.g., Spectral Normalization)
  - Gradient penalties (e.g., WGAN-GP adds a penalty term related to the norm of the discriminator’s gradient with respect to its input, often evaluated at points interpolated between real and fake samples: \(\|\nabla_{\hat{x}} D(\hat{x})\|_2\)).
    - Explanation: In WGAN-GP, the gradient penalty enforces the Lipschitz constraint by penalizing the squared deviation of the gradient norm from 1: \[ \lambda \mathbb{E}_{\hat{x}} \left( \|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1 \right)^2 \] where \(\hat{x}\) is sampled uniformly along straight lines between pairs of points from the real and generated data distributions.

StyleGANs

Overview
- StyleGANs are a family of GAN architectures known for generating high-resolution, high-fidelity images (e.g., 1024x1024 faces).
Key Innovations
- Progressive Growing (introduced in PGGAN, adopted by StyleGAN)
  - Training starts with low-resolution images (e.g., 4x4).
  - New layers are progressively added to both \(G\) and \(D\) to increase the resolution (e.g., 8x8, 16x16, …), fine-tuning the network at each stage. This stabilizes training for high resolutions.
- Style-Based Generator
  - Mapping Network: A separate network (e.g., an MLP) maps the initial latent code \(z \sim p_z(z)\) to an intermediate latent space \(W\). This disentangles the latent space. \(w \in W\).
  - Adaptive Instance Normalization (AdaIN): The intermediate latent \(w\) is transformed into style parameters (scale \(\mathbf{y}_s\) and bias \(\mathbf{y}_b\)) which modulate the activations at each layer of the synthesis network. \[ \text{AdaIN}(x_i, w) = \mathbf{y}_{s,i} \frac{x_i - \mu(x_i)}{\sigma(x_i)} + \mathbf{y}_{b,i} \] where \(x_i\) are the activations of the \(i\)-th feature map, \(\mu(x_i)\) and \(\sigma(x_i)\) are the mean and standard deviation, and \(\mathbf{y}_{s,i}, \mathbf{y}_{b,i}\) are learned affine transformations applied to \(w\).
  - Constant Input & Noise Injection: The synthesis network starts from a learned constant tensor instead of the latent code directly. Stochasticity (e.g., hair placement, freckles) is introduced by adding learned per-pixel noise at each layer, scaled by learned factors.
  - Hierarchical Control: Injecting \(w\) (via AdaIN) at different layers controls different levels of attributes: early layers affect coarse features (pose, shape), while later layers affect finer details (color scheme, micro-textures).

Image-to-Image Translation

Task: Learn a mapping function to translate an image from a source domain A to a target domain B (e.g., segmentation maps to photos, sketches to photos, day to night).
Pix2Pix
- A conditional GAN (cGAN) approach where the generator is conditioned on the input image from domain A. \(G: A \to B\).
- Requires Paired Data: Training needs datasets where each image \(x_A\) has a corresponding ground truth image \(x_B\).
- Loss Function: Combines a conditional GAN loss (discriminator \(D\) tries to distinguish real pairs \((x_A, x_B)\) from fake pairs \((x_A, G(x_A))\)) with a reconstruction loss, typically L1 loss: \[ \mathcal{L}_{L1}(G) = \mathbb{E}_{x_A, x_B}[||x_B - G(x_A)||_1] \] L1 loss is often preferred over L2 (MSE) as it encourages less blurring in the generated images.
- Architecture: Often uses a “U-Net” architecture for the generator and a “PatchGAN” discriminator (classifies patches of the image as real/fake).
CycleGAN
- Designed for Unpaired Data: Can learn translations when there is no direct one-to-one mapping between images in domain A and domain B (e.g., translating photos of horses to zebras, or photographs to Monet paintings).
- Architecture: Uses two generators (\(G_{A \to B}\) and \(G_{B \to A}\)) and two discriminators (\(D_A\) and \(D_B\)).
  - \(G_{A \to B}\): Translates A to B. \(D_B\): Distinguishes real B images from generated \(\hat{x}_B = G_{A \to B}(x_A)\).
  - \(G_{B \to A}\): Translates B to A. \(D_A\): Distinguishes real A images from generated \(\hat{x}_A = G_{B \to A}(x_B)\).
- Cycle Consistency Loss: Ensures that translating an image from A to B and back to A should recover the original image (and vice-versa). \[ \mathcal{L}_{cyc}(G_{A \to B}, G_{B \to A}) = \mathbb{E}_{x_A}[||G_{B \to A}(G_{A \to B}(x_A)) - x_A||_1] + \mathbb{E}_{x_B}[||G_{A \to B}(G_{B \to A}(x_B)) - x_B||_1] \]
- Total Loss: Sum of the adversarial losses for both mappings and the cycle consistency loss. \[ \mathcal{L}_{total} = \mathcal{L}_{GAN}(G_{A \to B}, D_B) + \mathcal{L}_{GAN}(G_{B \to A}, D_A) + \lambda \mathcal{L}_{cyc}(G_{A \to B}, G_{B \to A}) \] where \(\lambda\) is a hyperparameter controlling the cycle consistency term.

Other Applications & Extensions

Video-to-Video Translation (Vid2Vid)
- Extends image translation to video, requiring temporal consistency across generated frames.
3D GANs
- Generate 3D shapes or scenes, often represented as voxel grids, point clouds, or implicit neural representations.