Week 7

Published

Wednesday, April 2, 2025

Generative Adversarial Networks (GANs)

Introduction

  • Origin and Context
    • GANs were introduced by Goodfellow et al. in 2014.
    • They became widely adopted and studied around 2016.
  • Implicit Generative Models
    • GANs are implicit generative models: they learn to represent a data distribution \(p_{data}(x)\) through a generative process, rather than by explicitly modeling the probability density function.
    • The model does not require an explicit likelihood function.

Limitations of Likelihood-based Methods

  • Challenges in Likelihood Estimation
    • Maximizing likelihood does not guarantee high-quality or representative samples.
      • A model may assign high likelihood to noisy or outlier regions, skewing the overall likelihood score, even if it fails to capture the main modes of the data.
  • Paradoxical Outcomes
    • Models can achieve high likelihood scores while generating poor-quality samples.
    • Conversely, models that generate visually realistic samples may have poor log-likelihood scores.
  • Memorization vs. Generalization
    • Likelihood-based models may overfit and memorize the training data, achieving high likelihood on the training set but failing to generalize and generate novel samples.

GANs as Implicit Generative Models

  • Key Characteristics
    • Likelihood-free: GANs do not require explicit definition or optimization of a likelihood function \(p(x)\).
    • Architectural Flexibility: The generator and discriminator can be implemented using various neural network architectures (e.g., MLP, CNN, RNN).
    • Transformation: GANs map low-dimensional random noise vectors \(z\) to high-dimensional structured data samples \(G(z)\).
    • Sample Quality: GANs often produce sharper and more realistic samples than Variational Autoencoders (VAEs), especially for images.
    • Composability: GANs can be integrated with other generative models (e.g., VAEs, Normalizing Flows, Diffusion Models) to potentially improve sample quality or training stability.

Architecture Components

  • Generator Network (\(G\))
    • Input: Latent vector \(z\) sampled from a prior distribution \(p_z(z)\) (e.g., standard Gaussian \(\mathcal{N}(0, I)\) or Uniform \(U[-1, 1]\)).
    • Output: Synthetic data sample \(x_{fake} = G(z)\), in the same space as the real data.
    • Objective: Generate samples \(G(z)\) that are indistinguishable from real data samples \(x \sim p_{data}(x)\).
  • Discriminator Network (\(D\))
    • Input: Data sample \(x\), which can be either a real sample (\(x \sim p_{data}(x)\)) or a fake sample (\(x = G(z)\)).
    • Output: Scalar probability \(D(x) \in [0, 1]\), representing the estimated probability that \(x\) is a real sample from \(p_{data}(x)\).
    • Objective: Accurately classify inputs as real (\(D(x) \to 1\)) or fake (\(D(G(z)) \to 0\)).

Training Dynamics

  • Adversarial Framework
    • The process is a zero-sum, two-player minimax game between \(G\) and \(D\).
      • \(G\) tries to fool \(D\) by generating increasingly realistic samples.
      • \(D\) tries to improve its ability to distinguish real samples from fake ones generated by \(G\).
    • Training involves backpropagating gradients: \(D\)’s gradients guide its own learning, and also flow back through \(D\) (when frozen) to train \(G\).
  • Mathematical Formulation
    • The core value function \(V(G, D)\) defines the minimax objective: \[ \min_G \max_D V(G, D) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))] \]
  • Optimization
    • \(D\) is trained to maximize \(V(G, D)\) (maximize classification accuracy).
    • \(G\) is trained to minimize \(V(G, D)\) (minimize the log-probability of \(D\) being correct about fake samples).
  • Theoretical Equilibrium (Nash Equilibrium)
    • If \(G\) and \(D\) have sufficient capacity and training converges optimally:
      • The generator’s distribution \(p_g\) matches the real data distribution \(p_{data}\).
      • The discriminator outputs \(D(x) = 1/2\) everywhere, unable to distinguish real from fake.
    • Assumptions: This requires strong assumptions like infinite model capacity and the ability to find the Nash equilibrium, which are often not met in practice.

Practical Training Algorithm

  • Alternating Optimization
    • Simultaneous gradient descent for minimax games is unstable, so training typically alternates:
      • Train Discriminator: Freeze \(G\), update \(D\) for one or more (\(k\)) steps by ascending its gradient: \(\nabla_D V(G, D)\).
      • Train Generator: Freeze \(D\), update \(G\) for one step by descending its gradient: \(\nabla_G V(G, D)\).
      • Repeat.
    • Often \(k > 1\) (e.g., \(k=5\)) to ensure \(D\) remains relatively well-optimized.
  • Training Considerations
    • Balance: Crucial to maintain balance between \(G\) and \(D\). If \(D\) becomes too strong too quickly, gradients for \(G\) can vanish. If \(D\) is too weak, \(G\) receives poor guidance.
    • Computational Cost: Training is computationally intensive due to the need to train two networks and potentially perform multiple discriminator updates per generator update.

Proposition 1: Optimal Discriminator

  • For a fixed generator \(G\), the optimal discriminator \(D^*(x)\) that maximizes \(V(G, D)\) is:

    \[ D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} \]

    where \(p_g(x)\) is the probability density of the samples generated by \(G\).

  • Derivation:

    • The discriminator’s objective is: \[ \max_D \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))] \]
    • By the law of the unconscious statistician, this can be rewritten as: \[ \max_D \int_x p_{data}(x) \log D(x) dx + \int_x p_g(x) \log(1 - D(x)) dx \]
    • For each \(x\), maximize \(a \log y + b \log(1-y)\) with \(a = p_{data}(x)\), \(b = p_g(x)\), \(y = D(x)\).
    • The maximum is at \(y^* = \frac{a}{a+b}\), so: \[ D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} \]

Proposition 2: GANs Minimize Jensen-Shannon Divergence

  • When \(D\) is optimal (\(D^*\)), the training objective for \(G\) corresponds to minimizing the Jensen-Shannon (JS) divergence between \(p_{data}\) and \(p_g\).
  • Kullback-Leibler (KL) Divergence: \[ D_{KL}(P || Q) = \int P(x) \log \frac{P(x)}{Q(x)} dx \]
  • Jensen-Shannon (JS) Divergence: \[ D_{JS}(P || Q) = \frac{1}{2} D_{KL}(P || M) + \frac{1}{2} D_{KL}(Q || M) \] where \(M = \frac{1}{2}(P + Q)\). \(0 \le D_{JS}(P || Q) \le \log 2\).
  • Proof Sketch:
    • Substitute \(D^*(x)\) into \(V(G, D)\): \[ \begin{aligned} C(G) &= \max_D V(G, D) = V(G, D^*) \\ &= \mathbb{E}_{x \sim p_{data}}[\log D^*(x)] + \mathbb{E}_{x \sim p_g}[\log(1 - D^*(x))] \\ &= \mathbb{E}_{x \sim p_{data}}\left[\log \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}\right] + \mathbb{E}_{x \sim p_g}\left[\log \frac{p_g(x)}{p_{data}(x) + p_g(x)}\right] \\ &= \int p_{data}(x) \log \frac{p_{data}(x)}{2 \cdot \frac{1}{2}(p_{data}(x) + p_g(x))} dx + \int p_g(x) \log \frac{p_g(x)}{2 \cdot \frac{1}{2}(p_{data}(x) + p_g(x))} dx \\ &= \int p_{data}(x) \left( \log \frac{p_{data}(x)}{m(x)} - \log 2 \right) dx + \int p_g(x) \left( \log \frac{p_g(x)}{m(x)} - \log 2 \right) dx \\ &= (D_{KL}(p_{data} || m) - \log 2) + (D_{KL}(p_g || m) - \log 2) \\ &= D_{KL}(p_{data} || m) + D_{KL}(p_g || m) - 2 \log 2 \\ &= 2 D_{JS}(p_{data}, p_g) - 2 \log 2 \end{aligned} \]
    • Minimizing \(C(G)\) with respect to \(G\) is equivalent to minimizing \(D_{JS}(p_{data}, p_g)\), as \(-2 \log 2\) is a constant.
    • The minimum value of \(D_{JS}\) is \(0\), achieved if and only if \(p_{data} = p_g\).

Practical Training Algorithm (continued)

  • The theoretical analysis assumes \(D\) can always reach its optimum \(D^*\) for any \(G\), but in practice, \(D\) is only optimized for a finite number of steps (\(k\)), and this inner-loop optimization is computationally intensive and can lead to overfitting on finite datasets.
  • Alternating optimization (typically \(k \in \{1, ..., 5\}\) steps for \(D\) per \(G\) update) is used to keep \(D\) near optimal while allowing \(G\) to change slowly.

Issues with Training

  • Vanishing Gradients
    • Problem: When \(D\) is very confident (outputs close to 0 or 1) and accurate, the gradient \(\log(1 - D(G(z)))\) for \(G\) can become very small (saturate), especially early in training when \(G\) produces poor samples easily distinguishable by \(D\). This stalls the learning of \(G\).
    • Solution (Non-Saturating Loss): Instead of minimizing \(\mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]\), the generator maximizes \(\mathbb{E}_{z \sim p_z(z)}[\log D(G(z))]\). This provides stronger gradients early in training. The generator’s objective becomes: \[ \mathcal{L}_G = - \mathbb{E}_{z \sim p_z(z)}[\log D(G(z))] \]
  • Mode Collapse
    • Problem: The generator \(G\) learns to produce only a limited variety of samples, collapsing many modes of the true data distribution \(p_{data}\) into one or a few output modes. \(G\) finds a few samples that can easily fool the current \(D\), and stops exploring.
    • Potential Causes: \(G\) finds a local minimum in the adversarial game.
    • Mitigation Strategies:
      • Architectural changes
      • Alternative loss functions (e.g., Wasserstein GAN, which uses the Wasserstein distance as a similarity measure between distributions, providing smoother gradients and better mode coverage)
      • Unrolled GANs: The generator is optimized with respect to the state of \(D\) after the next \(k\) steps, discouraging \(G\) from exploiting local minima.
      • Mini-batch discrimination and gradient accumulation can also help.
  • Training Instability
    • Problem: The adversarial training process involves finding a Nash equilibrium, which is inherently more unstable than standard minimization problems. Updates improving one player (\(G\) or \(D\)) might worsen the objective for the other, leading to oscillations or divergence.
    • Mitigation Strategies:
      • Careful hyperparameter tuning
      • Architectural choices (e.g., Spectral Normalization)
      • Gradient penalties (e.g., WGAN-GP adds a penalty term related to the norm of the discriminator’s gradient with respect to its input, often evaluated at points interpolated between real and fake samples: \(\|\nabla_{\hat{x}} D(\hat{x})\|_2\)).
        • Explanation: In WGAN-GP, the gradient penalty enforces the Lipschitz constraint by penalizing the squared deviation of the gradient norm from 1: \[ \lambda \mathbb{E}_{\hat{x}} \left( \|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1 \right)^2 \] where \(\hat{x}\) is sampled uniformly along straight lines between pairs of points from the real and generated data distributions.

StyleGANs

  • Overview
    • StyleGANs are a family of GAN architectures known for generating high-resolution, high-fidelity images (e.g., 1024x1024 faces).
  • Key Innovations
    • Progressive Growing (introduced in PGGAN, adopted by StyleGAN)
      • Training starts with low-resolution images (e.g., 4x4).
      • New layers are progressively added to both \(G\) and \(D\) to increase the resolution (e.g., 8x8, 16x16, …), fine-tuning the network at each stage. This stabilizes training for high resolutions.
    • Style-Based Generator
      • Mapping Network: A separate network (e.g., an MLP) maps the initial latent code \(z \sim p_z(z)\) to an intermediate latent space \(W\). This disentangles the latent space. \(w \in W\).
      • Adaptive Instance Normalization (AdaIN): The intermediate latent \(w\) is transformed into style parameters (scale \(\mathbf{y}_s\) and bias \(\mathbf{y}_b\)) which modulate the activations at each layer of the synthesis network. \[ \text{AdaIN}(x_i, w) = \mathbf{y}_{s,i} \frac{x_i - \mu(x_i)}{\sigma(x_i)} + \mathbf{y}_{b,i} \] where \(x_i\) are the activations of the \(i\)-th feature map, \(\mu(x_i)\) and \(\sigma(x_i)\) are the mean and standard deviation, and \(\mathbf{y}_{s,i}, \mathbf{y}_{b,i}\) are learned affine transformations applied to \(w\).
      • Constant Input & Noise Injection: The synthesis network starts from a learned constant tensor instead of the latent code directly. Stochasticity (e.g., hair placement, freckles) is introduced by adding learned per-pixel noise at each layer, scaled by learned factors.
      • Hierarchical Control: Injecting \(w\) (via AdaIN) at different layers controls different levels of attributes: early layers affect coarse features (pose, shape), while later layers affect finer details (color scheme, micro-textures).

Image-to-Image Translation

  • Task: Learn a mapping function to translate an image from a source domain A to a target domain B (e.g., segmentation maps to photos, sketches to photos, day to night).
  • Pix2Pix
    • A conditional GAN (cGAN) approach where the generator is conditioned on the input image from domain A. \(G: A \to B\).
    • Requires Paired Data: Training needs datasets where each image \(x_A\) has a corresponding ground truth image \(x_B\).
    • Loss Function: Combines a conditional GAN loss (discriminator \(D\) tries to distinguish real pairs \((x_A, x_B)\) from fake pairs \((x_A, G(x_A))\)) with a reconstruction loss, typically L1 loss: \[ \mathcal{L}_{L1}(G) = \mathbb{E}_{x_A, x_B}[||x_B - G(x_A)||_1] \] L1 loss is often preferred over L2 (MSE) as it encourages less blurring in the generated images.
    • Architecture: Often uses a “U-Net” architecture for the generator and a “PatchGAN” discriminator (classifies patches of the image as real/fake).
  • CycleGAN
    • Designed for Unpaired Data: Can learn translations when there is no direct one-to-one mapping between images in domain A and domain B (e.g., translating photos of horses to zebras, or photographs to Monet paintings).
    • Architecture: Uses two generators (\(G_{A \to B}\) and \(G_{B \to A}\)) and two discriminators (\(D_A\) and \(D_B\)).
      • \(G_{A \to B}\): Translates A to B. \(D_B\): Distinguishes real B images from generated \(\hat{x}_B = G_{A \to B}(x_A)\).
      • \(G_{B \to A}\): Translates B to A. \(D_A\): Distinguishes real A images from generated \(\hat{x}_A = G_{B \to A}(x_B)\).
    • Cycle Consistency Loss: Ensures that translating an image from A to B and back to A should recover the original image (and vice-versa). \[ \mathcal{L}_{cyc}(G_{A \to B}, G_{B \to A}) = \mathbb{E}_{x_A}[||G_{B \to A}(G_{A \to B}(x_A)) - x_A||_1] + \mathbb{E}_{x_B}[||G_{A \to B}(G_{B \to A}(x_B)) - x_B||_1] \]
    • Total Loss: Sum of the adversarial losses for both mappings and the cycle consistency loss. \[ \mathcal{L}_{total} = \mathcal{L}_{GAN}(G_{A \to B}, D_B) + \mathcal{L}_{GAN}(G_{B \to A}, D_A) + \lambda \mathcal{L}_{cyc}(G_{A \to B}, G_{B \to A}) \] where \(\lambda\) is a hyperparameter controlling the cycle consistency term.

Other Applications & Extensions

  • Video-to-Video Translation (Vid2Vid)
    • Extends image translation to video, requiring temporal consistency across generated frames.
  • 3D GANs
    • Generate 3D shapes or scenes, often represented as voxel grids, point clouds, or implicit neural representations.