Week 5

Published

Wednesday, March 19, 2025

Generative Models

  • The primary objective of generative modeling is to learn the underlying probability distribution \(p(\text{data})\) of a dataset—either explicitly or implicitly—so that we can sample new data points from it.
    • Implicit Models: Define a stochastic procedure to generate samples without an explicit density.
      • Examples: Generative Adversarial Networks (GANs), Markov Chains.
    • Explicit Models: Specify a parametric form for the density \(p(\text{data})\).
      • Tractable Explicit Models: The density can be evaluated directly.
        • Examples: Autoregressive models (PixelRNN, PixelCNN), Normalizing Flows.
      • Approximate Explicit Models: The density is intractable; use approximations.
        • Examples: Variational Autoencoders (VAEs), Boltzmann Machines (via MCMC).

Autoencoders (AEs)

  • Autoencoders are neural networks for unsupervised representation learning and dimensionality reduction.
    • Encoder \(f_{\phi}\): maps \(x\) to latent \(z = f_{\phi}(x)\).
    • Decoder \(g_{\theta}\): reconstructs \(\hat{x} = g_{\theta}(z)\) from \(z\).
    • Assumes data lie on a low-dimensional manifold embedded in the input space.
    • Desirable latent properties:
      • Smoothness: small changes in \(z\) yield small, meaningful changes in \(\hat{x}\).
      • Disentanglement: each dimension of \(z\) corresponds to a distinct factor of variation.
  • PCA as a Linear Autoencoder: PCA minimizes \(L_2\) reconstruction loss under orthonormality constraints.
  • Nonlinear Autoencoders: Use neural nets, optimize \(\|x - \hat{x}\|_2^2\) (MSE).
  • Latent Dimension:
    • Undercomplete: \(\dim(Z) < \dim(X)\) for compression.
    • Overcomplete: \(\dim(Z) \ge \dim(X)\) (e.g. denoising, sparse coding).
      • Denoising Autoencoders: Train to reconstruct clean \(x\) from corrupted \(x'\), loss computed on \(x\).
  • Limitation: Standard AEs lack a structured latent space; sampling arbitrary \(z\) often yields poor outputs.

Variational Autoencoders (VAEs)

  • VAEs learn a continuous, structured latent space suitable for generation.

  • Encoder (Recognition Model) \(q_{\phi}(z|x)\):

    • Outputs parameters of a Gaussian: mean \(\mu_{\phi}(x)\) and log-variance \(\log \sigma^2_{\phi}(x)\).
    • Defines \(q_{\phi}(z|x) = \mathcal{N}(z; \mu_{\phi}(x), \mathrm{diag}(\sigma^2_{\phi}(x)))\).
  • Reparameterization Trick:

    • Sample \(\epsilon \sim \mathcal{N}(0,I)\).
    • Set \(z = \mu_{\phi}(x) + \sigma_{\phi}(x)\odot \epsilon\), permitting backpropagation.
  • Regularization & Posterior Collapse:

    • Without regularization, \(\sigma_{\phi}(x)\to 0\) → AE collapse.
    • Posterior collapse: decoder ignores \(z\).
    • Remedy: add \(D_{KL}(q_{\phi}(z|x)\|p(z))\) to the loss, where \(p(z)=\mathcal{N}(0,I)\).
  • Derivation of the Objective Function (Evidence Lower Bound - ELBO):

    • Goal: maximize the marginal likelihood \[ p_{\theta}(x) = \int p_{\theta}(x,z)\,dz. \]
    • Insert approximate posterior \(q_{\phi}(z|x)\): \[ \log p_{\theta}(x) = \log \int q_{\phi}(z|x)\,\frac{p_{\theta}(x,z)}{q_{\phi}(z|x)}\,dz = \log \mathbb{E}_{q_{\phi}(z|x)}\Bigl[\tfrac{p_{\theta}(x,z)}{q_{\phi}(z|x)}\Bigr]. \]
    • Apply Jensen’s inequality (\(\log\) concave): \[ \log p_{\theta}(x) \;\ge\; \mathbb{E}_{q_{\phi}(z|x)}\Bigl[\log\tfrac{p_{\theta}(x,z)}{q_{\phi}(z|x)}\Bigr]. \]
    • Decompose the expectation: \[ \mathbb{E}_{q_{\phi}(z|x)}\Bigl[\log\tfrac{p_{\theta}(x,z)}{q_{\phi}(z|x)}\Bigr] = \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] + \mathbb{E}_{q_{\phi}(z|x)}[\log p(z)] - \mathbb{E}_{q_{\phi}(z|x)}[\log q_{\phi}(z|x)]. \]
    • Recognize \[ \mathbb{E}_{q_{\phi}(z|x)}[\log p(z)] - \mathbb{E}_{q_{\phi}(z|x)}[\log q_{\phi}(z|x)] = -D_{KL}\bigl(q_{\phi}(z|x)\|p(z)\bigr). \]
    • Therefore, \[ \log p_{\theta}(x) = \underbrace{\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]}_{(1)\,\text{Reconstruction}} \;-\;\underbrace{D_{KL}\bigl(q_{\phi}(z|x)\|p(z)\bigr)}_{(2)\,\text{Prior matching}} \;+\;\underbrace{D_{KL}\bigl(q_{\phi}(z|x)\|p_{\theta}(z|x)\bigr)}_{(3)\,\ge0\,,\text{+gap}}. \]
    • The ELBO is terms (1) and (2): \[ \mathrm{ELBO}(\phi,\theta;x) = \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{KL}\bigl(q_{\phi}(z|x)\|p(z)\bigr). \]
    • Maximizing the ELBO maximizes a lower bound on \(\log p_{\theta}(x)\).
    • The VAE loss (to minimize) is the negative ELBO: \[ L_{VAE}(\phi,\theta;x) = -\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] + D_{KL}\bigl(q_{\phi}(z|x)\|p(z)\bigr). \]
  • Analytical KL Divergence:

    • If \(q_{\phi}(z|x)=\mathcal{N}(z;\mu_{\phi}(x),\mathrm{diag}(\sigma^2_{\phi}(x)))\) and \(p(z)=\mathcal{N}(0,I)\), then \[ D_{KL}\bigl(q_{\phi}(z|x)\|p(z)\bigr) = \tfrac{1}{2}\sum_{j=1}^{D}\bigl(\sigma_j^2(x)+\mu_j^2(x) -\log\sigma_j^2(x)-1\bigr). \]
  • Generation After Training:

    • Sample \(z\sim p(z)\) and decode via \(p_{\theta}(x|z)\).
    • Without conditioning, attributes (e.g. class) are uncontrolled.
    • Conditional VAEs (CVAEs): condition on labels \(y\), use \(q_{\phi}(z|x,y)\) and \(p_{\theta}(x|z,y)\) for controlled generation.

β-VAEs

  • β-VAEs introduce a hyperparameter β (>1) to the ELBO: \[ L_{\beta\text{-VAE}}(\phi,\theta;x) = -\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] + \beta\,D_{KL}\bigl(q_{\phi}(z|x)\|p(z)\bigr). \]

  • Rationale for Disentanglement:

    • The KL term matches \(q_{\phi}(z|x)\) to an isotropic Gaussian \(p(z)=\mathcal{N}(0,I)\).
    • Isotropy implies zero mean, unit variance, and independence across latent dimensions.
    • By upweighting the KL term (β>1), β-VAEs more strongly penalize any deviation from:
      • unit variance in each \(z_j\),
      • zero covariance between distinct latent axes.
    • This stronger pressure reduces correlations among latent dimensions, encouraging each \(z_j\) to capture an independent factor of variation.
  • Constrained Optimization Perspective:

    • Maximizing ELBO subject to \(D_{KL}(q_{\phi}(z|x)\|p(z))\le \epsilon_0\).
    • β acts as a Lagrange multiplier: larger β ↔︎ smaller ε₀ ↔︎ tighter constraint on information capacity.
    • A tighter constraint forces a more compressed, factorized (disentangled) representation.
  • Disentanglement vs. Reconstruction Trade-off:

    • Increasing β improves disentanglement but may degrade reconstruction quality.
    • The model prioritizes matching \(p(z)\) over reconstructing \(x\) perfectly.