Week 5
Generative Models
- The primary objective of generative modeling is to learn the underlying probability distribution \(p(\text{data})\) of a dataset—either explicitly or implicitly—so that we can sample new data points from it.
- Implicit Models: Define a stochastic procedure to generate samples without an explicit density.
- Examples: Generative Adversarial Networks (GANs), Markov Chains.
- Explicit Models: Specify a parametric form for the density \(p(\text{data})\).
- Tractable Explicit Models: The density can be evaluated directly.
- Examples: Autoregressive models (PixelRNN, PixelCNN), Normalizing Flows.
- Approximate Explicit Models: The density is intractable; use approximations.
- Examples: Variational Autoencoders (VAEs), Boltzmann Machines (via MCMC).
- Tractable Explicit Models: The density can be evaluated directly.
- Implicit Models: Define a stochastic procedure to generate samples without an explicit density.
Autoencoders (AEs)
- Autoencoders are neural networks for unsupervised representation learning and dimensionality reduction.
- Encoder \(f_{\phi}\): maps \(x\) to latent \(z = f_{\phi}(x)\).
- Decoder \(g_{\theta}\): reconstructs \(\hat{x} = g_{\theta}(z)\) from \(z\).
- Assumes data lie on a low-dimensional manifold embedded in the input space.
- Desirable latent properties:
- Smoothness: small changes in \(z\) yield small, meaningful changes in \(\hat{x}\).
- Disentanglement: each dimension of \(z\) corresponds to a distinct factor of variation.
- PCA as a Linear Autoencoder: PCA minimizes \(L_2\) reconstruction loss under orthonormality constraints.
- Nonlinear Autoencoders: Use neural nets, optimize \(\|x - \hat{x}\|_2^2\) (MSE).
- Latent Dimension:
- Undercomplete: \(\dim(Z) < \dim(X)\) for compression.
- Overcomplete: \(\dim(Z) \ge \dim(X)\) (e.g. denoising, sparse coding).
- Denoising Autoencoders: Train to reconstruct clean \(x\) from corrupted \(x'\), loss computed on \(x\).
- Limitation: Standard AEs lack a structured latent space; sampling arbitrary \(z\) often yields poor outputs.
Variational Autoencoders (VAEs)
VAEs learn a continuous, structured latent space suitable for generation.
Encoder (Recognition Model) \(q_{\phi}(z|x)\):
- Outputs parameters of a Gaussian: mean \(\mu_{\phi}(x)\) and log-variance \(\log \sigma^2_{\phi}(x)\).
- Defines \(q_{\phi}(z|x) = \mathcal{N}(z; \mu_{\phi}(x), \mathrm{diag}(\sigma^2_{\phi}(x)))\).
Reparameterization Trick:
- Sample \(\epsilon \sim \mathcal{N}(0,I)\).
- Set \(z = \mu_{\phi}(x) + \sigma_{\phi}(x)\odot \epsilon\), permitting backpropagation.
Regularization & Posterior Collapse:
- Without regularization, \(\sigma_{\phi}(x)\to 0\) → AE collapse.
- Posterior collapse: decoder ignores \(z\).
- Remedy: add \(D_{KL}(q_{\phi}(z|x)\|p(z))\) to the loss, where \(p(z)=\mathcal{N}(0,I)\).
Derivation of the Objective Function (Evidence Lower Bound - ELBO):
- Goal: maximize the marginal likelihood \[ p_{\theta}(x) = \int p_{\theta}(x,z)\,dz. \]
- Insert approximate posterior \(q_{\phi}(z|x)\): \[ \log p_{\theta}(x) = \log \int q_{\phi}(z|x)\,\frac{p_{\theta}(x,z)}{q_{\phi}(z|x)}\,dz = \log \mathbb{E}_{q_{\phi}(z|x)}\Bigl[\tfrac{p_{\theta}(x,z)}{q_{\phi}(z|x)}\Bigr]. \]
- Apply Jensen’s inequality (\(\log\) concave): \[ \log p_{\theta}(x) \;\ge\; \mathbb{E}_{q_{\phi}(z|x)}\Bigl[\log\tfrac{p_{\theta}(x,z)}{q_{\phi}(z|x)}\Bigr]. \]
- Decompose the expectation: \[ \mathbb{E}_{q_{\phi}(z|x)}\Bigl[\log\tfrac{p_{\theta}(x,z)}{q_{\phi}(z|x)}\Bigr] = \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] + \mathbb{E}_{q_{\phi}(z|x)}[\log p(z)] - \mathbb{E}_{q_{\phi}(z|x)}[\log q_{\phi}(z|x)]. \]
- Recognize \[ \mathbb{E}_{q_{\phi}(z|x)}[\log p(z)] - \mathbb{E}_{q_{\phi}(z|x)}[\log q_{\phi}(z|x)] = -D_{KL}\bigl(q_{\phi}(z|x)\|p(z)\bigr). \]
- Therefore, \[ \log p_{\theta}(x) = \underbrace{\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]}_{(1)\,\text{Reconstruction}} \;-\;\underbrace{D_{KL}\bigl(q_{\phi}(z|x)\|p(z)\bigr)}_{(2)\,\text{Prior matching}} \;+\;\underbrace{D_{KL}\bigl(q_{\phi}(z|x)\|p_{\theta}(z|x)\bigr)}_{(3)\,\ge0\,,\text{+gap}}. \]
- The ELBO is terms (1) and (2): \[ \mathrm{ELBO}(\phi,\theta;x) = \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{KL}\bigl(q_{\phi}(z|x)\|p(z)\bigr). \]
- Maximizing the ELBO maximizes a lower bound on \(\log p_{\theta}(x)\).
- The VAE loss (to minimize) is the negative ELBO: \[ L_{VAE}(\phi,\theta;x) = -\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] + D_{KL}\bigl(q_{\phi}(z|x)\|p(z)\bigr). \]
Analytical KL Divergence:
- If \(q_{\phi}(z|x)=\mathcal{N}(z;\mu_{\phi}(x),\mathrm{diag}(\sigma^2_{\phi}(x)))\) and \(p(z)=\mathcal{N}(0,I)\), then \[ D_{KL}\bigl(q_{\phi}(z|x)\|p(z)\bigr) = \tfrac{1}{2}\sum_{j=1}^{D}\bigl(\sigma_j^2(x)+\mu_j^2(x) -\log\sigma_j^2(x)-1\bigr). \]
Generation After Training:
- Sample \(z\sim p(z)\) and decode via \(p_{\theta}(x|z)\).
- Without conditioning, attributes (e.g. class) are uncontrolled.
- Conditional VAEs (CVAEs): condition on labels \(y\), use \(q_{\phi}(z|x,y)\) and \(p_{\theta}(x|z,y)\) for controlled generation.
β-VAEs
β-VAEs introduce a hyperparameter β (>1) to the ELBO: \[ L_{\beta\text{-VAE}}(\phi,\theta;x) = -\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] + \beta\,D_{KL}\bigl(q_{\phi}(z|x)\|p(z)\bigr). \]
Rationale for Disentanglement:
- The KL term matches \(q_{\phi}(z|x)\) to an isotropic Gaussian \(p(z)=\mathcal{N}(0,I)\).
- Isotropy implies zero mean, unit variance, and independence across latent dimensions.
- By upweighting the KL term (β>1), β-VAEs more strongly penalize any deviation from:
- unit variance in each \(z_j\),
- zero covariance between distinct latent axes.
- This stronger pressure reduces correlations among latent dimensions, encouraging each \(z_j\) to capture an independent factor of variation.
Constrained Optimization Perspective:
- Maximizing ELBO subject to \(D_{KL}(q_{\phi}(z|x)\|p(z))\le \epsilon_0\).
- β acts as a Lagrange multiplier: larger β ↔︎ smaller ε₀ ↔︎ tighter constraint on information capacity.
- A tighter constraint forces a more compressed, factorized (disentangled) representation.
Disentanglement vs. Reconstruction Trade-off:
- Increasing β improves disentanglement but may degrade reconstruction quality.
- The model prioritizes matching \(p(z)\) over reconstructing \(x\) perfectly.