Markov Chain Monte Carlo

Overview

Turn posterior sampling into Markov-chain design for cases where variational approximations fall short.

Figure 1: Directed graphical model of a Markov chain. The random variable \(X_{t+1}\) is conditionally independent of the random variables \(X_{0:t-1}\) given \(X_t\).

Figure 2: Transition graphs of Markov chains: (1) is not ergodic as its transition diagram is not strongly connected; (2) is not ergodic for the same reason; (3) is irreducible but periodic and therefore not ergodic; (4) is ergodic with stationary distribution \(\pi(1) = \frac{2}{3}, \pi(2) = \frac{1}{3}\).

Markov-chain preliminaries

  • Homogeneous chain with transition kernel \(P(x'\mid x)\); \(k\)-step transition \(P^k\).
  • Stationary distribution \(\pi\) solves \(\pi^\top P = \pi^\top\). If chain is irreducible and aperiodic (ergodic), \(q_t\to\pi\) for any start (Fundamental theorem; see Figure 2 ).
  • Detailed balance \(\pi(x)P(x'\mid x)=\pi(x')P(x\mid x')\) suffices for stationarity (reversibility).
  • Distance to stationarity measured in total variation \(\|q_t-\pi\|_{\mathrm{TV}}\); mixing time \(\tau_{\mathrm{mix}}(\epsilon)\) ≈ steps to get within \(\epsilon\).

Metropolis–Hastings (MH)

  • Goal: target density \(p(x)=q(x)/Z\) (unnormalized \(q\) known).
  • Proposal \(r(x'\mid x)\); accept with \[\alpha(x'\mid x)=\min\left\{1,\frac{q(x')r(x\mid x')}{q(x)r(x'\mid x)}\right\}.\]
  • Ensures detailed balance w.r.t. \(p\).
  • Tuning: poor proposals ⇒ slow mixing (random walk). Adaptive schemes adjust step size to hit desired acceptance (≈0.234 in high-dims for isotropic random walk).

Gradient-informed proposals

  • MALA: \(x' = x + \tfrac{\epsilon^2}{2}\nabla\log q(x) + \epsilon\xi\) (Langevin step) then MH accept; leverages gradient, useful in high dimension.
  • HMC: introduce momentum \(p\sim\mathcal{N}(0,M)\), simulate Hamiltonian dynamics, accept via MH. Long, low-rejection moves reduce random walk behaviour; requires gradient of \(\log q\).
  • Riemannian manifold HMC: adapt metric to local curvature.
Figure 3: Metropolis–Hastings proposes uninformed jumps; Langevin dynamics follows the gradient of \(f(\theta)\) for more directed moves.
Figure 4: Hamiltonian Monte Carlo lifts states into \((x,y)\) space, simulates Hamiltonian dynamics, and projects back to propose distant points with high acceptance.

Gibbs sampling

  • Cycle through coordinates (or blocks): sample \(x_i \sim p(x_i\mid x_{-i})\).
  • Special case of MH with unit acceptance. Works when all conditionals easy; order can be fixed or random (random scan preserves detailed balance).
  • Block Gibbs improves mixing when variables correlated.

Practical workflow

  1. Initialize (multiple chains if possible).
  2. Burn-in until diagnostics (trace plots, potential scale reduction \(\hat{R}\)) stabilize.
  3. Collect samples, optionally thin to reduce storage (does not improve ESS).
  4. Estimate expectations via Monte Carlo averages; report Monte Carlo error using ESS or batch means.

Energy interpretation

  • Target \(p(x) \propto e^{-E(x)}\) with energy \(E(x)=-\log q(x)\).
  • Random-walk MH explores energy landscape slowly; Langevin/HMC follow gradient/level sets ⇒ faster exploration.
  • Simulated tempering / parallel tempering mix chains at different temperatures to escape local modes.

When to choose MCMC

  • High-fidelity posterior samples needed (e.g., small models with multimodality) and gradient evaluations affordable.
  • Combine with VI: use VI for initialization, then MCMC for refinement (e.g., Laplace+MCMC).
  • Expensive in high dimension; mixing diagnostics essential.

MCMC keeps normalization outside the loop; designing good proposals is the art. Understanding irreducibility, acceptance ratios, and diagnostics is crucial before scaling to Bayesian deep learning or world-model uncertainty.