Model-based Reinforcement Learning

Overview

Final piece: learn a “world model”, plan through it, and fold epistemic uncertainty into exploration and safety.

Collect experience with current policy \(\pi\) (real environment).
Learn dynamics \(f_\theta(x,a)\approx p(x'\mid x,a)\) and reward model \(r_\theta(x,a)\) (a.k.a. world model).
Plan/improve policy using the learned model (MPC, trajectory optimization, policy optimization in imagination).
Repeat until convergence or performance threshold.

Reuses familiar value-learning machinery while adding model learning + planning; key benefit is sample efficiency and optional safety analysis.

MPC / receding-horizon control: choose horizon \(H\), optimize action sequence \(a_{t:t+H-1}\), apply first action, replan next step.
Objective with terminal bootstrap (Eq. 13.2): \[J_H(a_{t:t+H-1}) = \sum_{\tau=t}^{t+H-1}\gamma^{\tau-t} r(x_\tau,a_\tau) + \gamma^H V(x_{t+H}).\]
Optimization tools: shooting (random sampling), gradient-based (backprop through dynamics), cross-entropy method (CEM) which iteratively reweights samples toward high-return elites.
Special case \(H=1\) recovers the greedy policy from dynamic programming.

Trajectory sampling / Stochastic Average Approximation: simulate multiple rollouts under candidate action sequence, average returns. Use reparameterization trick to express stochastic dynamics as \(x_{t+1}=g(x_t,a_t;\varepsilon_t)\) for Monte Carlo gradients.
Always replan to mitigate model error accumulation; keep horizon short (5–30 steps) to balance foresight and compounding error.

Deterministic regressors: neural nets, RFF-based GPs, latent linear models.
Probabilistic dynamics: output distribution over next state (e.g., Gaussian with mean/cov). Distinguish aleatoric \(\mathcal{B}(x_{t+1}\mid f,x_t,a_t)\) vs epistemic \(p(f\mid \mathcal{D})\) uncertainty.
Ensembles / BNNs: maintain multiple models \(\{f^{(k)}\}\) (bootstrapped, variational) to approximate epistemic distribution. Used heavily in PETS.
Latent SSMs: PlaNet, Dreamer learn stochastic latent dynamics via variational inference (ELBO). Planner operates in latent space; Dreamer also learns actor/critic in imagination (backprop through latent rollouts).

For each sampled model \(f^{(k)}\), perform standard planning (respect aleatoric noise) then average returns/gradients across \(k\). Provides robustness by marginalizing over epistemic uncertainty.
PETS: ensemble of probabilistic networks + MPC via CEM. Handles both uncertainty types; strong sample efficiency on MuJoCo, etc., and the plot below visualizes how epistemic variance concentrates near unexplored regions along candidate plans.
Dreamer/PlaNet: optimize policy/value inside learned latent world; generate imagined rollouts to train actor-critic via policy gradient.

Thompson sampling: draw one model sample, plan greedily. Randomizing over models encourages exploration of uncertain regions.
Optimism: introduce “luck” variables \(\eta\) to allow transitions within confidence bounds that favour high reward; ensures optimistic value estimates until uncertainty shrinks.
Information-directed sampling: weigh squared regret vs information gain; the heatmap illustrates the ratio used to pick the next rollout.
Safe exploration: propagate uncertainty to enforce chance constraints or reachable sets; combine with MPC for high-probability safety.

Use world model to generate synthetic data for TD/Q/actor-critic (Dyna style).
Terminal value \(V\) in MPC learned via TD or deep critics.
Planning horizon \(H=1\) reduces to model-free greedy; longer horizons add foresight absent in purely value-based methods.

Short horizons + frequent replanning mitigate model error; warm-start optimization from previous solution.
Keep track of epistemic uncertainty (ensembles, Bayesian nets) to decide when additional real data is needed.
Use CEM or evolutionary strategies as robust default for optimizing action sequences.

Confidence sets on dynamics ⇒ safe reachable tubes; combine with MPC to keep agent within safe set despite model error.
Optimistic-yet-safe planners mix optimism (exploration) with reachable-set constraints; HUCRL timeline highlights confidence widening when the agent strays outside proven-safe regions.
Sets stage for constrained RL and control-barrier approaches.