Week 12

Published

Wednesday, May 21, 2025

Implicit Surfaces, Neural Radiance Fields (NeRFs), and 3D Gaussian Splatting

This lecture extends 2D model generation concepts to 3D, focusing on representations, rendering, and learning methods for 3D scenes.

3D Representations

Voxels
- Discretize 3D space into a regular grid of volumetric pixels (voxels).
- Memory complexity is cubic: \(O(N^3)\) for an \(N \times N \times N\) grid.
- Resolution is limited by grid size.
Point Primitives / Volumetric Primitives
- Represent geometry as a collection of points (point cloud) or simple volumetric shapes (e.g., spheres).
- Do not explicitly encode connectivity or topology.
- Easily acquired with depth sensors (e.g., LiDAR).
Meshes
- Composed of vertices (points), edges (connections), and faces (typically triangles).
- Limited by the number of vertices; low resolution can cause self-intersections.
- Some methods require class-specific templates, but general mesh learning is also common.
- Remain popular due to compatibility with standard rendering pipelines.

Implicit Surfaces

Explicit vs. Implicit Shape Representation
- Explicit (mesh-like): Defined by discrete vertices and faces; representation is discontinuous.
- Implicit: Defined by a function \(f(x, y) = 0\) (e.g., a circle: \(x^2 + y^2 - r^2 = 0\)); representation is continuous.
  - To compute \(\frac{dy}{dx}\) for \(F(x, y) = 0\) (e.g., \(x^2 + y^2 - 1 = 0\)): \[ \begin{aligned} \frac{d}{dx}[x^2 + y^2] &= \frac{d}{dx}[1] \\ 2x + 2y \frac{dy}{dx} &= 0 \\ \frac{dy}{dx} &= -\frac{x}{y} \end{aligned} \]
  - More generally, for \(F(x, y) = 0\): \[ \frac{dy}{dx} = -\frac{\partial F / \partial x}{\partial F / \partial y} \] (assuming \(\partial F / \partial y \neq 0\)).
Level Sets
- Represent the surface as the zero level set of a continuous function \(f: \mathbb{R}^3 \to \mathbb{R}\): \[ S = \{ \mathbf{x} \in \mathbb{R}^3 \mid f(\mathbf{x}) = 0 \} \]
- \(f\) can be approximated by a neural network \(f_\theta(\mathbf{x})\).
Signed Distance Functions (SDFs)
- \(f(\mathbf{x})\) gives the signed distance from \(\mathbf{x}\) to the surface; sign indicates inside/outside.
- Storing SDFs on a grid leads to \(O(N^3)\) memory and limited resolution.
Neural Implicit Representations (e.g., Occupancy Networks, DeepSDF)
- Neural network predicts occupancy probability or SDF value for any \(\mathbf{x} \in \mathbb{R}^3\).
Can be conditioned on class labels or images.
Advantages:
- Low memory (just network weights).
- Continuous, theoretically infinite resolution.
- Can represent arbitrary topologies.

Neural Fields

Neural fields generalize neural implicit representations to predict not only geometry but also color, lighting, and other properties.
Surface Normals from SDFs
- Surface normal at \(\mathbf{x}\): \(\mathbf{n}(\mathbf{x}) = \nabla f(\mathbf{x}) / \| \nabla f(\mathbf{x}) \|\).
- Computed efficiently via auto-differentiation.
- Used for regularization (e.g., Eikonal term).

Training Neural Implicit Surfaces

Supervision Levels (from easiest to hardest for geometry learning):
1. Watertight Meshes
  - Sample points and query occupancy or SDF.
  - Train with cross-entropy (occupancy) or regression (SDF).
2. Point Clouds
  - Supervise so that \(f_\theta(\mathbf{x}) \approx 0\) for observed points.
3. Images
  - Requires differentiable rendering to compare rendered and real images.
Visualizing Implicit Surfaces
- Evaluate \(f_\theta(\mathbf{x})\) on a grid and extract a mesh (e.g., Marching Cubes).
Overfitting for Representation
- Networks can be trained to overfit a single shape/class, compressing it into the weights.
- For implicit surfaces, this means accurate 3D reconstruction, not novel view synthesis.
Eikonal Regularization
- Enforces \(\| \nabla f(\mathbf{x}) \| = 1\) for SDFs.
- Encourages smooth, well-behaved level sets.
- Loss: \[ L = \sum_{\mathbf{x}} |f_\theta(\mathbf{x})| + \lambda \sum_{\mathbf{x}} (\| \nabla f_\theta(\mathbf{x}) \| - 1)^2 \]
- Alternatively, an L1 penalty: \[ L = \sum_{\mathbf{x}} |f_\theta(\mathbf{x})| + \lambda \sum_{\mathbf{x}} |\| \nabla f_\theta(\mathbf{x}) \| - 1| \]

Differentiable Rendering for Neural Fields

Goal: Learn \(f_\theta\) (geometry) and \(t_\theta\) (texture) from 2D images by rendering and comparing to ground truth.
Rendering Pipeline:
1. For each pixel \(u\), cast a ray \(\mathbf{r}(d) = \mathbf{r}_0 + \mathbf{w}d\).
2. Find intersection \(\hat{\mathbf{p}}\) where \(f_\theta(\hat{\mathbf{p}}) = \tau\) (e.g., \(\tau = 0\)).
  - Use root-finding (e.g., Secant method) between samples with opposite signs.
3. Query \(t_\theta(\hat{\mathbf{p}})\) for color.
4. Assign color to pixel \(u\).
Forward Pass:
Backward Pass:
- For loss \(L(I, \hat{I}) = \sum_u \| \hat{I}_u - I_u \|_1\).
- Where \(\hat{I}_u = t_\theta(\hat{\mathbf{p}})\) is the rendered color at pixel \(u\), where both \(t_\theta\) and \(\hat{\mathbf{p}}\) depend on \(\theta\). \[ \frac{\partial \hat{I}_u}{\partial \theta} = \frac{\partial t_\theta(\hat{\mathbf{p}})}{\partial \theta} + \nabla_{\hat{\mathbf{p}}} t_\theta(\hat{\mathbf{p}}) \cdot \frac{\partial \hat{\mathbf{p}}}{\partial \theta} \]
- Implicit differentiation for \(\frac{\partial \hat{\mathbf{p}}}{\partial \theta}\): \[ f_\theta(\hat{\mathbf{p}}) = \tau \implies \frac{\partial f_\theta(\hat{\mathbf{p}})}{\partial \theta} + \nabla_{\hat{\mathbf{p}}} f_\theta(\hat{\mathbf{p}}) \cdot \frac{\partial \hat{\mathbf{p}}}{\partial \theta} = 0 \] \[ \frac{\partial \hat{\mathbf{p}}}{\partial \theta} = -\frac{\frac{\partial f_\theta(\hat{\mathbf{p}})}{\partial \theta}}{\nabla_{\hat{\mathbf{p}}} f_\theta(\hat{\mathbf{p}}) \cdot \mathbf{w}} \mathbf{w} \]

Neural Radiance Fields (NeRFs)

Motivation: Model complex scenes with thin structures, transparency, and view-dependent effects.
Task: Given images with known camera poses, learn a volumetric scene representation for novel view synthesis.
Network Architecture:
- Input: 3D point \(\mathbf{x}\) and viewing direction \(\mathbf{d}\).
- Output: Color \(\mathbf{c}\) and density \(\sigma\).
- Structure:
  1. \(\mathbf{x}\) (after positional encoding) passes through several fully connected layers.
  2. Outputs \(\sigma\) and a feature vector.
  3. \(\mathbf{d}\) (after positional encoding) is concatenated with the feature vector.
  4. Final layers output view-dependent color \(\mathbf{c}\).
- Density \(\sigma\) depends only on \(\mathbf{x}\); color \(\mathbf{c}\) depends on both \(\mathbf{x}\) and \(\mathbf{d}\).
Positional Encoding:
- MLPs are biased toward low-frequency functions.
- Encode each input coordinate \(p\) as: \[ \gamma(p) = (\sin(2^0 \pi p), \cos(2^0 \pi p), \ldots, \sin(2^{L-1} \pi p), \cos(2^{L-1} \pi p)) \]
- Typically \(L=10\) for \(\mathbf{x}\), \(L=4\) for \(\mathbf{d}\).
Volume Rendering Process:
- For each pixel, cast a ray \(\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}\).
- Sample \(N\) points along the ray.
- For each sample \(i\) at \(t_i\):
  - Query MLP: \((\mathbf{c}_i, \sigma_i) = \text{MLP}(\mathbf{r}(t_i), \mathbf{d})\).
  - Compute \(\delta_i = t_{i+1} - t_i\).
  - Compute opacity: \(\alpha_i = 1 - \exp(-\sigma_i \delta_i)\).
  - Compute transmittance: \(T_i = \prod_{j=1}^{i-1} (1 - \alpha_j)\), with \(T_1 = 1\).
- Color Integration: \[ \hat{C}(\mathbf{r}) = \sum_{i=1}^{N} T_i \alpha_i \mathbf{c}_i \]
  - \(N\): Number of samples along the ray.
  - \(\mathbf{c}_i, \sigma_i\): Color and density from MLP at sample \(i\).
  - \(\delta_i\): Distance between samples.
  - \(\alpha_i\): Opacity for interval \(i\).
  - \(T_i\): Transmittance up to sample \(i\).
Hierarchical Volume Sampling (HVS):
- Use a coarse network to sample uniformly, then a fine network to sample more densely where \(T_i \alpha_i\) is high.
Training & Characteristics:
- Volumetric, models transparency and thin structures.
- Geometry can be noisy compared to explicit surface methods.
- Requires many calibrated images.
- Rendering is slow due to many MLP queries.
- Original NeRF is for static scenes.
Animating NeRFs / Dynamic Scenes
- Use a canonical space (e.g., T-pose) and learn a deformation field to map to observed poses.
- Find correspondences between observed and canonical space (may require a separate network).
- If multiple correspondences, select or aggregate (e.g., max confidence) [Verification Needed].
- Example: Vid2Avatar.
Alternative Parametrizations
- Use explicit primitives (e.g., voxels, cubes) with local NeRFs.
- Point-based primitives (e.g., spheres, ellipsoids) with neural features can be optimized and rendered efficiently.
- Ellipsoids better capture thin structures than spheres.

3D Gaussian Splatting (Kerbl et al., 2023)

Overview: Represents a scene as a set of 3D Gaussians, each with learnable parameters, and renders by projecting and compositing them.
Process:
1. Initialization:
  - Start from a sparse point cloud (e.g., from Structure-from-Motion).
  - Initialize one Gaussian per point.
2. Optimization Loop:
  - Project 3D Gaussians to 2D using camera parameters.
  - Render using a differentiable rasterizer (often tile-based).
  - Compute loss (e.g., L1, D-SSIM) between rendered and ground truth images.
  - Backpropagate to update Gaussian parameters.
3. Adaptive Density Control:
  - Prune Gaussians with low opacity or excessive size.
  - Densify by cloning/splitting in under-reconstructed regions.
Gaussian Parameters:
- 3D mean \(\mathbf{\mu} \in \mathbb{R}^3\).
- 3x3 covariance \(\mathbf{\Sigma}\) (parameterized by scale and quaternion for rotation).
- Color \(\mathbf{c}\) (often as Spherical Harmonics coefficients).
- Opacity \(o\) (scalar, passed through sigmoid).
Rendering:
- For each pixel:
  1. Project relevant 3D Gaussians to 2D.
  2. Sort by depth (front-to-back).
  3. For each Gaussian \(i\):
    - Compute effective opacity at pixel: \[ \alpha_i' = o_i \cdot \exp\left(-\frac{1}{2} (\mathbf{p} - \mathbf{\mu}_i')^T (\mathbf{\Sigma}_i')^{-1} (\mathbf{p} - \mathbf{\mu}_i')\right) \]
      - \(o_i\): Learned opacity.
      - \(\mathbf{p}\): 2D pixel coordinate.
      - \(\mathbf{\mu}_i'\): Projected 2D mean.
      - \(\mathbf{\Sigma}_i'\): Projected 2D covariance.
    - Evaluate color \(\mathbf{k}_i\) (from SH coefficients).
  4. Color Blending: \[ C = \sum_{i=1}^{N_g} \mathbf{k}_i \alpha_i' \prod_{j=1}^{i-1} (1 - \alpha_j') \]
    - \(N_g\): Number of Gaussians overlapping the pixel, sorted by depth.
    - \(\prod_{j=1}^{i-1} (1 - \alpha_j')\): Accumulated transmittance.
Advantages:
- Faster than NeRF for both training and rendering (often real-time).
- State-of-the-art rendering quality.
- Once optimized, rendering is efficient—no need for repeated neural network queries.