Mathematical Foundations & Deep Learning Cheat Sheet

Matrix Determinant Lemma

For matrices \(A \in \mathbb{R}^{n \times n}\), \(u, v \in \mathbb{R}^n\): \[\det(A + uv^T) = \det(A)(1 + v^T A^{-1} u)\]

  • Special case: \(\det(A + \alpha I) = \det(A) \prod_{i=1}^n (1 + \alpha/\lambda_i)\) where \(\lambda_i\) are eigenvalues of \(A\).

Change of Variables Formula

For transformation \(y = g(x)\) where \(g: \mathbb{R}^n \to \mathbb{R}^n\) is differentiable and invertible: \[p_Y(y) = p_X(g^{-1}(y)) \left|\det\left(\frac{\partial g^{-1}}{\partial y}\right)\right| = p_X(x) \left|\det\left(\frac{\partial g}{\partial x}\right)\right|^{-1}\]

  • Jacobian: \(J_g = \frac{\partial g}{\partial x}\) is the Jacobian matrix of the transformation.

Model Collapse

A phenomenon where a generative model progressively loses diversity in its outputs, eventually producing only a limited set of samples that lack the variability of the training data. Common in:

  • GANs: Generator focuses on “easy” samples that fool the discriminator
  • Autoregressive models: Repetitive or generic text generation
  • Diffusion models: Loss of sample diversity with certain training procedures

GAN Objective Function

  • Minimax objective: \[\min_G \max_D V(D,G) = \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]\]
  • Optimal discriminator (for fixed \(G\)): \[D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}\]
  • Optimal value at equilibrium (when \(p_g = p_{data}\)): \[V(D^*, G^*) = -\log(4)\]

Plane Equation (Intercept Form)

Given x, y, z intercepts \(a\), \(b\), \(c\): \[\frac{x}{a} + \frac{y}{b} + \frac{z}{c} = 1\]

Convolution Output Size & Receptive Field

  • Output size: \[H_{out} = \left\lfloor \frac{H_{in} + 2P - K}{S} \right\rfloor + 1\] where \(H_{in}\) = input height, \(P\) = padding, \(K\) = kernel size, \(S\) = stride
  • Receptive field: The region in the input that affects a single output unit
    • Formula: \(RF_l = RF_{l-1} + (K_l - 1) \times \prod_{i=1}^{l-1} S_i\)

What is a Neuron?

  • MLP: \(y = \sigma(w^T x + b)\) where \(\sigma\) is activation function, \(w\) weights, \(b\) bias
  • CNN: A convolution operation \(y_{i,j} = \sigma\left(\sum_{m,n} w_{m,n} \cdot x_{i+m,j+n} + b\right)\) applied across spatial locations

Immediate Derivative (\(\partial^+\))

The right-hand derivative for non-differentiable functions: \[\partial^+ f(x) = \lim_{h \to 0^+} \frac{f(x+h) - f(x)}{h}\]

  • Used in subgradient methods and RNN backpropagation through non-smooth activations

Regularization

Any technique that aims to reduce the generalization error of a model, typically by:

  • Adding penalty terms to the loss function (L1, L2)
  • Modifying the training procedure (dropout, early stopping)
  • Constraining model complexity (weight decay, batch normalization)

CNN Bias Parameters

For a convolutional layer with kernel size \(k \times k\), \(c_{in}\) input channels, \(c_{out}\) output channels:

  • Weights: \(k \times k \times c_{in} \times c_{out}\)
  • Biases: \(c_{out}\) (one bias per output channel)

LSTM Cell Equations

Given input \(x_t\), hidden state \(h_{t-1}\), cell state \(c_{t-1}\): \[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \quad \text{(forget gate)}\] \[i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \quad \text{(input gate)}\] \[\tilde{c}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) \quad \text{(candidate values)}\] \[c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \quad \text{(cell state)}\] \[o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \quad \text{(output gate)}\] \[h_t = o_t \odot \tanh(c_t) \quad \text{(hidden state)}\]

RNN Gradient Unrolling & Immediate Derivative

  • Backpropagation through time (BPTT): Unroll RNN for \(T\) steps and compute gradients
  • Gradient flow: \(\frac{\partial L}{\partial h_{t-1}} = \frac{\partial L}{\partial h_t} \frac{\partial h_t}{\partial h_{t-1}}\)
  • Vanishing/exploding gradients: Due to repeated multiplication of \(\frac{\partial h_t}{\partial h_{t-1}}\)
  • Immediate derivative: Used when activation functions are non-differentiable (e.g., ReLU at 0)

VAE Gaussian Latent Advantages

  1. Analytical KL divergence: \(D_{KL}(q(z|x) \| p(z))\) has closed form when both are Gaussian
  2. Easy sampling: Reparameterization trick \(z = \mu + \sigma \odot \epsilon\) where \(\epsilon \sim \mathcal{N}(0,I)\)
  3. Smooth latent space: Gaussian assumption promotes smooth interpolation between data points
  4. Mathematical tractability: Enables end-to-end gradient-based optimization

Neural Radiance Fields (NeRF) - Color Integration

The color \(\hat{C}(\mathbf{r})\) for a camera ray \(\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}\) is:

\[ \hat{C}(\mathbf{r}) = \sum_{i=1}^{N} T_i \alpha_i \mathbf{c}_i \]

Where:

  • \(N\): Number of samples along the ray.
  • \(\mathbf{c}_i, \sigma_i = \text{MLP}(\mathbf{r}(t_i), \mathbf{d})\): Color and volume density from MLP for sample \(i\) at position \(\mathbf{r}(t_i)\) viewed from direction \(\mathbf{d}\).
  • \(\delta_i = t_{i+1} - t_i\): Distance between adjacent samples.
  • \(\alpha_i = 1 - \exp(-\sigma_i \delta_i)\): Opacity of interval \(i\).
  • \(T_i = \prod_{j=1}^{i-1} (1 - \alpha_j)\): Transmittance up to sample \(i\) (\(T_1 = 1\)).

3D Gaussian Splatting - Color Blending

The color \(C\) for a pixel is:

\[ C = \sum_{i=1}^{N_g} \mathbf{k}_i \alpha_i' \prod_{j=1}^{i-1} (1 - \alpha_j') \]

Where the sum is over \(N_g\) Gaussians sorted by depth (front-to-back):

  • \(\mathbf{k}_i\): Color of the \(i\)-th Gaussian (e.g., from Spherical Harmonics).
  • \(\alpha_i'\): Effective opacity of Gaussian \(i\) at the pixel: \[ \alpha_i' = o_i \cdot \exp\left(-\frac{1}{2} (\mathbf{p} - \mathbf{\mu}_i')^T (\mathbf{\Sigma}_i')^{-1} (\mathbf{p} - \mathbf{\mu}_i')\right) \]
    • \(o_i\): Learned opacity scalar of the \(i\)-th 3D Gaussian.
    • \(\mathbf{p}\): 2D pixel coordinate.
    • \(\mathbf{\mu}_i'\): Projected 2D mean of Gaussian \(i\).
    • \(\mathbf{\Sigma}_i'\): Projected 2D covariance of Gaussian \(i\).
  • \(\prod_{j=1}^{i-1} (1 - \alpha_j')\): Accumulated transmittance from Gaussians closer to the camera.

SMPL (Skinned Multi-Person Linear Model) Cheatsheet

Big Picture: SMPL is a realistic, learned 3D parametric model of the human body. It can represent a wide variety of human body shapes and poses using a compact set of parameters. It’s widely used in computer vision and graphics for tasks like 3D human pose and shape estimation, animation, virtual try-on, and generating synthetic human data. The model is differentiable, allowing it to be integrated into deep learning pipelines.

Core Idea: Map low-dimensional shape (\(\vec{\beta}\)) and pose (\(\vec{\theta}\)) parameters to a 3D mesh (\(V\) vertices).

Key Parameters & Components:

  1. Shape Parameters (\(\vec{\beta}\)):

    • Vector \(\vec{\beta} = [\beta_1, ..., \beta_N]\) (e.g., \(N=10\) or \(20\)).
    • Controls identity-dependent body proportions (height, weight, etc.).
    • Coefficients for shape blend shapes.
  2. Pose Parameters (\(\vec{\theta}\)):

    • Vector \(\vec{\theta} \in \mathbb{R}^{3K}\) representing joint rotations for \(K\) joints (e.g., \(K=24\), including root).
    • Typically axis-angle representation for each joint.
    • \(\theta_0\) denotes the rest pose (canonical T-pose).
  3. Template Mesh (\(\bar{T}\)):

    • A base 3D mesh with \(V\) vertices \(\{\bar{t}_1, ..., \bar{t}_V\}\).
    • Represents the mean shape in the rest pose (\(\theta_0\)).
  4. Shape Blend Shapes (\(S\)):

    • A set of \(N\) vertex displacement fields. \(S_{n,i} \in \mathbb{R}^3\) is the displacement for vertex \(i\) due to the \(n\)-th shape component.
    • \(B_S(\vec{\beta})_i = \sum_{n=1}^{N} \beta_n S_{n,i}\) is the total shape-induced displacement for vertex \(i\).
  5. Pose Blend Shapes (\(P\)):

    • Vertex displacements \(B_P(\vec{\theta})_i\) that correct LBS artifacts for non-rigid effects of pose (e.g., muscle bulging).
    • Function of joint rotations \(\vec{\theta}\) relative to the rest pose \(\theta_0\).
    • Calculated as a linear combination of \(9(K-1)\) basis pose blend shapes, where coefficients depend on \((R_k(\vec{\theta}) - R_k(\vec{\theta}_0))\) for \(k=1,...,K-1\) non-root joints.
  6. Joint Regressor (\(J(\vec{\beta})\)):

    • A function (often a linear regressor/sparse matrix) that computes the 3D locations of \(K\) skeleton joints from the vertices of the shaped mesh \(T_S = \bar{T} + B_S(\vec{\beta})\).
  7. Skinning Weights (\(W\)):

    • Matrix \(W \in \mathbb{R}^{V \times K}\). \(w_{k,i}\) is the influence of joint \(k\) on vertex \(i\).
    • \(\sum_{k=1}^{K} w_{k,i} = 1\) for each vertex \(i\).
  8. Joint Transformations:

    • \(G_k(\vec{\theta})\): World transformation ( \(4 \times 4\) matrix) of joint \(k\) under pose \(\vec{\theta}\), derived from \(\vec{\theta}\) and joint locations \(J(\vec{\beta})\) via forward kinematics.
    • \(G_k(\vec{\theta}_0)\): World transformation of joint \(k\) in the rest pose.

SMPL Model Formulation (Vertex Positions \(M(\vec{\beta}, \vec{\theta})\)):

Step 1: Apply Shape Blend Shapes (Identity) Calculate the personalized rest shape \(T_S\) by adding shape-dependent displacements to the template mesh: For each vertex \(i\): \[ t_{S,i} = \bar{t}_i + B_S(\vec{\beta})_i = \bar{t}_i + \sum_{n=1}^{N} \beta_n S_{n,i} \]

Step 2: Apply Pose Blend Shapes (Pose-dependent corrections) Add corrective pose-dependent displacements to \(t_{S,i}\) to get vertices \(t'_i\) in the “posed canonical space”: For each vertex \(i\): \[ t'_{i} = t_{S,i} + B_P(\vec{\theta})_i \]

Step 3: Linear Blend Skinning (LBS) Transform vertices \(t'_i\) from the canonical space to the final world pose using LBS. Let \(\tilde{t}'_i\) be the homogeneous coordinate of \(t'_i\). The effective transformation for joint \(k\) from canonical to world space is \(G'_k = G_k(\vec{\theta})G_k(\vec{\theta}_0)^{-1}\). The final homogeneous position \(\tilde{p}_i\) of vertex \(i\) is: \[ \tilde{p}_i = \sum_{k=1}^{K} w_{k,i} (G_k(\vec{\theta})G_k(\vec{\theta}_0)^{-1} \tilde{t}'_i) \] The 3D position \(p_i\) is obtained from \(\tilde{p}_i\). The set of all \(p_i\) forms the final mesh \(M(\vec{\beta}, \vec{\theta})\).

Backpropagation Through Time & Matrix Calculus

Layout Conventions

Choose one and stick to it:

  • Numerator layout: Gradient \(\nabla f(x)\) is a row vector, \(\frac{\partial x}{\partial y}\) is \(m \times n\) matrix
  • Denominator layout: Gradient \(\nabla f(x)\) is a column vector, \(\frac{\partial x}{\partial y}\) is \(n \times m\) matrix

Key Rule: Matrix multiplications within derived quantities must always be dimensionally valid.

Gradient Definition

  • Strict definition: \(\nabla f(x) = \frac{\partial f(x)}{\partial x}\) for \(f: \mathbb{R}^n \to \mathbb{R}\) (scalar function)
  • Deep Learning usage: Term “gradient” used loosely for all matrix/vector-valued derivatives

Practical Tips

  1. Plan ahead: Determine dimensions of input/output before deriving
    • Scalar by vector → vector
    • Vector by vector → matrix
    • Vector by scalar → vector
  2. Check dimensions: Verify matrix operations are well-defined
  3. When stuck: Work element-wise, then collect into matrix form

Common Derivatives

Scalar Functions:

  • \(\frac{\partial}{\partial x}(ax) = a\)
  • \(\frac{\partial}{\partial x}(x^n) = nx^{n-1}\)
  • \(\frac{\partial}{\partial x}(\log x) = \frac{1}{x}\)
  • \(\frac{\partial}{\partial x}(e^x) = e^x\)

Vector/Matrix Functions (Numerator Layout):

  • \(\frac{\partial \mathbf{a}^T \mathbf{x}}{\partial \mathbf{x}} = \mathbf{a}^T\)
  • \(\frac{\partial \mathbf{x}^T \mathbf{a}}{\partial \mathbf{x}} = \mathbf{a}^T\)
  • \(\frac{\partial \mathbf{x}^T \mathbf{A} \mathbf{x}}{\partial \mathbf{x}} = \mathbf{x}^T(\mathbf{A} + \mathbf{A}^T)\)
  • \(\frac{\partial \mathbf{A} \mathbf{x}}{\partial \mathbf{x}} = \mathbf{A}\)
  • \(\frac{\partial ||\mathbf{x}||^2}{\partial \mathbf{x}} = 2\mathbf{x}^T\)

Activation Functions:

  • \(\frac{\partial}{\partial x} \sigma(x) = \sigma(x)(1-\sigma(x))\) (sigmoid)
  • \(\frac{\partial}{\partial x} \tanh(x) = 1 - \tanh^2(x)\)
  • \(\frac{\partial}{\partial x} \text{ReLU}(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x < 0 \\ \partial^+ & \text{if } x = 0 \end{cases}\)

Chain Rule: \[\frac{\partial f(g(x))}{\partial x} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial x}\]

BPTT Key Insight: \[\frac{\partial L}{\partial h_{t-k}} = \frac{\partial L}{\partial h_t} \prod_{i=0}^{k-1} \frac{\partial h_{t-i}}{\partial h_{t-i-1}}\]

  • Vanishing gradients: When \(\left|\frac{\partial h*t}{\partial h*{t-1}}\right| < 1\)
  • Exploding gradients: When \(\left|\frac{\partial h*t}{\partial h*{t-1}}\right| > 1\)