Week 2

Published

Wednesday, February 26, 2025

Training Neural Networks

Regularization

Regularization encompasses a set of strategies designed to reduce a model’s test error (generalization error), sometimes at the cost of a slight increase in training error. The primary goal is to prevent overfitting, thereby improving the model’s performance on unseen data. Informally, these strategies often “make training harder” by constraining the model or introducing noise, compelling it to learn more robust and generalizable features.

Definition: Techniques aimed at minimizing the generalization gap (the difference between test error and training error).

Parameter Norm Penalties

These methods aim to limit the model’s effective capacity by adding a penalty term to the loss function based on the magnitude of the model parameters (weights).

L2 Regularization (Ridge Regression / Weight Decay):
- Adds a penalty proportional to the sum of the squares of the weights: \(L_{reg} = \lambda ||w||_2^2 = \lambda \sum_i w_i^2\).
- This encourages smaller, more diffuse weight values.
- The gradient of this penalty term with respect to a weight \(w_j\) is \(2 \lambda w_j\). In the weight update rule (\(w \leftarrow w - \eta (\nabla L_{data} + \nabla L_{reg})\)), this results in \(w_j \leftarrow w_j - \eta ( \dots + 2 \lambda w_j ) = w_j (1 - 2 \eta \lambda) - \dots\), effectively shrinking the weights by a constant factor in each step (hence “weight decay”).
L1 Regularization (Lasso):
- Adds a penalty proportional to the sum of the absolute values of the weights: \(L_{reg} = \lambda ||w||_1 = \lambda \sum_i |w_i|\).
- This is known for inducing sparsity, meaning it tends to drive many weights to exactly zero, effectively performing feature selection.
- The gradient of this penalty term with respect to \(w_j\) is \(\lambda \text{sign}(w_j)\) (for \(w_j \neq 0\)). The update rule subtracts a constant value (\(\eta \lambda\)) from positive weights and adds it to negative weights, pushing them towards zero, irrespective of their current magnitude (as long as they are non-zero).
Interpretations:
- Bayesian Prior: Parameter norm penalties can be interpreted as imposing a prior distribution on the weights in a Maximum A Posteriori (MAP) estimation framework. L2 regularization corresponds to a Gaussian prior, while L1 regularization corresponds to a Laplace prior.
- Constrained Optimization: They are also equivalent to solving a constrained optimization problem where the weights are constrained to lie within an \(L_2\) ball (for L2) or an \(L_1\) ball (for L1).

Ensemble Methods

Ensemble methods combine predictions from multiple models to achieve better performance and robustness than any single constituent model.

General Idea: Train several different models and then average their predictions (for regression) or take a majority vote (for classification).
Sources of Diversity:
- Train different model architectures on the same data.
- Train the same model architecture on different subsets of the data or with different initializations.
Bagging (Bootstrap Aggregating):
- Trains multiple instances of the same base model on different bootstrap samples (random subsets of the training data, sampled with replacement).
- Primarily reduces variance and helps prevent overfitting. Random Forests are a prominent example.
Boosting:
- Trains multiple base models sequentially. Each new model focuses on correcting the errors made by the previous models (e.g., by up-weighting misclassified examples).
- Primarily reduces bias. Examples include AdaBoost, Gradient Boosting Machines (GBMs), and XGBoost.

Dropout

Dropout is a regularization technique specifically for neural networks that prevents complex co-adaptations of neurons.

Mechanism: During training, for each forward pass, a fraction of neuron outputs in a layer are randomly set to zero (dropped out). The dropout probability (e.g., \(p=0.5\)) is a hyperparameter.
Ensemble Interpretation: Dropout can be viewed as implicitly training a large ensemble of thinned networks (networks with different subsets of neurons).
Inference Time:
- Weight Scaling: The most common approach is to use the full network at test time, but scale down the weights of the dropout layers by the dropout probability \(p\) (or, more commonly, scale up activations by \(1/(1-p)\) during training – known as inverted dropout – so no scaling is needed at test time). This approximates averaging the predictions of the exponentially many thinned networks.
- Monte Carlo (MC) Dropout: Alternatively, one can perform multiple forward passes with dropout still active at test time and average the predictions. This provides an estimate of model uncertainty.

Data Normalization

Normalizing input data can stabilize and accelerate training.

Motivation: Large input values or features with vastly different scales can lead to correspondingly large weight updates and an ill-conditioned loss landscape, potentially causing training instability or slow convergence.
Common Practice: Standardize features to have zero mean and unit variance: \(x'_{ij} = (x_{ij} - \mu_j) / \sigma_j\), where \(\mu_j\) and \(\sigma_j\) are the mean and standard deviation of feature \(j\) over the training set.
Test Time: The same mean and standard deviation values computed from the training data must be used to normalize test data to prevent data leakage and ensure consistency.

Batch Normalization (BN)

Batch Normalization normalizes the activations within the network, layer by layer, to address the problem of internal covariate shift.

Mechanism:
1. For each mini-batch during training, compute the mean \(\mu_{\mathcal{B}}\) and variance \(\sigma_{\mathcal{B}}^2\) of the activations for each feature/channel, along the batch dimension.
2. Normalize the activations: \(\hat{x}_i = (x_i - \mu_{\mathcal{B}}) / \sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}\).
3. Scale and shift the normalized activations using learned parameters \(\gamma\) (scale) and \(\beta\) (shift): \(y_i = \gamma \hat{x}_i + \beta\). These parameters allow the network to learn the optimal distribution for activations, potentially undoing the normalization if necessary.
Inference Time: During inference, the mini-batch statistics are unavailable. Instead, population statistics (running means and variances) estimated during training (e.g., via exponential moving averages of mini-batch means/variances) are used for normalization.
Benefits:
- Reduces internal covariate shift (changes in the distribution of layer inputs during training), stabilizing learning.
- Allows for higher learning rates and can make training less sensitive to weight initialization.
- The bias terms in layers preceding BN become redundant (as \(\beta\) can fulfill their role) and can often be omitted.
- The stochasticity introduced by using mini-batch statistics for normalization has a slight regularization effect.

Data Augmentation

Data augmentation artificially expands the training dataset by creating modified copies of existing data or synthesizing new data.

Principle: More diverse training data generally leads to better generalization and robustness.
Technique: Apply transformations to input data while preserving their labels.
Examples for Images:
- Geometric Transformations: Rotation, scaling, translation, shearing, flipping.
- Color Jittering: Adjusting brightness, contrast, saturation, hue.
Noise Injection: Adding random noise (e.g., Gaussian noise) to inputs can enforce robustness against small perturbations.

Transfer Learning

Transfer learning leverages knowledge gained from solving one problem to improve performance on a different but related problem.

Process:
1. Pre-training: A model is trained on a large-scale dataset (e.g., ImageNet for computer vision tasks). This model learns general-purpose features.
2. Fine-tuning: The pre-trained model (or parts of it) is then adapted to a smaller, target dataset. This typically involves replacing the final classification layer and re-training some or all of the network’s layers with a smaller learning rate.
Using Pre-trained Models: One can directly use well-known architectures (e.g., ResNet, VGG, BERT) that have been pre-trained on extensive datasets, saving significant training time and resources.

Activation Functions

Activation functions introduce non-linearities into the network, enabling it to learn complex patterns.

Sigmoid: \(f(x) = 1 / (1 + e^{-x})\)
- Output range: \((0, 1)\).
- Saturates for large positive or negative inputs, leading to vanishing gradients (gradients close to zero), which can slow down or stall learning in deep networks.
- Not zero-centered, which can be a minor issue for gradient updates.
Tanh (Hyperbolic Tangent): \(f(x) = (e^x - e^{-x}) / (e^x + e^{-x})\)
- Output range: \((-1, 1)\).
- Zero-centered, which is generally preferred over Sigmoid.
- Also saturates and suffers from vanishing gradients, though its gradients are typically larger than Sigmoid’s around zero.
ReLU (Rectified Linear Unit): \(f(x) = \max(0, x)\)
- Non-saturating in the positive domain, which helps alleviate the vanishing gradient problem and accelerates convergence.
- Computationally efficient.
- Can lead to “dead neurons”: if a neuron’s input is consistently negative, it will always output zero, and its gradient will also be zero, preventing weight updates for that neuron.
- Unbounded in the positive direction.
Leaky ReLU (LReLU) / Parametric ReLU (PReLU): \(f(x) = \max(\alpha x, x)\)
- Addresses the dead neuron problem by allowing a small, non-zero gradient when the unit is not active (e.g., \(\alpha = 0.01\) for LReLU).
- In PReLU, \(\alpha\) is a learnable parameter.
- Randomized Leaky ReLU (RReLU) samples \(\alpha\) from a distribution during training.
GELU (Gaussian Error Linear Unit): \(f(x) = x \Phi(x)\), where \(\Phi(x)\) is the cumulative distribution function (CDF) of the standard Gaussian distribution.
- A smooth approximation of ReLU, often performing well in Transformer models.
- Approximation: \(f(x) \approx 0.5x (1 + \tanh[\sqrt{2/\pi}(x + 0.044715x^3)])\).
- Non-saturating in the positive domain.
Initialization: Proper weight initialization techniques (e.g., He initialization for ReLU-like activations) can also help mitigate issues like dead neurons and improve training stability.

Optimization Algorithms

Optimization algorithms adjust the model’s parameters (weights and biases) to minimize the loss function.

Gradient Descent Variants:
- Batch Gradient Descent (Full GD): Computes the gradient using the entire training dataset. Accurate but computationally expensive for large datasets and can get stuck in local minima.
- Stochastic Gradient Descent (SGD): Computes the gradient using a single training example at a time. Faster updates, noisy gradients can help escape local minima, but high variance in updates.
- Mini-Batch Gradient Descent: Computes the gradient using a small batch of training examples. Strikes a balance between the accuracy of full GD and the efficiency/stochasticity of SGD. This is the most common approach.

Momentum-Based Methods

These methods accelerate SGD by incorporating information from past updates.

Polyak’s Momentum (Heavy Ball Method):
- Addresses oscillations in SGD, particularly in ravines (areas where the surface curves more steeply in one dimension than another).
- Maintains an exponentially moving average (EMA) of past gradients, which acts as a “velocity” term.
- Update equations: Let \(g_t = \nabla_{\theta} J(\theta_t)\) be the gradient at time \(t\). Velocity (EMA of gradients): \(v_t = \beta v_{t-1} + (1 - \beta) g_t\) Parameter update: \(\theta_{t+1} = \theta_t - \eta v_t\) where \(\beta\) is the momentum coefficient (e.g., 0.9) and \(\eta\) is the learning rate.
- Convergence: While standard SGD with momentum is widely used and effective for non-convex optimization, its convergence properties can be complex. Polyak’s original heavy ball method has convergence guarantees for strongly convex functions.
Nesterov Accelerated Gradient (NAG):
- A modification of momentum that provides a “lookahead” capability. It calculates the gradient not at the current position \(\theta_t\), but at an approximated future position \(\theta_t - \eta \beta v_{t-1}\) (where the parameters would be after applying the previous momentum). This allows for a more informed gradient step.
- Update equations (one common formulation): Lookahead point: \(\theta_{lookahead} = \theta_t - \eta_{lr} \beta v_{t-1}\) Gradient at lookahead: \(g_{lookahead} = \nabla J(\theta_{lookahead})\) Velocity EMA: \(v_t = \beta v_{t-1} + (1-\beta) g_{lookahead}\) Parameter update: \(\theta_{t+1} = \theta_t - \eta_{lr} v_t\)
- Alternatively, a common implementation form (e.g., Hinton et al.): \(v_{t+1} = \mu v_t - \epsilon \nabla J(\theta_t + \mu v_t)\) \(\theta_{t+1} = \theta_t + v_{t+1}\) where \(\mu\) is momentum and \(\epsilon\) is learning rate.
- The “lookahead” often provides faster convergence and better performance than standard momentum. The computation of gradients at different points can be re-written through a change of variables for implementation.

Adaptive Learning Rate Methods

These methods adjust the learning rate for each parameter individually based on the history of gradients.

Adagrad (Adaptive Gradient Algorithm):
- Adapts the learning rate for each parameter, performing larger updates for infrequent parameters and smaller updates for frequent parameters.
- Accumulates the sum of squared past gradients for each parameter.
- Update for parameter \(\theta_i\): \[ G_{t,ii} = G_{t-1,ii} + g_{t,i}^2 \] \[ \theta_{t+1, i} = \theta_{t, i} - \frac{\eta}{\sqrt{G_{t,ii} + \epsilon}} g_{t,i} \] where \(g_{t,i}\) is the gradient of the loss with respect to \(\theta_i\) at time \(t\), \(G_t\) is a diagonal matrix where each diagonal element \(G_{t,ii}\) is the sum of squares of past gradients for \(\theta_i\), \(\eta\) is a global learning rate, and \(\epsilon\) is a small constant for numerical stability.
- Strength: Good for sparse data.
- Weakness: The learning rate monotonically decreases and can become too small prematurely, effectively stopping learning.
RMSProp (Root Mean Square Propagation):
- Addresses Adagrad’s aggressively diminishing learning rates by using an exponentially decaying moving average of squared gradients instead of summing all past squared gradients.
- Update for parameter \(\theta_i\): \[ E[g^2]_{t,i} = \gamma E[g^2]_{t-1,i} + (1-\gamma) g_{t,i}^2 \] \[ \theta_{t+1, i} = \theta_{t, i} - \frac{\eta}{\sqrt{E[g^2]_{t,i} + \epsilon}} g_{t,i} \] where \(E[g^2]_{t,i}\) is the moving average of squared gradients for \(\theta_i\), and \(\gamma\) is the decay rate (e.g., 0.9).
Adam (Adaptive Moment Estimation):
- Combines the ideas of momentum (using an EMA of past gradients, first moment) and adaptive learning rates (using an EMA of past squared gradients, second moment, similar to RMSProp).
- Includes bias correction terms for the first and second moment estimates, which are particularly important during the initial stages of training when the EMAs are initialized at zero.
- Update equations: \(m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t\) (EMA of gradients - 1st moment) \(v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2\) (EMA of squared gradients - 2nd moment) Bias-corrected estimates: \[ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} \] \[ \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \] Parameter update: \[ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t \] where \(\beta_1\) (e.g., 0.9) and \(\beta_2\) (e.g., 0.999) are decay rates for the moment estimates. Adam is often a good default choice for many problems.

Learning Rate Schedules

A learning rate schedule dynamically adjusts the learning rate during training.

Step Decay: Reduces the learning rate by a fixed factor (e.g., 0.1) every few epochs.
Exponential Decay: Reduces the learning rate exponentially: \(\eta_t = \eta_0 \exp(-kt)\), where \(k\) is a decay rate.
Time-Based Decay: Reduces the learning rate as a function of iteration number, e.g., \(\eta_t = \eta_0 / (1 + kt)\).
Cosine Annealing: Varies the learning rate according to a cosine function, typically from an initial high value \(\eta_{max}\) down to a minimum value \(\eta_{min}\) (often 0) over a specified number of epochs (a “cycle”). \[ \eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{cycle}}\pi\right)\right) \] where \(T_{cur}\) is the current epoch/iteration within the cycle and \(T_{cycle}\) is the total duration of the cycle. This schedule smoothly decreases the learning rate. It can be used with “restarts” (SGDR - Stochastic Gradient Descent with Warm Restarts), where the cosine cycle is repeated multiple times.

Advanced Ensemble and Weight Averaging Techniques

Beyond basic bagging and boosting, several techniques leverage model weights or training dynamics for ensembling.

Snapshot Ensembling: Train a single model, but save “snapshots” (checkpoints) of its weights at different points during training (often at local minima found by a cyclical learning rate schedule). These snapshots are then ensembled.
Polyak-Ruppert Averaging (or EMA of weights): Maintain an exponential moving average of the model’s weights during the later stages of training. This averaged model often generalizes better.

Fast Geometric Ensembling (FGE)

FGE leverages the observation that local minima in the loss landscape of deep neural networks are often connected by simple, low-loss paths.

Concept: Train a single network using a cyclical learning rate schedule (e.g., multiple cycles of cosine annealing).
Procedure:
1. Define a learning rate schedule with multiple cycles (e.g., \(M\) cycles over \(T\) total epochs). Each cycle involves a period of higher learning rate followed by a decrease to a very low learning rate.
2. At the end of each cycle, when the learning rate is at its minimum (allowing the model to converge to a local optimum), save a snapshot of the model weights.
3. The ensemble consists of these \(M\) saved models. Their predictions are averaged at inference time.
Benefit: Allows finding multiple diverse, high-performing models with roughly the same computational cost as training a single model.

Stochastic Weight Averaging (SWA)

SWA is a simpler approximation that often yields similar benefits to FGE, leading to wider optima and better generalization.

Procedure:
1. Initial Training: Train a model using a standard learning rate schedule for a certain number of epochs (e.g., 75% of the total training budget).
2. SWA Phase: For the remaining epochs (e.g., 25%), switch to a modified learning rate schedule:
  - Typically a constant high learning rate or a cyclical schedule with a relatively high average (e.g., short cosine annealing cycles).
3. Weight Collection: During this SWA phase, collect model weights at regular intervals (e.g., at the end of each epoch or every few iterations).
4. Averaging: Compute the average of these collected weights: \(w_{SWA} = \frac{1}{N_{models}} \sum_{i=1}^{N_{models}} w_i\).
5. Batch Normalization Update: After obtaining \(w_{SWA}\), perform a forward pass over the entire training data using the \(w_{SWA}\) model to recompute the Batch Normalization running statistics. This is crucial for good performance.
Inference: Use the \(w_{SWA}\) model for predictions.
Advantages: Easy to implement and often provides a significant performance boost with minimal additional computational cost beyond the initial training.