4. Optimization

4.1. Gradient Descent

Simple idea: move in the direction of the negative gradient. Until we reach the minimum.
Saddle point: the gradient is zero but it is not a minimum nor a maximum.
We take the gradient of all the training data at once at each epoch.

4.2. Stochastic Gradient Descent

Insted of doing the update at the end of the epoch we do it after each batch or iteration in batches.
Less accurate but faster.
The noise in batch gradient has a regularizing effect. GD tends to underfit. We can escape local minima.
Almost alwyas better. But also Adam is also almost always better than SGD.
Optimizer step in every iteration rather than every epoch.
Theoretical definition with batch size of 1 but in practice we use a mini-batch.

4.3. SGD with Momentum

We add a fraction of the previous update to the current update.
\(\beta = 0.9\) is a common value.
\(\alpha\) is the learning rate. \[ m\_{t+1} = \gamma m_t + \eta \nabla L \] \[ \theta*{t+1} = \theta_t - m*{t+1} \]
First order optimization method.
When the slope is steep the velocity will increase.

4.4. Nestrerov Momentum

It calcualats the gradient not at the current point but at the point where the momentum would take us == Look ahead.
Algorithm:
- Compute the new point just ahead of the current point, given the current momentum.
- Compute the gradient at that new point.
- Update the momentum and the current point.

4.5. RMSprop

\[ v\_{t+1} = \gamma v_t + (1 - \gamma) \nabla L^2 \]

\[ \theta*{t+1} = \theta_t - \frac{\eta}{\sqrt{v*{t+1} + \epsilon}} \nabla L \]

Second order optimization method (not second derivative), uses the second moment of the loss gradient.
Idea: divide the learning rate by the square root of the sum of the squared gradients (exponential moving average).
Dampens the oscialtions, adapts the learning rate to the gradient.
Squared gradients approximate the variance.
Biased towards 0 in the beginning.

4.6. Adam

Combines RMSprop and momentum.

\[ m\_{t+1} = \beta_1 m_t + (1 - \beta_1) \nabla L \]

\[ v\_{t+1} = \beta_2 v_t + (1 - \beta_2) \nabla L^2 \]

\[ \hat{m}_{t+1} = \frac{m_{t+1}}{1 - \beta_1^{t+1}} \]

\[ \hat{v}_{t+1} = \frac{v_{t+1}}{1 - \beta_2^{t+1}} \]

\[ \theta*{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}*{t+1} + \epsilon}} \hat{m}\_{t+1} \]
The momentum would be vary small in the beginning. That is why we need to normalize it, after a few iterations \(\hat{m} = m\) and \(\hat{v} = v\). is called Bias correction.
m for mean or first moment, v for variance or second moment.
Adam and RMSprop are called adaptive optimizers since they adapt the learning rate to the gradient.
- Each parameter has its own learning rate depending on the velocity.

4.7. Newtons’s method and it’s variants

Second order optimization method.
Used to find the roots of a function. Our function is the gradient of the loss.

\[ x\_{t+1} = x_t - \frac{f'(x_t)}{f''(x_t)} \]

Computes the Inverse Hessian, which is expensive especially if we do it on the whole dataset. (which would be need for nice properties) For a matrix:

\[ \theta\_{t+1} = \theta_t - H^{-1} \nabla L \]
Variants that only approximate the Hessian:
- L-BFGS (Limited-memory BFGS)
Still needs the whole dataset in the RAM.

4.8. Optimization probelems and solutions

4.8.1. Overfitting and Underfitting

We are optimizing the traning loss, if we fit it too well we will overfit.
There are multiple reasons for this:
- Model too complex, too many parameters. Memorizing the training data.
- Too few data, not enough to generalize.
- Not stoping the training at the right time. Model learns the noise in the data.
  - Note: Early stopping is not a regularization method. It doesn’t make the training harder.
We can detect it with the validation loss or a validation metric.
Generalization gap: difference between the training and validation loss becomes bigger and bigger.
Solutions, regularization:
- Weight decay. Add the sum of the weights to the loss.
- Dropout. Randomly set some neurons to 0.
- Data augmentation. Add noise to the data.
- Hyperparameter tuning. Find the right model complexity.
Underfitting: the model is too simple.
- Increase the model complexity.
- Increase the number of epochs.
- Learning rate decay. Better optimization method.

4.8.2. Vanishing and Exploding Gradients

Exploding gradients: the gradient is too big, diverges.
- Reccurent cells: The gradients are multiplied by the same matrix at each time step if eigenvalues are bigger than 1 it will explode.
- No normalization: If activations become very large the gradients will also become very large.
- Bad initialization: If the weights are too big the gradients will also be too big, use Xavier or Kaiming initialization.
- Solutions: Gradient clipping
- Normalize activations: Batch normalization, layer normalization.

4.8.3. Learning reate scheduling - decay

Theoretical solutions for strictly convex functions: The sum of the learning rates should be infinite, the sum of the squares should be finite. Example 1/t

4.8.4. Regularization

Technique to make training harder to prevent overfitting.

L1 and L2 regularization + weight decay

It has to split the load between the weights. Not make any weight too big.
We introduce a second objective to the loss function.
L1 vs L2: L1 will make the wieghts sparse. L2 will make the weights small and spread out.
Weight decay we don’t add the regularization term to the loss function but to the gradient. It is exactly the same for SGD but different for Adam where its magnitude is controlled by the learning rate.

Dropout

Ensamble = multiple models trained on the same data. Dropout is a cheap way to do this.
With a probability p we drop a neuron (not the weights).
- Neuron has multiple weights.
- The model learns not to rely on a single neuron.
- During inference we don’t drop neurons. We scale the output by 1-p.
Being applied last in the network. Linear, normalization, activation, dropout.
Inverse dropout: we scale the weights up during training so we don’t have to scale the output during inference.
Dropout for convolutions: we drop the whole channel.

Data Augmentaion

Add noise to the data or virtually increase the dataset size by applying transformations.
Usually image data.
Applied only to the training set in the dataloader.
Rotation, flipping, cropping, translation, scaling, color jittering, cutout.
We need to transformt the labels as well.

Batch Normalization

Makes samples in the batch interact unlike all other layers.
Keep magnitudes of the activations in the same range.
Normalize a group of neurons in the same layer that are created by the same weight.
So in a FC layer, neuron corresponds to a feature and we normalize the features by the batch.

\[ \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \]

\[ y = \gamma \hat{x} + \beta \]
During training we calculate the mean and variance of the batch. And keep the running average of the mean and variance.
During inference we use the running average to normalize the data. And learn the gamma and beta to scale back the data.
The main difference is that batch normalization in a convolutional layer normalizes the activations across the mini-batch and spatial dimensions for each channel independently, while in a fully connected layer it normalizes each feature across the mini-batch.
- We flatten the spatial dimensions.

\[ \mu*{running} = \alpha \mu*{running} + (1 - \alpha) \mu\_{batch} \]

If the batch size is too small (<16) the normalization will be noisy.
It can be seen as a regularizer. It adds noise to the training. It also makes the optimization easier.

4.8.5. Weight Initialization

Xavier Initialization

also know as Glorot initialization. It aims to scale the gradients to be the same size in all layers. Useful for tanh and sigmoid activation functions.
Variance of the input and output should be the same. Gaussain distribution with mean 0 and variance 1/n. Where n is the number of input features.

Kaiming Initialization

Also known as He initialization. It is used for ReLU and its variants.
Relu kills half of the neurons. We need to double the variance.
Variance of the input should be 2/n.

Some problems

4.9. Transfer Learning

When we can reuse the feature extractor of a model, or the task, distriboution is similar.
Low level features are similar in all images.
Pretrained models: Trained on large datasets like ImageNet. Can be transferred to other tasks.
Adaptaion to new tasks: Replacing the classifier, freezing the feature extractor. Or setting a very low learning rate for the feature extractor.

4.10. Questions

Optimizers:
1. What is the main difference between gradient descent (GD) and stochastic gradient descent (SGD)?
- GD performs the optimization step after the entire dataset is processed, while SGD performs the optimization step after each batch.
1. Name two advantages of SGD over GD.
- Faster convergence, less memory consumption.
- Can escape local minima by intorducing noise. Acts as a regularizer.
1. Why can we call RMS prop an adaptive optimizer?
- The learning rate is divided by the velocity, each parameter has its own learning rate.
1. What is the bias correction in Adam? Why isn’t it implemented in RMS prop?
- Since the velocity and momentum is initalized to 0, they are biased towards 0 in the beginning. In RMS prop it’s not done since it wasn’t invented yet.
1. Adam with bias correction and Adam without bias correction will NOT converge to the same minimum. True or False?
- In practice it should just slower, in theory it’s not guaranteed.
1. What two optimizers does Adam combine?
- RMS prop and momentum.
1. Why is SGD+momentum usually better than SGD?
- It uses information from previous gradients to smooth the optimization path, while speeding up convergence in directions with consistent gradients. Also can escape saddle points.
1. 1. Why do we always step in the direction of the negative gradient?
- Since the direction of the positive gradient is the direction of the maximum increase of the function.
  1. Write down a modified version of the SGD optimizer step, in case we want to maximize the loss instead of minimizing it.
- \(\theta_{t+1} = \theta_t + \eta \nabla L\)
Overfitting and underfitting:
1. Define overfitting in a short sentence.
- The model is too complex and is learning the noise and memorizing the data. Performs well on the training data but poorly on the validation data.
1. Define underfitting in a short sentence.
- Model too simple or undertrained to capture the training data. Performs poorly on both the training and validation data.
1. State two different behaviors that indicate underfitting.
- Validation loss is still decreasing at the end of training.
- The accuracy is low on both the training and validation data.
1. State a possible reason for a situation where the validation loss is lower than the training loss.
- Data leakage, the validation data is not representative of the test data. Bug in the code.
1. In a single word, what is the go-to solution to overfitting?
- Regularization.
Regularization:
1. The term “regularization term” is ambiguous. What are the two usages of such “terms”?
- Term that makes the training more difficult, hence forcing the model to generalize better.
- Term that is added represents an additional objective to the loss function.
1. What is the effect of L1 and L2 regularization terms on the weights?
- L1 makes the weights sparse, L2 makes the weights small and spread out.
1. What are the two differences between L1 and L2 regularization terms and “weight decay”?
- The L1 and L2 regularization terms are a part of the loss function.
- Weight decay is a part of the optimizer, in the case of Adam it varies with the learning rate.
1. How can we avoid the computational overhead of the dropout layer during inference?
- We can divide the output by 1-p during training. This is called inverse dropout.
1. Explain how regular dropout works during training and during inference. Hint: crucial to distinguish between the definitions of p.
- p = probability of dropping a neuron
- Training: We drop a neuron with probability p.
- Inference: We don’t drop any neurons but multipy the output by 1-p.
1. Why is data augmentation considered a regularization technique?
- Since it makes the training harder, the model learns to generalize better.
1. Why don’t I allow you to consider Early-stopping as a regularization technique?
- Since it doesn’t make the training harder, it just keeps the model from overfitting. Saving checkpoints is a better approach.
Batch Normalization: 1. Given a loss value L, and a fully connected layer with output shape 8 × 16, followed by a batch normalization layer: \(y = BN(x) = \gamma xnorm + \beta\)
1. What are dimension of the parameters \(\gamma and \beta\)? - The same as the number of features, 16.
2. Show a derivation of the gradients of those parameters and show how to use NumPy to calculate it.
- \[ y = BN(x) = 1_{N} \cdot \gamma \cdot x_{\text{norm}} + 1_{N} \cdot \beta \] \[ \frac{\partial{L} }{ \partial{\gamma}} = \frac{\partial L}{ \partial y} \frac{\partial y}{\partial \gamma} = 1_N^T \frac{\partial L}{ \partial y} x_{\text{norm}} \]

\[ \frac{\partial{L} }{ \partial{\beta}} = \frac{\partial L}{ \partial y} \frac{\partial y}{\partial \beta} = 1_N^T \frac{\partial L}{\partial y} \]

dgamma = np.sum(dout * x_norm, axis=0)
dbeta = np.sum(dout, axis=0)

Why is batch normalization sometimes referred to as a regularization technique?

Since it adds noise to the training, making it harder. It also makes the optimization easier.

Explain why we need to save the running averages of the mean and variance in the batch norm to the memory during training.

To use them during inference to normalize the data.

What optimization problems could batch norm help us solve?

It normalizes the activations, preventing vanishing and exploding gradients.

Hyperparameter tuning:
1. What is the main bottleneck for grid search?

We are searching in a high-dimensional space, the number of hyperparameters grows exponentially.

Weight initialization:
1. What weight initialization scheme fits the ReLU activation function? How does it affect the output of the activation function?
- Kaiming/He it keeps the variance of the input and output the same. It doubles the variance to account for the ReLU killing half of the neurons. \(w \sim N(0, 2/n\_{in})\)
1. What do we expect to observe if we initialize the weights of the model to the same value?
- All neurons will learn the same thing, the model will be symmetrical. Prevents the dead relu problem. Groups of neurons will learn the same thing.
Transfer Learning:
1. Explain one of the scenarios in the exercises where we used transfer learning, and how.
- We trained an autoencoder on a large dataset on a reconstruction task and then used the encoder as a feature extractor for a classification task.
- Trained a segmentation model using a pretrained model for feature extraction.