1. ML Basics

1.0.1. Supervised vs Unsupervised Learning

1.0.2. Dataset and Data-loaders

ML is data-driven.
Data is split into training, validation, and test sets.
We minimize over the training set and usually overfit.
Preventing data leakage is crucial.
Good performance on the test set is essential.
Common splits: 80/10/10 or 70/15/15; training should be 50% or more.
Validation and test sets must be large enough to be good estimators of the distribution.
Shuffling data is important to avoid biases (e.g., daytime vs. nighttime images).
The Dataset class in PyTorch needs __len__ and __getitem__ methods.
- __len__ returns the length of the dataset.
- __getitem__ returns the i-th element in the dataset, which can be loaded from disk, normalized, transformed, etc.
The DataLoader takes a dataset, creates batches, and loads them into RAM efficiently.

1.0.3. Data Augmentation

Artificially augmenting data improves training.
Techniques include flipping, rotating, cropping images, Gaussian blur, etc., with a probability $p$.
Acts as a regularization method to prevent overfitting.
Not applied to validation and test sets, unlike data processing.
Should be problem-specific (e.g., cannot flip a 9).

1.0.4. Data Processing

Applied to all samples.
Normalization ensures all dimensions are on the same scale, easing learning and preventing exploding gradients.

1.0.5. Exam Questions

Difference between supervised and unsupervised learning?
- Supervised: Trained on existing and fixed groud truth data. Classes for classification, depth for depth estimation.
- Unsupervised: No labels, find patterns in the data. Clustering, dimensionality reduction, anomaly detection.
Difference between classification and regression?
- Classification predicts discrete labels, while regression predicts continuous values.
Advantages of unsupervised learning over supervised learning?
- Labeling data is expensive and sometimes impossible to obtain. Can be used to find patterns in data without prior knowledge.
- Semi-supervised we create labels like in masked language modeling.
Example of using unsupervised learning to improve accuracy.
- Autoencoders for feature extraction.
- Training a language model on on a masked language modeling task and then fine-tuning it on a classification task.
K-means:
- 1. What is the K-means clustering algorithm?
  - Unsupervised learning algorithm. Used for clustering into $k$ groups. Assigns a data point to the cluster with the nearest centroid. Then the new clusters are recalculated. Iterative algorithm, converges to a local minimum.
- 1. What is its key hyperparameter?
  - The number of clusters, $k$.
- 1. Consequences of a too small or too large value of its hyperparameter?
  - Too small $k$ can underfit (oversimplify) the data, while too large $k$ can overfit (make it too complex).
PCA:
- 1. What is PCA?
  - Unsupervised dimensionality reduction technique. Uses a linear transformation to project data into a lower-dimensional space that preserves the most variance.
- 1. What Deep Learning architecture could perform a similar task? Why is the deep learning method preferable for real-world problems?
  - Autoencoders: they can approximate non-linear functions and reduce dimensions.
Data splits:
- 1. Importance of shuffling data before splitting?
  - To avoid biases and ensure the data distribution is uniform across splits.
- 1. Common split ratios?
  - 80/10/10 or 70/15/15 for training/validation/test.
- 1. Why is the training set bigger?
  - To provide more data for the model to learn from.
- 1. When can we relax the ratio between the splits to be more even? When the other way around?
  - More even when the dataset is small; more skewed towards training when the dataset is large.
- 1. Consequence of a too small validation set?
  - It may not provide a reliable estimate of model performance.
- 1. How to overcome a too small training set?
  - Use data augmentation to artificially increase the dataset size.
Why not use the validation set within the training set and just use the test set for validation?
- To monitor overfitting and perform hyperparameter tuning without biasing the test set.
Main issue with modern dataset sizes? Why is it so?
- They can be extremely large, making them difficult to handle and process efficiently.
- Still limited and cannot represent the entire population.
To which part of the dataset should we apply data augmentation? Why?
- To the training set, to increase the diversity of data the model learns from and reduce overfitting.
Difference between data augmentation and data processing techniques like normalization?

Data augmentation: artificially increases the dataset size by applying transformations to the data, it is a regularization method.
Data processing: applied to all samples and all splits. Normalization ensures all dimensions are on the same scale, easing learning and preventing exploding gradients.

Why is the mean and standard deviation of the data calculated only over the training set?
- To prevent data leakage and keep validation and test sets unbiased.

2. Neural Networks

Composed of a series of functions.
Circles (nodes) represent neurons.
Lines (edges) represent connections between neurons.
Each neuron holds a single value.
Computations usually occur in parallel (except for batch normalization).

2.0.1. Common Notations

Index connections as $ W_{lio} $: $ l $ = layer, $ i $ = input neuron, $o$ = output neuron.
Apply activation functions, which are usually element-wise non-linearities.
Softmax: derivatives are element-wise multiplications.

2.0.2. Fully Connected Layer

Output neuron is a weighted function of all the input neurons.
Equation:

\[ y = WX + b \]

where

\[ X \in \mathbb{R}^{n \times d}, \, W \in \mathbb{R}^{d \times m}, \, b \in \mathbb{R}^{m} \]

Fully connected networks (MLP, FCN, affine layer) capture global features (e.g., overall color).

2.0.3. Backpropagation

Computes gradients layer by layer.
Flow from left to right.
Only take gradients with respect to the loss function (scalar).

2.0.4. Activation Functions

Make neural networks universal function approximators.
Activation map: output of a linear layer after an activation.
Features: meaning of a group of individual numbers together.
Zero-centered functions output both positive and negative values.

Sigmoid

Range: $(0, 1)$.
Derivative: \[ \text{sigm}(1 - \text{sigm}) \] .
Issues: vanishing gradients, not zero-centered.

Hyperbolic Tangent (tanh)

Range: $(-1, 1)$.
Zero-centered.
Derivative: \[ 1 - \text{tanh}^2 \]

ReLU

Range: $(0, \infty)$.
Simple to compute, not differentiable at 0, not zero-centered.
Issues: dying ReLU problem (negative inputs become 0).

Leaky ReLU

\[ \max(\alpha x, x) \]

where $\alpha$ is a small positive number.

Zero-centered.

Parametric ReLU (PReLU)

ELU (Exponential Linear Unit)

Maps $\mathbb{R}$ to $\mathbb{R}$.
Zero-centered, solves the dying ReLU problem.

MaxOut

Takes the maximum of two affine layers’ outputs.
Inefficient.

Gelu

Gaussian Error Linear Unit.
Smooth approximation of ReLU.
Can be interpreted as a Gaussian CDF scaled by a learned parameter. \[ GeLU(x) = x \cdot \Phi(x) \] Approximation: \[ GeLU(x) = 0.5x(1 + \tanh(\sqrt{2/\pi}(x + 0.044715x^3))) \]

Softmax

Maps $\mathbb{R}^k$ to $[0, 1]^k$.
Works row-wise for multi-class classification or attention.
Magnifies input differences, making the network more confident in predictions.
Derivative: \[ \frac{d \text{softmax}_i}{d x_j} = \text{softmax}\_i (\delta_{ij} - \text{softmax}\_j) \]
For numerical stability, shift by the maximum value $x_i - \max(x)$.

2.1. Loss Functions

Minimizing the training loss is the goal of training.
Always a scalar value.

2.1.1. Classification Losses

3.1.1. Binary Cross-Entropy

\[ \text{BCE}(y, \hat{y}) = -1/N \sum\_{i=1}^N y_i \log(\hat{y}\_i) + (1 - y_i) \log(1 - \hat{y}\_i) \]

N is the number of residuals = not always the same as the number of samples for example of pixel semantic segmentation.
we have to clip the values of the output of the model to avoid log(0) which is undefined.
y hat is the output after sigmoid activation function.

3.1.2. Categorical Cross-Entropy

\[ \text{CCE}(y, \hat{y}) = -1/R \sum_{i=1}^R \sum_{j=1}^{C} y_{ij} \log(\hat{y}_{ij}) \]

only y_ij is the one hot encoded vector of the target.
R is the number of residuals.

2.1.2. Regression Losses

3.2.1. Mean Squared Error

\[ \text{MSE}(y, \hat{y}) = 1/N \sum\_{i=1}^N (y_i - \hat{y}\_i)^2 \]

Also called L2 loss.
Penalizes large errors quadratically.
Not robust to outliers.
Things will look more blurry.

3.2.2. Mean Absolute Error

\[ \text{MAE}(y, \hat{y}) = 1/N \sum\_{i=1}^N |y_i - \hat{y}\_i| \]

Also called L1 loss.
Manhattans distance.
Less sensitive to outliers.
It will look more sharp takes the median of the data.

2.2. Models

2.2.1. Linear Regression

\[ y = Wx + b \]

\[ W = (X^T X)^{-1} X^T y \]

Linear model
Has closed-form solution but not scalable.

2.2.2. Logistic Regression

\[ y = \text{sigmoid}(Wx + b) \]

Binary classification.
Non-linear model.
MLE gets us back to BCE loss.
We set a threshold to classify the output, usually 0.5.

2.3. Questions

Linear Regression
1. Assume a linear problem of fitting with 5 billion points, each with 100 features. Is it feasible to find the optimal without a neural network? State if yes or no, and explain why. Answer: No. The computational complexity and memory requirements for handling such a massive dataset with traditional linear regression methods would be prohibitive.
2. How can we transform a linear problem (with only affine layers) into a classification problem? Answer: We can transform a linear problem into a classification problem by adding a softmax layer at the end of the affine layers to produce probability distributions over the classes.
Fully Connected Layers 1. Given the following layer \[ y = f(X, W, Z, R, T, A) = XW + ZR + T + A \]

\[ X \in \mathbb{R}^{N \times D} , W \in \mathbb{R}^{D \times M} , Z \in \mathbb{R}^{N \times D} , R \in \mathbb{R}^{D \times M} , T \in \mathbb{R}^{N \times M} , A \in \mathbb{R}^{N \times M} \]

and a loss function defined as

\[ L = \sum_{i=1}^N \sum_{j=1}^M y_{ij} \]

Answer:

\[ \frac{\partial L}{\partial Z} = \frac{\partial L}{\partial y} R^T \]

\[ \frac{\partial L}{\partial A} = 1_{NxM} \circ \frac{\partial L}{\partial y} \]

Given a batch of images of shape $(8 \times 3 \times 8 \times 8)$. How can we process them through a fully connected layer? What would be the shape of the weight matrix $W$ in the case of logistic regression and BCE as a loss function?

Answer: Flatten each image to shape $(8, 192)$. The weight matrix $ W $ would have shape $(192, 1)$.

Given a batch of images of shape $8 \times 3 \times 8 \times 8$, a fully connected network (FCN), and a task to detect eyes of people in the images. Name two disadvantages for trying to solve the task with the current setup.

Answer: 1. Loss of spatial information due to flattening. 2. High computational cost and overfitting risk due to large number of parameters.

Given two affine layers: \[ y_1 = f_1(X, W_1, B_1) = XW_1 + B_1 \]

\[ y_2 = f_2(y_1, W_2, B_2) = y_1W_2 + B_2 \]

Show that it could be described as a single affine layer. Answer: Combine the layers:

\[ y_2 = (XW_1 + B_1)W_2 + B_2 = X(W_1W_2) + (B_1W_2 + B_2) \]

This is equivalent to a single affine transformation:

\[ y_2 = XW' + B' \]

where $ W’ = W_1W_2 $ and $ B’ = B_1W_2 + B_2 $.

Activation Functions
1. What is the purpose of the activation functions in a neural network? Answer: To introduce non-linearity, allowing the network to learn complex patterns.
  1. Give two advantages of the Tanh function over the Sigmoid function. Answer:
  1. Zero-centered output.
  2. Stronger gradients (steeper derivatives) which can mitigate the vanishing gradient problem.
  1. Explain the vanishing gradient problem, and describe one method that could help us solve it. Answer:
  - The vanishing gradient problem occurs when gradients become exponentially small upstream, causing the first layers to be updated less. One method to address this is to use activation functions like ReLU or LeakyReLU.
  - Residual connections: skip connections that allow gradients to flow directly to earlier layers, mitigating the vanishing gradient problem.
  1. We’ve learned that the sigmoid function can cause the “vanishing gradient” problem. Therefore, explain why the sigmoid function is still sometimes used on the logits (the output of the last layer). Answer: The sigmoid function is used on the logits in binary classification tasks to map the output to a probability between 0 and 1.
  2. Assume the LeakyReLU activation function. If we take $\alpha < 0$ what property of the function would we lose? Answer: The function would no longer be zero-centered.
  3. The Softmax function could suffer from numerical instability, given the logits. Show that reducing the maximum value of the logits from each one of the values in the vector would not change the output of the function: \[ \text{Softmax}(x - \max(x))_i = \text{Softmax}(x)_i \] Answer:
\[ \text{Softmax}(x - \max(x))_i = \frac{e^{x_i - \max(x)}}{\sum_{j} e^{x_j - \max(x)}} = \frac{e^{x_i} \cdot e^{-\max(x)}}{\sum_{j} e^{x_j} \cdot e^{-\max(x)}} = \frac{e^{x_i}}{\sum_{j} e^{x_j}} = \text{Softmax}(x)_i \]
Loss Functions
1. Which property of loss functions allows us to perform backpropagation without computing Jacobians? Answer: Scalar output.
2. Assume that you’re using the MSE loss function to compare two batches of images, of shape $(4 \times 3 \times 8 \times 8)$ what would be the value of $ N $, that we should divide the loss value by? Answer: $N = 4 \times 3 \times 8 \times 8 = 768$
3. In depth estimation, where we predict the depth of each pixel in an image, it was found that the loss for the majority of pixels is very small, while for some few pixels it is very large. What would be a better loss function to use in this case out of [MAE, MSE, BCE, CE], and why? Answer: MAE (Mean Absolute Error) would be better as it is less sensitive to outliers compared to MSE.
4. What could cause numerical instabilities in the BCE and CE functions? How could we solve that? Answer: Numerical instabilities can be caused by very large or very small logits. This can be solved by using techniques like clipping logits or adding a small constant to the logits.
5. Why is the MAE loss function still used, although it is not differentiable at $ x = 0 $? Answer: MAE is still used because it is robust to outliers and provides a meaningful measure of average error.
6. Assume that in the first iteration of training, some of the logits are values above 1000, while the ground-truth values are in the range of [0, 1]. If we’re using the MSE loss function, what optimization problem should we expect to observe? Answer: The gradients will be extremely large, causing the weights to update excessively and potentially destabilize the training process.
7. Let’s use the CE loss function for a task of classification of 100 classes. What is the expected loss value after the first iteration, and why? Answer: The expected loss value is approximately $-\log (1/100) = \log(100) \approx 4.605$ because the initial predictions are likely to be uniformly distributed over the classes.
8. BCE: how many neurons are at the output layer? Answer: 1 neuron for binary classification.
9. BCE: why do we multiply the result by $ -1 $? Answer: To convert the maximization problem into a minimization problem, as the loss functions are typically minimized in gradient descent. To make it possitive since the argument of the log is between 0 and 1.
10. Why don’t we multiply by $ -1 $ in MSE or MAE? Answer: Because MSE and MAE are inherently designed to measure error and are minimized directly, unlike the log-likelihood in BCE.
11. You are given a neural network for a classification task with 4 classes and the CE as a loss function. The batch size is 1000. After the very first iteration of training, what is the expected loss value? Answer: The expected loss value is approximately $\log(4) \approx 1.386$, assuming the initial predictions are uniformly distributed over the classes.

3. Convolutions

We can extract local features.
Linear operations with shared weights.
Translation-equivariant.
- Can find the same object in different parts of the image.
- Not rotation-equivariant.
- Not scale-equivariant, if we change the resolution, we need to change the weights.
Global look over the image is possible in the deeper layers where the receptive field is larger.
Very efficient parameter-wise.
Kerenel size is an odd number, can have hight and width different.
Stride is the step size of the kernel.
Amout of pixels added to the edge of the image is called padding.

\[ \text{output size} = \frac{W - F + 2P}{S} + 1 \]
Popular options
- k=1, s=1, p=0 pointwise convolution. Processing each pixel independently keeping the dimensions. Used to reduce the number of channels.
- k=3, s=1, p=1 standard convolution.
- k=3, s=2, p=1 downsampling. Halving the dimensions.
- k=7, s=4, p=3 upsampling. Spatial size is decreased by 4.

3.0.1. Max Pooling

Works channel-wise independently.
In a kernel size it takes the maximum value.
We have to keep track of the indices to backpropagate.
Usually k=2, s=2, p=0.
Only a quarter of the input gets a gradient. Reason why it is not used anymore.
If there is a tie both values get the gradient.

3.0.2. Average Pooling

Averages the values in the kernel.
!! Works channel-wise independently. !! Different from the convolution that works channel-wise together.
Usually k=2, s=2, p=0.

3.0.3. Special Convolutions

Depthwise convolution. Special case of convolution where each channel is processed independently with a different kernel.
Global Max Pooling. Takes the maximum value of the whole feature map.
Upsample
- Nearest neighbor. Just repeats the pixels.
- Bilinear. Takes the average of the 4 nearest pixels.
- Bi-cubic. Takes the average of the 16 nearest pixels.
- Dosn’t have learnable parameters.
Transposed convolution. Upsampling with learnable parameters.
- Also called fractionally strided convolution. Adds zeros between the pixels and to the edges.
- Not the same as deconvolution or inverse convolution.
Dilated convolution. Increases the receptive field without increasing the number of parameters.
- Also called atrous convolution.
- The kernel is applied to every n-th pixel.
- The receptive field is increased by a factor of n.
- Used in the encoder part of the U-Net.

3.0.4. Receptive Field

Receptive filed is the area of the input image that affects the output of a neuron. For layer l:

\[ r*l = r*{l-1} + (k*{l} - 1)\prod*{i=1}^{l-1} s_i \]

Its a tuple (height, width).

3.0.5. Handcrafted Kernels

Don’t need to remember the values, just the concept.

!! Each kernel also has a bias term. !!

3.1. Questions

How can we represent a FC layer with 5 output neurons with a convolutional layer, over an image in a batch of size 4 × 3 × 8 × 8? And with 1x1 Convolution?
- Convolution with 5 filters and kernel size 8x8. The weight dimensions would be (5, 3, 8, 8). The output shape would be (4, 5, 1, 1).
- With a 1x1 convolution we would reshape the input to (4, 3x8x8) and use a kernel of size 5x192. The output shape would be (4, 5).
Given a 1 × 1 convolutional layer with input tensor of 10 channels, that outputs a tensor with 5 channels.
1. Write the shape of the weight matrix.
- The shape of the weight matrix would be (5, 10, 1, 1).
1. State the number of parameters in the layer.
- The number of parameters would be 55 (5x10 + 5).
Give two reasons to use a convolution or a pooling layer that reduce the spatial size of the input tensor.
- Reduces the number of parameters and computational cost. Allowing the network to be deeper.
- Compression of the feature map, focusing on the most important features.
Does reducing the spatial size throughout the network reduce the number of parameters in the down-the-stream convolutional layers?
- No, the number of parameters depends on the kernel size and the number of channels.
Does reducing the spatial size throughout the network reduce the number of parameters in the down-the-stream FC layers?
- Yes, here the tensor is flattened and the number of parameters is reduced.
State two differences between a convolutional layer and a pooling layer.
- Convolutional layers have learnable parameters.
- Pooling layers work channel-wise independently.
Assume a fully convolutional model for some task:
1. Can we feed the model with images that are double the resolution of the original training set? Yes, we can but the prediction will be bad since the model was trained on a different resolution.
2. Can we expect in such case the same performance?
- No, the model was trained on a different resolution.
Can we apply a pooling layer without reducing the spatial size of the input tensor?
- Yes, by using a pooling layer with a kernel size and stride of 1 and padding 1.
Assume that when using maxpool with (k = 2, s = 2, p = 0), where only a single entry in a given window holds the maximum value in that window. How many of the total pixels in the tensor would get a live gradient? What is the value of said gradient?
- Only 1/4 of the pixels would get a live gradient, and the value would be 1.
State one advantage and one disadvantage of a 1 × 1 convolution over a 3 × 3 convolution.

1x1 convolutions don’t caputere local features, they are pointwise convolutions. They are used to reduce the number of channels.
3x3 convolutions have a larger receptive field but more parameters. Can change the spatial size.

Why don’t we use hand-crafted kernels (e.g. Sobel filter for edge detection) within our deep learning models?
- They are not learnable, they are fixed. They are not able to learn the features from the data.
In what technique, however, can we use hand-crafted kernels such as Gaussian blur?

Data augmentation.

Assume that we want to use a transformer to process an intermediate feature map (an output tensor of a convolutional layer), to learn meaningful relations between the pixels. How can we change the tensor to do that?
- We can reshape the tensor to (B, C, H, W) to (B, HxW, C) and use the transformer. Now each pixel is a C-dimensional vector. Used as a token in the transformer.

4. Optimization

4.1. Gradient Descent

Simple idea: move in the direction of the negative gradient. Until we reach the minimum.
Saddle point: the gradient is zero but it is not a minimum nor a maximum.
We take the gradient of all the training data at once at each epoch.

4.2. Stochastic Gradient Descent

Insted of doing the update at the end of the epoch we do it after each batch or iteration in batches.
Less accurate but faster.
The noise in batch gradient has a regularizing effect. GD tends to underfit. We can escape local minima.
Almost alwyas better. But also Adam is also almost always better than SGD.
Optimizer step in every iteration rather than every epoch.
Theoretical definition with batch size of 1 but in practice we use a mini-batch.

4.3. SGD with Momentum

We add a fraction of the previous update to the current update.
$\beta = 0.9$ is a common value.
$\alpha$ is the learning rate. \[ m\_{t+1} = \gamma m_t + \eta \nabla L \] \[ \theta*{t+1} = \theta_t - m*{t+1} \]
First order optimization method.
When the slope is steep the velocity will increase.

4.4. Nestrerov Momentum

It calcualats the gradient not at the current point but at the point where the momentum would take us == Look ahead.
Algorithm:
- Compute the new point just ahead of the current point, given the current momentum.
- Compute the gradient at that new point.
- Update the momentum and the current point.

4.5. RMSprop

\[ v\_{t+1} = \gamma v_t + (1 - \gamma) \nabla L^2 \]

\[ \theta*{t+1} = \theta_t - \frac{\eta}{\sqrt{v*{t+1} + \epsilon}} \nabla L \]

Second order optimization method (not second derivative), uses the second moment of the loss gradient.
Idea: divide the learning rate by the square root of the sum of the squared gradients (exponential moving average).
Dampens the oscialtions, adapts the learning rate to the gradient.
Squared gradients approximate the variance.
Biased towards 0 in the beginning.

4.6. Adam

Combines RMSprop and momentum.

\[ m\_{t+1} = \beta_1 m_t + (1 - \beta_1) \nabla L \]

\[ v\_{t+1} = \beta_2 v_t + (1 - \beta_2) \nabla L^2 \]

\[ \hat{m}_{t+1} = \frac{m_{t+1}}{1 - \beta_1^{t+1}} \]

\[ \hat{v}_{t+1} = \frac{v_{t+1}}{1 - \beta_2^{t+1}} \]

\[ \theta*{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}*{t+1} + \epsilon}} \hat{m}\_{t+1} \]
The momentum would be vary small in the beginning. That is why we need to normalize it, after a few iterations $\hat{m} = m$ and $\hat{v} = v$. is called Bias correction.
m for mean or first moment, v for variance or second moment.
Adam and RMSprop are called adaptive optimizers since they adapt the learning rate to the gradient.
- Each parameter has its own learning rate depending on the velocity.

4.7. Newtons’s method and it’s variants

Second order optimization method.
Used to find the roots of a function. Our function is the gradient of the loss.

\[ x\_{t+1} = x_t - \frac{f'(x_t)}{f''(x_t)} \]

Computes the Inverse Hessian, which is expensive especially if we do it on the whole dataset. (which would be need for nice properties) For a matrix:

\[ \theta\_{t+1} = \theta_t - H^{-1} \nabla L \]
Variants that only approximate the Hessian:
- L-BFGS (Limited-memory BFGS)
Still needs the whole dataset in the RAM.

4.8. Optimization probelems and solutions

4.8.1. Overfitting and Underfitting

We are optimizing the traning loss, if we fit it too well we will overfit.
There are multiple reasons for this:
- Model too complex, too many parameters. Memorizing the training data.
- Too few data, not enough to generalize.
- Not stoping the training at the right time. Model learns the noise in the data.
  - Note: Early stopping is not a regularization method. It doesn’t make the training harder.
We can detect it with the validation loss or a validation metric.
Generalization gap: difference between the training and validation loss becomes bigger and bigger.
Solutions, regularization:
- Weight decay. Add the sum of the weights to the loss.
- Dropout. Randomly set some neurons to 0.
- Data augmentation. Add noise to the data.
- Hyperparameter tuning. Find the right model complexity.
Underfitting: the model is too simple.
- Increase the model complexity.
- Increase the number of epochs.
- Learning rate decay. Better optimization method.

4.8.2. Vanishing and Exploding Gradients

Exploding gradients: the gradient is too big, diverges.
- Reccurent cells: The gradients are multiplied by the same matrix at each time step if eigenvalues are bigger than 1 it will explode.
- No normalization: If activations become very large the gradients will also become very large.
- Bad initialization: If the weights are too big the gradients will also be too big, use Xavier or Kaiming initialization.
- Solutions: Gradient clipping
- Normalize activations: Batch normalization, layer normalization.

4.8.3. Learning reate scheduling - decay

Theoretical solutions for strictly convex functions: The sum of the learning rates should be infinite, the sum of the squares should be finite. Example 1/t

4.8.4. Regularization

Technique to make training harder to prevent overfitting.

L1 and L2 regularization + weight decay

It has to split the load between the weights. Not make any weight too big.
We introduce a second objective to the loss function.
L1 vs L2: L1 will make the wieghts sparse. L2 will make the weights small and spread out.
Weight decay we don’t add the regularization term to the loss function but to the gradient. It is exactly the same for SGD but different for Adam where its magnitude is controlled by the learning rate.

Dropout

Ensamble = multiple models trained on the same data. Dropout is a cheap way to do this.
With a probability p we drop a neuron (not the weights).
- Neuron has multiple weights.
- The model learns not to rely on a single neuron.
- During inference we don’t drop neurons. We scale the output by 1-p.
Being applied last in the network. Linear, normalization, activation, dropout.
Inverse dropout: we scale the weights up during training so we don’t have to scale the output during inference.
Dropout for convolutions: we drop the whole channel.

Data Augmentaion

Add noise to the data or virtually increase the dataset size by applying transformations.
Usually image data.
Applied only to the training set in the dataloader.
Rotation, flipping, cropping, translation, scaling, color jittering, cutout.
We need to transformt the labels as well.

Batch Normalization

Makes samples in the batch interact unlike all other layers.
Keep magnitudes of the activations in the same range.
Normalize a group of neurons in the same layer that are created by the same weight.
So in a FC layer, neuron corresponds to a feature and we normalize the features by the batch.

\[ \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \]

\[ y = \gamma \hat{x} + \beta \]
During training we calculate the mean and variance of the batch. And keep the running average of the mean and variance.
During inference we use the running average to normalize the data. And learn the gamma and beta to scale back the data.
The main difference is that batch normalization in a convolutional layer normalizes the activations across the mini-batch and spatial dimensions for each channel independently, while in a fully connected layer it normalizes each feature across the mini-batch.
- We flatten the spatial dimensions.

\[ \mu*{running} = \alpha \mu*{running} + (1 - \alpha) \mu\_{batch} \]

If the batch size is too small (<16) the normalization will be noisy.
It can be seen as a regularizer. It adds noise to the training. It also makes the optimization easier.

4.8.5. Weight Initialization

Xavier Initialization

also know as Glorot initialization. It aims to scale the gradients to be the same size in all layers. Useful for tanh and sigmoid activation functions.
Variance of the input and output should be the same. Gaussain distribution with mean 0 and variance 1/n. Where n is the number of input features.

Kaiming Initialization

Also known as He initialization. It is used for ReLU and its variants.
Relu kills half of the neurons. We need to double the variance.
Variance of the input should be 2/n.

Some problems

4.9. Transfer Learning

When we can reuse the feature extractor of a model, or the task, distriboution is similar.
Low level features are similar in all images.
Pretrained models: Trained on large datasets like ImageNet. Can be transferred to other tasks.
Adaptaion to new tasks: Replacing the classifier, freezing the feature extractor. Or setting a very low learning rate for the feature extractor.

4.10. Questions

Optimizers:
1. What is the main difference between gradient descent (GD) and stochastic gradient descent (SGD)?
- GD performs the optimization step after the entire dataset is processed, while SGD performs the optimization step after each batch.
1. Name two advantages of SGD over GD.
- Faster convergence, less memory consumption.
- Can escape local minima by intorducing noise. Acts as a regularizer.
1. Why can we call RMS prop an adaptive optimizer?
- The learning rate is divided by the velocity, each parameter has its own learning rate.
1. What is the bias correction in Adam? Why isn’t it implemented in RMS prop?
- Since the velocity and momentum is initalized to 0, they are biased towards 0 in the beginning. In RMS prop it’s not done since it wasn’t invented yet.
1. Adam with bias correction and Adam without bias correction will NOT converge to the same minimum. True or False?
- In practice it should just slower, in theory it’s not guaranteed.
1. What two optimizers does Adam combine?
- RMS prop and momentum.
1. Why is SGD+momentum usually better than SGD?
- It uses information from previous gradients to smooth the optimization path, while speeding up convergence in directions with consistent gradients. Also can escape saddle points.
1. 1. Why do we always step in the direction of the negative gradient?
- Since the direction of the positive gradient is the direction of the maximum increase of the function.
  1. Write down a modified version of the SGD optimizer step, in case we want to maximize the loss instead of minimizing it.
- $\theta_{t+1} = \theta_t + \eta \nabla L$
Overfitting and underfitting:
1. Define overfitting in a short sentence.
- The model is too complex and is learning the noise and memorizing the data. Performs well on the training data but poorly on the validation data.
1. Define underfitting in a short sentence.
- Model too simple or undertrained to capture the training data. Performs poorly on both the training and validation data.
1. State two different behaviors that indicate underfitting.
- Validation loss is still decreasing at the end of training.
- The accuracy is low on both the training and validation data.
1. State a possible reason for a situation where the validation loss is lower than the training loss.
- Data leakage, the validation data is not representative of the test data. Bug in the code.
1. In a single word, what is the go-to solution to overfitting?
- Regularization.
Regularization:
1. The term “regularization term” is ambiguous. What are the two usages of such “terms”?
- Term that makes the training more difficult, hence forcing the model to generalize better.
- Term that is added represents an additional objective to the loss function.
1. What is the effect of L1 and L2 regularization terms on the weights?
- L1 makes the weights sparse, L2 makes the weights small and spread out.
1. What are the two differences between L1 and L2 regularization terms and “weight decay”?
- The L1 and L2 regularization terms are a part of the loss function.
- Weight decay is a part of the optimizer, in the case of Adam it varies with the learning rate.
1. How can we avoid the computational overhead of the dropout layer during inference?
- We can divide the output by 1-p during training. This is called inverse dropout.
1. Explain how regular dropout works during training and during inference. Hint: crucial to distinguish between the definitions of p.
- p = probability of dropping a neuron
- Training: We drop a neuron with probability p.
- Inference: We don’t drop any neurons but multipy the output by 1-p.
1. Why is data augmentation considered a regularization technique?
- Since it makes the training harder, the model learns to generalize better.
1. Why don’t I allow you to consider Early-stopping as a regularization technique?
- Since it doesn’t make the training harder, it just keeps the model from overfitting. Saving checkpoints is a better approach.
Batch Normalization: 1. Given a loss value L, and a fully connected layer with output shape 8 × 16, followed by a batch normalization layer: $y = BN(x) = \gamma xnorm + \beta$
1. What are dimension of the parameters $\gamma and \beta$? - The same as the number of features, 16.
2. Show a derivation of the gradients of those parameters and show how to use NumPy to calculate it.
- \[ y = BN(x) = 1_{N} \cdot \gamma \cdot x_{\text{norm}} + 1_{N} \cdot \beta \] \[ \frac{\partial{L} }{ \partial{\gamma}} = \frac{\partial L}{ \partial y} \frac{\partial y}{\partial \gamma} = 1_N^T \frac{\partial L}{ \partial y} x_{\text{norm}} \]

\[ \frac{\partial{L} }{ \partial{\beta}} = \frac{\partial L}{ \partial y} \frac{\partial y}{\partial \beta} = 1_N^T \frac{\partial L}{\partial y} \]

dgamma = np.sum(dout * x_norm, axis=0)
dbeta = np.sum(dout, axis=0)

Why is batch normalization sometimes referred to as a regularization technique?

Since it adds noise to the training, making it harder. It also makes the optimization easier.

Explain why we need to save the running averages of the mean and variance in the batch norm to the memory during training.

To use them during inference to normalize the data.

What optimization problems could batch norm help us solve?

It normalizes the activations, preventing vanishing and exploding gradients.

Hyperparameter tuning:
1. What is the main bottleneck for grid search?

We are searching in a high-dimensional space, the number of hyperparameters grows exponentially.

Weight initialization:
1. What weight initialization scheme fits the ReLU activation function? How does it affect the output of the activation function?
- Kaiming/He it keeps the variance of the input and output the same. It doubles the variance to account for the ReLU killing half of the neurons. $w \sim N(0, 2/n\_{in})$
1. What do we expect to observe if we initialize the weights of the model to the same value?
- All neurons will learn the same thing, the model will be symmetrical. Prevents the dead relu problem. Groups of neurons will learn the same thing.
Transfer Learning:
1. Explain one of the scenarios in the exercises where we used transfer learning, and how.
- We trained an autoencoder on a large dataset on a reconstruction task and then used the encoder as a feature extractor for a classification task.
- Trained a segmentation model using a pretrained model for feature extraction.

5. Popular Models and Architectures

5.1. CNNS

All modeles designed for ImageNet 1000 classes.

Top-1 accuracy: the model predicts the correct class.
Top-5 accuracy: the correct class is in the top 5 predictions.
Top-5 error: 1 - top-5 accuracy.

5.1.1. LeNet

Original CNN by Yann LeCun.
For handwritten digit recognition.
Average pooling, kernel size of 5.
60k parameters.

5.1.2. AlexNet

60 million parameters.
8 layers.
ReLU activation instead of tanh.
Max pooling.
Most of the parameters are in the last fully connected layer.
Couldn’t go deeper because of the vanishing gradient problem.

5.1.3. VGG

CONV = kernel size 3x3, stride 1, padding 1.
Max pooling = kernel size 2x2, stride 2.
138 million parameters.
16-19 layers.

5.1.4. ResNet

Residual Blocks

Skip connections (residual connections), highway for gradients.
- Solves the vanishing gradient problem.
Screenshot 2024-07-21 at 14.28.00
Can learn the identity function , performance should be at least as good as without the skip connection.
If we just add the input to the output we need to make sure the dimensions are exactly the same.
If we use concatenation we only need to make sure that batch and spatial dimensions are the same. Then we can use a 1x1 convolution to make the number of channels smaller.
- U-Net uses this.
Residual block = Conv -> Relu -> Conv + Skip connection -> Relu.
The gradient stops flowing when we reduce the spatial dimensions.

Architecture

152 layers, 60 million parameters.

5.1.5. GoogLeNet (Inception Layer)

Each block has 4 different convolutions or pooling operations that are concatenated.
Very expensive so we use 1x1 convolutions to reduce the number of channels.

5.2. Autoencoders

Original autoencoder: fully connected layers.
With convolutional layers: Convolutional Autoencoder or U-Net.
Consists of 3 main parts: encoder, bottleneck, decoder.
- Encoder: reduces the input to a lower-dimensional representation. Collets the most important features. Feature extraction.
- Bottleneck or latent space: the lowest-dimensional representation. The latent can be used for other tasks, they can model distribution of the data. Represents the high-level features.
  - Calculations in the latent space are faster.
- Decoder: reconstructs the input from the latent space. The output should be as close to the input as possible.
Without non-linearities it is very similar to PCA.
We train in an unsupervised way. We don’t need labels.
Then we remove the decoder and use the encoder as a feature extractor for a supervised task and fine-tune it.

5.3. Fully Convolutional Networks

No fully connected layers. Can work with any input size (but convolutions aren’t scale-invariant).

5.4. U-Net

Fully convolutional autoencoder with skip connections.
Screenshot 2024-07-21 at 14.54.12
Introduced for biomedical image segmentation.
As we decrease the spatial dimensions we increase the number of channels.
To match the spacial dimension for skip connection we use transposed convolution.
Compared to ResNet, the skip connections are not added but concatenated, but this means the gradient can flow through the whole network.
The latent space here is a tensor of HxWxC in the middle of the network.

5.5. Variational Autoencoders (VAE)

Probabilistic autoencoder. Learn the distribution of the data.
The bottleneck represents the mean and variance of the distribution.
We can sample from the distribution to generate new data.
The loss function is the reconstruction loss (MAE / MSE) + KL divergence.
Reparametrization trick: we sample from a normal distribution and then multiply by the standard deviation and add the mean. This way we can backpropagate through the sampling since the parameters are not in the sampling process. Dead end in the computation graph.
The covariance matrix is diagonal.

5.6. Generative Adversarial Networks (GANs)

Two networks: generator and discriminator.
Generator: generates fake data.
Discriminator: distinguishes between real and fake data.
Minimax game: generator tries to fool the discriminator.
Very flexible architecture, can generate images, music, text, etc.
The loss doesn’t converge, should be in an equilibrium, we are pulling the generator in one direction and the discriminator in the other.
Very hard to train, mode collapse, vanishing gradients.

\[ L_D = - 1/N \sum_{i=1}^N y_i \log(D(x_i)) + (1 - y_i) \log(1 - D(G(z_i))) \]

\[ L_G = - 1/N \sum_{i=1}^N \log(D(G(z_i))) \]

Questions

Architectures:
1. LeNet uses average pooling to reduce the spatial size. Give one advantage and one disadvantage of using average pooling over max pooling.
- The gradient prpagets through all the pixels.
- Maxpooling is a non-linear operation, it can learn the most important features.
1. In LeNet, what is the receptive field of a neuron in the first FC layer?
- The receptive field is the whole image.
1. AlexNet uses an 11 × 11 convolutional filter in the first layer. Name two disadvantages of using such a large filter.
- Lot of parameters, computationally expensive.
- The receptive field is too big, we lose the local features.
1. AlexNet uses ReLU instead of sigmoid or Tanh, as used in LeNet. Explain why it allows AlexNet to be deeper than LeNet, when coupled with the Kaiming initialization.
- It mitigates the vanishing gradient problem. The gradients are not squashed to 0. or saturated.
1. VGGNet: What is the purpose of the convolutional part of the model? Why do we need the FC layers at the end?
- Extract the features, the FC layers are the classifier on top of high-level features.
1. InceptionNet:
  1. What was the problem with the first version of InceptionNet? How was it solved?
  - Very expensive, since the number of channels in the convolutions with large kernel size. We used 1x1 convolutions to reduce the number of channels.
  1. We learned in class that MaxPool is usually used to reduce the spatial dimensions. Therefore, how was it possible to use it inside the Inception block, and concatenate its output to all other outputs?
  - Didn’t use the stride, the spatial dimensions are the same.
Skip connections:
1. Why can we say that skip connections introduce a “highway of gradients”?
- Allows skipping whole blocks of layers, by adding the input to the output. Allows the gradient to bypass the block when backpropagating.
1. Given a residual block $ X{l+1} = X_l + F(X_l) $, where $ X_l, X{l+1} $ - show the highway of gradients in the chain rule formula of $ $, given some loss value $ L $
- \[ \frac{\partial L}{\partial X_l} = \frac{\partial L}{ \partial X_{l+1}} \frac{\partial X_{l+1}}{ \partial X_{l}} = \frac{\partial L}{ \partial X_{l+1}} (1 + \frac{\partial F(X_l)}{ \partial X_l}) \]
1. Can you give a Python-like implementation of the residual block?


  z = Conv2D(64, 3, padding='same')(x)
  z = BatchNormalization()(z)
  z = ReLU()(z)
  z = Conv2D(64, 3, padding='same')(z)
  z = BatchNormalization()(z)
  z = z + x
  z = ReLU()(z)

AutoEncoders:
1. Assume we use an autoencoder to reconstruct an image. What could be used as a loss function? What do we compare between? What kind of learning is it (supervised or unsupervised)?
- Reconstruction loss, MSE, MAE. We compare the output to the input. Unsupervised learning.
1. What is the effect of a latent space that is too small? What is the effect of a latent space that is too big?
- Too samall underfitting, we cannot caputure maningful features.
- Too big overfitting, we just copy the input.
1. What linear approach does this kind of autoencoder resemble? What is the advantage of an autoencoder over this method?
- PCA. Autoencoders can learn non-linear features.
1. State a scenario in which we would like to use an autoencoder for feature extraction.
- Pretraining a model on a large dataset and then using the encoder as a feature extractor for a supervised task.
U-net:
1. Give 3 advantages of U-net over the vanilla Autoencoder
- Uses skip connections for better gradient flow.
- Is fully convolutional, can work with any input size.
- Reduces the spatial size and increases the number of channels therfore extracting more features.
1. How do we mitigate the drop in spatial size, so not too much information is lost?
- We increase the number of channels.
1. For the task of image reconstruction, does it make sense to use skip-connections between the encoder and the decoder? Explain.
- No, since the model would just copy the input.
Generative Networks:
1. What is the difference between GANs and VAEs in the way they learn the distribution of the training set?
- VAEs learn the distribution explicitly and we can sample from it. GANs learn the distribution implicitly, we can’t sample from it.
1. In the Vanilla GAN, why is it prone to underfitting and mode collapse?
- If the discriminator is too good the generator will not learn anything. If the generator lears to generate an image that fools the discriminator it will just generate the same image.
1. In GAN, how do we train the generator to fool the discriminator?
- We maximize the probability of the discriminator being wrong.
1. VAE: What are the two loss functions we use, and what is their purpose?
- Reconstruction loss, to make the output as close to the input as possible. KL divergence, to make the distribution of the latent space close to be close to the real distribution.
1. What do we assume the training set distribution to be in both VAEs and GANs?
- Standard normal distribution.
1. Sampling from a random distribution, that is not the normal distribution, is very hard. How do VAEs solve this problem?
- By using the reparametrization trick. We sample from a normal distribution and then multiply by the standard deviation and add the mean.

6. Recurrent Neural Networks and Transformers

6.1 RNNs

Used for sequential data where the independence assumption doesn’t hold.
They produce and output and also a hidden state that is passed to the next time step.
There are some variations:
- Many-to-one: sentiment analysis. Sentences to a single output.
- Many-to-many: translation. Sentences to sentences (shifted). Video segmentation.
- One-to-many: image captioning. Image to a sentence.
- Multi-layer RNNs: stack RNNs on top of each other. \[ h_t = \sigma (W_{hh} h_{t-1} + W_{xh} x_t) \] \[ x_t = \sigma (W_{hy} h_t) \]
The hidden state has to compress the whole sequence into a fixed-size vector.
If we unroll the sequence we get a “polynomil” of the weight matrices. The influence of the first input decreases with each time step = forgetting.
Exploding gradients: we can clip the gradients. Vanishing gradients: hard to solve, LSTMs and GRUs.
- Gates instead of weight matrices that are multiplied over and over again.

6.1.1 LSTM

Highways for gradients to flow through in the cell state.
Still struggles with long-term dependencies.
Hard to train, we still feed the model autoregressively so we have to wait for the output to feed it back in.
- Transformers can do it in parallel.

6.2 Transformers

Attention mechanism: each token can attend to all other tokens.
Self-attention: each token can attend to itself.
Multi-head attention: multiple attention heads.
Positional encoding: add the position of the token to the embedding.

\[ \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V \]

First we have encoder that processes the input sequence with self-attention.
Decoder takes the partial output processes it with self-attention. This processed output is then used to attend to the encoder output with masked cross-attention.
Then again self-attention and finally classification.
Autoregressive, we have to wait for the output to feed it back in.
Squared complexity, we can use sparse attention.

6.3 Questions

Give two drawbacks of using RNNs that are not the exploding or vanishing gradients.
How did LSTM solve the vanishing gradient problem in RNNs?
Is it a good idea to use ReLU instead of Sigmoid as the activation of the input gate of LSTM?
Transformers:
1. Can the transformer architecture take embeddings of different sizes?
2. Can it take sequences of different sizes as inputs to the encoder and the decoder?
3. Cross-attention layer. Given the encoder outputs of shape Xe ∈ R N×M and the decoder outputs of shape Xd ∈ RK×M, what is the dimension of the output of that layer?
4. Is the task of predicting the next token in the sequence a classification or a regression task?
5. Why is the self-attention mechanism prone to exploding gradients? How does the original architecture of the transformer solve that?
6. Why is it important in transformers to use positional encoding, in comparison to RNNs?