2. Neural Networks

Composed of a series of functions.
Circles (nodes) represent neurons.
Lines (edges) represent connections between neurons.
Each neuron holds a single value.
Computations usually occur in parallel (except for batch normalization).

2.0.1. Common Notations

Index connections as $ W_{lio} $: $ l $ = layer, $ i $ = input neuron, $o$ = output neuron.
Apply activation functions, which are usually element-wise non-linearities.
Softmax: derivatives are element-wise multiplications.

2.0.2. Fully Connected Layer

Output neuron is a weighted function of all the input neurons.
Equation:

\[ y = WX + b \]

where

\[ X \in \mathbb{R}^{n \times d}, \, W \in \mathbb{R}^{d \times m}, \, b \in \mathbb{R}^{m} \]

Fully connected networks (MLP, FCN, affine layer) capture global features (e.g., overall color).

2.0.3. Backpropagation

Computes gradients layer by layer.
Flow from left to right.
Only take gradients with respect to the loss function (scalar).

2.0.4. Activation Functions

Make neural networks universal function approximators.
Activation map: output of a linear layer after an activation.
Features: meaning of a group of individual numbers together.
Zero-centered functions output both positive and negative values.

Sigmoid

Range: $(0, 1)$.
Derivative: \[ \text{sigm}(1 - \text{sigm}) \] .
Issues: vanishing gradients, not zero-centered.

Hyperbolic Tangent (tanh)

Range: $(-1, 1)$.
Zero-centered.
Derivative: \[ 1 - \text{tanh}^2 \]

ReLU

Range: $(0, \infty)$.
Simple to compute, not differentiable at 0, not zero-centered.
Issues: dying ReLU problem (negative inputs become 0).

Leaky ReLU

\[ \max(\alpha x, x) \]

where $\alpha$ is a small positive number.

Zero-centered.

Parametric ReLU (PReLU)

ELU (Exponential Linear Unit)

Maps $\mathbb{R}$ to $\mathbb{R}$.
Zero-centered, solves the dying ReLU problem.

MaxOut

Takes the maximum of two affine layers’ outputs.
Inefficient.

Gelu

Gaussian Error Linear Unit.
Smooth approximation of ReLU.
Can be interpreted as a Gaussian CDF scaled by a learned parameter. \[ GeLU(x) = x \cdot \Phi(x) \] Approximation: \[ GeLU(x) = 0.5x(1 + \tanh(\sqrt{2/\pi}(x + 0.044715x^3))) \]

Softmax

Maps $\mathbb{R}^k$ to $[0, 1]^k$.
Works row-wise for multi-class classification or attention.
Magnifies input differences, making the network more confident in predictions.
Derivative: \[ \frac{d \text{softmax}_i}{d x_j} = \text{softmax}\_i (\delta_{ij} - \text{softmax}\_j) \]
For numerical stability, shift by the maximum value $x_i - \max(x)$.

2.1. Loss Functions

Minimizing the training loss is the goal of training.
Always a scalar value.

2.1.1. Classification Losses

3.1.1. Binary Cross-Entropy

\[ \text{BCE}(y, \hat{y}) = -1/N \sum\_{i=1}^N y_i \log(\hat{y}\_i) + (1 - y_i) \log(1 - \hat{y}\_i) \]

N is the number of residuals = not always the same as the number of samples for example of pixel semantic segmentation.
we have to clip the values of the output of the model to avoid log(0) which is undefined.
y hat is the output after sigmoid activation function.

3.1.2. Categorical Cross-Entropy

\[ \text{CCE}(y, \hat{y}) = -1/R \sum_{i=1}^R \sum_{j=1}^{C} y_{ij} \log(\hat{y}_{ij}) \]

only y_ij is the one hot encoded vector of the target.
R is the number of residuals.

2.1.2. Regression Losses

3.2.1. Mean Squared Error

\[ \text{MSE}(y, \hat{y}) = 1/N \sum\_{i=1}^N (y_i - \hat{y}\_i)^2 \]

Also called L2 loss.
Penalizes large errors quadratically.
Not robust to outliers.
Things will look more blurry.

3.2.2. Mean Absolute Error

\[ \text{MAE}(y, \hat{y}) = 1/N \sum\_{i=1}^N |y_i - \hat{y}\_i| \]

Also called L1 loss.
Manhattans distance.
Less sensitive to outliers.
It will look more sharp takes the median of the data.

2.2. Models

2.2.1. Linear Regression

\[ y = Wx + b \]

\[ W = (X^T X)^{-1} X^T y \]

Linear model
Has closed-form solution but not scalable.

2.2.2. Logistic Regression

\[ y = \text{sigmoid}(Wx + b) \]

Binary classification.
Non-linear model.
MLE gets us back to BCE loss.
We set a threshold to classify the output, usually 0.5.

2.3. Questions

Linear Regression
1. Assume a linear problem of fitting with 5 billion points, each with 100 features. Is it feasible to find the optimal without a neural network? State if yes or no, and explain why. Answer: No. The computational complexity and memory requirements for handling such a massive dataset with traditional linear regression methods would be prohibitive.
2. How can we transform a linear problem (with only affine layers) into a classification problem? Answer: We can transform a linear problem into a classification problem by adding a softmax layer at the end of the affine layers to produce probability distributions over the classes.
Fully Connected Layers 1. Given the following layer \[ y = f(X, W, Z, R, T, A) = XW + ZR + T + A \]

\[ X \in \mathbb{R}^{N \times D} , W \in \mathbb{R}^{D \times M} , Z \in \mathbb{R}^{N \times D} , R \in \mathbb{R}^{D \times M} , T \in \mathbb{R}^{N \times M} , A \in \mathbb{R}^{N \times M} \]

and a loss function defined as

\[ L = \sum_{i=1}^N \sum_{j=1}^M y_{ij} \]

Answer:

\[ \frac{\partial L}{\partial Z} = \frac{\partial L}{\partial y} R^T \]

\[ \frac{\partial L}{\partial A} = 1_{NxM} \circ \frac{\partial L}{\partial y} \]

Given a batch of images of shape $(8 \times 3 \times 8 \times 8)$. How can we process them through a fully connected layer? What would be the shape of the weight matrix $W$ in the case of logistic regression and BCE as a loss function?

Answer: Flatten each image to shape $(8, 192)$. The weight matrix $ W $ would have shape $(192, 1)$.

Given a batch of images of shape $8 \times 3 \times 8 \times 8$, a fully connected network (FCN), and a task to detect eyes of people in the images. Name two disadvantages for trying to solve the task with the current setup.

Answer: 1. Loss of spatial information due to flattening. 2. High computational cost and overfitting risk due to large number of parameters.

Given two affine layers: \[ y_1 = f_1(X, W_1, B_1) = XW_1 + B_1 \]

\[ y_2 = f_2(y_1, W_2, B_2) = y_1W_2 + B_2 \]

Show that it could be described as a single affine layer. Answer: Combine the layers:

\[ y_2 = (XW_1 + B_1)W_2 + B_2 = X(W_1W_2) + (B_1W_2 + B_2) \]

This is equivalent to a single affine transformation:

\[ y_2 = XW' + B' \]

where $ W’ = W_1W_2 $ and $ B’ = B_1W_2 + B_2 $.

Activation Functions
1. What is the purpose of the activation functions in a neural network? Answer: To introduce non-linearity, allowing the network to learn complex patterns.
  1. Give two advantages of the Tanh function over the Sigmoid function. Answer:
  1. Zero-centered output.
  2. Stronger gradients (steeper derivatives) which can mitigate the vanishing gradient problem.
  1. Explain the vanishing gradient problem, and describe one method that could help us solve it. Answer:
  - The vanishing gradient problem occurs when gradients become exponentially small upstream, causing the first layers to be updated less. One method to address this is to use activation functions like ReLU or LeakyReLU.
  - Residual connections: skip connections that allow gradients to flow directly to earlier layers, mitigating the vanishing gradient problem.
  1. We’ve learned that the sigmoid function can cause the “vanishing gradient” problem. Therefore, explain why the sigmoid function is still sometimes used on the logits (the output of the last layer). Answer: The sigmoid function is used on the logits in binary classification tasks to map the output to a probability between 0 and 1.
  2. Assume the LeakyReLU activation function. If we take $\alpha < 0$ what property of the function would we lose? Answer: The function would no longer be zero-centered.
  3. The Softmax function could suffer from numerical instability, given the logits. Show that reducing the maximum value of the logits from each one of the values in the vector would not change the output of the function: \[ \text{Softmax}(x - \max(x))_i = \text{Softmax}(x)_i \] Answer:
\[ \text{Softmax}(x - \max(x))_i = \frac{e^{x_i - \max(x)}}{\sum_{j} e^{x_j - \max(x)}} = \frac{e^{x_i} \cdot e^{-\max(x)}}{\sum_{j} e^{x_j} \cdot e^{-\max(x)}} = \frac{e^{x_i}}{\sum_{j} e^{x_j}} = \text{Softmax}(x)_i \]
Loss Functions
1. Which property of loss functions allows us to perform backpropagation without computing Jacobians? Answer: Scalar output.
2. Assume that you’re using the MSE loss function to compare two batches of images, of shape $(4 \times 3 \times 8 \times 8)$ what would be the value of $ N $, that we should divide the loss value by? Answer: $N = 4 \times 3 \times 8 \times 8 = 768$
3. In depth estimation, where we predict the depth of each pixel in an image, it was found that the loss for the majority of pixels is very small, while for some few pixels it is very large. What would be a better loss function to use in this case out of [MAE, MSE, BCE, CE], and why? Answer: MAE (Mean Absolute Error) would be better as it is less sensitive to outliers compared to MSE.
4. What could cause numerical instabilities in the BCE and CE functions? How could we solve that? Answer: Numerical instabilities can be caused by very large or very small logits. This can be solved by using techniques like clipping logits or adding a small constant to the logits.
5. Why is the MAE loss function still used, although it is not differentiable at $ x = 0 $? Answer: MAE is still used because it is robust to outliers and provides a meaningful measure of average error.
6. Assume that in the first iteration of training, some of the logits are values above 1000, while the ground-truth values are in the range of [0, 1]. If we’re using the MSE loss function, what optimization problem should we expect to observe? Answer: The gradients will be extremely large, causing the weights to update excessively and potentially destabilize the training process.
7. Let’s use the CE loss function for a task of classification of 100 classes. What is the expected loss value after the first iteration, and why? Answer: The expected loss value is approximately $-\log (1/100) = \log(100) \approx 4.605$ because the initial predictions are likely to be uniformly distributed over the classes.
8. BCE: how many neurons are at the output layer? Answer: 1 neuron for binary classification.
9. BCE: why do we multiply the result by $ -1 $? Answer: To convert the maximization problem into a minimization problem, as the loss functions are typically minimized in gradient descent. To make it possitive since the argument of the log is between 0 and 1.
10. Why don’t we multiply by $ -1 $ in MSE or MAE? Answer: Because MSE and MAE are inherently designed to measure error and are minimized directly, unlike the log-likelihood in BCE.
11. You are given a neural network for a classification task with 4 classes and the CE as a loss function. The batch size is 1000. After the very first iteration of training, what is the expected loss value? Answer: The expected loss value is approximately $\log(4) \approx 1.386$, assuming the initial predictions are uniformly distributed over the classes.