5. Popular Models and Architectures

5.1. CNNS

All modeles designed for ImageNet 1000 classes.

Top-1 accuracy: the model predicts the correct class.
Top-5 accuracy: the correct class is in the top 5 predictions.
Top-5 error: 1 - top-5 accuracy.

5.1.1. LeNet

Original CNN by Yann LeCun.
For handwritten digit recognition.
Average pooling, kernel size of 5.
60k parameters.

5.1.2. AlexNet

60 million parameters.
8 layers.
ReLU activation instead of tanh.
Max pooling.
Most of the parameters are in the last fully connected layer.
Couldn’t go deeper because of the vanishing gradient problem.

5.1.3. VGG

CONV = kernel size 3x3, stride 1, padding 1.
Max pooling = kernel size 2x2, stride 2.
138 million parameters.
16-19 layers.

5.1.4. ResNet

Residual Blocks

Skip connections (residual connections), highway for gradients.
- Solves the vanishing gradient problem.
Screenshot 2024-07-21 at 14.28.00
Can learn the identity function , performance should be at least as good as without the skip connection.
If we just add the input to the output we need to make sure the dimensions are exactly the same.
If we use concatenation we only need to make sure that batch and spatial dimensions are the same. Then we can use a 1x1 convolution to make the number of channels smaller.
- U-Net uses this.
Residual block = Conv -> Relu -> Conv + Skip connection -> Relu.
The gradient stops flowing when we reduce the spatial dimensions.

Architecture

152 layers, 60 million parameters.

5.1.5. GoogLeNet (Inception Layer)

Each block has 4 different convolutions or pooling operations that are concatenated.
Very expensive so we use 1x1 convolutions to reduce the number of channels.

5.2. Autoencoders

Original autoencoder: fully connected layers.
With convolutional layers: Convolutional Autoencoder or U-Net.
Consists of 3 main parts: encoder, bottleneck, decoder.
- Encoder: reduces the input to a lower-dimensional representation. Collets the most important features. Feature extraction.
- Bottleneck or latent space: the lowest-dimensional representation. The latent can be used for other tasks, they can model distribution of the data. Represents the high-level features.
  - Calculations in the latent space are faster.
- Decoder: reconstructs the input from the latent space. The output should be as close to the input as possible.
Without non-linearities it is very similar to PCA.
We train in an unsupervised way. We don’t need labels.
Then we remove the decoder and use the encoder as a feature extractor for a supervised task and fine-tune it.

5.3. Fully Convolutional Networks

No fully connected layers. Can work with any input size (but convolutions aren’t scale-invariant).

5.4. U-Net

Fully convolutional autoencoder with skip connections.
Screenshot 2024-07-21 at 14.54.12
Introduced for biomedical image segmentation.
As we decrease the spatial dimensions we increase the number of channels.
To match the spacial dimension for skip connection we use transposed convolution.
Compared to ResNet, the skip connections are not added but concatenated, but this means the gradient can flow through the whole network.
The latent space here is a tensor of HxWxC in the middle of the network.

5.5. Variational Autoencoders (VAE)

Probabilistic autoencoder. Learn the distribution of the data.
The bottleneck represents the mean and variance of the distribution.
We can sample from the distribution to generate new data.
The loss function is the reconstruction loss (MAE / MSE) + KL divergence.
Reparametrization trick: we sample from a normal distribution and then multiply by the standard deviation and add the mean. This way we can backpropagate through the sampling since the parameters are not in the sampling process. Dead end in the computation graph.
The covariance matrix is diagonal.

5.6. Generative Adversarial Networks (GANs)

Two networks: generator and discriminator.
Generator: generates fake data.
Discriminator: distinguishes between real and fake data.
Minimax game: generator tries to fool the discriminator.
Very flexible architecture, can generate images, music, text, etc.
The loss doesn’t converge, should be in an equilibrium, we are pulling the generator in one direction and the discriminator in the other.
Very hard to train, mode collapse, vanishing gradients.

\[ L_D = - 1/N \sum_{i=1}^N y_i \log(D(x_i)) + (1 - y_i) \log(1 - D(G(z_i))) \]

\[ L_G = - 1/N \sum_{i=1}^N \log(D(G(z_i))) \]

Questions

Architectures:
1. LeNet uses average pooling to reduce the spatial size. Give one advantage and one disadvantage of using average pooling over max pooling.
- The gradient prpagets through all the pixels.
- Maxpooling is a non-linear operation, it can learn the most important features.
1. In LeNet, what is the receptive field of a neuron in the first FC layer?
- The receptive field is the whole image.
1. AlexNet uses an 11 × 11 convolutional filter in the first layer. Name two disadvantages of using such a large filter.
- Lot of parameters, computationally expensive.
- The receptive field is too big, we lose the local features.
1. AlexNet uses ReLU instead of sigmoid or Tanh, as used in LeNet. Explain why it allows AlexNet to be deeper than LeNet, when coupled with the Kaiming initialization.
- It mitigates the vanishing gradient problem. The gradients are not squashed to 0. or saturated.
1. VGGNet: What is the purpose of the convolutional part of the model? Why do we need the FC layers at the end?
- Extract the features, the FC layers are the classifier on top of high-level features.
1. InceptionNet:
  1. What was the problem with the first version of InceptionNet? How was it solved?
  - Very expensive, since the number of channels in the convolutions with large kernel size. We used 1x1 convolutions to reduce the number of channels.
  1. We learned in class that MaxPool is usually used to reduce the spatial dimensions. Therefore, how was it possible to use it inside the Inception block, and concatenate its output to all other outputs?
  - Didn’t use the stride, the spatial dimensions are the same.
Skip connections:
1. Why can we say that skip connections introduce a “highway of gradients”?
- Allows skipping whole blocks of layers, by adding the input to the output. Allows the gradient to bypass the block when backpropagating.
1. Given a residual block $ X{l+1} = X_l + F(X_l) $, where $ X_l, X{l+1} $ - show the highway of gradients in the chain rule formula of $ $, given some loss value $ L $
- \[ \frac{\partial L}{\partial X_l} = \frac{\partial L}{ \partial X_{l+1}} \frac{\partial X_{l+1}}{ \partial X_{l}} = \frac{\partial L}{ \partial X_{l+1}} (1 + \frac{\partial F(X_l)}{ \partial X_l}) \]
1. Can you give a Python-like implementation of the residual block?


  z = Conv2D(64, 3, padding='same')(x)
  z = BatchNormalization()(z)
  z = ReLU()(z)
  z = Conv2D(64, 3, padding='same')(z)
  z = BatchNormalization()(z)
  z = z + x
  z = ReLU()(z)

AutoEncoders:
1. Assume we use an autoencoder to reconstruct an image. What could be used as a loss function? What do we compare between? What kind of learning is it (supervised or unsupervised)?
- Reconstruction loss, MSE, MAE. We compare the output to the input. Unsupervised learning.
1. What is the effect of a latent space that is too small? What is the effect of a latent space that is too big?
- Too samall underfitting, we cannot caputure maningful features.
- Too big overfitting, we just copy the input.
1. What linear approach does this kind of autoencoder resemble? What is the advantage of an autoencoder over this method?
- PCA. Autoencoders can learn non-linear features.
1. State a scenario in which we would like to use an autoencoder for feature extraction.
- Pretraining a model on a large dataset and then using the encoder as a feature extractor for a supervised task.
U-net:
1. Give 3 advantages of U-net over the vanilla Autoencoder
- Uses skip connections for better gradient flow.
- Is fully convolutional, can work with any input size.
- Reduces the spatial size and increases the number of channels therfore extracting more features.
1. How do we mitigate the drop in spatial size, so not too much information is lost?
- We increase the number of channels.
1. For the task of image reconstruction, does it make sense to use skip-connections between the encoder and the decoder? Explain.
- No, since the model would just copy the input.
Generative Networks:
1. What is the difference between GANs and VAEs in the way they learn the distribution of the training set?
- VAEs learn the distribution explicitly and we can sample from it. GANs learn the distribution implicitly, we can’t sample from it.
1. In the Vanilla GAN, why is it prone to underfitting and mode collapse?
- If the discriminator is too good the generator will not learn anything. If the generator lears to generate an image that fools the discriminator it will just generate the same image.
1. In GAN, how do we train the generator to fool the discriminator?
- We maximize the probability of the discriminator being wrong.
1. VAE: What are the two loss functions we use, and what is their purpose?
- Reconstruction loss, to make the output as close to the input as possible. KL divergence, to make the distribution of the latent space close to be close to the real distribution.
1. What do we assume the training set distribution to be in both VAEs and GANs?
- Standard normal distribution.
1. Sampling from a random distribution, that is not the normal distribution, is very hard. How do VAEs solve this problem?
- By using the reparametrization trick. We sample from a normal distribution and then multiply by the standard deviation and add the mean.