3. Convolutions

We can extract local features.
Linear operations with shared weights.
Translation-equivariant.
- Can find the same object in different parts of the image.
- Not rotation-equivariant.
- Not scale-equivariant, if we change the resolution, we need to change the weights.
Global look over the image is possible in the deeper layers where the receptive field is larger.
Very efficient parameter-wise.
Kerenel size is an odd number, can have hight and width different.
Stride is the step size of the kernel.
Amout of pixels added to the edge of the image is called padding.

\[ \text{output size} = \frac{W - F + 2P}{S} + 1 \]
Popular options
- k=1, s=1, p=0 pointwise convolution. Processing each pixel independently keeping the dimensions. Used to reduce the number of channels.
- k=3, s=1, p=1 standard convolution.
- k=3, s=2, p=1 downsampling. Halving the dimensions.
- k=7, s=4, p=3 upsampling. Spatial size is decreased by 4.

3.0.1. Max Pooling

Works channel-wise independently.
In a kernel size it takes the maximum value.
We have to keep track of the indices to backpropagate.
Usually k=2, s=2, p=0.
Only a quarter of the input gets a gradient. Reason why it is not used anymore.
If there is a tie both values get the gradient.

3.0.2. Average Pooling

Averages the values in the kernel.
!! Works channel-wise independently. !! Different from the convolution that works channel-wise together.
Usually k=2, s=2, p=0.

3.0.3. Special Convolutions

Depthwise convolution. Special case of convolution where each channel is processed independently with a different kernel.
Global Max Pooling. Takes the maximum value of the whole feature map.
Upsample
- Nearest neighbor. Just repeats the pixels.
- Bilinear. Takes the average of the 4 nearest pixels.
- Bi-cubic. Takes the average of the 16 nearest pixels.
- Dosn’t have learnable parameters.
Transposed convolution. Upsampling with learnable parameters.
- Also called fractionally strided convolution. Adds zeros between the pixels and to the edges.
- Not the same as deconvolution or inverse convolution.
Dilated convolution. Increases the receptive field without increasing the number of parameters.
- Also called atrous convolution.
- The kernel is applied to every n-th pixel.
- The receptive field is increased by a factor of n.
- Used in the encoder part of the U-Net.

3.0.4. Receptive Field

Receptive filed is the area of the input image that affects the output of a neuron. For layer l:

\[ r*l = r*{l-1} + (k*{l} - 1)\prod*{i=1}^{l-1} s_i \]

Its a tuple (height, width).

3.0.5. Handcrafted Kernels

Don’t need to remember the values, just the concept.

!! Each kernel also has a bias term. !!

3.1. Questions

How can we represent a FC layer with 5 output neurons with a convolutional layer, over an image in a batch of size 4 × 3 × 8 × 8? And with 1x1 Convolution?
- Convolution with 5 filters and kernel size 8x8. The weight dimensions would be (5, 3, 8, 8). The output shape would be (4, 5, 1, 1).
- With a 1x1 convolution we would reshape the input to (4, 3x8x8) and use a kernel of size 5x192. The output shape would be (4, 5).
Given a 1 × 1 convolutional layer with input tensor of 10 channels, that outputs a tensor with 5 channels.
1. Write the shape of the weight matrix.
- The shape of the weight matrix would be (5, 10, 1, 1).
1. State the number of parameters in the layer.
- The number of parameters would be 55 (5x10 + 5).
Give two reasons to use a convolution or a pooling layer that reduce the spatial size of the input tensor.
- Reduces the number of parameters and computational cost. Allowing the network to be deeper.
- Compression of the feature map, focusing on the most important features.
Does reducing the spatial size throughout the network reduce the number of parameters in the down-the-stream convolutional layers?
- No, the number of parameters depends on the kernel size and the number of channels.
Does reducing the spatial size throughout the network reduce the number of parameters in the down-the-stream FC layers?
- Yes, here the tensor is flattened and the number of parameters is reduced.
State two differences between a convolutional layer and a pooling layer.
- Convolutional layers have learnable parameters.
- Pooling layers work channel-wise independently.
Assume a fully convolutional model for some task:
1. Can we feed the model with images that are double the resolution of the original training set? Yes, we can but the prediction will be bad since the model was trained on a different resolution.
2. Can we expect in such case the same performance?
- No, the model was trained on a different resolution.
Can we apply a pooling layer without reducing the spatial size of the input tensor?
- Yes, by using a pooling layer with a kernel size and stride of 1 and padding 1.
Assume that when using maxpool with (k = 2, s = 2, p = 0), where only a single entry in a given window holds the maximum value in that window. How many of the total pixels in the tensor would get a live gradient? What is the value of said gradient?
- Only 1/4 of the pixels would get a live gradient, and the value would be 1.
State one advantage and one disadvantage of a 1 × 1 convolution over a 3 × 3 convolution.

1x1 convolutions don’t caputere local features, they are pointwise convolutions. They are used to reduce the number of channels.
3x3 convolutions have a larger receptive field but more parameters. Can change the spatial size.

Why don’t we use hand-crafted kernels (e.g. Sobel filter for edge detection) within our deep learning models?
- They are not learnable, they are fixed. They are not able to learn the features from the data.
In what technique, however, can we use hand-crafted kernels such as Gaussian blur?

Data augmentation.

Assume that we want to use a transformer to process an intermediate feature map (an output tensor of a convolutional layer), to learn meaningful relations between the pixels. How can we change the tensor to do that?
- We can reshape the tensor to (B, C, H, W) to (B, HxW, C) and use the transformer. Now each pixel is a C-dimensional vector. Used as a token in the transformer.