Week 3

Published

Wednesday, March 5, 2025

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are a class of deep neural networks particularly well-suited for processing grid-like data, such as images.

  • Wide Range of Applications:
    • Image classification, object localization, instance segmentation, semantic segmentation, 3D pose estimation, eye gaze estimation, dynamic gesture recognition, and more.
  • Biological Inspiration:
    • CNNs are inspired by the organization of the animal visual cortex.
    • Hubel and Wiesel (1950s-1960s) discovered a hierarchy in the visual cortex:
      • Simple cells: Respond to specific features (e.g., oriented edges) at particular locations; susceptible to noise.
      • Complex cells: Aggregate responses from simple cells, providing spatial invariance and robustness.
  • Neocognitron (Fukushima, 1980):
    • Early hierarchical neural network model for visual pattern recognition.
    • S-cells: Simple cells, analogous to convolutional filters.
    • C-cells: Complex cells, analogous to pooling layers.

The Convolution Operation in CNNs

  • Properties:
    • Linear transformation.
    • Shift equivariance: Shifting the input shifts the output in the same way.
  • Mathematical Definition (Convolution):
    • Given input image \(I\) and kernel \(K\), the output \(I'\) is: \[ I'(i, j) = \sum_{m} \sum_{n} K(m, n) \, I(i - m, j - n) \]
  • Cross-Correlation (as used in most deep learning libraries): \[ I'(i, j) = \sum_{m} \sum_{n} K(m, n) \, I(i + m, j + n) \]
    • True convolution is cross-correlation with a 180-degree rotated kernel: \(I * K = I \star K_{\text{flipped}}\), where \(K_{\text{flipped}}(m, n) = K(-m, -n)\).
  • Commutativity: Convolution is commutative (\(I * K = K * I\)), cross-correlation is not.

CNN Architecture Overview

A typical CNN consists of:

  • Convolutional layers: Learn spatial features.
  • Activation functions: Non-linearities (e.g., ReLU, sigmoid, tanh).
  • Pooling layers: Downsample feature maps.
  • Fully connected layers: Used at the end for classification/regression.

Mathematical Derivation of a Convolutional Layer

Let \(z^{[l-1]}_{u, v}\) be the output of layer \(l-1\) at position \((u, v)\). Let \(w^{[l]}_{m, n}\) be the filter weights, and \(b^{[l]}\) the bias.

Forward Pass:

\[ z^{[l]}_{i, j} = \sum_{m} \sum_{n} w^{[l]}_{m, n} \, z^{[l-1]}_{i - m, j - n} + b^{[l]} \]

Backward Pass:

Let \(L\) be the loss, and \(\delta^{[l]}_{i, j} = \frac{\partial L}{\partial z^{[l]}_{i, j}}\).

  1. Gradient w.r.t. previous layer activations:

    \[ \delta^{[l-1]}_{i, j} = \sum_{m} \sum_{n} \delta^{[l]}_{i + m, j + n} \, w^{[l]}_{m, n} \]

    • This is equivalent to convolving \(\delta^{[l]}\) with the kernel \(w^{[l]}\) rotated by 180 degrees: \[ \delta^{[l-1]} = \delta^{[l]} * \text{rot180}(w^{[l]}) \]
  2. Gradient w.r.t. weights:

    \[ \frac{\partial L}{\partial w^{[l]}_{m, n}} = \sum_{i} \sum_{j} \delta^{[l]}_{i, j} \, z^{[l-1]}_{i - m, j - n} \]

    • This is the cross-correlation of \(z^{[l-1]}\) with \(\delta^{[l]}\).
  3. Gradient w.r.t. bias: \[ \frac{\partial L}{\partial b^{[l]}} = \sum_{i} \sum_{j} \delta^{[l]}_{i, j} \]

Pooling Layers

  • Purpose: Downsample feature maps, reduce computation, and provide translation invariance.
  • Max pooling: Takes the maximum value in a local region.
    • No parameters.
    • During backpropagation, the gradient is passed only to the input that had the maximum value.
  • Average pooling: Takes the average value in a local region.
  • Modern trend: Strided convolutions often replace pooling.

Convolution and Cross-Correlation Identities

  • \(A * \text{rot180}(B) = A \star B\)
  • \(A \star B = \text{rot180}(B) * A\)

Fully Convolutional Networks (FCNs)

  • Semantic segmentation: Classify each pixel.
  • Input and output dimensions: Typically the same.
  • Architecture: Downsample (encoder), then upsample (decoder).
  • Upsampling methods:
    • Fixed: Nearest neighbor, “bed of nails”, max unpooling.
    • Learnable: Transposed convolution.

Transposed Convolution

  • Also called: Fractionally strided convolution, “deconvolution” (not a true inverse).
  • Mechanism:
    • Upsample by inserting zeros between input elements.
    • Apply a standard convolution to the upsampled input.
    • Output size for input height \(H_{in}\), kernel size \(k\), stride \(s\), and padding \(p\): \[ H_{out} = s (H_{in} - 1) + k - 2p \]
  • Visualization: Each input value is multiplied by the kernel and “spread” over the output, overlapping values are summed.

U-Net Architecture

  • Symmetric encoder-decoder: Contracting path (encoder) for context, expanding path (decoder) for localization.
  • Skip connections: Concatenate encoder features with decoder features at corresponding resolutions for better localization.
  • Applications: Semantic segmentation, image generation from segmentation maps, 3D human pose/shape estimation, and more.