6. Recurrent Neural Networks and Transformers

6.1 RNNs

Used for sequential data where the independence assumption doesn’t hold.
They produce and output and also a hidden state that is passed to the next time step.
There are some variations:
- Many-to-one: sentiment analysis. Sentences to a single output.
- Many-to-many: translation. Sentences to sentences (shifted). Video segmentation.
- One-to-many: image captioning. Image to a sentence.
- Multi-layer RNNs: stack RNNs on top of each other. \[ h_t = \sigma (W_{hh} h_{t-1} + W_{xh} x_t) \] \[ x_t = \sigma (W_{hy} h_t) \]
The hidden state has to compress the whole sequence into a fixed-size vector.
If we unroll the sequence we get a “polynomil” of the weight matrices. The influence of the first input decreases with each time step = forgetting.
Exploding gradients: we can clip the gradients. Vanishing gradients: hard to solve, LSTMs and GRUs.
- Gates instead of weight matrices that are multiplied over and over again.

Highways for gradients to flow through in the cell state.
Still struggles with long-term dependencies.
Hard to train, we still feed the model autoregressively so we have to wait for the output to feed it back in.
- Transformers can do it in parallel.

\[ \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V \]

First we have encoder that processes the input sequence with self-attention.
Decoder takes the partial output processes it with self-attention. This processed output is then used to attend to the encoder output with masked cross-attention.
Then again self-attention and finally classification.
Autoregressive, we have to wait for the output to feed it back in.
Squared complexity, we can use sparse attention.

Give two drawbacks of using RNNs that are not the exploding or vanishing gradients.
How did LSTM solve the vanishing gradient problem in RNNs?
Is it a good idea to use ReLU instead of Sigmoid as the activation of the input gate of LSTM?
Transformers:
1. Can the transformer architecture take embeddings of different sizes?
2. Can it take sequences of different sizes as inputs to the encoder and the decoder?
3. Cross-attention layer. Given the encoder outputs of shape Xe ∈ R N×M and the decoder outputs of shape Xd ∈ RK×M, what is the dimension of the output of that layer?
4. Is the task of predicting the next token in the sequence a classification or a regression task?
5. Why is the self-attention mechanism prone to exploding gradients? How does the original architecture of the transformer solve that?
6. Why is it important in transformers to use positional encoding, in comparison to RNNs?