6. Recurrent Neural Networks and Transformers
6. Recurrent Neural Networks and Transformers
6.1 RNNs
Used for sequential data where the independence assumption doesn’t hold.
They produce and output and also a hidden state that is passed to the next time step.
There are some variations:
- Many-to-one: sentiment analysis. Sentences to a single output.
- Many-to-many: translation. Sentences to sentences (shifted). Video segmentation.
- One-to-many: image captioning. Image to a sentence.
- Multi-layer RNNs: stack RNNs on top of each other. \[ h_t = \sigma (W_{hh} h_{t-1} + W_{xh} x_t) \] \[ x_t = \sigma (W_{hy} h_t) \]
The hidden state has to compress the whole sequence into a fixed-size vector.
If we unroll the sequence we get a “polynomil” of the weight matrices. The influence of the first input decreases with each time step = forgetting.
Exploding gradients: we can clip the gradients. Vanishing gradients: hard to solve, LSTMs and GRUs.
- Gates instead of weight matrices that are multiplied over and over again.
6.1.1 LSTM
- Highways for gradients to flow through in the cell state.
- Still struggles with long-term dependencies.
- Hard to train, we still feed the model autoregressively so we have to wait for the output to feed it back in.
- Transformers can do it in parallel.
- Transformers can do it in parallel.
6.2 Transformers
- Attention mechanism: each token can attend to all other tokens.
- Self-attention: each token can attend to itself.
- Multi-head attention: multiple attention heads.
- Positional encoding: add the position of the token to the embedding.
\[ \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V \]
- First we have encoder that processes the input sequence with self-attention.
- Decoder takes the partial output processes it with self-attention. This processed output is then used to attend to the encoder output with masked cross-attention.
- Then again self-attention and finally classification.
- Autoregressive, we have to wait for the output to feed it back in.
- Squared complexity, we can use sparse attention.
6.3 Questions
- Give two drawbacks of using RNNs that are not the exploding or vanishing gradients.
- How did LSTM solve the vanishing gradient problem in RNNs?
- Is it a good idea to use ReLU instead of Sigmoid as the activation of the input gate of LSTM?
- Transformers:
- Can the transformer architecture take embeddings of different sizes?
- Can it take sequences of different sizes as inputs to the encoder and the decoder?
- Cross-attention layer. Given the encoder outputs of shape Xe ∈ R N×M and the decoder outputs of shape Xd ∈ RK×M, what is the dimension of the output of that layer?
- Is the task of predicting the next token in the sequence a classification or a regression task?
- Why is the self-attention mechanism prone to exploding gradients? How does the original architecture of the transformer solve that?
- Why is it important in transformers to use positional encoding, in comparison to RNNs?