Part 2 - Week 1
Mrinmaya Sachan
Transfer Learning
- Definition: Transfer learning leverages knowledge from a source task/domain to improve performance on a target task/domain, often with limited target data.
- Key principle: Learn general-purpose representations from a large, diverse dataset, then adapt them to specific downstream tasks.
- Advantages:
- Reduces the amount of labeled data required for the target task.
- Speeds up convergence during training.
- Can improve performance, especially in low-resource settings.
- Two-phase training:
- Pre-training: Learn general representations from a large corpus (unsupervised/self-supervised or supervised).
- Fine-tuning: Adapt the pre-trained model to the target task using a smaller, task-specific dataset.
- Few-shot / zero-shot learning: In some cases, no weight updates are needed; the model can perform tasks directly from prompts and examples.
Early Contextual Embedding Models
CoVe — Contextualized Word Vectors (2017)
- Architecture: Sequence-to-sequence model for machine translation with a bidirectional LSTM encoder.
- Process:
- Input: GloVe embeddings → BiLSTM encoder → contextualized word embeddings.
- Output: Hidden states from the encoder used as features for downstream tasks.
- Usage:
- Concatenate CoVe embeddings with GloVe embeddings for task-specific models (e.g., question classification, entailment, sentiment analysis).
- Training: Pre-trained on machine translation, then transferred to other NLP tasks.
- Impact: Demonstrated that contextualized embeddings improve over static embeddings.
ELMo — Deep Contextual Word Representations (2018)
- Architecture: Two-layer bidirectional LSTM language model (forward and backward).
- Key innovations:
- Combines hidden states from all LSTM layers, not just the top layer.
- Learns task-specific scalar weights for combining layers (weights sum to 1).
- Training: Pre-trained on a large language modeling task.
- Impact: Achieved SOTA results at the time; showed the benefit of deep contextualization.
Transformer-based Transfer Learning
BERT (2019) — Bidirectional Encoder Representations from Transformers
- Architecture: Transformer encoder stack (no decoder).
- Tokenization: WordPiece (subword units, similar to BPE).
- Input format:
[CLS]
token at start (used for classification tasks).[SEP]
token between sentence pairs.- Embeddings = token embeddings + positional embeddings + segment embeddings.
- Pre-training objectives:
- Masked Language Modeling (MLM): Randomly mask ~15% of tokens; predict them.
- To reduce pretrain–finetune mismatch: 80%
[MASK]
, 10% random token, 10% unchanged.
- To reduce pretrain–finetune mismatch: 80%
- Next Sentence Prediction (NSP): Predict if sentence B follows sentence A.
- Masked Language Modeling (MLM): Randomly mask ~15% of tokens; predict them.
- Fine-tuning:
- Classification: Use
[CLS]
representation. - QA: Predict start/end token positions.
- Classification: Use
- Performance: Outperformed GPT-1 despite smaller size; usable for both fine-tuning and feature extraction.
RoBERTa — Robustly Optimized BERT Pretraining
- Changes from BERT:
- Larger training corpus (~160GB vs. 16GB).
- Dynamic masking (masking pattern changes each epoch).
- Removed NSP objective.
- Longer training and larger batches.
ALBERT — A Lite BERT
- Efficiency improvements:
- Cross-layer parameter sharing (reduces parameters).
- Factorized embedding parameterization (separate smaller embedding matrix projected to hidden size).
- Objective change: Replaced NSP with Sentence Order Prediction (SOP) to better model discourse coherence.
ELECTRA (2020)
- Objective: Replaced token detection (discriminative) instead of MLM (generative).
- Train a small generator to replace some tokens.
- Train a discriminator to detect replaced tokens.
- Advantages: More sample-efficient than MLM; trains on all tokens, not just masked ones.
- Note: Not adversarial like GANs; generator is discarded after pre-training.
GPT Family — Decoder-only Transformers
GPT-1 (2018)
- Architecture: 12-layer Transformer decoder.
- Training: Left-to-right language modeling (next-token prediction).
- Task adaptation: Special tokens for classification, multiple-choice QA, etc.
- Difference from BERT: Unidirectional context; generative LM objective.
GPT-2 (2019)
- Scale: 1.5B parameters.
- Key result: Strong zero-shot performance via prompting.
- Tasks: Summarization, translation, QA without fine-tuning.
GPT-3 (2020)
- Scale: 175B parameters.
- Key result: In-context learning — few-shot and zero-shot capabilities without parameter updates.
- Impact: Popularized prompt engineering.
Seq2Seq Transformers
- Architecture: Encoder–decoder Transformer.
- Applications: Translation, summarization, text generation, etc.
T5 (2020) — Text-to-Text Transfer Transformer
- Unified framework: All tasks cast as text-to-text.
- Pre-training objective: Span corruption (text infilling) — mask contiguous spans.
- Data: C4 corpus (~750GB cleaned Common Crawl).
- Advantage: Single model for multiple NLP tasks.
BART (2020)
- Architecture: Encoder–decoder Transformer.
- Pre-training: Denoising autoencoder with:
- Text infilling (mask spans).
- Sentence permutation.
- Applications: Summarization, translation, text generation.
Fine-tuning Challenges
- Overfitting: Risk when target dataset is small.
- Catastrophic forgetting: Fine-tuning can degrade performance on pre-trained knowledge.
- Memory inefficiency: Storing separate fine-tuned models for each task is costly.
Parameter-Efficient Fine-Tuning (PEFT)
- Goal: Adapt large models by training only a small subset of parameters.
- Benefits:
- Lower compute and storage cost.
- Easier multi-task deployment (store small task-specific modules).
- Also called: Specification-based tuning.
Methods
Adapters (2017)
- Mechanism:
- Insert small bottleneck feedforward modules with residual connections between Transformer sub-layers (after attention and feedforward).
- Only adapter parameters are trained (~1–3% of model size).
- Structure: Down-projection → nonlinearity → up-projection → residual add.
- Advantages: Comparable performance to full fine-tuning; store only small adapter weights per task.
Prefix-Tuning
- Idea: Learn continuous task-specific prefix vectors prepended to the input at each Transformer layer.
- Setup:
- Freeze base model parameters \(\theta\).
- Train prefix parameters \(\phi\) only.
- Implementation:
- Trainable prefix matrix \(P_\phi \in \mathbb{R}^{T' \times D'}\).
- MLP maps \(D' \to D\) to match model hidden size.
- Concatenate prefix activations to input embeddings:
\(\mathbf{X'} = [\text{Prefix}_\phi; \mathbf{X}]\). - Feed \(\mathbf{X'}\) into frozen Transformer.
- Advantages:
- Very parameter-efficient.
- Modular — swap prefixes for different tasks.
- No modification to base model weights.
- Note: Prefix vectors influence computation via self-attention.
LoRA — Low-Rank Adaptation
- Mechanism:
- Decompose weight update into low-rank matrices: \(W' = W_0 + AB\).
- \(A \in \mathbb{R}^{d \times r}\), \(B \in \mathbb{R}^{r \times k}\) with \(r \ll \min(d,k)\).
- Advantages:
- No extra inference latency (can merge \(AB\) into \(W_0\)).
- Preserves full sequence length (unlike prefix-tuning).
- Often easier to optimize than prefix-tuning.
- Use case: Widely adopted for fine-tuning large LMs with minimal compute.
BitFit (2021)
- Mechanism: Fine-tune only bias terms in the model.
- Performance: Achieves ~95% of full fine-tuning performance on many tasks.
Diff Pruning (2021)
- Mechanism: Represent weights as \(W = W_0 + \Delta W\), where \(\Delta W\) is sparse.
- Regularization: \(L_1\) penalty on \(\Delta W\) to encourage sparsity.
- Trade-off: Can require more memory than full fine-tuning if sparsity is low.
Summary Table — Model Families & Objectives
Model | Architecture | Pre-training Objective(s) | Notable Features |
---|---|---|---|
CoVe | BiLSTM encoder | MT seq2seq | Contextual embeddings from MT |
ELMo | BiLSTM LM | Forward + backward LM | Layer-wise learned combination |
BERT | Transformer encoder | MLM + NSP | Bidirectional context |
RoBERTa | Transformer encoder | MLM | Larger data, dynamic masking, no NSP |
ALBERT | Transformer encoder | MLM + SOP | Parameter sharing, factorized embeddings |
ELECTRA | Transformer encoder | Replaced token detection | Discriminative pre-training |
GPT-1/2/3 | Transformer decoder | Next-token prediction | Generative LM, scaling to 175B |
T5 | Transformer encoder–decoder | Span corruption | Unified text-to-text |
BART | Transformer encoder–decoder | Text infilling + permutation | Denoising autoencoder |