Part 2 - Week 1

Mrinmaya Sachan

Published

Tuesday, April 1, 2025

Transfer Learning

Definition: Transfer learning leverages knowledge from a source task/domain to improve performance on a target task/domain, often with limited target data.
Key principle: Learn general-purpose representations from a large, diverse dataset, then adapt them to specific downstream tasks.
Advantages:
- Reduces the amount of labeled data required for the target task.
- Speeds up convergence during training.
- Can improve performance, especially in low-resource settings.
Two-phase training:
1. Pre-training: Learn general representations from a large corpus (unsupervised/self-supervised or supervised).
2. Fine-tuning: Adapt the pre-trained model to the target task using a smaller, task-specific dataset.
Few-shot / zero-shot learning: In some cases, no weight updates are needed; the model can perform tasks directly from prompts and examples.

Early Contextual Embedding Models

CoVe — Contextualized Word Vectors (2017)

Architecture: Sequence-to-sequence model for machine translation with a bidirectional LSTM encoder.
Process:
- Input: GloVe embeddings → BiLSTM encoder → contextualized word embeddings.
- Output: Hidden states from the encoder used as features for downstream tasks.
Usage:
- Concatenate CoVe embeddings with GloVe embeddings for task-specific models (e.g., question classification, entailment, sentiment analysis).
Training: Pre-trained on machine translation, then transferred to other NLP tasks.
Impact: Demonstrated that contextualized embeddings improve over static embeddings.

ELMo — Deep Contextual Word Representations (2018)

Architecture: Two-layer bidirectional LSTM language model (forward and backward).
Key innovations:
- Combines hidden states from all LSTM layers, not just the top layer.
- Learns task-specific scalar weights for combining layers (weights sum to 1).
Training: Pre-trained on a large language modeling task.
Impact: Achieved SOTA results at the time; showed the benefit of deep contextualization.

Transformer-based Transfer Learning

BERT (2019) — Bidirectional Encoder Representations from Transformers

Architecture: Transformer encoder stack (no decoder).
Tokenization: WordPiece (subword units, similar to BPE).
Input format:
- [CLS] token at start (used for classification tasks).
- [SEP] token between sentence pairs.
- Embeddings = token embeddings + positional embeddings + segment embeddings.
Pre-training objectives:
1. Masked Language Modeling (MLM): Randomly mask ~15% of tokens; predict them.
  - To reduce pretrain–finetune mismatch: 80% [MASK], 10% random token, 10% unchanged.
2. Next Sentence Prediction (NSP): Predict if sentence B follows sentence A.
Fine-tuning:
- Classification: Use [CLS] representation.
- QA: Predict start/end token positions.
Performance: Outperformed GPT-1 despite smaller size; usable for both fine-tuning and feature extraction.

RoBERTa — Robustly Optimized BERT Pretraining

Changes from BERT:
- Larger training corpus (~160GB vs. 16GB).
- Dynamic masking (masking pattern changes each epoch).
- Removed NSP objective.
- Longer training and larger batches.

ALBERT — A Lite BERT

Efficiency improvements:
- Cross-layer parameter sharing (reduces parameters).
- Factorized embedding parameterization (separate smaller embedding matrix projected to hidden size).
Objective change: Replaced NSP with Sentence Order Prediction (SOP) to better model discourse coherence.

ELECTRA (2020)

Objective: Replaced token detection (discriminative) instead of MLM (generative).
- Train a small generator to replace some tokens.
- Train a discriminator to detect replaced tokens.
Advantages: More sample-efficient than MLM; trains on all tokens, not just masked ones.
Note: Not adversarial like GANs; generator is discarded after pre-training.

GPT Family — Decoder-only Transformers

GPT-1 (2018)

Architecture: 12-layer Transformer decoder.
Training: Left-to-right language modeling (next-token prediction).
Task adaptation: Special tokens for classification, multiple-choice QA, etc.
Difference from BERT: Unidirectional context; generative LM objective.

GPT-2 (2019)

Scale: 1.5B parameters.
Key result: Strong zero-shot performance via prompting.
Tasks: Summarization, translation, QA without fine-tuning.

GPT-3 (2020)

Scale: 175B parameters.
Key result: In-context learning — few-shot and zero-shot capabilities without parameter updates.
Impact: Popularized prompt engineering.

Seq2Seq Transformers

Architecture: Encoder–decoder Transformer.
Applications: Translation, summarization, text generation, etc.

T5 (2020) — Text-to-Text Transfer Transformer

Unified framework: All tasks cast as text-to-text.
Pre-training objective: Span corruption (text infilling) — mask contiguous spans.
Data: C4 corpus (~750GB cleaned Common Crawl).
Advantage: Single model for multiple NLP tasks.

BART (2020)

Architecture: Encoder–decoder Transformer.
Pre-training: Denoising autoencoder with:
- Text infilling (mask spans).
- Sentence permutation.
Applications: Summarization, translation, text generation.

Fine-tuning Challenges

Overfitting: Risk when target dataset is small.
Catastrophic forgetting: Fine-tuning can degrade performance on pre-trained knowledge.
Memory inefficiency: Storing separate fine-tuned models for each task is costly.

Parameter-Efficient Fine-Tuning (PEFT)

Goal: Adapt large models by training only a small subset of parameters.
Benefits:
- Lower compute and storage cost.
- Easier multi-task deployment (store small task-specific modules).
Also called: Specification-based tuning.

Methods

Adapters (2017)

Mechanism:
- Insert small bottleneck feedforward modules with residual connections between Transformer sub-layers (after attention and feedforward).
- Only adapter parameters are trained (~1–3% of model size).
Structure: Down-projection → nonlinearity → up-projection → residual add.
Advantages: Comparable performance to full fine-tuning; store only small adapter weights per task.

Prefix-Tuning

Idea: Learn continuous task-specific prefix vectors prepended to the input at each Transformer layer.
Setup:
- Freeze base model parameters \(\theta\).
- Train prefix parameters \(\phi\) only.
Implementation:
1. Trainable prefix matrix \(P_\phi \in \mathbb{R}^{T' \times D'}\).
2. MLP maps \(D' \to D\) to match model hidden size.
3. Concatenate prefix activations to input embeddings:
  \(\mathbf{X'} = [\text{Prefix}_\phi; \mathbf{X}]\).
4. Feed \(\mathbf{X'}\) into frozen Transformer.
Advantages:
- Very parameter-efficient.
- Modular — swap prefixes for different tasks.
- No modification to base model weights.
Note: Prefix vectors influence computation via self-attention.

LoRA — Low-Rank Adaptation

Mechanism:
- Decompose weight update into low-rank matrices: \(W' = W_0 + AB\).
- \(A \in \mathbb{R}^{d \times r}\), \(B \in \mathbb{R}^{r \times k}\) with \(r \ll \min(d,k)\).
Advantages:
- No extra inference latency (can merge \(AB\) into \(W_0\)).
- Preserves full sequence length (unlike prefix-tuning).
- Often easier to optimize than prefix-tuning.
Use case: Widely adopted for fine-tuning large LMs with minimal compute.

BitFit (2021)

Mechanism: Fine-tune only bias terms in the model.
Performance: Achieves ~95% of full fine-tuning performance on many tasks.

Diff Pruning (2021)

Mechanism: Represent weights as \(W = W_0 + \Delta W\), where \(\Delta W\) is sparse.
Regularization: \(L_1\) penalty on \(\Delta W\) to encourage sparsity.
Trade-off: Can require more memory than full fine-tuning if sparsity is low.

Summary Table — Model Families & Objectives

Model	Architecture	Pre-training Objective(s)	Notable Features
CoVe	BiLSTM encoder	MT seq2seq	Contextual embeddings from MT
ELMo	BiLSTM LM	Forward + backward LM	Layer-wise learned combination
BERT	Transformer encoder	MLM + NSP	Bidirectional context
RoBERTa	Transformer encoder	MLM	Larger data, dynamic masking, no NSP
ALBERT	Transformer encoder	MLM + SOP	Parameter sharing, factorized embeddings
ELECTRA	Transformer encoder	Replaced token detection	Discriminative pre-training
GPT-1/2/3	Transformer decoder	Next-token prediction	Generative LM, scaling to 175B
T5	Transformer encoder–decoder	Span corruption	Unified text-to-text
BART	Transformer encoder–decoder	Text infilling + permutation	Denoising autoencoder