Part 2 - Week 1

Mrinmaya Sachan
Published

Tuesday, April 1, 2025

Transfer Learning

  • Definition: Transfer learning leverages knowledge from a source task/domain to improve performance on a target task/domain, often with limited target data.
  • Key principle: Learn general-purpose representations from a large, diverse dataset, then adapt them to specific downstream tasks.
  • Advantages:
    • Reduces the amount of labeled data required for the target task.
    • Speeds up convergence during training.
    • Can improve performance, especially in low-resource settings.
  • Two-phase training:
    1. Pre-training: Learn general representations from a large corpus (unsupervised/self-supervised or supervised).
    2. Fine-tuning: Adapt the pre-trained model to the target task using a smaller, task-specific dataset.
  • Few-shot / zero-shot learning: In some cases, no weight updates are needed; the model can perform tasks directly from prompts and examples.

Early Contextual Embedding Models

CoVe — Contextualized Word Vectors (2017)

  • Architecture: Sequence-to-sequence model for machine translation with a bidirectional LSTM encoder.
  • Process:
    • Input: GloVe embeddings → BiLSTM encoder → contextualized word embeddings.
    • Output: Hidden states from the encoder used as features for downstream tasks.
  • Usage:
    • Concatenate CoVe embeddings with GloVe embeddings for task-specific models (e.g., question classification, entailment, sentiment analysis).
  • Training: Pre-trained on machine translation, then transferred to other NLP tasks.
  • Impact: Demonstrated that contextualized embeddings improve over static embeddings.

ELMo — Deep Contextual Word Representations (2018)

  • Architecture: Two-layer bidirectional LSTM language model (forward and backward).
  • Key innovations:
    • Combines hidden states from all LSTM layers, not just the top layer.
    • Learns task-specific scalar weights for combining layers (weights sum to 1).
  • Training: Pre-trained on a large language modeling task.
  • Impact: Achieved SOTA results at the time; showed the benefit of deep contextualization.

Transformer-based Transfer Learning

BERT (2019) — Bidirectional Encoder Representations from Transformers

  • Architecture: Transformer encoder stack (no decoder).
  • Tokenization: WordPiece (subword units, similar to BPE).
  • Input format:
    • [CLS] token at start (used for classification tasks).
    • [SEP] token between sentence pairs.
    • Embeddings = token embeddings + positional embeddings + segment embeddings.
  • Pre-training objectives:
    1. Masked Language Modeling (MLM): Randomly mask ~15% of tokens; predict them.
      • To reduce pretrain–finetune mismatch: 80% [MASK], 10% random token, 10% unchanged.
    2. Next Sentence Prediction (NSP): Predict if sentence B follows sentence A.
  • Fine-tuning:
    • Classification: Use [CLS] representation.
    • QA: Predict start/end token positions.
  • Performance: Outperformed GPT-1 despite smaller size; usable for both fine-tuning and feature extraction.

RoBERTa — Robustly Optimized BERT Pretraining

  • Changes from BERT:
    • Larger training corpus (~160GB vs. 16GB).
    • Dynamic masking (masking pattern changes each epoch).
    • Removed NSP objective.
    • Longer training and larger batches.

ALBERT — A Lite BERT

  • Efficiency improvements:
    • Cross-layer parameter sharing (reduces parameters).
    • Factorized embedding parameterization (separate smaller embedding matrix projected to hidden size).
  • Objective change: Replaced NSP with Sentence Order Prediction (SOP) to better model discourse coherence.

ELECTRA (2020)

  • Objective: Replaced token detection (discriminative) instead of MLM (generative).
    • Train a small generator to replace some tokens.
    • Train a discriminator to detect replaced tokens.
  • Advantages: More sample-efficient than MLM; trains on all tokens, not just masked ones.
  • Note: Not adversarial like GANs; generator is discarded after pre-training.

GPT Family — Decoder-only Transformers

GPT-1 (2018)

  • Architecture: 12-layer Transformer decoder.
  • Training: Left-to-right language modeling (next-token prediction).
  • Task adaptation: Special tokens for classification, multiple-choice QA, etc.
  • Difference from BERT: Unidirectional context; generative LM objective.

GPT-2 (2019)

  • Scale: 1.5B parameters.
  • Key result: Strong zero-shot performance via prompting.
  • Tasks: Summarization, translation, QA without fine-tuning.

GPT-3 (2020)

  • Scale: 175B parameters.
  • Key result: In-context learning — few-shot and zero-shot capabilities without parameter updates.
  • Impact: Popularized prompt engineering.

Seq2Seq Transformers

  • Architecture: Encoder–decoder Transformer.
  • Applications: Translation, summarization, text generation, etc.

T5 (2020) — Text-to-Text Transfer Transformer

  • Unified framework: All tasks cast as text-to-text.
  • Pre-training objective: Span corruption (text infilling) — mask contiguous spans.
  • Data: C4 corpus (~750GB cleaned Common Crawl).
  • Advantage: Single model for multiple NLP tasks.

BART (2020)

  • Architecture: Encoder–decoder Transformer.
  • Pre-training: Denoising autoencoder with:
    • Text infilling (mask spans).
    • Sentence permutation.
  • Applications: Summarization, translation, text generation.

Fine-tuning Challenges

  • Overfitting: Risk when target dataset is small.
  • Catastrophic forgetting: Fine-tuning can degrade performance on pre-trained knowledge.
  • Memory inefficiency: Storing separate fine-tuned models for each task is costly.

Parameter-Efficient Fine-Tuning (PEFT)

  • Goal: Adapt large models by training only a small subset of parameters.
  • Benefits:
    • Lower compute and storage cost.
    • Easier multi-task deployment (store small task-specific modules).
  • Also called: Specification-based tuning.

Methods

Adapters (2017)

  • Mechanism:
    • Insert small bottleneck feedforward modules with residual connections between Transformer sub-layers (after attention and feedforward).
    • Only adapter parameters are trained (~1–3% of model size).
  • Structure: Down-projection → nonlinearity → up-projection → residual add.
  • Advantages: Comparable performance to full fine-tuning; store only small adapter weights per task.

Prefix-Tuning

  • Idea: Learn continuous task-specific prefix vectors prepended to the input at each Transformer layer.
  • Setup:
    • Freeze base model parameters \(\theta\).
    • Train prefix parameters \(\phi\) only.
  • Implementation:
    1. Trainable prefix matrix \(P_\phi \in \mathbb{R}^{T' \times D'}\).
    2. MLP maps \(D' \to D\) to match model hidden size.
    3. Concatenate prefix activations to input embeddings:
      \(\mathbf{X'} = [\text{Prefix}_\phi; \mathbf{X}]\).
    4. Feed \(\mathbf{X'}\) into frozen Transformer.
  • Advantages:
    • Very parameter-efficient.
    • Modular — swap prefixes for different tasks.
    • No modification to base model weights.
  • Note: Prefix vectors influence computation via self-attention.

LoRA — Low-Rank Adaptation

  • Mechanism:
    • Decompose weight update into low-rank matrices: \(W' = W_0 + AB\).
    • \(A \in \mathbb{R}^{d \times r}\), \(B \in \mathbb{R}^{r \times k}\) with \(r \ll \min(d,k)\).
  • Advantages:
    • No extra inference latency (can merge \(AB\) into \(W_0\)).
    • Preserves full sequence length (unlike prefix-tuning).
    • Often easier to optimize than prefix-tuning.
  • Use case: Widely adopted for fine-tuning large LMs with minimal compute.

BitFit (2021)

  • Mechanism: Fine-tune only bias terms in the model.
  • Performance: Achieves ~95% of full fine-tuning performance on many tasks.

Diff Pruning (2021)

  • Mechanism: Represent weights as \(W = W_0 + \Delta W\), where \(\Delta W\) is sparse.
  • Regularization: \(L_1\) penalty on \(\Delta W\) to encourage sparsity.
  • Trade-off: Can require more memory than full fine-tuning if sparsity is low.

Summary Table — Model Families & Objectives

Model Architecture Pre-training Objective(s) Notable Features
CoVe BiLSTM encoder MT seq2seq Contextual embeddings from MT
ELMo BiLSTM LM Forward + backward LM Layer-wise learned combination
BERT Transformer encoder MLM + NSP Bidirectional context
RoBERTa Transformer encoder MLM Larger data, dynamic masking, no NSP
ALBERT Transformer encoder MLM + SOP Parameter sharing, factorized embeddings
ELECTRA Transformer encoder Replaced token detection Discriminative pre-training
GPT-1/2/3 Transformer decoder Next-token prediction Generative LM, scaling to 175B
T5 Transformer encoder–decoder Span corruption Unified text-to-text
BART Transformer encoder–decoder Text infilling + permutation Denoising autoencoder