Part 2 - Week 2

Mrinmaya Sachan

Published

Tuesday, April 8, 2025

In-context Learning (ICL)

Definition: The ability of large language models (LLMs) to perform novel tasks at inference time by conditioning on task descriptions and/or examples provided in the input prompt — without updating model parameters.
Historical emergence:
- The phenomenon was first systematically highlighted in Brown et al., 2020 with GPT‑3, though earlier transformer LMs (e.g., GPT‑2, 2019) showed rudimentary forms.
- Marked a shift from the pre‑2020 paradigm where pre‑trained LMs were primarily used as initialization for fine‑tuning.
Emergent capability:
- Appears only in sufficiently large models (scaling laws, Kaplan et al., 2020).
- Contrasts with traditional supervised learning where each task requires explicit parameter updates.
Advantages:
- Reduces or eliminates the need for fine-tuning.
- Particularly valuable when task-specific labeled data is scarce.
- Enables rapid prototyping and task adaptation.
View as conditional generation:
- Map an input prompt \(x\) to an augmented prompt \(x'\) that includes instructions/examples.
- Insert a placeholder \(z\) in the text where the model should produce the answer.
- Search over possible completions for \(z\):
  - Greedy decoding: Choose the highest-probability answer.
  - Sampling: Generate multiple candidates and select based on scoring or consistency.

Definition: Provide only a natural language instruction describing the task, without examples.
Chronology: Became viable with very large LMs (GPT‑3, 2020) that had enough world knowledge and pattern recognition to generalize from instructions alone.
Key property: Relies heavily on the model’s pretraining distribution and instruction-following ability.

Definition: Provide a small number of input–output demonstrations in the prompt before the test query.
Origins: Demonstrated in Brown et al., 2020 as a way to elicit task-specific behavior without gradient updates.
Benefits:
- Establishes the expected output format.
- Guides the model toward the desired output distribution.
Relation to meta-learning: Functions as a form of “on-the-fly” adaptation using context as training data.

Key principles:
- Provide enough context to disambiguate the task without exceeding model context limits.
- Place the question or task instruction in a position that maximizes attention (often before the context).
- Use clear, unambiguous language.
Limitations:
- Prompt design is still heuristic; performance can vary significantly with small changes.
- Prompt sensitivity documented in Zhao et al., 2021 (“Calibrate Before Use”).

Goal: Systematically improve prompts to maximize model performance.
Methods:
- Corpus mining:
  - Search large text corpora for naturally occurring patterns that connect inputs to outputs.
  - Example: Extracting bridging phrases from bilingual corpora for translation prompts (Shin et al., 2020).
- Paraphrasing approaches:
  - Back-translation: Translate prompt to another language and back to generate variations.
  - Synonym substitution: Replace words with synonyms to test robustness.
- Learned rewriters:
  - Train models to rewrite prompts for better performance (meta-optimization, Jiang et al., 2020 “KnowPrompt”).
Impact: Moves prompt engineering from manual trial-and-error to data-driven search.

Definition: Learn prompt representations directly in the model’s embedding space.
Relation to PEFT: Similar to prefix-tuning (Li and Liang, 2021) — prepend trainable continuous vectors to the input embeddings.
Variants:
- Start from a manually written prompt and fine-tune it in embedding space.
- Fully learned continuous prompts without human-readable text (Lester et al., 2021, “The Power of Scale”).
Properties:
- Can be more sensitive to small changes in learned vectors.
- Not constrained by natural language syntax.
- Often more parameter-efficient than full fine-tuning.

Hypothesis 1 — Task Selection:
- The model stores many task-specific behaviors from pretraining.
- The prompt acts as a key to retrieve the relevant behavior.
- Related to retrieval-augmented generation ideas.
Hypothesis 2 — Meta-learning:
- The model has learned general learning algorithms during pretraining (Xie et al., 2022).
- At inference, it uses in-context examples to adapt to the task (learning from the prompt).
- Analogous to gradient-free meta-learners like MAML but operating in hidden state space.
Hypothesis 3 — Structured Task Composition:
- The model decomposes complex prompts into familiar subtasks and composes solutions.
- Supported by findings in compositional generalization research.

Explicitly request intermediate reasoning steps.
Improves performance by breaking down large inferential leaps into smaller, verifiable steps.
Early evidence in Wei et al., 2022 (Chain-of-Thought).

Definition: Include examples in the prompt that demonstrate explicit reasoning before the final answer.
Effect: Encourages the model to generate intermediate reasoning steps.
Chronology: Popularized by Wei et al., 2022; shown to significantly improve reasoning in large models (≥62B parameters).
Note: Different from CoT training (e.g., in O1/DeepSeek) where reasoning is reinforced via RL.

Idea: Decompose a complex problem into a sequence of simpler subproblems.
Process:
1. Solve the simplest subproblem.
2. Use its solution as context for the next subproblem.
3. Repeat until the final solution is reached.
Origin: Zhou et al., 2022; shown to improve compositional reasoning.

Definition: Express reasoning as executable code (e.g., Python).
Advantage: Deterministic execution ensures correctness for well-defined computations.
Origin: Chen et al., 2022; related to “PAL” (Program-Aided Language models).

Motivation: CoT outputs can vary; a single reasoning path may be incorrect.
Method:
1. Sample \(n\) reasoning paths from the model for the same question.
2. Extract the final answer from each path.
3. Select the most frequent answer (majority vote).
Effect: Improves robustness by aggregating over multiple reasoning trajectories.
Origin: Wang et al., 2022; demonstrated large gains on math word problems and commonsense reasoning benchmarks.