Part 2 - Week 2
Mrinmaya Sachan
In-context Learning (ICL)
- Definition: The ability of large language models (LLMs) to perform novel tasks at inference time by conditioning on task descriptions and/or examples provided in the input prompt — without updating model parameters.
- Historical emergence:
- The phenomenon was first systematically highlighted in Brown et al., 2020 with GPT‑3, though earlier transformer LMs (e.g., GPT‑2, 2019) showed rudimentary forms.
- Marked a shift from the pre‑2020 paradigm where pre‑trained LMs were primarily used as initialization for fine‑tuning.
- Emergent capability:
- Appears only in sufficiently large models (scaling laws, Kaplan et al., 2020).
- Contrasts with traditional supervised learning where each task requires explicit parameter updates.
- Advantages:
- Reduces or eliminates the need for fine-tuning.
- Particularly valuable when task-specific labeled data is scarce.
- Enables rapid prototyping and task adaptation.
- View as conditional generation:
- Map an input prompt \(x\) to an augmented prompt \(x'\) that includes instructions/examples.
- Insert a placeholder \(z\) in the text where the model should produce the answer.
- Search over possible completions for \(z\):
- Greedy decoding: Choose the highest-probability answer.
- Sampling: Generate multiple candidates and select based on scoring or consistency.
Prompting Techniques
Zero-shot Prompting
- Definition: Provide only a natural language instruction describing the task, without examples.
- Chronology: Became viable with very large LMs (GPT‑3, 2020) that had enough world knowledge and pattern recognition to generalize from instructions alone.
- Key property: Relies heavily on the model’s pretraining distribution and instruction-following ability.
Few-shot Prompting
- Definition: Provide a small number of input–output demonstrations in the prompt before the test query.
- Origins: Demonstrated in Brown et al., 2020 as a way to elicit task-specific behavior without gradient updates.
- Benefits:
- Establishes the expected output format.
- Guides the model toward the desired output distribution.
- Relation to meta-learning: Functions as a form of “on-the-fly” adaptation using context as training data.
Effective Prompt Design
- Key principles:
- Provide enough context to disambiguate the task without exceeding model context limits.
- Place the question or task instruction in a position that maximizes attention (often before the context).
- Use clear, unambiguous language.
- Limitations:
- Prompt design is still heuristic; performance can vary significantly with small changes.
- Prompt sensitivity documented in Zhao et al., 2021 (“Calibrate Before Use”).
Automated Prompt Optimization
- Goal: Systematically improve prompts to maximize model performance.
- Methods:
- Corpus mining:
- Search large text corpora for naturally occurring patterns that connect inputs to outputs.
- Example: Extracting bridging phrases from bilingual corpora for translation prompts (Shin et al., 2020).
- Paraphrasing approaches:
- Back-translation: Translate prompt to another language and back to generate variations.
- Synonym substitution: Replace words with synonyms to test robustness.
- Learned rewriters:
- Train models to rewrite prompts for better performance (meta-optimization, Jiang et al., 2020 “KnowPrompt”).
- Corpus mining:
- Impact: Moves prompt engineering from manual trial-and-error to data-driven search.
Continuous Prompting
- Definition: Learn prompt representations directly in the model’s embedding space.
- Relation to PEFT: Similar to prefix-tuning (Li and Liang, 2021) — prepend trainable continuous vectors to the input embeddings.
- Variants:
- Start from a manually written prompt and fine-tune it in embedding space.
- Fully learned continuous prompts without human-readable text (Lester et al., 2021, “The Power of Scale”).
- Properties:
- Can be more sensitive to small changes in learned vectors.
- Not constrained by natural language syntax.
- Often more parameter-efficient than full fine-tuning.
Theoretical Perspectives on ICL
- Hypothesis 1 — Task Selection:
- The model stores many task-specific behaviors from pretraining.
- The prompt acts as a key to retrieve the relevant behavior.
- Related to retrieval-augmented generation ideas.
- Hypothesis 2 — Meta-learning:
- The model has learned general learning algorithms during pretraining (Xie et al., 2022).
- At inference, it uses in-context examples to adapt to the task (learning from the prompt).
- Analogous to gradient-free meta-learners like MAML but operating in hidden state space.
- Hypothesis 3 — Structured Task Composition:
- The model decomposes complex prompts into familiar subtasks and composes solutions.
- Supported by findings in compositional generalization research.
Advanced Prompting for Complex Tasks
Step-by-step Solutions
- Explicitly request intermediate reasoning steps.
- Improves performance by breaking down large inferential leaps into smaller, verifiable steps.
- Early evidence in Wei et al., 2022 (Chain-of-Thought).
Chain-of-Thought (CoT) Prompting
- Definition: Include examples in the prompt that demonstrate explicit reasoning before the final answer.
- Effect: Encourages the model to generate intermediate reasoning steps.
- Chronology: Popularized by Wei et al., 2022; shown to significantly improve reasoning in large models (≥62B parameters).
- Note: Different from CoT training (e.g., in O1/DeepSeek) where reasoning is reinforced via RL.
Least-to-Most Prompting
- Idea: Decompose a complex problem into a sequence of simpler subproblems.
- Process:
- Solve the simplest subproblem.
- Use its solution as context for the next subproblem.
- Repeat until the final solution is reached.
- Origin: Zhou et al., 2022; shown to improve compositional reasoning.
Program-of-Thought Prompting
- Definition: Express reasoning as executable code (e.g., Python).
- Advantage: Deterministic execution ensures correctness for well-defined computations.
- Origin: Chen et al., 2022; related to “PAL” (Program-Aided Language models).
Self-Consistency for Reasoning
- Motivation: CoT outputs can vary; a single reasoning path may be incorrect.
- Method:
- Sample \(n\) reasoning paths from the model for the same question.
- Extract the final answer from each path.
- Select the most frequent answer (majority vote).
- Effect: Improves robustness by aggregating over multiple reasoning trajectories.
- Origin: Wang et al., 2022; demonstrated large gains on math word problems and commonsense reasoning benchmarks.