Part 2 - Week 2

Mrinmaya Sachan
Published

Tuesday, April 8, 2025

In-context Learning (ICL)

  • Definition: The ability of large language models (LLMs) to perform novel tasks at inference time by conditioning on task descriptions and/or examples provided in the input prompt — without updating model parameters.
  • Historical emergence:
    • The phenomenon was first systematically highlighted in Brown et al., 2020 with GPT‑3, though earlier transformer LMs (e.g., GPT‑2, 2019) showed rudimentary forms.
    • Marked a shift from the pre‑2020 paradigm where pre‑trained LMs were primarily used as initialization for fine‑tuning.
  • Emergent capability:
    • Appears only in sufficiently large models (scaling laws, Kaplan et al., 2020).
    • Contrasts with traditional supervised learning where each task requires explicit parameter updates.
  • Advantages:
    • Reduces or eliminates the need for fine-tuning.
    • Particularly valuable when task-specific labeled data is scarce.
    • Enables rapid prototyping and task adaptation.
  • View as conditional generation:
    • Map an input prompt \(x\) to an augmented prompt \(x'\) that includes instructions/examples.
    • Insert a placeholder \(z\) in the text where the model should produce the answer.
    • Search over possible completions for \(z\):
      • Greedy decoding: Choose the highest-probability answer.
      • Sampling: Generate multiple candidates and select based on scoring or consistency.

Prompting Techniques

Zero-shot Prompting

  • Definition: Provide only a natural language instruction describing the task, without examples.
  • Chronology: Became viable with very large LMs (GPT‑3, 2020) that had enough world knowledge and pattern recognition to generalize from instructions alone.
  • Key property: Relies heavily on the model’s pretraining distribution and instruction-following ability.

Few-shot Prompting

  • Definition: Provide a small number of input–output demonstrations in the prompt before the test query.
  • Origins: Demonstrated in Brown et al., 2020 as a way to elicit task-specific behavior without gradient updates.
  • Benefits:
    • Establishes the expected output format.
    • Guides the model toward the desired output distribution.
  • Relation to meta-learning: Functions as a form of “on-the-fly” adaptation using context as training data.

Effective Prompt Design

  • Key principles:
    • Provide enough context to disambiguate the task without exceeding model context limits.
    • Place the question or task instruction in a position that maximizes attention (often before the context).
    • Use clear, unambiguous language.
  • Limitations:
    • Prompt design is still heuristic; performance can vary significantly with small changes.
    • Prompt sensitivity documented in Zhao et al., 2021 (“Calibrate Before Use”).

Automated Prompt Optimization

  • Goal: Systematically improve prompts to maximize model performance.
  • Methods:
    • Corpus mining:
      • Search large text corpora for naturally occurring patterns that connect inputs to outputs.
      • Example: Extracting bridging phrases from bilingual corpora for translation prompts (Shin et al., 2020).
    • Paraphrasing approaches:
      • Back-translation: Translate prompt to another language and back to generate variations.
      • Synonym substitution: Replace words with synonyms to test robustness.
    • Learned rewriters:
      • Train models to rewrite prompts for better performance (meta-optimization, Jiang et al., 2020 “KnowPrompt”).
  • Impact: Moves prompt engineering from manual trial-and-error to data-driven search.

Continuous Prompting

  • Definition: Learn prompt representations directly in the model’s embedding space.
  • Relation to PEFT: Similar to prefix-tuning (Li and Liang, 2021) — prepend trainable continuous vectors to the input embeddings.
  • Variants:
    • Start from a manually written prompt and fine-tune it in embedding space.
    • Fully learned continuous prompts without human-readable text (Lester et al., 2021, “The Power of Scale”).
  • Properties:
    • Can be more sensitive to small changes in learned vectors.
    • Not constrained by natural language syntax.
    • Often more parameter-efficient than full fine-tuning.

Theoretical Perspectives on ICL

  • Hypothesis 1 — Task Selection:
    • The model stores many task-specific behaviors from pretraining.
    • The prompt acts as a key to retrieve the relevant behavior.
    • Related to retrieval-augmented generation ideas.
  • Hypothesis 2 — Meta-learning:
    • The model has learned general learning algorithms during pretraining (Xie et al., 2022).
    • At inference, it uses in-context examples to adapt to the task (learning from the prompt).
    • Analogous to gradient-free meta-learners like MAML but operating in hidden state space.
  • Hypothesis 3 — Structured Task Composition:
    • The model decomposes complex prompts into familiar subtasks and composes solutions.
    • Supported by findings in compositional generalization research.

Advanced Prompting for Complex Tasks

Step-by-step Solutions

  • Explicitly request intermediate reasoning steps.
  • Improves performance by breaking down large inferential leaps into smaller, verifiable steps.
  • Early evidence in Wei et al., 2022 (Chain-of-Thought).

Chain-of-Thought (CoT) Prompting

  • Definition: Include examples in the prompt that demonstrate explicit reasoning before the final answer.
  • Effect: Encourages the model to generate intermediate reasoning steps.
  • Chronology: Popularized by Wei et al., 2022; shown to significantly improve reasoning in large models (≥62B parameters).
  • Note: Different from CoT training (e.g., in O1/DeepSeek) where reasoning is reinforced via RL.

Least-to-Most Prompting

  • Idea: Decompose a complex problem into a sequence of simpler subproblems.
  • Process:
    1. Solve the simplest subproblem.
    2. Use its solution as context for the next subproblem.
    3. Repeat until the final solution is reached.
  • Origin: Zhou et al., 2022; shown to improve compositional reasoning.

Program-of-Thought Prompting

  • Definition: Express reasoning as executable code (e.g., Python).
  • Advantage: Deterministic execution ensures correctness for well-defined computations.
  • Origin: Chen et al., 2022; related to “PAL” (Program-Aided Language models).

Self-Consistency for Reasoning

  • Motivation: CoT outputs can vary; a single reasoning path may be incorrect.
  • Method:
    1. Sample \(n\) reasoning paths from the model for the same question.
    2. Extract the final answer from each path.
    3. Select the most frequent answer (majority vote).
  • Effect: Improves robustness by aggregating over multiple reasoning trajectories.
  • Origin: Wang et al., 2022; demonstrated large gains on math word problems and commonsense reasoning benchmarks.