Part 3 - Week 1

Florian Tramèr
Published

Tuesday, May 6, 2025

Security and Adversarial Examples

  • In security contexts, achieving 99% accuracy constitutes a failure.
    • Developing models that behave reliably under all conditions is challenging.

Adversarial Examples

  • Adversarial examples are inputs modified with targeted perturbations that appear normal to humans but cause machine learning models to produce incorrect outputs.
  • These can be generated easily with access to model weights by performing gradient ascent on the input to maximize misclassification.
  • Models exhibit robustness to random noise, but adversarial perturbations are directed non-randomly, exploiting specific vulnerabilities.
  • This can be formulated as a constrained optimization problem: maximize the model’s loss subject to a constraint on the perturbation size, such as the \(L_\infty\) norm (infinity norm), ensuring the change remains semantically small.
  • From a security standpoint, this is problematic because, for nearly every input, an adversarial example exists that can fool the model.
  • Optimization requires model weights (white-box attack); however, even when models are hidden behind an Application Programming Interface (API), attacks are possible.
  • Black-box attacks, where weights are unavailable, include:
    • Transfer attacks: Train a surrogate model on similar data, generate adversarial examples on it, and transfer them to the target model, exploiting shared brittleness.
    • Black-box optimization methods, such as derivative-free optimization algorithms that rely on function evaluations via sampling rather than gradients.

Adversarial Examples in Large Language Models (LLMs)

  • LLMs are transformer-based models trained on vast text corpora for tasks like generation and classification; adversarial examples for LLMs involve inputs that elicit undesired behaviors.
  • Key considerations include defining the attack goal, perturbation metric, and optimization strategy.
  • Jailbreaking aims to induce outputs violating safety guidelines, such as generating harmful content.
  • Unlike continuous image spaces, text operates in a discrete space, preventing continuous input modifications.
  • For tasks like sentiment analysis, prefixes can be added to flip outputs, though such perturbations are often visible and not stealthy.
  • Examples include biased outputs (e.g., Wallace et al., 2021, where prefixes induced racist responses) and eliciting dangerous instructions (e.g., bypassing safeguards to describe bomb-making).
  • Tool hijacking involves prompt injections to misuse integrated tools, analogous to SQL injection attacks, potentially executing harmful code.

Jailbreaking Techniques

  • Jailbreaking involves crafting prefixes or prompts that elicit unsafe content from LLMs.
  • Models can be tricked via persuasive role-playing or seemingly random, unintelligible text.
  • Unlike traditional adversarial examples, the focus is on preventing any unsafe output rather than minimizing perturbation distance.
  • Optimization challenges include defining “bad” outputs quantitatively; one approach exploits autoregressive generation by forcing an initial affirmative token (e.g., “Yes”), limiting backtracking.

Attacking Multimodal LLMs

  • Multimodal LLMs combine text and image encoders.
  • Adversarial images can be optimized via backpropagation through the image encoder to jailbreak the model, which is relatively straightforward.

Pure Text Attacks

  • Manual search or role-playing (e.g., “grandma hack” where the model is prompted to recall a deceased relative sharing forbidden knowledge) can succeed.
  • Translations to other languages sometimes bypass filters.
  • Attacks can be human-interpretable or simplistic.
  • Greedy optimization via APIs (e.g., ChatGPT) involves querying next-token probabilities and appending suffixes to increase the likelihood of affirmative responses (hill climbing).

Gradient Descent over Text

  • Gradient descent in the embedding space yields vectors not corresponding to valid tokens.
  • Combine with hill climbing: Project gradients to the nearest valid token and evaluate loss improvements.
  • Greedy Coordinate Gradient Descent (GCG)
    • From Zou et al. (2023): Identifies top-k token substitutions using gradients.
    • Randomly selects B positions in the suffix, evaluates loss for candidate substitutions, and chooses the best.
    • This is a white-box attack but transfers to black-box settings, especially for models distilled from targets like ChatGPT.
    • Hypothesis from Ilyas et al. (2019): Adversarial examples exploit meaningful but non-robust features in training data, which models learn for generalization.

Defenses Against Jailbreaks

  • Numerous attacks exist, but defenses are limited and often ineffective.
  • Content filters: Deploy a secondary model to detect and flag unsafe outputs.
  • Perplexity filters: Identify anomalous (high-perplexity) inputs like gibberish jailbreaks, but fail against coherent ones; stronger optimization can yield low-perplexity jailbreaks.
  • Representation engineering (Zou et al., 2023): Leverages mechanistic interpretability (analyzing internal model circuits) to identify activations or features correlated with harmful behaviors; interventions (e.g., steering representations) suppress these during inference to prevent unsafe outputs. [Further explanation: This involves techniques like activation patching or steering vectors derived from contrastive examples (harmful vs. safe), allowing targeted modifications to model internals without full retraining, assuming familiarity with transformer architectures.]
  • Circuit Breakers (Zou et al., 2024): A defense mechanism that trains LLMs to detect and interrupt “circuits” (subnetworks) responsible for harmful generations; it uses representation-based interventions to halt unsafe trajectories early in generation, improving robustness against jailbreaks while preserving utility. [Filled information: Builds on representation engineering by automating circuit identification and breaking via fine-tuning or prompting.]
  • Other approaches: Perturb inputs slightly before processing, adversarial training on known jailbreaks, and more; however, no defense is universally effective.

LLM Misuses

  • LLMs enable generation of low-quality content (“AI slop”) at scale, including fake news, spam, phishing emails, and articles.
  • Dual-use potential: LLMs can enhance defenses like spam filters and vulnerability detection.
  • Asymmetry exists: Attackers need only exploit one vulnerability, while defenders must address all; this favors attackers in cost and effort.
  • “Malware 2.0” (inspired by Karpathy’s “Software 2.0” paradigm, where software is learned rather than programmed): Deploy swarms of AI agents to exploit systems at scale, targeting humans with personalized phishing or automatically scanning small applications for vulnerabilities.
  • Offensive cybersecurity: AI-driven attacks on systems, including prompt injection attacks on LLM agents (e.g., AgentDojo framework evaluates robustness against such attacks, where agents are hijacked to execute malicious tasks via untrusted data).
  • Abusing inference capabilities: Exploiting LLM inference for malicious purposes, such as generating deceptive content, automating attacks, or privacy breaches through side channels and model extraction.

Watermarking

  • Watermarking aims to detect LLM-generated text, e.g., distinguishing human from AI essays, to mitigate misuses.
  • Detectors based on statistical distributions or likelihood are unreliable, as they may flag memorized human text as AI-generated.
  • Embed imperceptible signals into generated text, typically for closed models by biasing token distributions during generation.
  • Analogy: Lipogrammatic writing, like Georges Perec’s “A Void” (1969), which avoids the letter ‘e’.
  • While deployed for images, text watermarking is nascent and not widely used.
  • One method: For a given prefix, partition the vocabulary into “green” and “red” lists (e.g., based on a hash); bias sampling toward the green list.
  • Detection: Unwatermarked text has ~50% chance per token of being green; watermarked text deviates, with probability decaying exponentially over tokens.
  • Soft variants subtract from red-list logits to avoid blocking useful tokens, preserving utility.
  • Without the secret key (e.g., hash seed), expectations remain 50/50, but long watermarked texts show bias.
  • Limitations: Brittle, may degrade output quality, and can be reverse-engineered to forge or remove watermarks.
  • Advanced schemes enable public verifiability, using cryptographic primitives like public-key systems where generation requires a private key, but verification is public.

Conclusion

  • Security emphasizes worst-case performance over average accuracy.
  • Adversarial examples pose a significant threat to model reliability.
  • Defenses are largely ad-hoc and insufficient, lacking robust solutions.