Part 3 - Week 1
Florian Tramèr
Security and Adversarial Examples
- In security contexts, achieving 99% accuracy constitutes a failure.
- Developing models that behave reliably under all conditions is challenging.
Adversarial Examples
- Adversarial examples are inputs modified with targeted perturbations that appear normal to humans but cause machine learning models to produce incorrect outputs.
- These can be generated easily with access to model weights by performing gradient ascent on the input to maximize misclassification.
- Models exhibit robustness to random noise, but adversarial perturbations are directed non-randomly, exploiting specific vulnerabilities.
- This can be formulated as a constrained optimization problem: maximize the model’s loss subject to a constraint on the perturbation size, such as the \(L_\infty\) norm (infinity norm), ensuring the change remains semantically small.
- From a security standpoint, this is problematic because, for nearly every input, an adversarial example exists that can fool the model.
- Optimization requires model weights (white-box attack); however, even when models are hidden behind an Application Programming Interface (API), attacks are possible.
- Black-box attacks, where weights are unavailable, include:
- Transfer attacks: Train a surrogate model on similar data, generate adversarial examples on it, and transfer them to the target model, exploiting shared brittleness.
- Black-box optimization methods, such as derivative-free optimization algorithms that rely on function evaluations via sampling rather than gradients.
Adversarial Examples in Large Language Models (LLMs)
- LLMs are transformer-based models trained on vast text corpora for tasks like generation and classification; adversarial examples for LLMs involve inputs that elicit undesired behaviors.
- Key considerations include defining the attack goal, perturbation metric, and optimization strategy.
- Jailbreaking aims to induce outputs violating safety guidelines, such as generating harmful content.
- Unlike continuous image spaces, text operates in a discrete space, preventing continuous input modifications.
- For tasks like sentiment analysis, prefixes can be added to flip outputs, though such perturbations are often visible and not stealthy.
- Examples include biased outputs (e.g., Wallace et al., 2021, where prefixes induced racist responses) and eliciting dangerous instructions (e.g., bypassing safeguards to describe bomb-making).
- Tool hijacking involves prompt injections to misuse integrated tools, analogous to SQL injection attacks, potentially executing harmful code.
Jailbreaking Techniques
- Jailbreaking involves crafting prefixes or prompts that elicit unsafe content from LLMs.
- Models can be tricked via persuasive role-playing or seemingly random, unintelligible text.
- Unlike traditional adversarial examples, the focus is on preventing any unsafe output rather than minimizing perturbation distance.
- Optimization challenges include defining “bad” outputs quantitatively; one approach exploits autoregressive generation by forcing an initial affirmative token (e.g., “Yes”), limiting backtracking.
Attacking Multimodal LLMs
- Multimodal LLMs combine text and image encoders.
- Adversarial images can be optimized via backpropagation through the image encoder to jailbreak the model, which is relatively straightforward.
Pure Text Attacks
- Manual search or role-playing (e.g., “grandma hack” where the model is prompted to recall a deceased relative sharing forbidden knowledge) can succeed.
- Translations to other languages sometimes bypass filters.
- Attacks can be human-interpretable or simplistic.
- Greedy optimization via APIs (e.g., ChatGPT) involves querying next-token probabilities and appending suffixes to increase the likelihood of affirmative responses (hill climbing).
Gradient Descent over Text
- Gradient descent in the embedding space yields vectors not corresponding to valid tokens.
- Combine with hill climbing: Project gradients to the nearest valid token and evaluate loss improvements.
- Greedy Coordinate Gradient Descent (GCG)
- From Zou et al. (2023): Identifies top-k token substitutions using gradients.
- Randomly selects B positions in the suffix, evaluates loss for candidate substitutions, and chooses the best.
- This is a white-box attack but transfers to black-box settings, especially for models distilled from targets like ChatGPT.
- Hypothesis from Ilyas et al. (2019): Adversarial examples exploit meaningful but non-robust features in training data, which models learn for generalization.
Defenses Against Jailbreaks
- Numerous attacks exist, but defenses are limited and often ineffective.
- Content filters: Deploy a secondary model to detect and flag unsafe outputs.
- Perplexity filters: Identify anomalous (high-perplexity) inputs like gibberish jailbreaks, but fail against coherent ones; stronger optimization can yield low-perplexity jailbreaks.
- Representation engineering (Zou et al., 2023): Leverages mechanistic interpretability (analyzing internal model circuits) to identify activations or features correlated with harmful behaviors; interventions (e.g., steering representations) suppress these during inference to prevent unsafe outputs. [Further explanation: This involves techniques like activation patching or steering vectors derived from contrastive examples (harmful vs. safe), allowing targeted modifications to model internals without full retraining, assuming familiarity with transformer architectures.]
- Circuit Breakers (Zou et al., 2024): A defense mechanism that trains LLMs to detect and interrupt “circuits” (subnetworks) responsible for harmful generations; it uses representation-based interventions to halt unsafe trajectories early in generation, improving robustness against jailbreaks while preserving utility. [Filled information: Builds on representation engineering by automating circuit identification and breaking via fine-tuning or prompting.]
- Other approaches: Perturb inputs slightly before processing, adversarial training on known jailbreaks, and more; however, no defense is universally effective.
LLM Misuses
- LLMs enable generation of low-quality content (“AI slop”) at scale, including fake news, spam, phishing emails, and articles.
- Dual-use potential: LLMs can enhance defenses like spam filters and vulnerability detection.
- Asymmetry exists: Attackers need only exploit one vulnerability, while defenders must address all; this favors attackers in cost and effort.
- “Malware 2.0” (inspired by Karpathy’s “Software 2.0” paradigm, where software is learned rather than programmed): Deploy swarms of AI agents to exploit systems at scale, targeting humans with personalized phishing or automatically scanning small applications for vulnerabilities.
- Offensive cybersecurity: AI-driven attacks on systems, including prompt injection attacks on LLM agents (e.g., AgentDojo framework evaluates robustness against such attacks, where agents are hijacked to execute malicious tasks via untrusted data).
- Abusing inference capabilities: Exploiting LLM inference for malicious purposes, such as generating deceptive content, automating attacks, or privacy breaches through side channels and model extraction.
Watermarking
- Watermarking aims to detect LLM-generated text, e.g., distinguishing human from AI essays, to mitigate misuses.
- Detectors based on statistical distributions or likelihood are unreliable, as they may flag memorized human text as AI-generated.
- Embed imperceptible signals into generated text, typically for closed models by biasing token distributions during generation.
- Analogy: Lipogrammatic writing, like Georges Perec’s “A Void” (1969), which avoids the letter ‘e’.
- While deployed for images, text watermarking is nascent and not widely used.
- One method: For a given prefix, partition the vocabulary into “green” and “red” lists (e.g., based on a hash); bias sampling toward the green list.
- Detection: Unwatermarked text has ~50% chance per token of being green; watermarked text deviates, with probability decaying exponentially over tokens.
- Soft variants subtract from red-list logits to avoid blocking useful tokens, preserving utility.
- Without the secret key (e.g., hash seed), expectations remain 50/50, but long watermarked texts show bias.
- Limitations: Brittle, may degrade output quality, and can be reverse-engineered to forge or remove watermarks.
- Advanced schemes enable public verifiability, using cryptographic primitives like public-key systems where generation requires a private key, but verification is public.
Conclusion
- Security emphasizes worst-case performance over average accuracy.
- Adversarial examples pose a significant threat to model reliability.
- Defenses are largely ad-hoc and insufficient, lacking robust solutions.