Part 3 - Week 1

Florian Tramèr

Published

Tuesday, May 6, 2025

Security and Adversarial Examples

In security contexts, achieving 99% accuracy constitutes a failure.
- Developing models that behave reliably under all conditions is challenging.

Adversarial Examples

Adversarial examples are inputs modified with targeted perturbations that appear normal to humans but cause machine learning models to produce incorrect outputs.
These can be generated easily with access to model weights by performing gradient ascent on the input to maximize misclassification.
Models exhibit robustness to random noise, but adversarial perturbations are directed non-randomly, exploiting specific vulnerabilities.
This can be formulated as a constrained optimization problem: maximize the model’s loss subject to a constraint on the perturbation size, such as the \(L_\infty\) norm (infinity norm), ensuring the change remains semantically small.
From a security standpoint, this is problematic because, for nearly every input, an adversarial example exists that can fool the model.
Optimization requires model weights (white-box attack); however, even when models are hidden behind an Application Programming Interface (API), attacks are possible.
Black-box attacks, where weights are unavailable, include:
- Transfer attacks: Train a surrogate model on similar data, generate adversarial examples on it, and transfer them to the target model, exploiting shared brittleness.
- Black-box optimization methods, such as derivative-free optimization algorithms that rely on function evaluations via sampling rather than gradients.

Adversarial Examples in Large Language Models (LLMs)

LLMs are transformer-based models trained on vast text corpora for tasks like generation and classification; adversarial examples for LLMs involve inputs that elicit undesired behaviors.
Key considerations include defining the attack goal, perturbation metric, and optimization strategy.
Jailbreaking aims to induce outputs violating safety guidelines, such as generating harmful content.
Unlike continuous image spaces, text operates in a discrete space, preventing continuous input modifications.
For tasks like sentiment analysis, prefixes can be added to flip outputs, though such perturbations are often visible and not stealthy.
Examples include biased outputs (e.g., Wallace et al., 2021, where prefixes induced racist responses) and eliciting dangerous instructions (e.g., bypassing safeguards to describe bomb-making).
Tool hijacking involves prompt injections to misuse integrated tools, analogous to SQL injection attacks, potentially executing harmful code.

Jailbreaking Techniques

Jailbreaking involves crafting prefixes or prompts that elicit unsafe content from LLMs.
Models can be tricked via persuasive role-playing or seemingly random, unintelligible text.
Unlike traditional adversarial examples, the focus is on preventing any unsafe output rather than minimizing perturbation distance.
Optimization challenges include defining “bad” outputs quantitatively; one approach exploits autoregressive generation by forcing an initial affirmative token (e.g., “Yes”), limiting backtracking.

Attacking Multimodal LLMs

Multimodal LLMs combine text and image encoders.
Adversarial images can be optimized via backpropagation through the image encoder to jailbreak the model, which is relatively straightforward.

Pure Text Attacks

Manual search or role-playing (e.g., “grandma hack” where the model is prompted to recall a deceased relative sharing forbidden knowledge) can succeed.
Translations to other languages sometimes bypass filters.
Attacks can be human-interpretable or simplistic.
Greedy optimization via APIs (e.g., ChatGPT) involves querying next-token probabilities and appending suffixes to increase the likelihood of affirmative responses (hill climbing).

Gradient Descent over Text

Gradient descent in the embedding space yields vectors not corresponding to valid tokens.
Combine with hill climbing: Project gradients to the nearest valid token and evaluate loss improvements.
Greedy Coordinate Gradient Descent (GCG)
- From Zou et al. (2023): Identifies top-k token substitutions using gradients.
- Randomly selects B positions in the suffix, evaluates loss for candidate substitutions, and chooses the best.
- This is a white-box attack but transfers to black-box settings, especially for models distilled from targets like ChatGPT.
- Hypothesis from Ilyas et al. (2019): Adversarial examples exploit meaningful but non-robust features in training data, which models learn for generalization.

Defenses Against Jailbreaks

Numerous attacks exist, but defenses are limited and often ineffective.
Content filters: Deploy a secondary model to detect and flag unsafe outputs.
Perplexity filters: Identify anomalous (high-perplexity) inputs like gibberish jailbreaks, but fail against coherent ones; stronger optimization can yield low-perplexity jailbreaks.
Representation engineering (Zou et al., 2023): Leverages mechanistic interpretability (analyzing internal model circuits) to identify activations or features correlated with harmful behaviors; interventions (e.g., steering representations) suppress these during inference to prevent unsafe outputs. [Further explanation: This involves techniques like activation patching or steering vectors derived from contrastive examples (harmful vs. safe), allowing targeted modifications to model internals without full retraining, assuming familiarity with transformer architectures.]
Circuit Breakers (Zou et al., 2024): A defense mechanism that trains LLMs to detect and interrupt “circuits” (subnetworks) responsible for harmful generations; it uses representation-based interventions to halt unsafe trajectories early in generation, improving robustness against jailbreaks while preserving utility. [Filled information: Builds on representation engineering by automating circuit identification and breaking via fine-tuning or prompting.]
Other approaches: Perturb inputs slightly before processing, adversarial training on known jailbreaks, and more; however, no defense is universally effective.

LLM Misuses

LLMs enable generation of low-quality content (“AI slop”) at scale, including fake news, spam, phishing emails, and articles.
Dual-use potential: LLMs can enhance defenses like spam filters and vulnerability detection.
Asymmetry exists: Attackers need only exploit one vulnerability, while defenders must address all; this favors attackers in cost and effort.
“Malware 2.0” (inspired by Karpathy’s “Software 2.0” paradigm, where software is learned rather than programmed): Deploy swarms of AI agents to exploit systems at scale, targeting humans with personalized phishing or automatically scanning small applications for vulnerabilities.
Offensive cybersecurity: AI-driven attacks on systems, including prompt injection attacks on LLM agents (e.g., AgentDojo framework evaluates robustness against such attacks, where agents are hijacked to execute malicious tasks via untrusted data).
Abusing inference capabilities: Exploiting LLM inference for malicious purposes, such as generating deceptive content, automating attacks, or privacy breaches through side channels and model extraction.

Watermarking

Watermarking aims to detect LLM-generated text, e.g., distinguishing human from AI essays, to mitigate misuses.
Detectors based on statistical distributions or likelihood are unreliable, as they may flag memorized human text as AI-generated.
Embed imperceptible signals into generated text, typically for closed models by biasing token distributions during generation.
Analogy: Lipogrammatic writing, like Georges Perec’s “A Void” (1969), which avoids the letter ‘e’.
While deployed for images, text watermarking is nascent and not widely used.
One method: For a given prefix, partition the vocabulary into “green” and “red” lists (e.g., based on a hash); bias sampling toward the green list.
Detection: Unwatermarked text has ~50% chance per token of being green; watermarked text deviates, with probability decaying exponentially over tokens.
Soft variants subtract from red-list logits to avoid blocking useful tokens, preserving utility.
Without the secret key (e.g., hash seed), expectations remain 50/50, but long watermarked texts show bias.
Limitations: Brittle, may degrade output quality, and can be reverse-engineered to forge or remove watermarks.
Advanced schemes enable public verifiability, using cryptographic primitives like public-key systems where generation requires a private key, but verification is public.

Conclusion

Security emphasizes worst-case performance over average accuracy.
Adversarial examples pose a significant threat to model reliability.
Defenses are largely ad-hoc and insufficient, lacking robust solutions.