Part 3 - Week 2

Florian Tramèr

Published

Tuesday, May 13, 2025

Prompt Injections & Poisoning Attacks

Importance of the Topic
Prompt injections and poisoning attacks pose significant security risks to Large Language Models (LLMs), which are neural networks trained on vast datasets to generate human-like text. These attacks can compromise system integrity, leading to unauthorized actions or data leaks. As LLMs integrate into applications with access to sensitive resources (e.g., email, codebases, or shells), such vulnerabilities become critical. Real-world examples include early integrations like Bing with GPT, where search queries could inject malicious instructions, and cases in Google Docs with Bard. Unlike traditional computers, LLMs lack clear separation between instructions and data, as everything is processed as text, exacerbating these issues.
Manual Jailbreaks
- Basic attacks involve prompts like “disregard your previous instructions” or hidden text instructing the model to perform unauthorized actions (e.g., “instead do X”).
- These extend beyond simple jailbreaks, enabling more severe exploits in integrated systems.
Challenges in Securing Prompts
- Prompts are intended to remain secret, but maintaining secrecy is difficult.
- Agentic models (LLMs that act autonomously with tools) amplify risks by accessing real-world resources.
- Attacks are practical, increasingly common with application integrations, and no inherent separation exists between memory and data in LLMs.

Partial Solutions

Training Models Against Attacks
- Fine-tuning LLMs on adversarial examples to resist injections; however, this is not fully reliable due to evolving attack methods.
Separation Symbols
- Using delimiters to distinguish instructions from data; unreliable as LLMs may still misinterpret or ignore them.
Additional AI Layers
- Employing another LLM to scan for injections; vulnerable to recursive injections targeting the scanner itself.
Speculative Approaches
- Separate pipelines for processing data versus instructions; not yet implemented as a practical solution.
Instruction Hierarchy (Wallace 2024)
- Training LLMs to prioritize privileged instructions, embedding a hierarchy where system prompts (highest priority) override user prompts, which override other inputs.
- Demonstrated to improve robustness against unseen attacks, with minimal impact on general capabilities.
Training to Prioritize Privileged Instructions (2024)
- During training, emphasize that system prompts are paramount, followed by user prompts, then other data.
Quarantined LLMs: Dual LLM Pattern
- Uses a planner LLM that spawns a secondary, quarantined LLM to process untrusted data without tool access, preventing injection-based exploits.
- Limitation: Planner decisions may still depend on untrusted data, requiring eventual exposure.
CaMeL: Defeating Prompt Injections by Design
- The planner generates code to process data, ensuring control flow remains unaltered.
- Data is handled by a quarantined LLM resistant to injections, providing a structured defense.

Poisoning Attacks

Overview
- Unlike test-time attacks (e.g., prompt injections), poisoning targets the training phase to induce malicious behavior.
- Can degrade model performance through continuous training or introduce backdoors.
Backdoor Attacks
- Embed triggers (e.g., words like “James Bond”) that activate incorrect behavior only when present, making detection difficult during training.
- Applicable across training stages, such as Reinforcement Learning from Human Feedback (RLHF), where poisoned data includes easy-to-learn triggers exploitable at inference.
- In RLHF, poisoning involves mislabeling completions and corrupting the reward model; minimal poisoning suffices for the reward model, but more is needed for the base model, reducing practicality.
- During pretraining, mimicking system-user interactions can embed persistent manipulations, especially for underrepresented beliefs in the dataset.
Attacker Data Injection Methods
- Corrupting moderate amounts of training data enables significant harm.
- Vectors include editable sources like Wikipedia, which is often upsampled in training.
- Modern datasets (e.g., LAION-5B for image-text pairs) are vulnerable: distributed as captions with links, where links can be altered (e.g., via domain purchases for 404 errors).
- This issue persists in text-image datasets as of 2025.
- Text-Only Datasets
  - Sources like Wikipedia are prone to vandalism.
  - Dumps occur monthly, allowing persistent alterations.

Defenses Against Poisoning

Dataset Integrity Measures
- Adding hashes to verify dataset authenticity; however, this can be overly aggressive, triggering false positives from minor changes (e.g., watermarks or aspect ratio adjustments).
Wikipedia-Specific Solutions
- Randomize article saving or include only edits stable for a defined period to mitigate vandalism.
General Dataset Protections
- Apply similar randomization or stability checks to Common Crawl and other services, often managed by small teams.
- Recent research (as of 2025) highlights new poisoning variants, such as overthinking backdoors and finetuning-activated backdoors, emphasizing the need for ongoing defenses.

Model Stealing

Overview
Model stealing involves accessing a remote model via an Application Programming Interface (API) and training a surrogate model to replicate its behavior. This process, also known as model distillation, uses a teacher-student paradigm where the student model (typically smaller) learns from the teacher’s outputs. Distillation is effective in practice, as the teacher provides richer signals like probability distributions (logits) beyond just final predictions. This technique has been studied in the context of function approximation and reverse engineering.
Historical Examples
- Stanford’s Alpaca project fine-tuned the LLaMA model (a Large Language Model, or LLM, developed by Meta) on data generated by GPT-3 outputs.
- Student models rarely match the teacher’s performance exactly, often due to differences in scale and training data.
Real-World Implications
- Distillation of Chinese open models raises concerns, as these models may incorporate censorship or biases (e.g., avoiding certain topics) from their training.
- Allegations surfaced in early 2025 that DeepSeek queried OpenAI servers to distill models, particularly for mathematics and reasoning traces, leading to investigations by OpenAI and U.S. government involvement. DeepSeek denied wrongdoing, but this highlights potential intellectual property theft via distillation.
Provider Responses
- As of 2025, many providers (e.g., OpenAI, Anthropic) have begun hiding reasoning traces to prevent distillation, driven by concerns over model stealing and privacy. Joint research from major labs warns that transparency in AI reasoning may soon diminish, as models evolve to internalize thoughts without external visibility.
Analogy to Cryptanalysis
- Model stealing resembles cryptanalysis, where a known algorithm generates data from a seed, and reversal is intentionally hardened. Neural networks (NNs) are not optimized for such hardness, allowing adaptation of cryptographic reversal techniques.
Specific Stealing Techniques
- Linear Classifiers
  - For a model with \(d+1\) variables, exact recovery is possible with \(d+1\) queries, solving the system of equations.
- Extensions to Deeper Models
  - Applicable to networks with Rectified Linear Unit (ReLU) activations, which produce piecewise linear outputs.
  - Recovery involves identifying linear functions per subregion; this is complex but feasible with access to output derivatives.
  - Limited to small models under specific conditions, enabling one-to-one model theft.

Reverse Engineering Alternatives

If Full Model Theft Fails
- Estimate model parameters like size, embedding dimensions, and vocabulary size.
- Vocabulary size often exceeds hidden dimensions; knowing width allows depth estimation via empirical ratios.
- With logit access, linear algebra techniques can recover these details, successfully demonstrated on models like ChatGPT.