Part 3 - Week 4

Florian Tramèr

Published

Tuesday, May 27, 2025

Membership Inference

Overview
Membership Inference (MI) attacks aim to determine if a specific data point was included in a model’s training set. This can be formalized as a security game: A challenger trains a model on either dataset \(D\) or \(D \cup \{x\}\), and the adversary guesses which was used. MI exploits differences in model behavior on seen versus unseen data, often due to overfitting.
Relationship with Differential Privacy
- There is a formal connection between MI and Differential Privacy (DP), a framework that bounds information leakage about individuals in a dataset.
- If an algorithm is \(\epsilon\)-DP, MI success is bounded (adversary advantage is at most \(e^\epsilon - 1\)). Conversely, if MI is impossible (adversary advantage near zero), the algorithm provides DP-like guarantees.
- MI can empirically upper-bound data leakage or test DP compliance.
Implications
- MI itself leaks minimal data (e.g., binary membership), but worst-case scenarios include revealing sensitive attributes like disease status.
- It serves as a building block for broader attacks, such as data extraction.
- Applications include proving data provenance for copyright litigation, though this may encourage adversarial practices.

Methods for Membership Inference

Basic Approaches
- Prompt the model with a partial sample and check if it completes the rest accurately; success indicates strong evidence of memorization, though not exhaustive.
- Compare loss values: Training data typically has lower loss than held-out data. Statistical tests can detect this, but distributions overlap, complicating thresholds.
- High loss on a sample can confidently prove non-membership.
- Evaluation focuses on False Positive (FP) rate or Area Under the ROC Curve (AUC-ROC) rather than overall accuracy, as conservative decisions are prioritized.
LiRA Attack (Carlini et al., 2022)
- Improves on basic loss tests by accounting for varying sample difficulty.
- Low loss on easy samples (e.g., clear cat image) suggests non-membership; low loss on hard samples (e.g., noisy truck image) is suspicious.
- Uses a likelihood ratio: \(\frac{P(\text{loss} \mid x \in \text{training})}{P(\text{loss} \mid x \notin \text{training})}\).
- Approximates distributions via shadow models (surrogate models trained on subsets) modeled as Gaussians for membership tests.
- Achieves high precision on a subset of samples but with elevated FP rates overall.

Applying LiRA to Large Language Models (LLMs)

Adapting Loss Metrics
- For token sequences, use cross-entropy loss, modeling distributions as univariate Gaussians.
- Enhanced version treats each token as a dimension in a multivariate Gaussian, assuming independence or computing covariance for better accuracy (computationally intensive).
Challenges and Alternatives
- Training shadow models per sample is expensive, requiring full LLM retraining.
- Approximation: Train only on held-out data and check if the sample is an outlier.
- Without retraining: Compare losses from alternative LMs; high loss on one but low on the target suggests membership.
  - Use simple proxies like gzip compression size to approximate out-distribution loss.
  - Perturb samples (e.g., change capitalization) and observe loss sensitivity.
- These ad-hoc methods are unreliable and perform poorly in practice.
Evaluation Challenges
- Ground truth is hard to obtain; proxies include post-cutoff books (e.g., after 2021 for older models) as non-members and classics (e.g., Charles Dickens) as members.
- Distribution shifts (e.g., pre-2022 vs. post-2022 data) bias results; simple temporal classifiers outperform many MI attacks.
- Published evaluations are often flawed without baseline adjustments.
- Reliable tests require scratch-trained models like the Pythia suite, where exact training data is known.

Data Provenance Tests

Legal and Practical Considerations
- Aim to prove model training on copyrighted data with low FP rates, which are difficult to achieve reliably.
- Weak claims (e.g., model summarizing a book) are insufficient, as knowledge could stem from summaries.
- All methods remain speculative and “sketchy” without strong statistical backing.
Effective Techniques
- Data Canaries: Select a random string from a large set \(C\) (not in existing training data), insert it into your document, and measure target model losses on samples from \(C\). Enables statistical tests for membership. Requires proactive insertion, not post-hoc analysis.
- Data Extraction: Prompt the model to regurgitate training data verbatim; providers mitigate this, but jailbreaks can succeed.
  - Basis for lawsuits like The New York Times vs. OpenAI, claiming verbatim reproduction implies training data inclusion.

Niloofar Mireshghallah Guest Lecture

Non-Literal Copying

Beyond Verbatim Reproduction
Copyright lawsuits can extend beyond word-for-word copying; imitation of an artist’s style or semantic elements may constitute infringement, as seen in cases involving generative models producing outputs in the style of protected works.
Utility vs. Copyright Tradeoff
Models must balance utility (e.g., answering factual questions about copyrighted books) with avoiding infringement, ensuring knowledge dissemination without unauthorized reproduction.
Impact of Instruction Tuning
Instruction tuning (fine-tuning LLMs on task-specific instructions to improve alignment) reduces literal copying but increases semantic copying, where concepts or structures are replicated without exact phrasing.
Proposed Fixes
- Reminder Prompts: Inject prompts reminding the model not to copy training data; ineffective in practice, as models can still generate infringing content.
- Memfree Decoding: During generation, check outputs against the training set to prevent memorization-based regurgitation; reduces literal copying but allows non-literal infringement and degrades factual recall.
Extractability vs. Memorization
Extractability refers to the ability to prompt the model to output specific training data, distinct from memorization (internal storage of data). Additional fine-tuning can make previously non-extractable data accessible, highlighting evolving vulnerabilities.

Does Sanitization or Synthetic Data Help?

Sanitization Techniques
- Assess leakage in sanitized datasets, where sensitive information is removed or altered.
- Remove Personally Identifiable Information (PII), such as names or identifiers.
- Obfuscate attributes like location or age by reducing granularity (e.g., “city” instead of exact address).
- Use LLMs to depersonalize text, creating a one-to-one mapping from original to sanitized versions.
Synthetic Data Generation
- Treat original data as a seed to sample synthetic equivalents, aiming to preserve utility while enhancing privacy.
- Vulnerable to reidentification and linking attacks, where anonymized data is matched to individuals using auxiliary information (e.g., known from census or Netflix datasets, where unique combinations enable deanonymization).
- For tabular data, these attacks are straightforward; for text, convert to structured formats and apply semantic matching.
Memorization as a Performance Indicator
In some scenarios, memorization correlates with model performance, suggesting that complete elimination may hinder generalization.

Open Questions and Broader Concerns

Unresolved Research Questions
- Does memorization occur during the Reinforcement Learning (RL) phase of training, such as in Reinforcement Learning from Human Feedback (RLHF)?
- Can models leak memorized data in languages other than the primary training language?
Ethical and Societal Issues
- Consent forms for data usage (e.g., in chatbots) are often unclear, leaving users unaware that interactions are recorded and potentially usable in legal contexts, as highlighted in recent statements by figures like Sam Altman.
- Deepfake research and generative models enable misuse for doxing, stalking, or phishing.
- The field exhibits an “arms race” mentality, prioritizing rapid advancement with limited regard for regular end-users’ privacy and security.