Introduction

Often, ML models are optimized for simple metrics like misclassification rate. As AI becomes more prevalent and is used in critical applications, we must consider robustness. How do models behave in edge cases when data is corrupted or the model is attacked?

Construction of Adversarial Examples

Adversarial examples are specially crafted inputs designed to deceive machine learning models. There are examples of adversarial attacks (AA) for any model.

Why Should We Care?

Security: Protecting models from adversarial attacks.
Theory: Undermines the assumption that models learn meaningful features, creating a conceptual gap.
Interpretability: How can we interpret the model if a small change changes the output?

What does “small” mean? The perturbation set \(P(x)\) shouldn’t change the semantics. We say \(\hat{x}\) in \(P(x)\) is adversarial if \(f(\hat{x}) \neq f(x)\) (misclassified), usually measured by \(L_p\) norm: \[ P(x) = \{x' \mid \|x - x'\|_p \leq \epsilon\} \]

While convenient, \(L_p\) norm does not capture all perturbations, such as small rotations that change \(L_p\) significantly—it’s just a proxy.

How Do We Construct Them?

Black Box vs. White Box:
- White Box: We know the model and its parameters.
- Black Box: We only know the input and output, e.g., an API.

Related Concept: Unlearning—removing sensitive user info from a trained model, which can be reversed with AA.

Optimization Formulation

Formulated as an optimization problem: \[ x^* = \arg\max \text{ loss}(f(\hat{x}, y)) \quad \hat{x} \in P(x) \] where loss is the 0-1 loss or surrogate loss like cross-entropy.

Standard gradient-based training involves updating the data instead of the weights (projected gradient ascent).

Fast Gradient Sign Method

\[ x' = \text{projection}(x + \eta \cdot \text{sign}(\nabla_x \text{ loss}(f(x), y))) \]

When \(P(x)\) is an epsilon ball in \(L_\infty\) norm, setting \(\eta = \epsilon\) yields a valid perturbation with only one step, often without projection.

Alternative Optimization Problem

\[ \min D(x, x') \quad \text{s.t. } l_{01}(f(x'), y) > 0 \]

Finding the closest adversarial example is difficult. Using Lagrange: \[ D(x, x') + \lambda \text{ loss}(f(x'), y) = 0 \]

Margin Loss

\[ L(x', y) = [Z(x')_y - \max_{t \neq y} Z(x')_t]_+ \] where \(Z\) is the log probability of the model. We maximize the probability of a different class until misclassified.

Adversarial Attacks on GNNs

We can use both node features and edges to construct AA. Graph structure is a discrete object, making structure attacks challenging.

Targeted Attack: Change one node to misclassify another.
Global Attack: Change nodes to misclassify the whole graph (e.g., Nettack).

Adversarial Attack for Model Evaluation

Evaluates model robustness (accuracy under worst-case noise).
The attack must be specific to the model.

Improving Robustness

Ineffective Methods

Post hoc prevention of attacks: Implementing defenses after an attack has occurred (e.g., gradient obfuscation: randomizing gradients to prevent attacks—can be broken).
Out-of-distribution shift detection: Using anomaly detection for inputs from different distributions, which might not be effective for adversarial examples.

Robust Training

Optimizing for the worst-case scenario: Known as robust loss optimization, minimizing the worst-case loss.
Adversarial examples as proxies: Using adversarial examples to approximate worst-case perturbations.

Robust Loss

Non-robust loss: \[ R = \mathbb{E}_{(x, y) \in \text{data}} [l(f(x), y)] \]
Robust loss: \[ R = \mathbb{E} [ \sup_{x' \in P(x)} l(f(x'), y) ] \]

To perform Stochastic Gradient Descent (SGD) on the robust loss, we compute: \[ \nabla R = \mathbb{E} \left[\nabla \sup l(f_\theta(x'), y) \mid x' \in P(x)\right] \]

This requires finding the gradient of the worst-case loss with respect to the parameters, often approximated using methods like the Fast Gradient Sign Method (FGSM).

Algorithm

Sample a data point \((x, y)\).
Find \(x'\) using an adversarial attack procedure.
Update the model weights via gradient descent on \(x'\).

The adversarial example corresponds to the current weights: adaptive data augmentation.

Pros: - Empirically increases model robustness. - Easy to implement.

Cons: - Slows down training by approximately 10x with powerful attacks. - Often results in lower accuracy. - No theoretical guarantees on final robustness.

Improving Robustness for GNNs

Standard robustness approaches often fail for Graph Neural Networks (GNNs).

Certifiable Robustness

While adversarial training can improve robustness, it is crucial to certify that models are reliably robust.

Exact Certification

The goal is to develop algorithms that confirm whether a classifier is adversarial-free within a specific radius. We want an “if and only if” statement.

For networks with ReLU activations, exact certification using methods like Mixed-Integer Linear Programming (MILP) is NP-complete.

Expressing Verification as MILP

Classification margin: \[ m_t = f_{x, c^*} - f_{x, t} \]
Worst-case margin: \[ m_t^* = \min_{x'} f_{x', c^*} - f_{x', t} \quad \text{subject to} \quad \|x - x'\| \leq \epsilon \]

To certify robustness: - If \(m_t^* > 0\) for all \(t \neq c^*\), the classifier is robust. - Otherwise, an adversarial attack exists.

Convex Relaxation

By solving a relaxed version of the problem, we can obtain a lower bound in polynomial time: - If \(m_t^* > 0\), exact certification is confirmed. - If not, further analysis is required.

Since we are dropping constraints, \(m^*\) will get smaller since we are optimizing over a larger space. We write down the condition as three linear constraints instead of a ReLU. A convex function means that the set above the function is convex—just an observation. ReLU is a convex function, but the set of points on the input and the points it maps to is not a convex set. Now it’s a linear program.

If \(m^*\) is negative, we don’t know if it’s because of the relaxation or it’s just not robust. We can then feed the \(\tilde{x}\) we found to the neural network and if we get a different output, then the network is not robust, but it is not guaranteed to be an AA.

Works the same way for GNNs: graph and attributes can change simultaneously, and the nodes are not i.i.d. \(L_0\) norm = counting, a natural choice for discrete data.

Strong duality is used for training. We can find the dual, which is just another linear program. We can find \(m^*\) but not the gradient. We can introduce a dual problem that will be the lower bound on the relaxed \(m^*\). Any solution to the dual is a lower bound, so we can compute the gradient with respect to the dual formulation since it’s not now a maximization problem.

If the lower dual bound is too low, we are just making the model more robust than we intended (which could hurt the accuracy).

Lipschitz Continuity

Adversarial robustness can be analyzed through the lens of Lipschitz continuity: - A function \(f\) is Lipschitz continuous if \[ D(f(x_1), f(x_2)) \leq k D(x_1, x_2) \] where \(k\) is the Lipschitz constant. - The smaller the \(k\), the smoother the function is. - Adversarial examples are those that break the smoothness since small changes produce large outputs. - \(D\) can be two different metrics on the respective spaces. - Bounding the Lipschitz constant involves computing the product of the operator norms of weight matrices.

We can bound the Lipschitz constant for a NN: \[ L(f) \leq \prod L(\theta_i) \] where \(\theta_i\) is one layer. Lipschitz for a ReLU (affine) layer is the operator norm of the \(W\) matrix (sup \(\|Wx\|/\|x\|\)).

If we have a constant \(K\), then we know that \[ |x - \hat{x}| < \frac{\epsilon}{K} \] then \[ |F(x) - F(\hat{x})| < \epsilon \]

This provides a maximum certifiable radius \(\frac{\epsilon}{K}\)—this is our certificate. A small constant trades off expressivity of the network.

Random Smoothing

Random smoothing transforms a classifier \(f\) into a smooth classifier \(g\): \[ g(x) = \arg\max P(f(x + \epsilon) = c) \] where \(\epsilon \sim N(0, \sigma^2 I)\). This approach certifies robustness by ensuring predictions remain stable under Gaussian noise.

By making our base classifier smooth because it takes a majority vote over some neighborhood, we can get a certificate on \(g\). \[ c^* = \arg\max g(x)_c \] is the most voted class, and \[ p^* = \max g(x)_c \]

We want \[ \arg\max g(x + \delta)_c = c^* \] for all \[ |\delta| < r \]

We want to get a certificate for any classifier—a black box certificate—without knowing the weights or the architecture, just based on the outputs.

Since we don’t know what our classifier does, we assume the worst case: - We consider all classifiers that satisfy \[ P(h(x + \epsilon) = c^*) = p^* \] over Gaussian noise. - Now we minimize over this class: \[ \min P(h(x + \delta + \epsilon) = c^*) \] - For the same class \(p^*\), we are minimizing the probability so it might not get classified as \(c^*\)—not robust.

\(f\) is in the set, so the real probability is always higher.

This corresponds to a linear classifier. We use the hyperplane perpendicular to \(\delta\): \[ p^*_{(x + \delta)} = \phi(\phi^{-1}(p^*) - \frac{|\delta|}{\sigma}) \] If \(p^*_{(x + \delta)} \geq 0.5\), we are still robust.

The smooth classifier is a convolution with a Gaussian—Weierstrass transform—which has a Lipschitz constant of \(\sqrt{2/\pi} < 1\) which means we are making it more smooth.

How to determine \(p^*\): - First, we find the class \(c^*\) with sampling. - Then we sample more to see how often \(c^*\) is the majority class and get \(p^*\) and then compute the one-sided confidence interval and take the lower bound.

So with \(1 - \alpha\) probability, our classifier is robust in the specified radius.

For Discrete Data and Graphs

Random smoothing for graphs involves flipping edges probabilistically, but this is challenging due to graph sparsity. A better approach might involve assigning different probabilities for adding and removing edges to maintain the expected number of edges.

In conclusion, enhancing robustness in machine learning models involves a mix of robust training, certification techniques, and specialized methods for different data types, such as graphs. These strategies provide a multi-faceted approach to defend against adversarial attacks and ensure reliable model performance.