Introduction

Graph Neural Networks (GNNs), also known as deep learning with symmetries, have gained prominence following successes like AlphaFold. Despite these advances, challenges remain in the graph domain due to the complexity of graph structures. GNNs provide effective methods for addressing these issues in graph-based classification and prediction.

Graph Classification - Semi-Supervised Learning

Tasks

Classifying a protein as an enzyme or a non-enzyme
Detecting fraud in a financial transaction
Predicting users’ preferences in a social network

Collective Classification

A graph represents a social network
Labels are known for some nodes (labeled nodes)
Goal: Classify the remaining nodes
Standard assumption: Nodes with similar labels are likely to be connected (smoothness assumption or homophily)

Label Propagation

\(V = S \cup U\)
- \(S\): Seed nodes
- \(U\): Unlabeled nodes
- \(W\): Symmetric weight adjacency matrix
- \(y_i\): in \(\{0,1\}\)
Idea: Energy minimization
- \[E = \sum W_{ij}(y_i - y_j)^2\]
- Smoothness: Adjacent nodes should have the same class
- Subject to \(y_i = \hat{y}_i\) for \(i \in S\)
Constrained integer optimization problem
- Rewritten as \(\min y^T L y\)
- \(y \in [0,1]\)
- \[L = \begin{bmatrix} L_{ss} & L_{su} \\ L_{us} & L_{uu} \end{bmatrix}\]
- \[y = \begin{bmatrix} y_s \\ y_u \end{bmatrix} = \begin{bmatrix} \hat{y}_s \\ y_u \end{bmatrix}\]
- \(y_u = -L_{uu}^{-1} L_{us} y_s\) (closed-form solution)
- Unconstrained optimization problem

Generalization to \(k\) Labels

\(y_{ik} = 1\) if node \(i\) is of class \(k\)
\[E(y) = \sum W_{ij}(y_i - y_j)^2\]
Compatibility matrix \(H\)
- Models the compatibility between labels
- \[E_y = \sum W_{ij}(y_i - y_j) H (y_i - y_j)\]
- Example: \[H = \begin{bmatrix} 0.2 & 0.8 \\ 0.8 & 0.2 \end{bmatrix}\]
  - Using heterophily: Nodes with different labels are likely to be connected (leaders and followers)

Alternative Variant

Soft constraint on \(y_i = \hat{y}_i\)
\[\min_{F \in \mathbb{R}^{n \times k}} \sum W_{ij}(f_i - f_j)^2 + \mu \sum_i \sum_k (f_i - y_i)^2\]
- \(f_i\): \(i\)-th row of \(F\)
- \(y_i\): \(i\)-th row of \(Y\) (one-hot encoded)
- Normalize \(f_i\) and \(y_i\) by the degree of the node
- \(\mu\): Regularization parameter (softens the constraint)
- Solution: \[F = (1 - \alpha)(I - \alpha W D^{-1}) Y\]
  - \(\alpha = \frac{2}{2 + \mu}\)
  - Personalized PageRank where \(Y\) encodes the seed labels
  - PageRank score for each class and assign the class with the highest score
  - \(W D^{-1}\) is the normalized random walk matrix \(M\)

Label Propagation vs. Stochastic Block Models (SBM)

Labels \(y_i\) look like communities \(z_i\)
Compatibility matrix \(H\) resembles the SBM matrix
SBM:
- Can have observed labels
- Discriminative model: Models \(p(y|W)\)
- Generative model: Models \(p(W|z)\) and \(p(z)\)
- Used to generate new graphs (unlike LP)
- Estimates parameters that generated a given graph (unsupervised)
- Used for link prediction, etc.
LP:
- Predicts labels of \(U\) given observed labels and \(W\) (supervised)
- Different problem-solving approach

Transductive Learning

Label propagation is a special case of transductive learning
Labeled set \(T = \{(x_i, y_i)\}\) and an unlabeled set \(U = \{x_i\}\)
- May include additional structural information like \(A\)
- Goal: Predict labels of only the unlabeled set
- Easier task: Not solving the general problem for any new graph and any new labels
Vapnik Principle: Solve only the specific problem, not a more difficult intermediate step

Transductive Learning vs. Semi-Supervised Learning

LP is also graph-based semi-supervised learning (better termed SSP transductive learning)
Semi-supervised learning: Labeled set \(T = \{(x_i, y_i)\}\) and an unlabeled set \(U = \{x_i\}\)
- Goal: Use both \(T\) and \(U\) to learn a mapping \(f\)
- \(f\) can be transductive or inductive
Transductive learning is almost always semi-supervised because both \(T\) and \(U\) are used

Why Does Semi-Supervised Learning Work?

Helps better model the distribution
Assumptions about the data and labels provide a good learning signal
More unlabeled data available

Self-Learning

Learn embeddings first, then the classifier

Attributed Graphs

Each node has a feature vector \(\mathbb{R}^n\)
Traditional methods: Markov random fields (similar to HMMs)
- Latent variables are not sequential but graph-structured
- Parts of the latent variables are observed
Modern methods: Graph Neural Networks (GNNs)

Graph Construction

Handle vector data when no structure is available
- Construct a graph connecting similar data points (similarity function)
- Apply label propagation

Graph Classification vs. Node Classification

Node classification: Classify individual nodes
Graph classification: Classify the entire graph as a single instance (standard iid problem)

Graph Kernel Methods

Shortest path kernel
Random walk kernel
Use an SVM or GNNs

Node Embeddings

Specialized ML methods for graphs
Apply traditional ML methods (e.g., k-means, SVM) on iid data

Encoding Graph Structure in a Vector

Naive approaches:
- Adjacency matrix as an image: Not permutation equivariant
- Concatenating features of neighbors: Variable length, \(O(n^2)\) scaling
Learning node representations (like word embeddings)
- \(\phi: V \rightarrow \mathbb{R}^d\)
- Similar to graph clustering with \(d\) smallest eigenvectors of the Laplacian
- Close in embedding space \(\Rightarrow\) Close in graph space

Applications of Node Embeddings

Clustering: k-means
Semi-supervised classification: Classify remaining nodes based on learned embeddings
Link prediction: Measure likelihood of a link between nodes in embedding space

Types of Embeddings

Role-based embeddings: Nodes with similar roles are close, even if far in the graph
Spectral embeddings: Nodes have similar embeddings if in the same cluster
Deep embeddings: Similarity depends on the loss function and architecture

Spectral Embeddings

\(L\) in \(V \times V\) projected to \(V \times k\) (smallest \(k\) eigenvectors of \(L\))
Linear transformation, similar to PCA

Deep Embeddings

Inspired by NLP methods (e.g., word2vec)
Deep walk: Transform a graph into random walks and apply word2vec
- For every node \(v_i\), sample multiple random walks
- Random walks correspond to sentences (sequences of discrete tokens)
- Result: Nodes close in the graph are close in the embedding space

Graph2Gauss

Map each node \(v_i\) to a Gaussian distribution in embedding space
- Neighbors are closer than second neighbors, preserving ranking
- Learn mapping: \(f: v_i \rightarrow N(\mu(x_i), \sigma(x_i))\)
  - \(x_i\): Feature vector of \(v_i\)
- Define loss for each node: Ranking loss (triplet \(u, v, w\))
  - \(u\) is closer to \(v\) than to \(w\)
  - \[L(u) = \sum_{v, w} E_{uv}^2 + \exp(-E_{uw})\]
  - \[E_{uv} = KL(f(u), f(v))\]
- Result: Distances in graph correlate with distances in embedding space, variances encode uncertainty
- Inductive learning: Get embedding for new nodes (unlike deep walk, spectral)

Node Embeddings vs. Graph Embeddings

Leverage node embeddings to get graph embeddings
Different number of nodes: Use aggregation function (mean, sum, max)

Graph Neural Networks

Date: 2024-08-04

Introduction

Graph Neural Networks (GNNs) represent a form of deep learning tailored to exploit the inherent symmetries or geometric properties often encoded in graphs. They have demonstrated exceptional performance in various domains due to their flexibility in considering attributes of both nodes and edges.

Graph Convolutional Networks (GCNs)

GCNs are inspired by the concept of matrix convolutions, akin to sliding windows over images. In GCNs, information is aggregated from neighboring nodes to update each node’s representation.

Graph Convolution

Graph convolutions need to be permutation equivariant and can be defined in multiple ways:

Spatial GNNs: Utilize differentiable message passing.
Spectral GNNs: Employ graph signal processing and spectral filtering.

Both methods can achieve equivalent results.

Differentiable Message Passing (Spatial)

This method relies on the local aggregation of node representations. For a graph \(G\) with adjacency matrix \(A\) and node features \(X\):

Initialization: \(h_v^0 = X_v\)
Message Passing: For each node \(v\), \[ m_v^k = \sum_{u \in N(v)} M(h_u^{k-1}, h_v^{k-1}, E_{vu}) \]
Node Update: \[ h_v^k = U(h_v^{k-1}, m_v^k) \] where \(M\) and \(U\) are differentiable functions.

Example formulation: \[ m_v^k = \sum_{u \in N(v)} \frac{W_k h_u^{k-1} + b^k}{d_v} \] \[ h_v^k = \text{ReLU}(Q^k h_v^{k-1} + m_v^k + p^k) \] Trainable parameters include \(W_k\), \(Q_k\), \(p_k\), and \(b_k\).

Graph Convolutional Network (GCN)

GCNs implement a specific form of message passing: \[ H^{l+1} = \sigma(\hat{D}^{-1/2} \hat{A} \hat{D}^{-1/2} H^l W^l) \] where \(\hat{A} = A + I\) (adjacency matrix with self-loops) and \(\hat{D}\) is the diagonal degree matrix of \(\hat{A}\).

Transformation Depth vs. Propagation Depth: - Propagation Depth: Number of hops considered for information aggregation. - Transformation Depth: Number of layers stacked, which can involve more complex neural networks beyond linear transformations.

Spectral Graph Convolution

Given the normalized Laplacian \(L = I - D^{-1/2} A D^{-1/2}\), with eigen-decomposition \(L = U \Lambda U^T\): - Graph Fourier Transform: \(\hat{x} = U^T x\) - Inverse Graph Fourier Transform: \(x = U \hat{x}\)

Spectral filtering on graphs is defined as: \[ y = U g(\Lambda) U^T x \] where \(g\) is a spectral filter. For example, choosing \(g = \lambda\) results in 1-hop neighborhood aggregation, while \(g = \lambda^k\) results in k-hop aggregation.

GCNs can be seen as a first-order approximation of localized spectral filters.

Transferability: - Transferability to new graphs (inductive learning) is challenging due to the global nature of spectral properties. Adding a single node alters the spectrum, limiting transferability.

Semi-Supervised Node Classification

In semi-supervised node classification, the final layer outputs logits, with softmax and cross-entropy loss used only on labeled nodes. The entire graph is used to propagate embeddings.

Graph Classification: - Aggregate all node embeddings and use a Multi-Layer Perceptron (MLP) for classification.

Equivariant Machine Learning

For GNNs, node order should not affect the representation or classification. If two graphs are isomorphic (\(G1 = (A1, X1)\) and \(G2 = (A2, X2)\) with a permutation matrix \(P\)), then:

Invariant GNN: \(f(PX, PAP^T) = f(X, A)\)
Equivariant GNN: \(f(PX, PAP^T) = Pf(X, A)\)

In the message passing framework, we preserve invariance using functions that operate on sets like sum or max.

Spatial Graphs

Spatial graphs have nodes and edges corresponding to geometric objects in vector spaces with metrics (e.g., molecules with atoms as nodes and bonds as edges). For these graphs, invariance to rotations and translations is necessary, leading to \(E(n)\)-equivariant GNNs, where \(E(N) = \{RX + T: R \in O(n)\}\).

Limitations of Graph Neural Networks

Expressiveness:
- GNNs may perform worse than the Weisfeiler-Lehman (WL) test for graph isomorphism. The best algorithm has exponential worst-case runtime, and it’s unclear if it’s in P or NP.
- WL Test:
  - Label all nodes with 1.
  - Collect neighboring labels in a labeled multiset.
  - Relabel the node by hashing the current label concatenated with the multiset with some hash function.
  - Compare the sorted node labels or plot the labels in a histogram.
  - If the histograms are different, the graphs are not isomorphic. If they are the same, repeat the process. When it converges and is still not different, we can stop but don’t have the answer. It is only a heuristic.
- NN are worse than the WL test usually. The max function is not good for this, for example. We can use higher-order structures like triangles to improve the expressiveness or carefully designed aggregation.
- GCN-like models tend to oversmooth predictions with increasing depth. In the limit of infinite layers, the influence becomes proportional to the stationary distribution of the graph as a Markov Chain (PageRank). So, the influence is not local anymore - deep GCNs cannot represent local neighborhoods.
Oversquashing:
- GNNs often fail to propagate messages over long distances due to bottleneck behavior. An exponential amount of information is squashed into a fixed-size vector. This can result in low performance on long-range dependency tasks.
Oversmoothing vs. Oversquashing:
- Fully connected graphs: oversmoothing but not oversquashing.
- Directed tree graphs: oversquashing but not oversmoothing (subtrees are independent).

PageRank in the Message Passing Framework

\[ H^{l+1} = D^{-1/2} A D^{-1/2} \sigma(H^l W^l) \] Personalized PageRank Graph Convolution combines transformation and propagation, preventing oversmoothing by focusing on local neighborhoods.

Personalized Propagation of Neural Predictions (PPNP)

PPNP separates transformation and propagation:

Transformation: \[ H_0 = f(X) \] where \(f\) is modeled by a neural network.
Personalized Propagation: \[ H^{l+1} = (1-\alpha)D^{-1/2} A D^{-1/2} H^l + \alpha H_0 \]
Personalized Teleportation: \[ r^{t+1} = (1-\alpha)AD^{-1}r^t + \alpha p_i \] where \(p_i\) is the \(i\)-th unit vector.

Final prediction using personalized PageRank: \[ z = \text{softmax}\left(\alpha(I - (1-\alpha)D^{-1/2} A D^{-1/2})^{-1} H_0\right) \] Here, \(H_0\) is the output of the neural network, and personalized PageRank models the influence of nodes for each class.

This approach improves scalability because \(H_0\) is modeled with a normal neural network and can be distributed. The propagation then operates on a low-dimensional latent space. It prevents oversmoothing and provides a clear separation of transformation and propagation (infinite propagation depth using the stationary distribution).

Connection to Label Propagation

PPNP is related to label propagation where \(H\) would be the matrix \(Y\). Here, we don’t have the labeled instances and predict them instead.

Scalability and Reliability

GNNs require extensive communication and synchronization, making scalability to large graphs challenging.
Susceptible to adversarial attacks.
Adding nodes requires significant recomputation due to the small-world phenomenon.

Graph Transformers

Graph Transformers address oversquashing issues by using positional encodings to encode graph structure for attention mechanisms.

Position Encodings

In sequences, sinusoidal position encodings are used: \[ \text{PE}(\text{pos}, 2i) = \sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right) \] where \(d_{\text{model}}\) is the dimension of the model, \(pos\) is the node position, and \(i\) is the dimension index of the encoding. This is inspired by the Discrete Fourier Transform (DFT).

Directed Graphs

For directed graphs, the Laplacian is not symmetric, thus eigenvectors are not guaranteed.