Machine Learning on Graphs

Notes on Machine Learning techniques for Graphs at TUM 2024
Graphs
GNNs
TUM
Author

Peter Nutter

Published

2024-08-02

Introduction

Graph Neural Networks (GNNs), also known as deep learning with symmetries, have gained prominence following successes like AlphaFold. Despite these advances, challenges remain in the graph domain due to the complexity of graph structures. GNNs provide effective methods for addressing these issues in graph-based classification and prediction.

Graph Classification - Semi-Supervised Learning

Tasks

  • Classifying a protein as an enzyme or a non-enzyme
  • Detecting fraud in a financial transaction
  • Predicting users’ preferences in a social network

Collective Classification

  • A graph represents a social network
  • Labels are known for some nodes (labeled nodes)
  • Goal: Classify the remaining nodes
  • Standard assumption: Nodes with similar labels are likely to be connected (smoothness assumption or homophily)

Label Propagation

  • \(V = S \cup U\)
    • \(S\): Seed nodes
    • \(U\): Unlabeled nodes
    • \(W\): Symmetric weight adjacency matrix
    • \(y_i\): in \(\{0,1\}\)
  • Idea: Energy minimization
    • \[E = \sum W_{ij}(y_i - y_j)^2\]
    • Smoothness: Adjacent nodes should have the same class
    • Subject to \(y_i = \hat{y}_i\) for \(i \in S\)
  • Constrained integer optimization problem
    • Rewritten as \(\min y^T L y\)
    • \(y \in [0,1]\)
    • \[L = \begin{bmatrix} L_{ss} & L_{su} \\ L_{us} & L_{uu} \end{bmatrix}\]
    • \[y = \begin{bmatrix} y_s \\ y_u \end{bmatrix} = \begin{bmatrix} \hat{y}_s \\ y_u \end{bmatrix}\]
    • \(y_u = -L_{uu}^{-1} L_{us} y_s\) (closed-form solution)
    • Unconstrained optimization problem

Generalization to \(k\) Labels

  • \(y_{ik} = 1\) if node \(i\) is of class \(k\)
  • \[E(y) = \sum W_{ij}(y_i - y_j)^2\]
  • Compatibility matrix \(H\)
    • Models the compatibility between labels
    • \[E_y = \sum W_{ij}(y_i - y_j) H (y_i - y_j)\]
    • Example: \[H = \begin{bmatrix} 0.2 & 0.8 \\ 0.8 & 0.2 \end{bmatrix}\]
      • Using heterophily: Nodes with different labels are likely to be connected (leaders and followers)

Alternative Variant

  • Soft constraint on \(y_i = \hat{y}_i\)
  • \[\min_{F \in \mathbb{R}^{n \times k}} \sum W_{ij}(f_i - f_j)^2 + \mu \sum_i \sum_k (f_i - y_i)^2\]
    • \(f_i\): \(i\)-th row of \(F\)
    • \(y_i\): \(i\)-th row of \(Y\) (one-hot encoded)
    • Normalize \(f_i\) and \(y_i\) by the degree of the node
    • \(\mu\): Regularization parameter (softens the constraint)
    • Solution: \[F = (1 - \alpha)(I - \alpha W D^{-1}) Y\]
      • \(\alpha = \frac{2}{2 + \mu}\)
      • Personalized PageRank where \(Y\) encodes the seed labels
      • PageRank score for each class and assign the class with the highest score
      • \(W D^{-1}\) is the normalized random walk matrix \(M\)

Label Propagation vs. Stochastic Block Models (SBM)

  • Labels \(y_i\) look like communities \(z_i\)
  • Compatibility matrix \(H\) resembles the SBM matrix
  • SBM:
    • Can have observed labels
    • Discriminative model: Models \(p(y|W)\)
    • Generative model: Models \(p(W|z)\) and \(p(z)\)
    • Used to generate new graphs (unlike LP)
    • Estimates parameters that generated a given graph (unsupervised)
    • Used for link prediction, etc.
  • LP:
    • Predicts labels of \(U\) given observed labels and \(W\) (supervised)
    • Different problem-solving approach

Transductive Learning

  • Label propagation is a special case of transductive learning
  • Labeled set \(T = \{(x_i, y_i)\}\) and an unlabeled set \(U = \{x_i\}\)
    • May include additional structural information like \(A\)
    • Goal: Predict labels of only the unlabeled set
    • Easier task: Not solving the general problem for any new graph and any new labels
  • Vapnik Principle: Solve only the specific problem, not a more difficult intermediate step

Transductive Learning vs. Semi-Supervised Learning

  • LP is also graph-based semi-supervised learning (better termed SSP transductive learning)
  • Semi-supervised learning: Labeled set \(T = \{(x_i, y_i)\}\) and an unlabeled set \(U = \{x_i\}\)
    • Goal: Use both \(T\) and \(U\) to learn a mapping \(f\)
    • \(f\) can be transductive or inductive
  • Transductive learning is almost always semi-supervised because both \(T\) and \(U\) are used

Why Does Semi-Supervised Learning Work?

  • Helps better model the distribution
  • Assumptions about the data and labels provide a good learning signal
  • More unlabeled data available

Self-Learning

  • Learn embeddings first, then the classifier

Attributed Graphs

  • Each node has a feature vector \(\mathbb{R}^n\)
  • Traditional methods: Markov random fields (similar to HMMs)
    • Latent variables are not sequential but graph-structured
    • Parts of the latent variables are observed
  • Modern methods: Graph Neural Networks (GNNs)

Graph Construction

  • Handle vector data when no structure is available
    • Construct a graph connecting similar data points (similarity function)
    • Apply label propagation

Graph Classification vs. Node Classification

  • Node classification: Classify individual nodes
  • Graph classification: Classify the entire graph as a single instance (standard iid problem)

Graph Kernel Methods

  • Shortest path kernel
  • Random walk kernel
  • Use an SVM or GNNs

Node Embeddings

  • Specialized ML methods for graphs
  • Apply traditional ML methods (e.g., k-means, SVM) on iid data

Encoding Graph Structure in a Vector

  • Naive approaches:
    • Adjacency matrix as an image: Not permutation equivariant
    • Concatenating features of neighbors: Variable length, \(O(n^2)\) scaling
  • Learning node representations (like word embeddings)
    • \(\phi: V \rightarrow \mathbb{R}^d\)
    • Similar to graph clustering with \(d\) smallest eigenvectors of the Laplacian
    • Close in embedding space \(\Rightarrow\) Close in graph space

Applications of Node Embeddings

  • Clustering: k-means
  • Semi-supervised classification: Classify remaining nodes based on learned embeddings
  • Link prediction: Measure likelihood of a link between nodes in embedding space

Types of Embeddings

  • Role-based embeddings: Nodes with similar roles are close, even if far in the graph
  • Spectral embeddings: Nodes have similar embeddings if in the same cluster
  • Deep embeddings: Similarity depends on the loss function and architecture

Spectral Embeddings

  • \(L\) in \(V \times V\) projected to \(V \times k\) (smallest \(k\) eigenvectors of \(L\))
  • Linear transformation, similar to PCA

Deep Embeddings

  • Inspired by NLP methods (e.g., word2vec)
  • Deep walk: Transform a graph into random walks and apply word2vec
    • For every node \(v_i\), sample multiple random walks
    • Random walks correspond to sentences (sequences of discrete tokens)
    • Result: Nodes close in the graph are close in the embedding space

Graph2Gauss

  • Map each node \(v_i\) to a Gaussian distribution in embedding space
    • Neighbors are closer than second neighbors, preserving ranking
    • Learn mapping: \(f: v_i \rightarrow N(\mu(x_i), \sigma(x_i))\)
      • \(x_i\): Feature vector of \(v_i\)
    • Define loss for each node: Ranking loss (triplet \(u, v, w\))
      • \(u\) is closer to \(v\) than to \(w\)
      • \[L(u) = \sum_{v, w} E_{uv}^2 + \exp(-E_{uw})\]
      • \[E_{uv} = KL(f(u), f(v))\]
    • Result: Distances in graph correlate with distances in embedding space, variances encode uncertainty
    • Inductive learning: Get embedding for new nodes (unlike deep walk, spectral)

Node Embeddings vs. Graph Embeddings

  • Leverage node embeddings to get graph embeddings
  • Different number of nodes: Use aggregation function (mean, sum, max)

Graph Neural Networks

Date: 2024-08-04

Introduction

Graph Neural Networks (GNNs) represent a form of deep learning tailored to exploit the inherent symmetries or geometric properties often encoded in graphs. They have demonstrated exceptional performance in various domains due to their flexibility in considering attributes of both nodes and edges.

Graph Convolutional Networks (GCNs)

GCNs are inspired by the concept of matrix convolutions, akin to sliding windows over images. In GCNs, information is aggregated from neighboring nodes to update each node’s representation.

Graph Convolution

Graph convolutions need to be permutation equivariant and can be defined in multiple ways:

  • Spatial GNNs: Utilize differentiable message passing.
  • Spectral GNNs: Employ graph signal processing and spectral filtering.

Both methods can achieve equivalent results.

Differentiable Message Passing (Spatial)

This method relies on the local aggregation of node representations. For a graph \(G\) with adjacency matrix \(A\) and node features \(X\):

  1. Initialization: \(h_v^0 = X_v\)
  2. Message Passing: For each node \(v\), \[ m_v^k = \sum_{u \in N(v)} M(h_u^{k-1}, h_v^{k-1}, E_{vu}) \]
  3. Node Update: \[ h_v^k = U(h_v^{k-1}, m_v^k) \] where \(M\) and \(U\) are differentiable functions.

Example formulation: \[ m_v^k = \sum_{u \in N(v)} \frac{W_k h_u^{k-1} + b^k}{d_v} \] \[ h_v^k = \text{ReLU}(Q^k h_v^{k-1} + m_v^k + p^k) \] Trainable parameters include \(W_k\), \(Q_k\), \(p_k\), and \(b_k\).

Graph Convolutional Network (GCN)

GCNs implement a specific form of message passing: \[ H^{l+1} = \sigma(\hat{D}^{-1/2} \hat{A} \hat{D}^{-1/2} H^l W^l) \] where \(\hat{A} = A + I\) (adjacency matrix with self-loops) and \(\hat{D}\) is the diagonal degree matrix of \(\hat{A}\).

Transformation Depth vs. Propagation Depth: - Propagation Depth: Number of hops considered for information aggregation. - Transformation Depth: Number of layers stacked, which can involve more complex neural networks beyond linear transformations.

Spectral Graph Convolution

Given the normalized Laplacian \(L = I - D^{-1/2} A D^{-1/2}\), with eigen-decomposition \(L = U \Lambda U^T\): - Graph Fourier Transform: \(\hat{x} = U^T x\) - Inverse Graph Fourier Transform: \(x = U \hat{x}\)

Spectral filtering on graphs is defined as: \[ y = U g(\Lambda) U^T x \] where \(g\) is a spectral filter. For example, choosing \(g = \lambda\) results in 1-hop neighborhood aggregation, while \(g = \lambda^k\) results in k-hop aggregation.

GCNs can be seen as a first-order approximation of localized spectral filters.

Transferability: - Transferability to new graphs (inductive learning) is challenging due to the global nature of spectral properties. Adding a single node alters the spectrum, limiting transferability.

Semi-Supervised Node Classification

In semi-supervised node classification, the final layer outputs logits, with softmax and cross-entropy loss used only on labeled nodes. The entire graph is used to propagate embeddings.

Graph Classification: - Aggregate all node embeddings and use a Multi-Layer Perceptron (MLP) for classification.

Equivariant Machine Learning

For GNNs, node order should not affect the representation or classification. If two graphs are isomorphic (\(G1 = (A1, X1)\) and \(G2 = (A2, X2)\) with a permutation matrix \(P\)), then:

  • Invariant GNN: \(f(PX, PAP^T) = f(X, A)\)
  • Equivariant GNN: \(f(PX, PAP^T) = Pf(X, A)\)

In the message passing framework, we preserve invariance using functions that operate on sets like sum or max.

Spatial Graphs

Spatial graphs have nodes and edges corresponding to geometric objects in vector spaces with metrics (e.g., molecules with atoms as nodes and bonds as edges). For these graphs, invariance to rotations and translations is necessary, leading to \(E(n)\)-equivariant GNNs, where \(E(N) = \{RX + T: R \in O(n)\}\).

Limitations of Graph Neural Networks

  1. Expressiveness:
    • GNNs may perform worse than the Weisfeiler-Lehman (WL) test for graph isomorphism. The best algorithm has exponential worst-case runtime, and it’s unclear if it’s in P or NP.

    • WL Test:

      • Label all nodes with 1.
      • Collect neighboring labels in a labeled multiset.
      • Relabel the node by hashing the current label concatenated with the multiset with some hash function.
      • Compare the sorted node labels or plot the labels in a histogram.
      • If the histograms are different, the graphs are not isomorphic. If they are the same, repeat the process. When it converges and is still not different, we can stop but don’t have the answer. It is only a heuristic.
    • NN are worse than the WL test usually. The max function is not good for this, for example. We can use higher-order structures like triangles to improve the expressiveness or carefully designed aggregation.

    • GCN-like models tend to oversmooth predictions with increasing depth. In the limit of infinite layers, the influence becomes proportional to the stationary distribution of the graph as a Markov Chain (PageRank). So, the influence is not local anymore - deep GCNs cannot represent local neighborhoods.

  2. Oversquashing:
    • GNNs often fail to propagate messages over long distances due to bottleneck behavior. An exponential amount of information is squashed into a fixed-size vector. This can result in low performance on long-range dependency tasks.
  3. Oversmoothing vs. Oversquashing:
    • Fully connected graphs: oversmoothing but not oversquashing.
    • Directed tree graphs: oversquashing but not oversmoothing (subtrees are independent).

PageRank in the Message Passing Framework

\[ H^{l+1} = D^{-1/2} A D^{-1/2} \sigma(H^l W^l) \] Personalized PageRank Graph Convolution combines transformation and propagation, preventing oversmoothing by focusing on local neighborhoods.

Personalized Propagation of Neural Predictions (PPNP)

PPNP separates transformation and propagation:

  1. Transformation: \[ H_0 = f(X) \] where \(f\) is modeled by a neural network.

  2. Personalized Propagation: \[ H^{l+1} = (1-\alpha)D^{-1/2} A D^{-1/2} H^l + \alpha H_0 \]

  3. Personalized Teleportation: \[ r^{t+1} = (1-\alpha)AD^{-1}r^t + \alpha p_i \] where \(p_i\) is the \(i\)-th unit vector.

    Final prediction using personalized PageRank: \[ z = \text{softmax}\left(\alpha(I - (1-\alpha)D^{-1/2} A D^{-1/2})^{-1} H_0\right) \] Here, \(H_0\) is the output of the neural network, and personalized PageRank models the influence of nodes for each class.

This approach improves scalability because \(H_0\) is modeled with a normal neural network and can be distributed. The propagation then operates on a low-dimensional latent space. It prevents oversmoothing and provides a clear separation of transformation and propagation (infinite propagation depth using the stationary distribution).

Connection to Label Propagation

PPNP is related to label propagation where \(H\) would be the matrix \(Y\). Here, we don’t have the labeled instances and predict them instead.

Scalability and Reliability

  • GNNs require extensive communication and synchronization, making scalability to large graphs challenging.
  • Susceptible to adversarial attacks.
  • Adding nodes requires significant recomputation due to the small-world phenomenon.

Graph Transformers

Graph Transformers address oversquashing issues by using positional encodings to encode graph structure for attention mechanisms.

Position Encodings

In sequences, sinusoidal position encodings are used: \[ \text{PE}(\text{pos}, 2i) = \sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right) \] where \(d_{\text{model}}\) is the dimension of the model, \(pos\) is the node position, and \(i\) is the dimension index of the encoding. This is inspired by the Discrete Fourier Transform (DFT).

Directed Graphs

For directed graphs, the Laplacian is not symmetric, thus eigenvectors are not guaranteed.