Introduction
Graph Neural Networks (GNNs), also known as deep learning with symmetries, have gained prominence following successes like AlphaFold. Despite these advances, challenges remain in the graph domain due to the complexity of graph structures. GNNs provide effective methods for addressing these issues in graph-based classification and prediction.
Graph Classification - Semi-Supervised Learning
Tasks
- Classifying a protein as an enzyme or a non-enzyme
- Detecting fraud in a financial transaction
- Predicting users’ preferences in a social network
Collective Classification
- A graph represents a social network
- Labels are known for some nodes (labeled nodes)
- Goal: Classify the remaining nodes
- Standard assumption: Nodes with similar labels are likely to be connected (smoothness assumption or homophily)
Label Propagation
- \(V = S \cup U\)
- \(S\): Seed nodes
- \(U\): Unlabeled nodes
- \(W\): Symmetric weight adjacency matrix
- \(y_i\): in \(\{0,1\}\)
- Idea: Energy minimization
- \[E = \sum W_{ij}(y_i - y_j)^2\]
- Smoothness: Adjacent nodes should have the same class
- Subject to \(y_i = \hat{y}_i\) for \(i \in S\)
- Constrained integer optimization problem
- Rewritten as \(\min y^T L y\)
- \(y \in [0,1]\)
- \[L = \begin{bmatrix} L_{ss} & L_{su} \\ L_{us} & L_{uu} \end{bmatrix}\]
- \[y = \begin{bmatrix} y_s \\ y_u \end{bmatrix} = \begin{bmatrix} \hat{y}_s \\ y_u \end{bmatrix}\]
- \(y_u = -L_{uu}^{-1} L_{us} y_s\) (closed-form solution)
- Unconstrained optimization problem
Generalization to \(k\) Labels
- \(y_{ik} = 1\) if node \(i\) is of class \(k\)
- \[E(y) = \sum W_{ij}(y_i - y_j)^2\]
- Compatibility matrix \(H\)
- Models the compatibility between labels
- \[E_y = \sum W_{ij}(y_i - y_j) H (y_i - y_j)\]
- Example: \[H = \begin{bmatrix} 0.2 & 0.8 \\ 0.8 & 0.2 \end{bmatrix}\]
- Using heterophily: Nodes with different labels are likely to be connected (leaders and followers)
Alternative Variant
- Soft constraint on \(y_i = \hat{y}_i\)
- \[\min_{F \in \mathbb{R}^{n \times k}} \sum W_{ij}(f_i - f_j)^2 + \mu \sum_i \sum_k (f_i - y_i)^2\]
- \(f_i\): \(i\)-th row of \(F\)
- \(y_i\): \(i\)-th row of \(Y\) (one-hot encoded)
- Normalize \(f_i\) and \(y_i\) by the degree of the node
- \(\mu\): Regularization parameter (softens the constraint)
- Solution: \[F = (1 - \alpha)(I - \alpha W D^{-1}) Y\]
- \(\alpha = \frac{2}{2 + \mu}\)
- Personalized PageRank where \(Y\) encodes the seed labels
- PageRank score for each class and assign the class with the highest score
- \(W D^{-1}\) is the normalized random walk matrix \(M\)
Label Propagation vs. Stochastic Block Models (SBM)
- Labels \(y_i\) look like communities \(z_i\)
- Compatibility matrix \(H\) resembles the SBM matrix
- SBM:
- Can have observed labels
- Discriminative model: Models \(p(y|W)\)
- Generative model: Models \(p(W|z)\) and \(p(z)\)
- Used to generate new graphs (unlike LP)
- Estimates parameters that generated a given graph (unsupervised)
- Used for link prediction, etc.
- LP:
- Predicts labels of \(U\) given observed labels and \(W\) (supervised)
- Different problem-solving approach
Transductive Learning
- Label propagation is a special case of transductive learning
- Labeled set \(T = \{(x_i, y_i)\}\) and an unlabeled set \(U = \{x_i\}\)
- May include additional structural information like \(A\)
- Goal: Predict labels of only the unlabeled set
- Easier task: Not solving the general problem for any new graph and any new labels
- Vapnik Principle: Solve only the specific problem, not a more difficult intermediate step
Transductive Learning vs. Semi-Supervised Learning
- LP is also graph-based semi-supervised learning (better termed SSP transductive learning)
- Semi-supervised learning: Labeled set \(T = \{(x_i, y_i)\}\) and an unlabeled set \(U = \{x_i\}\)
- Goal: Use both \(T\) and \(U\) to learn a mapping \(f\)
- \(f\) can be transductive or inductive
- Transductive learning is almost always semi-supervised because both \(T\) and \(U\) are used
Why Does Semi-Supervised Learning Work?
- Helps better model the distribution
- Assumptions about the data and labels provide a good learning signal
- More unlabeled data available
Self-Learning
- Learn embeddings first, then the classifier
Attributed Graphs
- Each node has a feature vector \(\mathbb{R}^n\)
- Traditional methods: Markov random fields (similar to HMMs)
- Latent variables are not sequential but graph-structured
- Parts of the latent variables are observed
- Modern methods: Graph Neural Networks (GNNs)
Graph Construction
- Handle vector data when no structure is available
- Construct a graph connecting similar data points (similarity function)
- Apply label propagation
Graph Classification vs. Node Classification
- Node classification: Classify individual nodes
- Graph classification: Classify the entire graph as a single instance (standard iid problem)
Graph Kernel Methods
- Shortest path kernel
- Random walk kernel
- Use an SVM or GNNs
Node Embeddings
- Specialized ML methods for graphs
- Apply traditional ML methods (e.g., k-means, SVM) on iid data
Encoding Graph Structure in a Vector
- Naive approaches:
- Adjacency matrix as an image: Not permutation equivariant
- Concatenating features of neighbors: Variable length, \(O(n^2)\) scaling
- Learning node representations (like word embeddings)
- \(\phi: V \rightarrow \mathbb{R}^d\)
- Similar to graph clustering with \(d\) smallest eigenvectors of the Laplacian
- Close in embedding space \(\Rightarrow\) Close in graph space
Applications of Node Embeddings
- Clustering: k-means
- Semi-supervised classification: Classify remaining nodes based on learned embeddings
- Link prediction: Measure likelihood of a link between nodes in embedding space
Types of Embeddings
- Role-based embeddings: Nodes with similar roles are close, even if far in the graph
- Spectral embeddings: Nodes have similar embeddings if in the same cluster
- Deep embeddings: Similarity depends on the loss function and architecture
Spectral Embeddings
- \(L\) in \(V \times V\) projected to \(V \times k\) (smallest \(k\) eigenvectors of \(L\))
- Linear transformation, similar to PCA
Deep Embeddings
- Inspired by NLP methods (e.g., word2vec)
- Deep walk: Transform a graph into random walks and apply word2vec
- For every node \(v_i\), sample multiple random walks
- Random walks correspond to sentences (sequences of discrete tokens)
- Result: Nodes close in the graph are close in the embedding space
Graph2Gauss
- Map each node \(v_i\) to a Gaussian distribution in embedding space
- Neighbors are closer than second neighbors, preserving ranking
- Learn mapping: \(f: v_i \rightarrow N(\mu(x_i), \sigma(x_i))\)
- \(x_i\): Feature vector of \(v_i\)
- Define loss for each node: Ranking loss (triplet \(u, v, w\))
- \(u\) is closer to \(v\) than to \(w\)
- \[L(u) = \sum_{v, w} E_{uv}^2 + \exp(-E_{uw})\]
- \[E_{uv} = KL(f(u), f(v))\]
- Result: Distances in graph correlate with distances in embedding space, variances encode uncertainty
- Inductive learning: Get embedding for new nodes (unlike deep walk, spectral)
Node Embeddings vs. Graph Embeddings
- Leverage node embeddings to get graph embeddings
- Different number of nodes: Use aggregation function (mean, sum, max)
Graph Neural Networks
Date: 2024-08-04
Introduction
Graph Neural Networks (GNNs) represent a form of deep learning tailored to exploit the inherent symmetries or geometric properties often encoded in graphs. They have demonstrated exceptional performance in various domains due to their flexibility in considering attributes of both nodes and edges.
Graph Convolutional Networks (GCNs)
GCNs are inspired by the concept of matrix convolutions, akin to sliding windows over images. In GCNs, information is aggregated from neighboring nodes to update each node’s representation.
Graph Convolution
Graph convolutions need to be permutation equivariant and can be defined in multiple ways:
- Spatial GNNs: Utilize differentiable message passing.
- Spectral GNNs: Employ graph signal processing and spectral filtering.
Both methods can achieve equivalent results.
Differentiable Message Passing (Spatial)
This method relies on the local aggregation of node representations. For a graph \(G\) with adjacency matrix \(A\) and node features \(X\):
- Initialization: \(h_v^0 = X_v\)
- Message Passing: For each node \(v\), \[ m_v^k = \sum_{u \in N(v)} M(h_u^{k-1}, h_v^{k-1}, E_{vu}) \]
- Node Update: \[ h_v^k = U(h_v^{k-1}, m_v^k) \] where \(M\) and \(U\) are differentiable functions.
Example formulation: \[ m_v^k = \sum_{u \in N(v)} \frac{W_k h_u^{k-1} + b^k}{d_v} \] \[ h_v^k = \text{ReLU}(Q^k h_v^{k-1} + m_v^k + p^k) \] Trainable parameters include \(W_k\), \(Q_k\), \(p_k\), and \(b_k\).
Graph Convolutional Network (GCN)
GCNs implement a specific form of message passing: \[ H^{l+1} = \sigma(\hat{D}^{-1/2} \hat{A} \hat{D}^{-1/2} H^l W^l) \] where \(\hat{A} = A + I\) (adjacency matrix with self-loops) and \(\hat{D}\) is the diagonal degree matrix of \(\hat{A}\).
Transformation Depth vs. Propagation Depth: - Propagation Depth: Number of hops considered for information aggregation. - Transformation Depth: Number of layers stacked, which can involve more complex neural networks beyond linear transformations.
Spectral Graph Convolution
Given the normalized Laplacian \(L = I - D^{-1/2} A D^{-1/2}\), with eigen-decomposition \(L = U \Lambda U^T\): - Graph Fourier Transform: \(\hat{x} = U^T x\) - Inverse Graph Fourier Transform: \(x = U \hat{x}\)
Spectral filtering on graphs is defined as: \[ y = U g(\Lambda) U^T x \] where \(g\) is a spectral filter. For example, choosing \(g = \lambda\) results in 1-hop neighborhood aggregation, while \(g = \lambda^k\) results in k-hop aggregation.
GCNs can be seen as a first-order approximation of localized spectral filters.
Transferability: - Transferability to new graphs (inductive learning) is challenging due to the global nature of spectral properties. Adding a single node alters the spectrum, limiting transferability.
Semi-Supervised Node Classification
In semi-supervised node classification, the final layer outputs logits, with softmax and cross-entropy loss used only on labeled nodes. The entire graph is used to propagate embeddings.
Graph Classification: - Aggregate all node embeddings and use a Multi-Layer Perceptron (MLP) for classification.
Equivariant Machine Learning
For GNNs, node order should not affect the representation or classification. If two graphs are isomorphic (\(G1 = (A1, X1)\) and \(G2 = (A2, X2)\) with a permutation matrix \(P\)), then:
- Invariant GNN: \(f(PX, PAP^T) = f(X, A)\)
- Equivariant GNN: \(f(PX, PAP^T) = Pf(X, A)\)
In the message passing framework, we preserve invariance using functions that operate on sets like sum or max.
Spatial Graphs
Spatial graphs have nodes and edges corresponding to geometric objects in vector spaces with metrics (e.g., molecules with atoms as nodes and bonds as edges). For these graphs, invariance to rotations and translations is necessary, leading to \(E(n)\)-equivariant GNNs, where \(E(N) = \{RX + T: R \in O(n)\}\).
Limitations of Graph Neural Networks
- Expressiveness:
GNNs may perform worse than the Weisfeiler-Lehman (WL) test for graph isomorphism. The best algorithm has exponential worst-case runtime, and it’s unclear if it’s in P or NP.
WL Test:
- Label all nodes with 1.
- Collect neighboring labels in a labeled multiset.
- Relabel the node by hashing the current label concatenated with the multiset with some hash function.
- Compare the sorted node labels or plot the labels in a histogram.
- If the histograms are different, the graphs are not isomorphic. If they are the same, repeat the process. When it converges and is still not different, we can stop but don’t have the answer. It is only a heuristic.
NN are worse than the WL test usually. The max function is not good for this, for example. We can use higher-order structures like triangles to improve the expressiveness or carefully designed aggregation.
GCN-like models tend to oversmooth predictions with increasing depth. In the limit of infinite layers, the influence becomes proportional to the stationary distribution of the graph as a Markov Chain (PageRank). So, the influence is not local anymore - deep GCNs cannot represent local neighborhoods.
- Oversquashing:
- GNNs often fail to propagate messages over long distances due to bottleneck behavior. An exponential amount of information is squashed into a fixed-size vector. This can result in low performance on long-range dependency tasks.
- Oversmoothing vs. Oversquashing:
- Fully connected graphs: oversmoothing but not oversquashing.
- Directed tree graphs: oversquashing but not oversmoothing (subtrees are independent).
PageRank in the Message Passing Framework
\[ H^{l+1} = D^{-1/2} A D^{-1/2} \sigma(H^l W^l) \] Personalized PageRank Graph Convolution combines transformation and propagation, preventing oversmoothing by focusing on local neighborhoods.
Personalized Propagation of Neural Predictions (PPNP)
PPNP separates transformation and propagation:
Transformation: \[ H_0 = f(X) \] where \(f\) is modeled by a neural network.
Personalized Propagation: \[ H^{l+1} = (1-\alpha)D^{-1/2} A D^{-1/2} H^l + \alpha H_0 \]
Personalized Teleportation: \[ r^{t+1} = (1-\alpha)AD^{-1}r^t + \alpha p_i \] where \(p_i\) is the \(i\)-th unit vector.
Final prediction using personalized PageRank: \[ z = \text{softmax}\left(\alpha(I - (1-\alpha)D^{-1/2} A D^{-1/2})^{-1} H_0\right) \] Here, \(H_0\) is the output of the neural network, and personalized PageRank models the influence of nodes for each class.
This approach improves scalability because \(H_0\) is modeled with a normal neural network and can be distributed. The propagation then operates on a low-dimensional latent space. It prevents oversmoothing and provides a clear separation of transformation and propagation (infinite propagation depth using the stationary distribution).
Connection to Label Propagation
PPNP is related to label propagation where \(H\) would be the matrix \(Y\). Here, we don’t have the labeled instances and predict them instead.
Scalability and Reliability
- GNNs require extensive communication and synchronization, making scalability to large graphs challenging.
- Susceptible to adversarial attacks.
- Adding nodes requires significant recomputation due to the small-world phenomenon.
Graph Transformers
Graph Transformers address oversquashing issues by using positional encodings to encode graph structure for attention mechanisms.
Position Encodings
In sequences, sinusoidal position encodings are used: \[ \text{PE}(\text{pos}, 2i) = \sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right) \] where \(d_{\text{model}}\) is the dimension of the model, \(pos\) is the node position, and \(i\) is the dimension index of the encoding. This is inspired by the Discrete Fourier Transform (DFT).
Directed Graphs
For directed graphs, the Laplacian is not symmetric, thus eigenvectors are not guaranteed.