Graph Neural Networks
Date: 2024-08-04
Introduction
Graph Neural Networks (GNNs) represent a form of deep learning tailored to exploit the inherent symmetries or geometric properties often encoded in graphs. They have demonstrated exceptional performance in various domains due to their flexibility in considering attributes of both nodes and edges.
Graph Convolutional Networks (GCNs)
GCNs are inspired by the concept of matrix convolutions, akin to sliding windows over images. In GCNs, information is aggregated from neighboring nodes to update each node’s representation.
Graph Convolution
Graph convolutions need to be permutation equivariant and can be defined in multiple ways:
- Spatial GNNs: Utilize differentiable message passing.
- Spectral GNNs: Employ graph signal processing and spectral filtering.
Both methods can achieve equivalent results.
Differentiable Message Passing (Spatial)
This method relies on the local aggregation of node representations. For a graph \(G\) with adjacency matrix \(A\) and node features \(X\):
- Initialization: \(h_v^0 = X_v\)
- Message Passing: For each node \(v\), \[ m_v^k = \sum_{u \in N(v)} M(h_u^{k-1}, h_v^{k-1}, E_{vu}) \]
- Node Update: \[ h_v^k = U(h_v^{k-1}, m_v^k) \] where \(M\) and \(U\) are differentiable functions.
Example formulation: \[ m_v^k = \sum_{u \in N(v)} \frac{W_k h_u^{k-1} + b^k}{d_v} \] \[ h_v^k = \text{ReLU}(Q^k h_v^{k-1} + m_v^k + p^k) \] Trainable parameters include \(W_k\), \(Q_k\), \(p_k\), and \(b_k\).
Graph Convolutional Network (GCN)
GCNs implement a specific form of message passing: \[ H^{l+1} = \sigma(\hat{D}^{-1/2} \hat{A} \hat{D}^{-1/2} H^l W^l) \] where \(\hat{A} = A + I\) (adjacency matrix with self-loops) and \(\hat{D}\) is the diagonal degree matrix of \(\hat{A}\).
Transformation Depth vs. Propagation Depth: - Propagation Depth: Number of hops considered for information aggregation. - Transformation Depth: Number of layers stacked, which can involve more complex neural networks beyond linear transformations.
Spectral Graph Convolution
Given the normalized Laplacian \(L = I - D^{-1/2} A D^{-1/2}\), with eigen-decomposition \(L = U \Lambda U^T\): - Graph Fourier Transform: \(\hat{x} = U^T x\) - Inverse Graph Fourier Transform: \(x = U \hat{x}\)
Spectral filtering on graphs is defined as: \[ y = U g(\Lambda) U^T x \] where \(g\) is a spectral filter. For example, choosing \(g = \lambda\) results in 1-hop neighborhood aggregation, while \(g = \lambda^k\) results in k-hop aggregation.
GCNs can be seen as a first-order approximation of localized spectral filters.
Transferability: - Transferability to new graphs (inductive learning) is challenging due to the global nature of spectral properties. Adding a single node alters the spectrum, limiting transferability.
Semi-Supervised Node Classification
In semi-supervised node classification, the final layer outputs logits, with softmax and cross-entropy loss used only on labeled nodes. The entire graph is used to propagate embeddings.
Graph Classification: - Aggregate all node embeddings and use a Multi-Layer Perceptron (MLP) for classification.
Equivariant Machine Learning
For GNNs, node order should not affect the representation or classification. If two graphs are isomorphic (\(G1 = (A1, X1)\) and \(G2 = (A2, X2)\) with a permutation matrix \(P\)), then:
- Invariant GNN: \(f(PX, PAP^T) = f(X, A)\)
- Equivariant GNN: \(f(PX, PAP^T) = Pf(X, A)\)
In the message passing framework, we preserve invariance using functions that operate on sets like sum or max.
Spatial Graphs
Spatial graphs have nodes and edges corresponding to geometric objects in vector spaces with metrics (e.g., molecules with atoms as nodes and bonds as edges). For these graphs, invariance to rotations and translations is necessary, leading to \(E(n)\)-equivariant GNNs, where \(E(N) = \{RX + T: R \in O(n)\}\).
Limitations of Graph Neural Networks
- Expressiveness:
GNNs may perform worse than the Weisfeiler-Lehman (WL) test for graph isomorphism. The best algorithm has exponential worst-case runtime, and it’s unclear if it’s in P or NP.
WL Test:
- Label all nodes with 1.
- Collect neighboring labels in a labeled multiset.
- Relabel the node by hashing the current label concatenated with the multiset with some hash function.
- Compare the sorted node labels or plot the labels in a histogram.
- If the histograms are different, the graphs are not isomorphic. If they are the same, repeat the process. When it converges and is still not different, we can stop but don’t have the answer. It is only a heuristic.
NN are worse than the WL test usually. The max function is not good for this, for example. We can use higher-order structures like triangles to improve the expressiveness or carefully designed aggregation.
GCN-like models tend to oversmooth predictions with increasing depth. In the limit of infinite layers, the influence becomes proportional to the stationary distribution of the graph as a Markov Chain (PageRank). So, the influence is not local anymore - deep GCNs cannot represent local neighborhoods.
- Oversquashing:
- GNNs often fail to propagate messages over long distances due to bottleneck behavior. An exponential amount of information is squashed into a fixed-size vector. This can result in low performance on long-range dependency tasks.
- Oversmoothing vs. Oversquashing:
- Fully connected graphs: oversmoothing but not oversquashing.
- Directed tree graphs: oversquashing but not oversmoothing (subtrees are independent).
PageRank in the Message Passing Framework
\[ H^{l+1} = D^{-1/2} A D^{-1/2} \sigma(H^l W^l) \] Personalized PageRank Graph Convolution combines transformation and propagation, preventing oversmoothing by focusing on local neighborhoods.
Personalized Propagation of Neural Predictions (PPNP)
PPNP separates transformation and propagation:
Transformation: \[ H_0 = f(X) \] where \(f\) is modeled by a neural network.
Personalized Propagation: \[ H^{l+1} = (1-\alpha)D^{-1/2} A D^{-1/2} H^l + \alpha H_0 \]
Personalized Teleportation: \[ r^{t+1} = (1-\alpha)AD^{-1}r^t + \alpha p_i \] where \(p_i\) is the \(i\)-th unit vector.
Final prediction using personalized PageRank: \[ z = \text{softmax}\left(\alpha(I - (1-\alpha)D^{-1/2} A D^{-1/2})^{-1} H_0\right) \] Here, \(H_0\) is the output of the neural network, and personalized PageRank models the influence of nodes for each class.
This approach improves scalability because \(H_0\) is modeled with a normal neural network and can be distributed. The propagation then operates on a low-dimensional latent space. It prevents oversmoothing and provides a clear separation of transformation and propagation (infinite propagation depth using the stationary distribution).
Connection to Label Propagation
PPNP is related to label propagation where \(H\) would be the matrix \(Y\). Here, we don’t have the labeled instances and predict them instead.
Scalability and Reliability
- GNNs require extensive communication and synchronization, making scalability to large graphs challenging.
- Susceptible to adversarial attacks.
- Adding nodes requires significant recomputation due to the small-world phenomenon.
Graph Transformers
Graph Transformers address oversquashing issues by using positional encodings to encode graph structure for attention mechanisms.
Position Encodings
In sequences, sinusoidal position encodings are used: \[ \text{PE}(\text{pos}, 2i) = \sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right) \] where \(d_{\text{model}}\) is the dimension of the model, \(pos\) is the node position, and \(i\) is the dimension index of the encoding. This is inspired by the Discrete Fourier Transform (DFT).
Directed Graphs
For directed graphs, the Laplacian is not symmetric, thus eigenvectors are not guaranteed.