Graph Classification - Semi-Supervised Learning
Tasks
- Classifying a protein as an enzyme or a non-enzyme
- Detecting fraud in a financial transaction
- Predicting users’ preferences in a social network
Collective Classification
- A graph represents a social network
- Labels are known for some nodes (labeled nodes)
- Goal: Classify the remaining nodes
- Standard assumption: Nodes with similar labels are likely to be connected (smoothness assumption or homophily)
Label Propagation
- \(V = S \cup U\)
- \(S\): Seed nodes
- \(U\): Unlabeled nodes
- \(W\): Symmetric weight adjacency matrix
- \(y_i\): in \(\{0,1\}\)
- Idea: Energy minimization
- \[E = \sum W_{ij}(y_i - y_j)^2\]
- Smoothness: Adjacent nodes should have the same class
- Subject to \(y_i = \hat{y}_i\) for \(i \in S\)
- Constrained integer optimization problem
- Rewritten as \(\min y^T L y\)
- \(y \in [0,1]\)
- \[L = \begin{bmatrix} L_{ss} & L_{su} \\ L_{us} & L_{uu} \end{bmatrix}\]
- \[y = \begin{bmatrix} y_s \\ y_u \end{bmatrix} = \begin{bmatrix} \hat{y}_s \\ y_u \end{bmatrix}\]
- \(y_u = -L_{uu}^{-1} L_{us} y_s\) (closed-form solution)
- Unconstrained optimization problem
Generalization to \(k\) Labels
- \(y_{ik} = 1\) if node \(i\) is of class \(k\)
- \[E(y) = \sum W_{ij}(y_i - y_j)^2\]
- Compatibility matrix \(H\)
- Models the compatibility between labels
- \[E_y = \sum W_{ij}(y_i - y_j) H (y_i - y_j)\]
- Example: \[H = \begin{bmatrix} 0.2 & 0.8 \\ 0.8 & 0.2 \end{bmatrix}\]
- Using heterophily: Nodes with different labels are likely to be connected (leaders and followers)
Alternative Variant
- Soft constraint on \(y_i = \hat{y}_i\)
- \[\min_{F \in \mathbb{R}^{n \times k}} \sum W_{ij}(f_i - f_j)^2 + \mu \sum_i \sum_k (f_i - y_i)^2\]
- \(f_i\): \(i\)-th row of \(F\)
- \(y_i\): \(i\)-th row of \(Y\) (one-hot encoded)
- Normalize \(f_i\) and \(y_i\) by the degree of the node
- \(\mu\): Regularization parameter (softens the constraint)
- Solution: \[F = (1 - \alpha)(I - \alpha W D^{-1}) Y\]
- \(\alpha = \frac{2}{2 + \mu}\)
- Personalized PageRank where \(Y\) encodes the seed labels
- PageRank score for each class and assign the class with the highest score
- \(W D^{-1}\) is the normalized random walk matrix \(M\)
Label Propagation vs. Stochastic Block Models (SBM)
- Labels \(y_i\) look like communities \(z_i\)
- Compatibility matrix \(H\) resembles the SBM matrix
- SBM:
- Can have observed labels
- Discriminative model: Models \(p(y|W)\)
- Generative model: Models \(p(W|z)\) and \(p(z)\)
- Used to generate new graphs (unlike LP)
- Estimates parameters that generated a given graph (unsupervised)
- Used for link prediction, etc.
- LP:
- Predicts labels of \(U\) given observed labels and \(W\) (supervised)
- Different problem-solving approach
Transductive Learning
- Label propagation is a special case of transductive learning
- Labeled set \(T = \{(x_i, y_i)\}\) and an unlabeled set \(U = \{x_i\}\)
- May include additional structural information like \(A\)
- Goal: Predict labels of only the unlabeled set
- Easier task: Not solving the general problem for any new graph and any new labels
- Vapnik Principle: Solve only the specific problem, not a more difficult intermediate step
Transductive Learning vs. Semi-Supervised Learning
- LP is also graph-based semi-supervised learning (better termed SSP transductive learning)
- Semi-supervised learning: Labeled set \(T = \{(x_i, y_i)\}\) and an unlabeled set \(U = \{x_i\}\)
- Goal: Use both \(T\) and \(U\) to learn a mapping \(f\)
- \(f\) can be transductive or inductive
- Transductive learning is almost always semi-supervised because both \(T\) and \(U\) are used
Why Does Semi-Supervised Learning Work?
- Helps better model the distribution
- Assumptions about the data and labels provide a good learning signal
- More unlabeled data available
Self-Learning
- Learn embeddings first, then the classifier
Attributed Graphs
- Each node has a feature vector \(\mathbb{R}^n\)
- Traditional methods: Markov random fields (similar to HMMs)
- Latent variables are not sequential but graph-structured
- Parts of the latent variables are observed
- Modern methods: Graph Neural Networks (GNNs)
Graph Construction
- Handle vector data when no structure is available
- Construct a graph connecting similar data points (similarity function)
- Apply label propagation
Graph Classification vs. Node Classification
- Node classification: Classify individual nodes
- Graph classification: Classify the entire graph as a single instance (standard iid problem)
Graph Kernel Methods
- Shortest path kernel
- Random walk kernel
- Use an SVM or GNNs
Node Embeddings
- Specialized ML methods for graphs
- Apply traditional ML methods (e.g., k-means, SVM) on iid data
Encoding Graph Structure in a Vector
- Naive approaches:
- Adjacency matrix as an image: Not permutation equivariant
- Concatenating features of neighbors: Variable length, \(O(n^2)\) scaling
- Learning node representations (like word embeddings)
- \(\phi: V \rightarrow \mathbb{R}^d\)
- Similar to graph clustering with \(d\) smallest eigenvectors of the Laplacian
- Close in embedding space \(\Rightarrow\) Close in graph space
Applications of Node Embeddings
- Clustering: k-means
- Semi-supervised classification: Classify remaining nodes based on learned embeddings
- Link prediction: Measure likelihood of a link between nodes in embedding space
Types of Embeddings
- Role-based embeddings: Nodes with similar roles are close, even if far in the graph
- Spectral embeddings: Nodes have similar embeddings if in the same cluster
- Deep embeddings: Similarity depends on the loss function and architecture
Spectral Embeddings
- \(L\) in \(V \times V\) projected to \(V \times k\) (smallest \(k\) eigenvectors of \(L\))
- Linear transformation, similar to PCA
Deep Embeddings
- Inspired by NLP methods (e.g., word2vec)
- Deep walk: Transform a graph into random walks and apply word2vec
- For every node \(v_i\), sample multiple random walks
- Random walks correspond to sentences (sequences of discrete tokens)
- Result: Nodes close in the graph are close in the embedding space
Graph2Gauss
- Map each node \(v_i\) to a Gaussian distribution in embedding space
- Neighbors are closer than second neighbors, preserving ranking
- Learn mapping: \(f: v_i \rightarrow N(\mu(x_i), \sigma(x_i))\)
- \(x_i\): Feature vector of \(v_i\)
- Define loss for each node: Ranking loss (triplet \(u, v, w\))
- \(u\) is closer to \(v\) than to \(w\)
- \[L(u) = \sum_{v, w} E_{uv}^2 + \exp(-E_{uw})\]
- \[E_{uv} = KL(f(u), f(v))\]
- Result: Distances in graph correlate with distances in embedding space, variances encode uncertainty
- Inductive learning: Get embedding for new nodes (unlike deep walk, spectral)
Node Embeddings vs. Graph Embeddings
- Leverage node embeddings to get graph embeddings
- Different number of nodes: Use aggregation function (mean, sum, max)