Published

Friday, August 2, 2024

Graph Classification - Semi-Supervised Learning

Tasks

  • Classifying a protein as an enzyme or a non-enzyme
  • Detecting fraud in a financial transaction
  • Predicting users’ preferences in a social network

Collective Classification

  • A graph represents a social network
  • Labels are known for some nodes (labeled nodes)
  • Goal: Classify the remaining nodes
  • Standard assumption: Nodes with similar labels are likely to be connected (smoothness assumption or homophily)

Label Propagation

  • \(V = S \cup U\)
    • \(S\): Seed nodes
    • \(U\): Unlabeled nodes
    • \(W\): Symmetric weight adjacency matrix
    • \(y_i\): in \(\{0,1\}\)
  • Idea: Energy minimization
    • \[E = \sum W_{ij}(y_i - y_j)^2\]
    • Smoothness: Adjacent nodes should have the same class
    • Subject to \(y_i = \hat{y}_i\) for \(i \in S\)
  • Constrained integer optimization problem
    • Rewritten as \(\min y^T L y\)
    • \(y \in [0,1]\)
    • \[L = \begin{bmatrix} L_{ss} & L_{su} \\ L_{us} & L_{uu} \end{bmatrix}\]
    • \[y = \begin{bmatrix} y_s \\ y_u \end{bmatrix} = \begin{bmatrix} \hat{y}_s \\ y_u \end{bmatrix}\]
    • \(y_u = -L_{uu}^{-1} L_{us} y_s\) (closed-form solution)
    • Unconstrained optimization problem

Generalization to \(k\) Labels

  • \(y_{ik} = 1\) if node \(i\) is of class \(k\)
  • \[E(y) = \sum W_{ij}(y_i - y_j)^2\]
  • Compatibility matrix \(H\)
    • Models the compatibility between labels
    • \[E_y = \sum W_{ij}(y_i - y_j) H (y_i - y_j)\]
    • Example: \[H = \begin{bmatrix} 0.2 & 0.8 \\ 0.8 & 0.2 \end{bmatrix}\]
      • Using heterophily: Nodes with different labels are likely to be connected (leaders and followers)

Alternative Variant

  • Soft constraint on \(y_i = \hat{y}_i\)
  • \[\min_{F \in \mathbb{R}^{n \times k}} \sum W_{ij}(f_i - f_j)^2 + \mu \sum_i \sum_k (f_i - y_i)^2\]
    • \(f_i\): \(i\)-th row of \(F\)
    • \(y_i\): \(i\)-th row of \(Y\) (one-hot encoded)
    • Normalize \(f_i\) and \(y_i\) by the degree of the node
    • \(\mu\): Regularization parameter (softens the constraint)
    • Solution: \[F = (1 - \alpha)(I - \alpha W D^{-1}) Y\]
      • \(\alpha = \frac{2}{2 + \mu}\)
      • Personalized PageRank where \(Y\) encodes the seed labels
      • PageRank score for each class and assign the class with the highest score
      • \(W D^{-1}\) is the normalized random walk matrix \(M\)

Label Propagation vs. Stochastic Block Models (SBM)

  • Labels \(y_i\) look like communities \(z_i\)
  • Compatibility matrix \(H\) resembles the SBM matrix
  • SBM:
    • Can have observed labels
    • Discriminative model: Models \(p(y|W)\)
    • Generative model: Models \(p(W|z)\) and \(p(z)\)
    • Used to generate new graphs (unlike LP)
    • Estimates parameters that generated a given graph (unsupervised)
    • Used for link prediction, etc.
  • LP:
    • Predicts labels of \(U\) given observed labels and \(W\) (supervised)
    • Different problem-solving approach

Transductive Learning

  • Label propagation is a special case of transductive learning
  • Labeled set \(T = \{(x_i, y_i)\}\) and an unlabeled set \(U = \{x_i\}\)
    • May include additional structural information like \(A\)
    • Goal: Predict labels of only the unlabeled set
    • Easier task: Not solving the general problem for any new graph and any new labels
  • Vapnik Principle: Solve only the specific problem, not a more difficult intermediate step

Transductive Learning vs. Semi-Supervised Learning

  • LP is also graph-based semi-supervised learning (better termed SSP transductive learning)
  • Semi-supervised learning: Labeled set \(T = \{(x_i, y_i)\}\) and an unlabeled set \(U = \{x_i\}\)
    • Goal: Use both \(T\) and \(U\) to learn a mapping \(f\)
    • \(f\) can be transductive or inductive
  • Transductive learning is almost always semi-supervised because both \(T\) and \(U\) are used

Why Does Semi-Supervised Learning Work?

  • Helps better model the distribution
  • Assumptions about the data and labels provide a good learning signal
  • More unlabeled data available

Self-Learning

  • Learn embeddings first, then the classifier

Attributed Graphs

  • Each node has a feature vector \(\mathbb{R}^n\)
  • Traditional methods: Markov random fields (similar to HMMs)
    • Latent variables are not sequential but graph-structured
    • Parts of the latent variables are observed
  • Modern methods: Graph Neural Networks (GNNs)

Graph Construction

  • Handle vector data when no structure is available
    • Construct a graph connecting similar data points (similarity function)
    • Apply label propagation

Graph Classification vs. Node Classification

  • Node classification: Classify individual nodes
  • Graph classification: Classify the entire graph as a single instance (standard iid problem)

Graph Kernel Methods

  • Shortest path kernel
  • Random walk kernel
  • Use an SVM or GNNs

Node Embeddings

  • Specialized ML methods for graphs
  • Apply traditional ML methods (e.g., k-means, SVM) on iid data

Encoding Graph Structure in a Vector

  • Naive approaches:
    • Adjacency matrix as an image: Not permutation equivariant
    • Concatenating features of neighbors: Variable length, \(O(n^2)\) scaling
  • Learning node representations (like word embeddings)
    • \(\phi: V \rightarrow \mathbb{R}^d\)
    • Similar to graph clustering with \(d\) smallest eigenvectors of the Laplacian
    • Close in embedding space \(\Rightarrow\) Close in graph space

Applications of Node Embeddings

  • Clustering: k-means
  • Semi-supervised classification: Classify remaining nodes based on learned embeddings
  • Link prediction: Measure likelihood of a link between nodes in embedding space

Types of Embeddings

  • Role-based embeddings: Nodes with similar roles are close, even if far in the graph
  • Spectral embeddings: Nodes have similar embeddings if in the same cluster
  • Deep embeddings: Similarity depends on the loss function and architecture

Spectral Embeddings

  • \(L\) in \(V \times V\) projected to \(V \times k\) (smallest \(k\) eigenvectors of \(L\))
  • Linear transformation, similar to PCA

Deep Embeddings

  • Inspired by NLP methods (e.g., word2vec)
  • Deep walk: Transform a graph into random walks and apply word2vec
    • For every node \(v_i\), sample multiple random walks
    • Random walks correspond to sentences (sequences of discrete tokens)
    • Result: Nodes close in the graph are close in the embedding space

Graph2Gauss

  • Map each node \(v_i\) to a Gaussian distribution in embedding space
    • Neighbors are closer than second neighbors, preserving ranking
    • Learn mapping: \(f: v_i \rightarrow N(\mu(x_i), \sigma(x_i))\)
      • \(x_i\): Feature vector of \(v_i\)
    • Define loss for each node: Ranking loss (triplet \(u, v, w\))
      • \(u\) is closer to \(v\) than to \(w\)
      • \[L(u) = \sum_{v, w} E_{uv}^2 + \exp(-E_{uw})\]
      • \[E_{uv} = KL(f(u), f(v))\]
    • Result: Distances in graph correlate with distances in embedding space, variances encode uncertainty
    • Inductive learning: Get embedding for new nodes (unlike deep walk, spectral)

Node Embeddings vs. Graph Embeddings

  • Leverage node embeddings to get graph embeddings
  • Different number of nodes: Use aggregation function (mean, sum, max)