Published

Tuesday, July 30, 2024

Clustering

We can utilize the graph structure to extract clusters without needing any features. These clusters can be later interpreted with some prior knowledge.

Main Idea: There are many edges within clusters and few edges between clusters, leading to high intra-cluster density and low inter-cluster density.

Categories of Clustering Algorithms

  1. Partitioning Methods: Divide the graph into non-overlapping clusters.
  2. Non-Partitioning Methods: Allow clusters to overlap.

Partitioning Methods

Basic Idea: Formulate clustering as a constrained optimization problem.

k-Means with Different Distance Metrics

Given a graph \(G\) and an objective: - Define a set of clusters \(C = \{C_1, C_2, ..., C_k\}\). - Optimize \(G(C)\) subject to \(C_1 \cup C_2 \cup ... \cup C_k = V\) and \(C_1, C_2, ..., C_k\) are disjoint.

This problem is NP-hard due to its discrete nature. For \(k = 2\), polynomial-time algorithms exist.

Global Minimum Cut

Minimize the number of edges between clusters: - Objective: Partition the graph into two clusters such that the cut \((C_1, C_2) = \sum\) of edges between \(C_1\) and \(C_2\) is minimized. - The source-sink cut can be computed in polynomial time using the Ford-Fulkerson algorithm. - In graph clustering, no specific source and sink are given, and more than two clusters are needed.

Normalized Cut

Cuts tend to favor small clusters.

Ratio Cut: - Minimize \(\text{cut}(C_1, C_2) / |C_1| + \text{cut}(C_1, C_2) / |C_2|\). - Does not consider intra-cluster density.

Normalized Cut: - Minimize \(\text{cut}(C_1, C_2) / \text{vol}(C_1) + \text{cut}(C_1, C_2) / \text{vol}(C_2)\), where \(\text{vol}(C)\) is the sum of the degrees of all nodes in \(C\). - For more than two clusters, minimize \(\sum_{C_i} \text{cut}(C_i, V \setminus C_i) / \text{vol}(C_i)\). - This is NP-hard; approximate solutions use expectation maximization.

Graph Laplacian - Background

The Graph Laplacian \(L = D - W\): - \(D\) is the degree matrix. - \(W\) is the weighted adjacency matrix. - \(L\) is positive semi-definite and symmetric for undirected graphs.

Key Properties: - \(f^T L f = \sum_{i,j} W_{ij} (f_i - f_j)^2\) over edges. - The smallest eigenvalue is 0, with the corresponding eigenvector being the constant vector. - The number of connected components is given by the multiplicity of the 0 eigenvalue. - If the graph is connected, the second smallest eigenvalue (\(\lambda_2\)) is greater than 0. - The second smallest eigenvalue (Fiedler value) indicates the graph’s algebraic connectivity. - The Laplacian is additive under graph union: \(L(G_1 \cup G_2) = L(G_1) + L(G_2)\) if \(G_1\) and \(G_2\) are disjoint.

Normalized/Symmetric Laplacian: - \(L_{\text{sym}} = D^{-1/2} L D^{-1/2}\).

Computational Methods: - Spectral clustering involves solving an eigenvalue problem. - For \(k = 2\): - Define \(f\) such that \(f_i = +1\) if \(i \in C_1\) and \(-1\) if \(i \in C_2\). - \(f^T L f = 4 \cdot \text{cut}(C_1, C_2)\). - Minimize \(f^T L f\) subject to \(f^T f = \sqrt{n}\) and \(f^T 1 = 0\). - The solution is the second eigenvector of \(L\).

  • For \(k > 2\):
    • Indicator matrix \(H = [h_1, h_2, ..., h_k]\) where \(h_j = 1\) if \(i \in C_j\) and 0 otherwise.
    • \(H^T H = I\).
    • Minimize \(\text{trace}(H^T L H)\).
    • The solution is the \(k\) smallest eigenvectors of \(L\).
  • Spectral embedding for node \(1\) is the first entry of the \(k\) smallest eigenvectors of \(L\) (first row of \(H\)).
  • Perform k-means on the spectral embedding to obtain cluster assignments.

If the weight between nodes in different clusters increases, the spectral embedding will converge.

Drawbacks: There is no guarantee that the approximation will be accurate in all cases, but it works well in practice.

Spectral Clustering for Numerical Data: - Requires a similarity function $ (x,y) = (-||x-y||^2 / 2 ^2)$. - Create an edge between \(x\) and \(y\) if \(\text{sim}(x,y) > t\) or if \(x\) and \(y\) are mutual \(k\)-nearest neighbors. - Apply spectral clustering. - Advantages: Works well for non-convex clusters and complex shapes (more effective than applying k-means directly).

Probabilistic Community Detection

Optimization Methods: - Find clustering that minimizes an objective function like min cut.

Generative Models: - Consider a graph as a realization from a generative model controlled by parameters. - Communities are explicitly modeled within the generative model (e.g., PPM, SBM). - Finding good clustering is equivalent to parameter inference.

Example: PPM (Plantinum Partition Model): - Two communities with a latent variable \(z\). - Probability \(P(A_{ij} = 1 \mid z_j, z_i) = p\) if \(z_i = z_j\) and \(q\) otherwise. - Maximize the posterior estimate \(z^* = \arg \max_z P(Z \mid A)\). - Assume \(C_1 = C_2 = n/2\) then \(\sum z_i = 0\) (indicator: +1 or -1).

We can show that \(P(Z \mid A) \propto P(A \mid Z) = \left( \frac{(1-q)p}{(1-p)q} \right)^{E_{\text{cut}}(z)}\) (if \(q > p\)), where \(E_{\text{cut}}(z) = \sum_{i<j} A_{ij} 1(z_i \ne z_j)\). Minimize the cut to maximize the posterior. This can be solved using spectral clustering for this specific case.

Example: SBM (Stochastic Block Model): - Skip detailed explanation. - Use variational inference to handle intractable posteriors and infer model parameters.