Clustering

We can utilize the graph structure to extract clusters without needing any features. These clusters can be later interpreted with some prior knowledge.

Main Idea: There are many edges within clusters and few edges between clusters, leading to high intra-cluster density and low inter-cluster density.

Categories of Clustering Algorithms

Partitioning Methods: Divide the graph into non-overlapping clusters.
Non-Partitioning Methods: Allow clusters to overlap.

Partitioning Methods

Basic Idea: Formulate clustering as a constrained optimization problem.

k-Means with Different Distance Metrics

Given a graph $G$ and an objective: - Define a set of clusters $C = \{C_1, C_2, ..., C_k\}$. - Optimize $G(C)$ subject to $C_1 \cup C_2 \cup ... \cup C_k = V$ and $C_1, C_2, ..., C_k$ are disjoint.

This problem is NP-hard due to its discrete nature. For $k = 2$, polynomial-time algorithms exist.

Global Minimum Cut

Minimize the number of edges between clusters: - Objective: Partition the graph into two clusters such that the cut $(C_1, C_2) = \sum$ of edges between $C_1$ and $C_2$ is minimized. - The source-sink cut can be computed in polynomial time using the Ford-Fulkerson algorithm. - In graph clustering, no specific source and sink are given, and more than two clusters are needed.

Normalized Cut

Cuts tend to favor small clusters.

Ratio Cut: - Minimize $\text{cut}(C_1, C_2) / |C_1| + \text{cut}(C_1, C_2) / |C_2|$. - Does not consider intra-cluster density.

Normalized Cut: - Minimize $\text{cut}(C_1, C_2) / \text{vol}(C_1) + \text{cut}(C_1, C_2) / \text{vol}(C_2)$, where $\text{vol}(C)$ is the sum of the degrees of all nodes in $C$. - For more than two clusters, minimize $\sum_{C_i} \text{cut}(C_i, V \setminus C_i) / \text{vol}(C_i)$. - This is NP-hard; approximate solutions use expectation maximization.

Graph Laplacian - Background

The Graph Laplacian $L = D - W$: - $D$ is the degree matrix. - $W$ is the weighted adjacency matrix. - $L$ is positive semi-definite and symmetric for undirected graphs.

Key Properties: - $f^T L f = \sum_{i,j} W_{ij} (f_i - f_j)^2$ over edges. - The smallest eigenvalue is 0, with the corresponding eigenvector being the constant vector. - The number of connected components is given by the multiplicity of the 0 eigenvalue. - If the graph is connected, the second smallest eigenvalue ($\lambda_2$) is greater than 0. - The second smallest eigenvalue (Fiedler value) indicates the graph’s algebraic connectivity. - The Laplacian is additive under graph union: $L(G_1 \cup G_2) = L(G_1) + L(G_2)$ if $G_1$ and $G_2$ are disjoint.

Normalized/Symmetric Laplacian: - $L_{\text{sym}} = D^{-1/2} L D^{-1/2}$.

Computational Methods: - Spectral clustering involves solving an eigenvalue problem. - For $k = 2$: - Define $f$ such that $f_i = +1$ if $i \in C_1$ and $-1$ if $i \in C_2$. - $f^T L f = 4 \cdot \text{cut}(C_1, C_2)$. - Minimize $f^T L f$ subject to $f^T f = \sqrt{n}$ and $f^T 1 = 0$. - The solution is the second eigenvector of $L$.

For $k > 2$:
- Indicator matrix $H = [h_1, h_2, ..., h_k]$ where $h_j = 1$ if $i \in C_j$ and 0 otherwise.
- $H^T H = I$.
- Minimize $\text{trace}(H^T L H)$.
- The solution is the $k$ smallest eigenvectors of $L$.
Spectral embedding for node $1$ is the first entry of the $k$ smallest eigenvectors of $L$ (first row of $H$).
Perform k-means on the spectral embedding to obtain cluster assignments.

If the weight between nodes in different clusters increases, the spectral embedding will converge.

Drawbacks: There is no guarantee that the approximation will be accurate in all cases, but it works well in practice.

Spectral Clustering for Numerical Data: - Requires a similarity function $ (x,y) = (-||x-y||^2 / 2 ^2)$. - Create an edge between $x$ and $y$ if $\text{sim}(x,y) > t$ or if $x$ and $y$ are mutual $k$-nearest neighbors. - Apply spectral clustering. - Advantages: Works well for non-convex clusters and complex shapes (more effective than applying k-means directly).

Probabilistic Community Detection

Optimization Methods: - Find clustering that minimizes an objective function like min cut.

Generative Models: - Consider a graph as a realization from a generative model controlled by parameters. - Communities are explicitly modeled within the generative model (e.g., PPM, SBM). - Finding good clustering is equivalent to parameter inference.

Example: PPM (Plantinum Partition Model): - Two communities with a latent variable $z$. - Probability $P(A_{ij} = 1 \mid z_j, z_i) = p$ if $z_i = z_j$ and $q$ otherwise. - Maximize the posterior estimate $z^* = \arg \max_z P(Z \mid A)$. - Assume $C_1 = C_2 = n/2$ then $\sum z_i = 0$ (indicator: +1 or -1).

We can show that $P(Z \mid A) \propto P(A \mid Z) = \left( \frac{(1-q)p}{(1-p)q} \right)^{E_{\text{cut}}(z)}$ (if $q > p$), where $E_{\text{cut}}(z) = \sum_{i<j} A_{ij} 1(z_i \ne z_j)$. Minimize the cut to maximize the posterior. This can be solved using spectral clustering for this specific case.

Example: SBM (Stochastic Block Model): - Skip detailed explanation. - Use variational inference to handle intractable posteriors and infer model parameters.