Ranking
PageRank
Core Idea: A page is important if many important pages link to it, based on a recursive formulation.
Voting Principle: - Each page votes on the importance of pages it links to. - A link’s vote is proportional to the importance of its source page. - If page \(j\) with importance \(r_j\) has \(n\) out-links, then each link gets \(\frac{r_j}{n}\) votes. - A page’s own importance is the sum of the votes on its in-links: \[ r_j = \sum_{i} \frac{r_i}{d_i} \] where \(i\) links to \(j\) and \(d_i\) is the out-degree of \(i\).
This is a system of linear equations with an infinite number of solutions. We normalize the sum of \(r_j\) to 1. For large systems, iterative methods are used.
Stochastic Adjacency Matrix: - \(M_{ji}\): If \(i\) links to \(j\) then \(\frac{1}{d_i}\), else 0. - This is a column stochastic matrix, which does not have negative eigenvalues.
Rank Vector: - The rank vector \(r\) sums to one and is \(\geq 0\): \[ r = Mr \] \[ r = (I - M)^{-1} \cdot e \] - This is an eigenvector problem with eigenvalue 1: \[ ||Mr|| = 1 \] - We find the largest eigenvector of \(M\) using power iterations, which converge quickly.
Interpretations: - System of linear equations. - Eigenvector problem. - Fixed point problem. - Random walk.
Random Walk Interpretation
Forms a Markov chain: - At time \(t\), the walker is on a random page \(i\). - At time \(t+1\), the surfer follows an out-link \(j\) with probability \(\frac{1}{d_i}\).
Transition Matrix: \(B = M^T\) - The stationary distribution of \(B\) is the PageRank vector: \[ r = rB \] - The initial distribution does not matter: \[ P(X_t = i) = \pi_i(t) \]
Limiting Distribution: - \(\pi\) is the limiting distribution if we do infinite steps: - Irreducible: We can get from every state to every other state. - Aperiodic: No cycles (state \(i\) is aperiodic if the GCD of all cycle lengths is 1; if one node is aperiodic and irreducible, all nodes are aperiodic).
PageRank score indicates the probability of being on a page after infinite steps—visiting more often means greater importance.
Problems: - Dead ends: No out-links, causing importance to leak. - Spider traps: A set of pages that link to each other but not to the outside. - Periodic states: Cycles.
Solutions: - Random teleportation: At each step, the surfer has two options: - Follow a link with probability \(\beta\). - Teleport to a random page with probability \(1 - \beta\): \[ r_j = \beta M_{ji} + \frac{1 - \beta}{N} \] - This forms a fully connected matrix: \[ A = \beta M + (1 - \beta) ee^T \frac{1}{N} \] - Instead of full matrix multiplication, we multiply by \(M\) and then add the teleportation part, keeping only the vector: \[ r = \beta Mr + \frac{1 - \beta}{N} e \]
Vertex-Oriented Computation: - Each vertex computes a local operation. - Special systems: GraphLab, Giraph, GraphX. - GAS principle: Gather, Apply, Scatter (for PageRank, only Gather and Apply are needed).
This does not solve the issue of dead ends, where the jumping probability becomes 1.
Other Problems: - Measures generic popularity: Biased against topic-specific authorities. - Topic-sensitive PageRank: Evaluates how important a page is for a specific topic, allowing search queries to be answered based on user interests. This involves: - Biasing the random walk. - Picking a set of seed pages \(S\) (topic-relevant pages) and teleporting only to them. - Defining \(v\) with \(\frac{1}{|S|}\) on the columns of the seed pages: \[ r = \beta Mr + (1 - \beta) v \] \[ r = (1 - \beta) (I - \beta M)^{-1} v \] - Personalized PageRank: \(v\) has a 1 on your page and 0 on the rest.
Susceptible to Link Spam: - Artificial links increase importance. - TrustRank: Use a set of trusted pages and teleport to them to avoid farmed links.
Uses a Single Measure of Importance: - HITS (Hyperlink Induced Topic Search): - Authority: Pages that are linked to by many hubs. - Hub: Pages that link to many authorities.