Measuring the importance of a node or centrality of a node is one of the most important tasks in Network Analysis domain. Centrality measure can help us to find the most influential person on a social network like Twitter, most important pages on the web, hubs in the transportation network, etc. There are various algorithms that are used for measuring the centrality of a node, but in this article, we will be looking at the PageRank algorithm.
Table of Contents
- PageRank Algorithm.
- Iterating over a small network and finding the most important node in the network.
- Interpretation of PageRank, Problems associated with PageRank and Solution to it.
- Use Case ( using python library networkx ).
Note: This article expects all the readers to have some familiarity with python programming language and graph theory. If you are not familiar with graph theory , read this blog Introduction to Graph Theory which also introduces the networkx library, that we will be using towards the end of the article.
Page Rank Algorithm
Starting with a brief history, the PageRank algorithm was developed by Google founders when they were thinking about how to measure the importance of web pages using the hyperlink network structure of the web. The basic idea behind the PageRank algorithm is that a score of importance is assigned to every node in a network. One assumption that it makes, is that important nodes are those who have many in-links from important nodes (or pages). PageRank can be used for any type of network but it is mainly useful for the directed network. Now that we have a basic idea of the working of the PageRank Algorithm, in the following section, we will look at the step by step execution of the algorithm to find the final score of importance.
Note: Sum of the PageRank of all the nodes is always going to be a constant, it’s always going to be 1.
From above image it can be seen that ‘ n ‘ is the number of nodes in the network and ‘ k ‘ is the number of steps. Below are the steps taken to implement PageRank Algorithm :
- Assign all nodes a PageRank of 1/n.
- Perform the Basic PageRank Update Rule (mentioned in the image above ) k times.
Note: The new PageRank of each node is the sum of all the PageRank it received from other nodes.
Iterating over Small Network and Finding most important node
Consider Following Graph with n=5 . We will calculate PageRank for 2 ( k=2 ) steps.
Created with : draw.io
- Initially all the nodes will be given PageRank ⅕ as mentioned above.
- Performing PageRank Update Rule for the first time.
Calculating New value for A: A will receive PageRank from D and E ( nodes connected to it ). D will contribute ⅓ times its current PageRank to A since D is connected to 3 nodes and it will give an equal contribution to the nodes it is connected to. E will contribute its actual PageRank to A since it has only one outgoing link.
New value of A = ⅓ * ⅕ (contribution from D) + ⅕ (contribution from E)
Calculating New value for B: B will receive PageRank from C and A ( nodes connected to it ). C will contribute its entire PageRank to B since it has only one outgoing link. A will also contribute its entire PageRank. New value of B = ⅕ (contribution from C) + ⅕ (contribution from A)
Calculating New value for C: C will receive PageRank from B and D. New value of C = ½ * ⅕ (contribution from B) + ⅓ * ⅕ (contribution from D)
Calculating New value for D: D will receive PageRank from B New value of D = ½ * ⅕ (contribution from B)
Calculating New value for E: E will receive PageRank from D. New value of E = ⅓ * ⅕ (contribution from D)
- Now taking New values as Old , we will again follow PageRank Update Rule for second time.
Calculating New value for A: A will receive PageRank from E and D. New value of C = 1/15 (contribution from E) + 1/3 * 1/10 (contribution from D)
Calculating New value for B: B will receive PageRank from C and A New value of B = 1/6 (contribution from C) + 4/15 (contribution from A)
Calculating New value for C: C will receive PageRank from B and D. New value of E = 1/2 * 2/5 (contribution from B) + 1/3 * 1/10 (contribution from D)
Calculating New value for D: D will receive PageRank from B. New value of C = 1/2 * 2/5 (contribution from B)
Calculating New value for E: E will receive PageRank from D. New value of D = 1/3 * 1/10 (contribution from D)
Now a question arises: What if we continue with k=3,4,5,6……?
It is seen that as we continue for many steps, the PageRank starts changing by a minimal amount and they converge to a unique value ( for most networks ).
Final PageRank is maximum for B indicating that it is the most important node in the network.
Interpretation of PageRank, Problems associated and solution
- The basic PageRank of a node can be interpreted as the probability that a random walk lands on the node after k random steps.
- Basic PageRank has the problem that,in some networks, a few nodes can “ suck “ up all the PageRank from the network.
- To fix this problem, Scaled PageRank introduces a parameter ⍺ (alpha) known as damping parameter , such that the random walker chooses a random node to jump to with probability 1 - ⍺ .
Use Case (Using python library networkx)
We will use a dataset containing a network of political blogs. Social Network Analysis (SNA) is used as a research vehicle to investigate the structural patterns of blogging communities. We will try to explore one small part of the SNA i.e Centrality measurement of a node by using PageRank Algorithm to find the most important blog of the community of political blogs.
Shown below is a very small portion of the entire network of political blogs.
Code with detailed explanation
We will be using networkx library, if you are not at all familiar with networkx, you can read about it using the link provided. The provided link will have installation steps for the library as well , follow it to get it installed.
#Importing networkx library import networkx as nx import numpy as np import matplotlib.pyplot as plt
Now we will read the file ( blogs.gml ) which can be accessed using the dataset link provided above. We will be using networkx read_gml() function and we will name this graph as blog_net.
blog_net=nx.read_gml( ' blogs.gml ' ) print ( list( blog_net.edges ) [:5] ) #print some edges from the network
[('tsrightdominion.blogspot.com', 'democraticunderground.com'), ('tsrightdominion.blogspot.com', 'conservativepunk.com'), ('tsrightdominion.blogspot.com', 'anncoulter.org'), ('tsrightdominion.blogspot.com', 'aldaynet.org'), ('tsrightdominion.blogspot.com', 'gevkaffeegal.typepad.com/the_alliance')]
We will now find the PageRank for one particular blog in the network . Let that blog be realclearpolitics.com .
def findPageRank(blog): pr = nx.pagerank(blog_net, alpha=0.85) return pr[blog] findPageRank('realclearpolitics.com')
We will now Apply the Scaled PageRank Algorithm to this network with damping value 0.85.We will find the 5 nodes with highest PageRank.
The function returns a list of the top 5 blogs in descending (top to bottom) order of PageRank.
def topFiveBlogs(): pr = nx.pagerank(blog_net, alpha=0.85) return sorted(pr, key=lambda key:pr[key], reverse=True)[:5] topFiveBlogs()
['dailykos.com', 'atrios.blogspot.com', 'instapundit.com', 'blogsforbush.com', 'talkingpointsmemo.com']
This article introduced readers to the basics of PageRank algorithm and also showed the importance of the algorithm in real life use cases apart from the application it was originally designed for. This knowledge of centrality of nodes and other important concepts in network analysis opens the door for the analysis of large networks like the spread of disease on the global transportation network.
Please leave a comment if you would like to share some other applications related to PageRank Algorithms.