Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save neilk/4dcee2a9039ea6d69456c32ab27e7ed8 to your computer and use it in GitHub Desktop.
Save neilk/4dcee2a9039ea6d69456c32ab27e7ed8 to your computer and use it in GitHub Desktop.
IPython notebook to illustrate an algorithm to compute PageRanks.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:5d172e02eeffb5de9592061ec75a01f2a8d71652e030e5c6a4e934474635c2f2"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This page demonstrates the use of a short Python implementation of the PageRank algorithm on the link structure contained in the graph on the [PageRank Wikipedia](http://en.wikipedia.org/wiki/PageRank) page:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from IPython.display import Image\n",
"Image(url='http://upload.wikimedia.org/wikipedia/commons/f/fb/PageRanks-Example.svg')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<img src=\"http://upload.wikimedia.org/wikipedia/commons/f/fb/PageRanks-Example.svg\"/>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 1,
"text": [
"<IPython.core.display.Image at 0x7fba5dfb0208>"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we will encode the links present on this graph as a count matrix `M_counts`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"n_pages = 11 # numbering pages A through K as 0 to 10\n",
"M_counts = np.zeros((n_pages, n_pages)) # will hold the number of link counts (assumed 0 or 1)\n",
"# columns = starting page, row = destination page, ie M_ij = whether or not there is a link from j to i\n",
"\n",
"M_counts[:,0] = 1 # page 0 (A in the graphic) is a sink because it has no outgoing links at all; \n",
"# however, M cannot contain an all-zero column, so do as if A was linking to all other pages (ie put 1's everywhere)\n",
"M_counts[2,1] = 1 # B->C\n",
"M_counts[1,2] = 1 # C->B\n",
"M_counts[0,3] = 1 # D->A\n",
"M_counts[1,3] = 1 # D->B\n",
"M_counts[1,4] = 1 # E->B\n",
"M_counts[3,4] = 1 # E->D\n",
"M_counts[5,4] = 1 # E->F\n",
"M_counts[1,5] = 1 # F->B\n",
"M_counts[4,5] = 1 # F->E\n",
"M_counts[1,6] = 1 # G,H,I->B,E\n",
"M_counts[4,6] = 1\n",
"M_counts[1,7] = 1\n",
"M_counts[4,7] = 1\n",
"M_counts[1,8] = 1\n",
"M_counts[4,8] = 1\n",
"M_counts[4,9] = 1 # J,K->E\n",
"M_counts[4,10] = 1\n",
"print(M_counts)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[ 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]\n",
" [ 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 0.]\n",
" [ 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
" [ 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]\n",
" [ 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]\n",
" [ 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]\n",
" [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
" [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
" [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
" [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
" [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]\n"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can make an adjacency matrix `M` out of `M_counts`, by dividing each column by its sum, ie we are making sure columns sum to 1 :"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"M = np.empty((n_pages, n_pages))\n",
"for j in range(n_pages):\n",
" M[:,j] = M_counts[:,j] / M_counts[:,j].sum()\n",
"np.set_printoptions(precision=3)\n",
"print(M)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[ 0.091 0. 0. 0.5 0. 0. 0. 0. 0. 0. 0. ]\n",
" [ 0.091 0. 1. 0.5 0.333 0.5 0.5 0.5 0.5 0. 0. ]\n",
" [ 0.091 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]\n",
" [ 0.091 0. 0. 0. 0.333 0. 0. 0. 0. 0. 0. ]\n",
" [ 0.091 0. 0. 0. 0. 0.5 0.5 0.5 0.5 1. 1. ]\n",
" [ 0.091 0. 0. 0. 0.333 0. 0. 0. 0. 0. 0. ]\n",
" [ 0.091 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]\n",
" [ 0.091 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]\n",
" [ 0.091 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]\n",
" [ 0.091 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]\n",
" [ 0.091 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]]\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us check that all the conditions on M are fulfilled."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import numpy\n",
"def check_M(M):\n",
" \"\"\"\n",
" check that M has the right format to be used by pagerank function\n",
" \"\"\"\n",
" n_pages = M.shape[0] # n_pages is the number of rows of M\n",
" np.testing.assert_equal(M.shape[0], M.shape[1], err_msg = 'M should be square')\n",
" np.testing.assert_array_almost_equal(M.sum(axis=0), np.ones((n_pages)), \n",
" err_msg = 'assert each column sums to one (M is assumed column-stochastic)')\n",
" for j in range(n_pages):\n",
" M_column = M[:,j]\n",
" n_nonzero = np.count_nonzero(M[:,j])\n",
" np.testing.assert_array_almost_equal(M_column[M_column.nonzero()], np.ones((n_nonzero)) / n_nonzero,\n",
" err_msg = 'in column %g, all non-zero entries should be equal (and equal to 1 divided by their number)' % j)\n",
"\n",
"check_M(M) # will produce error if M does not have the right format"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And we are now ready to apply the `pagerank` function, which will iteratively apply page transitions to an randomly initialized distribution over the pages, until convergence."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import numpy as np\n",
"def pagerank(M, d=0.85, square_error=1e-6):\n",
" \"\"\"\n",
" M : the adjacency matrix of the pages. It is assumed to be column-stochastic (ie column sum to 1); all links have equal weight.\n",
" A page with no outgoing links (sink) is represented as a page with outgoing links to each other page (ie restart page).\n",
" d: damping factor\n",
" square_error : the algorithm iterates until the difference between two successive PageRank vectors v is less than this (in squared norm)\n",
" returns the PageRanks of all pages\n",
" \"\"\"\n",
" n_pages = M.shape[0] # n_pages is the number of rows of M\n",
" v = np.random.rand(n_pages) # initialize to random vector\n",
" v = v / v.sum() # make v sum to 1\n",
" last_v = np.ones((n_pages)) # will contain the previous v\n",
" M_hat = d * M + (1-d)/n_pages * np.ones((n_pages, n_pages)) # equation (***) in Wikipedia page\n",
" while np.square(v - last_v).sum() > square_error:\n",
" last_v = v\n",
" v = M_hat.dot(v) # at each iteration, progress one timestep\n",
" return v\n",
" "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pagerank(M)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
"array([ 0.033, 0.385, 0.343, 0.039, 0.081, 0.039, 0.016, 0.016,\n",
" 0.016, 0.016, 0.016])"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These are the numbers (within the allowed error) displayed on the graph (the numbers on the graph are rounded exact values)."
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment