Skip to content

Instantly share code, notes, and snippets.

@cloudaice
Created October 16, 2012 01:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cloudaice/3896861 to your computer and use it in GitHub Desktop.
Save cloudaice/3896861 to your computer and use it in GitHub Desktop.
about pagerank
最近我开始学习 Hadoop,本来以为课程应该会更多的侧重如何管理 Hadoop 集群,没想到开始阶段,老师为了让我们更好的理解 Hadoop 的 MapReduce 机制,让我们自己先来实现一个谷歌的 PageRank 算法,本来我想打算使用 Java 来实现的,因为毕竟过段时间,我需要在 Hadoop 集群上部署 Java 代码从而实现数据分析,但我从毕业后就再没用过 Java 写过一行代码,所以我真是写不出来啊,尤其是 PageRank 基本就是矩阵和向量的迭代运算,用 Java 的话一定用到二维数组,我上学的时候学的就不太好。我考虑再三还是决定用 Python 来实现,毕竟上半年的时候自学了一些 Python 语言,而且我知道 Python 有一个第三方模块叫 python-graph,用它来做图论方面的编程容易很多。我是在 Linode VPS 上搭建的 Python 编程环境。相关的模块安装过程如下:
[root@chenjunlu ~]# yum install graphviz*
[root@chenjunlu ~]# yum install vsftpd
[root@chenjunlu ~]# wget http://python-graph.googlecode.com/files/python-graph-core-1.8.2.tar.gz
[root@chenjunlu ~]# tar -zxvf python-graph-core-1.8.2.tar.gz
[root@chenjunlu ~]# cd python-graph-core-1.8.2
[root@chenjunlu ~]# python setup.py install
[root@chenjunlu ~]# wget http://python-graph.googlecode.com/files/python-graph-dot-1.8.2.tar.gz
[root@chenjunlu ~]# tar -zxvf python-graph-dot-1.8.2.tar.gz
[root@chenjunlu ~]# cd python-graph-dot-1.8.2
[root@chenjunlu ~]# python setup.py install
PageRank 是一种由搜索引擎根据网页之间相互的超链接计算的技术,而作为网页排名的要素之一,此技术通常和搜索引擎优化有关,Google 用它来体现网页的相关性和重要性。Google 的创始人拉里·佩奇和谢尔盖·布林于1998年在斯坦福大学发明了这项技术。基本上,PageRank 类似一个社交网络,如果你被某些重要的人提到,那么你的重要性就增加,同样被你提到的人,其重要性也将同样上涨。一个页面的“得票数”由所有链向它的页面的重要性来决定,到一个页面的超链接相当于对该页投一票。一个页面的 PageRank 是由所有链向它的页面(“链入页面”)的重要性经过递归算法得到的。一个有较多链入的页面会有较高的等级,相反如果一个页面没有任何链入页面,那么它没有等级。
完整的 PageRank 算法公式如下:
这个方程式引入了随机浏览的概念,即有人上网无聊随机打开一些页面,点一些链接。一个页面的 PageRank 值也影响了它被随机浏览的概率。为了便于理解,这里假设上网者不断点网页上的链接,最终到了一个没有任何链出页面的网页,这时候上网者会随机到另外的网页开始浏览。为了对那些有链出的页面公平,q = 0.85(这里的 q 被称为阻尼系数(damping factor),其意义是,在任意时刻,用户到达某页面后并继续向后浏览的概率。1 – q = 0.85 就是用户停止点击,随机跳到新 URL 的概率)的算法被用到了所有页面上,估算页面可能被上网者放入书签的概率。如果你想详细的了解 PageRank 算法的原理,请参考这里。
Hadoop 课程的作业是计算一个4网页模型各个节点的 PageRank 值。该模型如下:
Python 实现具体的计算过程如下:
# Import graphviz
import gv
# Import pygraph
from pygraph.classes.digraph import digraph
from pygraph.readwrite.dot import write
# Define pagerank function
def pagerank(graph, damping_factor=0.85, max_iterations=100, \
min_delta=0.00001):
"""
Compute and return the PageRank in an directed graph.
@type graph: digraph
@param graph: Digraph.
@type damping_factor: number
@param damping_factor: PageRank dumping factor.
@type max_iterations: number
@param max_iterations: Maximum number of iterations.
@type min_delta: number
@param min_delta: Smallest variation required for a new iteration.
@rtype: Dict
@return: Dict containing all the nodes PageRank.
"""
nodes = graph.nodes()
graph_size = len(nodes)
if graph_size == 0:
return {}
# value for nodes without inbound links
min_value = (1.0-damping_factor)/graph_size
# itialize the page rank dict with 1/N for all nodes
#pagerank = dict.fromkeys(nodes, 1.0/graph_size)
pagerank = dict.fromkeys(nodes, 1.0)
for i in range(max_iterations):
diff = 0 #total difference compared to last iteraction
# computes each node PageRank based on inbound links
for node in nodes:
rank = min_value
for referring_page in graph.incidents(node):
rank += damping_factor * pagerank[referring_page] / \
len(graph.neighbors(referring_page))
diff += abs(pagerank[node] - rank)
pagerank[node] = rank
print 'This is NO.%s iteration' % (i+1)
print pagerank
print ''
#stop if PageRank has converged
if diff < min_delta:
break
return pagerank
# Graph creation
gr = digraph()
# Add nodes and edges
gr.add_nodes(["1","2","3","4"])
gr.add_edge(("1","2"))
gr.add_edge(("1","3"))
gr.add_edge(("1","4"))
gr.add_edge(("2","3"))
gr.add_edge(("2","4"))
gr.add_edge(("3","4"))
gr.add_edge(("4","2"))
# Draw as PNG
dot = write(gr)
gvv = gv.readstring(dot)
gv.layout(gvv,'dot')
gv.render(gvv,'png','Model.png')
pagerank(gr)
运算的结果如下:
[root@chenjunlu ~]# python pagerank.py
This is NO.1 iteration result:
{'1': 0.037500000000000006, '3': 0.47312500000000002, '2': 0.89812499999999995, '4': 0.831984375}
This is NO.2 iteration result:
{'1': 0.037500000000000006, '3': 0.42982812500000001, '2': 0.75531171874999992, '4': 0.73448638671874988}
This is NO.3 iteration result:
{'1': 0.037500000000000006, '3': 0.36913248046874997, '2': 0.6724384287109374, '4': 0.64767394060058581}
This is NO.4 iteration result:
{'1': 0.037500000000000006, '3': 0.33391133220214841, '2': 0.59864784951049788, '4': 0.58637496841378778}
This is NO.5 iteration result:
{'1': 0.037500000000000006, '3': 0.30255033604196163, '2': 0.54654372315171962, '4': 0.53757386797514828}
This is NO.6 iteration result:
{'1': 0.037500000000000006, '3': 0.28040608233948083, '2': 0.50506278777887603, '4': 0.50112185479458105}
This is NO.7 iteration result:
{'1': 0.037500000000000006, '3': 0.26277668480602234, '2': 0.47407857657539393, '4': 0.47296857712966139}
This is NO.8 iteration result:
{'1': 0.037500000000000006, '3': 0.24960839504454241, '2': 0.45014829056021222, '4': 0.45160515927595124}
This is NO.9 iteration result:
{'1': 0.037500000000000006, '3': 0.23943802348809018, '2': 0.43198938538455856, '4': 0.435242808753314}
This is NO.10 iteration result:
{'1': 0.037500000000000006, '3': 0.23172048878843737, '2': 0.41808138744031692, '4': 0.42277200513230645}
This is NO.11 iteration result:
{'1': 0.037500000000000006, '3': 0.22580958966213469, '2': 0.40748120436246049, '4': 0.41324266306686019}
This is NO.12 iteration result:
{'1': 0.037500000000000006, '3': 0.2213045118540457, '2': 0.39938126360683118, '4': 0.40597087210884208}
This is NO.13 iteration result:
{'1': 0.037500000000000006, '3': 0.21786203703290324, '2': 0.3932002412925158, '4': 0.40041783402728698}
This is NO.14 iteration result:
{'1': 0.037500000000000006, '3': 0.21523510254931921, '2': 0.38848015892319393, '4': 0.39617890470927875}
This is NO.15 iteration result:
{'1': 0.037500000000000006, '3': 0.2132290675423574, '2': 0.38487706900288693, '4': 0.39294246173723074}
This is NO.16 iteration result:
{'1': 0.037500000000000006, '3': 0.21169775432622695, '2': 0.38212609247664614, '4': 0.39047168047986747}
This is NO.17 iteration result:
{'1': 0.037500000000000006, '3': 0.21052858930257459, '2': 0.38002592840788735, '4': 0.38858532048054051}
This is NO.18 iteration result:
{'1': 0.037500000000000006, '3': 0.20963601957335212, '2': 0.37842252240845947, '4': 0.38714518866094461}
This is NO.19 iteration result:
{'1': 0.037500000000000006, '3': 0.20895457202359527, '2': 0.37719841036180296, '4': 0.3860457106238222}
This is NO.20 iteration result:
{'1': 0.037500000000000006, '3': 0.20843432440376625, '2': 0.37626385403024887, '4': 0.38520631370605707}
This is NO.21 iteration result:
{'1': 0.037500000000000006, '3': 0.20803713796285578, '2': 0.37555036665014851, '4': 0.38456547309474054}
This is NO.22 iteration result:
{'1': 0.037500000000000006, '3': 0.20773390582631313, '2': 0.3750056521305295, '4': 0.3840762221078412}
This is NO.23 iteration result:
{'1': 0.037500000000000006, '3': 0.20750240215547502, '2': 0.37458978879166505, '4': 0.38370270206861146}
This is NO.24 iteration result:
{'1': 0.037500000000000006, '3': 0.20732566023645765, '2': 0.37427229675831974, '4': 0.38341753732327488}
This is NO.25 iteration result:
{'1': 0.037500000000000006, '3': 0.2071907261222859, '2': 0.37402990672478365, '4': 0.38319982756197607}
This is NO.26 iteration result:
{'1': 0.037500000000000006, '3': 0.20708771035803306, '2': 0.3738448534276797, '4': 0.38303361651109197}
This is NO.27 iteration result:
{'1': 0.037500000000000006, '3': 0.20700906270676386, '2': 0.37370357403442817, '4': 0.38290672226538125}
This is NO.28 iteration result:
{'1': 0.037500000000000006, '3': 0.20694901896463197, '2': 0.37359571392557406, '4': 0.38280984453830613}
This is NO.29 iteration result:
{'1': 0.037500000000000006, '3': 0.20690317841836897, '2': 0.37351336785756023, '4': 0.38273588299507677}
This is NO.30 iteration result:
{'1': 0.037500000000000006, '3': 0.20686818133946311, '2': 0.3734505005458153, '4': 0.3826794168705151}
This is NO.31 iteration result:
{'1': 0.037500000000000006, '3': 0.20684146273197149, '2': 0.37340250433993788, '4': 0.38263630766664936}
This is NO.32 iteration result:
{'1': 0.037500000000000006, '3': 0.20682106434447359, '2': 0.37336586151665196, '4': 0.38260339583737962}
This is NO.33 iteration result:
{'1': 0.037500000000000006, '3': 0.20680549114457708, '2': 0.37333788646177268, '4': 0.3825782692191439}
This is NO.34 iteration result:
{'1': 0.037500000000000006, '3': 0.20679360174625339, '2': 0.37331652883627231, '4': 0.38255908623973112}
This is NO.35 iteration result:
{'1': 0.037500000000000006, '3': 0.20678452475541573, '2': 0.37330022330377149, '4': 0.38254444094620621}
This is NO.36 iteration result:
{'1': 0.037500000000000006, '3': 0.20677759490410288, '2': 0.37328777480427527, '4': 0.38253325996030441}
This is NO.37 iteration result:
{'1': 0.037500000000000006, '3': 0.20677230429181698, '2': 0.37327827096625876, '4': 0.38252472380870439}
This is NO.38 iteration result:
{'1': 0.037500000000000006, '3': 0.20676826516065996, '2': 0.37327101523739875, '4': 0.38251820686245541}
This is NO.39 iteration result:
{'1': 0.037500000000000006, '3': 0.20676518147589446, '2': 0.37326547583308711, '4': 0.3825132314835723}
This is NO.40 iteration result:
{'1': 0.037500000000000006, '3': 0.20676282722906203, '2': 0.3732612467610365, '4': 0.38250943301814322}
This is NO.41 iteration result:
{'1': 0.037500000000000006, '3': 0.2067610298734405, '2': 0.37325801806542175, '4': 0.38250653307022864}
[root@chenjunlu ~]#
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment