Created
October 16, 2012 01:56
-
-
Save cloudaice/3896861 to your computer and use it in GitHub Desktop.
about pagerank
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
最近我开始学习 Hadoop,本来以为课程应该会更多的侧重如何管理 Hadoop 集群,没想到开始阶段,老师为了让我们更好的理解 Hadoop 的 MapReduce 机制,让我们自己先来实现一个谷歌的 PageRank 算法,本来我想打算使用 Java 来实现的,因为毕竟过段时间,我需要在 Hadoop 集群上部署 Java 代码从而实现数据分析,但我从毕业后就再没用过 Java 写过一行代码,所以我真是写不出来啊,尤其是 PageRank 基本就是矩阵和向量的迭代运算,用 Java 的话一定用到二维数组,我上学的时候学的就不太好。我考虑再三还是决定用 Python 来实现,毕竟上半年的时候自学了一些 Python 语言,而且我知道 Python 有一个第三方模块叫 python-graph,用它来做图论方面的编程容易很多。我是在 Linode VPS 上搭建的 Python 编程环境。相关的模块安装过程如下: | |
[root@chenjunlu ~]# yum install graphviz* | |
[root@chenjunlu ~]# yum install vsftpd | |
[root@chenjunlu ~]# wget http://python-graph.googlecode.com/files/python-graph-core-1.8.2.tar.gz | |
[root@chenjunlu ~]# tar -zxvf python-graph-core-1.8.2.tar.gz | |
[root@chenjunlu ~]# cd python-graph-core-1.8.2 | |
[root@chenjunlu ~]# python setup.py install | |
[root@chenjunlu ~]# wget http://python-graph.googlecode.com/files/python-graph-dot-1.8.2.tar.gz | |
[root@chenjunlu ~]# tar -zxvf python-graph-dot-1.8.2.tar.gz | |
[root@chenjunlu ~]# cd python-graph-dot-1.8.2 | |
[root@chenjunlu ~]# python setup.py install | |
PageRank 是一种由搜索引擎根据网页之间相互的超链接计算的技术,而作为网页排名的要素之一,此技术通常和搜索引擎优化有关,Google 用它来体现网页的相关性和重要性。Google 的创始人拉里·佩奇和谢尔盖·布林于1998年在斯坦福大学发明了这项技术。基本上,PageRank 类似一个社交网络,如果你被某些重要的人提到,那么你的重要性就增加,同样被你提到的人,其重要性也将同样上涨。一个页面的“得票数”由所有链向它的页面的重要性来决定,到一个页面的超链接相当于对该页投一票。一个页面的 PageRank 是由所有链向它的页面(“链入页面”)的重要性经过递归算法得到的。一个有较多链入的页面会有较高的等级,相反如果一个页面没有任何链入页面,那么它没有等级。 | |
完整的 PageRank 算法公式如下: | |
这个方程式引入了随机浏览的概念,即有人上网无聊随机打开一些页面,点一些链接。一个页面的 PageRank 值也影响了它被随机浏览的概率。为了便于理解,这里假设上网者不断点网页上的链接,最终到了一个没有任何链出页面的网页,这时候上网者会随机到另外的网页开始浏览。为了对那些有链出的页面公平,q = 0.85(这里的 q 被称为阻尼系数(damping factor),其意义是,在任意时刻,用户到达某页面后并继续向后浏览的概率。1 – q = 0.85 就是用户停止点击,随机跳到新 URL 的概率)的算法被用到了所有页面上,估算页面可能被上网者放入书签的概率。如果你想详细的了解 PageRank 算法的原理,请参考这里。 | |
Hadoop 课程的作业是计算一个4网页模型各个节点的 PageRank 值。该模型如下: | |
Python 实现具体的计算过程如下: | |
# Import graphviz | |
import gv | |
# Import pygraph | |
from pygraph.classes.digraph import digraph | |
from pygraph.readwrite.dot import write | |
# Define pagerank function | |
def pagerank(graph, damping_factor=0.85, max_iterations=100, \ | |
min_delta=0.00001): | |
""" | |
Compute and return the PageRank in an directed graph. | |
@type graph: digraph | |
@param graph: Digraph. | |
@type damping_factor: number | |
@param damping_factor: PageRank dumping factor. | |
@type max_iterations: number | |
@param max_iterations: Maximum number of iterations. | |
@type min_delta: number | |
@param min_delta: Smallest variation required for a new iteration. | |
@rtype: Dict | |
@return: Dict containing all the nodes PageRank. | |
""" | |
nodes = graph.nodes() | |
graph_size = len(nodes) | |
if graph_size == 0: | |
return {} | |
# value for nodes without inbound links | |
min_value = (1.0-damping_factor)/graph_size | |
# itialize the page rank dict with 1/N for all nodes | |
#pagerank = dict.fromkeys(nodes, 1.0/graph_size) | |
pagerank = dict.fromkeys(nodes, 1.0) | |
for i in range(max_iterations): | |
diff = 0 #total difference compared to last iteraction | |
# computes each node PageRank based on inbound links | |
for node in nodes: | |
rank = min_value | |
for referring_page in graph.incidents(node): | |
rank += damping_factor * pagerank[referring_page] / \ | |
len(graph.neighbors(referring_page)) | |
diff += abs(pagerank[node] - rank) | |
pagerank[node] = rank | |
print 'This is NO.%s iteration' % (i+1) | |
print pagerank | |
print '' | |
#stop if PageRank has converged | |
if diff < min_delta: | |
break | |
return pagerank | |
# Graph creation | |
gr = digraph() | |
# Add nodes and edges | |
gr.add_nodes(["1","2","3","4"]) | |
gr.add_edge(("1","2")) | |
gr.add_edge(("1","3")) | |
gr.add_edge(("1","4")) | |
gr.add_edge(("2","3")) | |
gr.add_edge(("2","4")) | |
gr.add_edge(("3","4")) | |
gr.add_edge(("4","2")) | |
# Draw as PNG | |
dot = write(gr) | |
gvv = gv.readstring(dot) | |
gv.layout(gvv,'dot') | |
gv.render(gvv,'png','Model.png') | |
pagerank(gr) | |
运算的结果如下: | |
[root@chenjunlu ~]# python pagerank.py | |
This is NO.1 iteration result: | |
{'1': 0.037500000000000006, '3': 0.47312500000000002, '2': 0.89812499999999995, '4': 0.831984375} | |
This is NO.2 iteration result: | |
{'1': 0.037500000000000006, '3': 0.42982812500000001, '2': 0.75531171874999992, '4': 0.73448638671874988} | |
This is NO.3 iteration result: | |
{'1': 0.037500000000000006, '3': 0.36913248046874997, '2': 0.6724384287109374, '4': 0.64767394060058581} | |
This is NO.4 iteration result: | |
{'1': 0.037500000000000006, '3': 0.33391133220214841, '2': 0.59864784951049788, '4': 0.58637496841378778} | |
This is NO.5 iteration result: | |
{'1': 0.037500000000000006, '3': 0.30255033604196163, '2': 0.54654372315171962, '4': 0.53757386797514828} | |
This is NO.6 iteration result: | |
{'1': 0.037500000000000006, '3': 0.28040608233948083, '2': 0.50506278777887603, '4': 0.50112185479458105} | |
This is NO.7 iteration result: | |
{'1': 0.037500000000000006, '3': 0.26277668480602234, '2': 0.47407857657539393, '4': 0.47296857712966139} | |
This is NO.8 iteration result: | |
{'1': 0.037500000000000006, '3': 0.24960839504454241, '2': 0.45014829056021222, '4': 0.45160515927595124} | |
This is NO.9 iteration result: | |
{'1': 0.037500000000000006, '3': 0.23943802348809018, '2': 0.43198938538455856, '4': 0.435242808753314} | |
This is NO.10 iteration result: | |
{'1': 0.037500000000000006, '3': 0.23172048878843737, '2': 0.41808138744031692, '4': 0.42277200513230645} | |
This is NO.11 iteration result: | |
{'1': 0.037500000000000006, '3': 0.22580958966213469, '2': 0.40748120436246049, '4': 0.41324266306686019} | |
This is NO.12 iteration result: | |
{'1': 0.037500000000000006, '3': 0.2213045118540457, '2': 0.39938126360683118, '4': 0.40597087210884208} | |
This is NO.13 iteration result: | |
{'1': 0.037500000000000006, '3': 0.21786203703290324, '2': 0.3932002412925158, '4': 0.40041783402728698} | |
This is NO.14 iteration result: | |
{'1': 0.037500000000000006, '3': 0.21523510254931921, '2': 0.38848015892319393, '4': 0.39617890470927875} | |
This is NO.15 iteration result: | |
{'1': 0.037500000000000006, '3': 0.2132290675423574, '2': 0.38487706900288693, '4': 0.39294246173723074} | |
This is NO.16 iteration result: | |
{'1': 0.037500000000000006, '3': 0.21169775432622695, '2': 0.38212609247664614, '4': 0.39047168047986747} | |
This is NO.17 iteration result: | |
{'1': 0.037500000000000006, '3': 0.21052858930257459, '2': 0.38002592840788735, '4': 0.38858532048054051} | |
This is NO.18 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20963601957335212, '2': 0.37842252240845947, '4': 0.38714518866094461} | |
This is NO.19 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20895457202359527, '2': 0.37719841036180296, '4': 0.3860457106238222} | |
This is NO.20 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20843432440376625, '2': 0.37626385403024887, '4': 0.38520631370605707} | |
This is NO.21 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20803713796285578, '2': 0.37555036665014851, '4': 0.38456547309474054} | |
This is NO.22 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20773390582631313, '2': 0.3750056521305295, '4': 0.3840762221078412} | |
This is NO.23 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20750240215547502, '2': 0.37458978879166505, '4': 0.38370270206861146} | |
This is NO.24 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20732566023645765, '2': 0.37427229675831974, '4': 0.38341753732327488} | |
This is NO.25 iteration result: | |
{'1': 0.037500000000000006, '3': 0.2071907261222859, '2': 0.37402990672478365, '4': 0.38319982756197607} | |
This is NO.26 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20708771035803306, '2': 0.3738448534276797, '4': 0.38303361651109197} | |
This is NO.27 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20700906270676386, '2': 0.37370357403442817, '4': 0.38290672226538125} | |
This is NO.28 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20694901896463197, '2': 0.37359571392557406, '4': 0.38280984453830613} | |
This is NO.29 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20690317841836897, '2': 0.37351336785756023, '4': 0.38273588299507677} | |
This is NO.30 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20686818133946311, '2': 0.3734505005458153, '4': 0.3826794168705151} | |
This is NO.31 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20684146273197149, '2': 0.37340250433993788, '4': 0.38263630766664936} | |
This is NO.32 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20682106434447359, '2': 0.37336586151665196, '4': 0.38260339583737962} | |
This is NO.33 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20680549114457708, '2': 0.37333788646177268, '4': 0.3825782692191439} | |
This is NO.34 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20679360174625339, '2': 0.37331652883627231, '4': 0.38255908623973112} | |
This is NO.35 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20678452475541573, '2': 0.37330022330377149, '4': 0.38254444094620621} | |
This is NO.36 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20677759490410288, '2': 0.37328777480427527, '4': 0.38253325996030441} | |
This is NO.37 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20677230429181698, '2': 0.37327827096625876, '4': 0.38252472380870439} | |
This is NO.38 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20676826516065996, '2': 0.37327101523739875, '4': 0.38251820686245541} | |
This is NO.39 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20676518147589446, '2': 0.37326547583308711, '4': 0.3825132314835723} | |
This is NO.40 iteration result: | |
{'1': 0.037500000000000006, '3': 0.20676282722906203, '2': 0.3732612467610365, '4': 0.38250943301814322} | |
This is NO.41 iteration result: | |
{'1': 0.037500000000000006, '3': 0.2067610298734405, '2': 0.37325801806542175, '4': 0.38250653307022864} | |
[root@chenjunlu ~]# |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment