dtmrc/gist:88b40dffb7aff41b7f2358c20c2ad42b

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    What's Important In My Data? Measuring Influence Using PageRank And Mortar

Networks are everywhere in a data-driven world: social networks, product purchasing networks, supply chain networks, or even biological networks. If your company sells anything to anyone, you will have data that can be modelled as a network, the mathematical term for which is a "graph". Analyzing these graphs can explain how fundamental social, commercial, and physical systems behave, and consequently, how to make money from them (Google revenue in 2012: $50 billion).
The problem is, there is often so much data that it can be hard to tell what one should even try to analyze. One of the first questions to ask then is "which parts of my graph dataset are the most important?"--for example, before one can investigate how Twitter users become influential, one has to find who the most influential Twitter users are in the first place.
A well-known algorithm for finding the most important nodes in a graph is called Pagerank. Originally used by Google to rank webpages in search results, the Pagerank algorithm calculates the importance of each node in a graph as a function of the importance of the nodes which link to it; so if Barack Obama follows you on Twitter, it's more significant than if you are followed by a hundred non-influential users.
We've made a project using Apache Pig and the Mortar Development Framework to calculate Pageranks for any graph--including a graph generated from your own data. This post will have three parts:

First, we'll shows how the project works by running it on a subset of the Twitter social graph.
Next, we'll change just a few configuration settings to run the code on a graph of US Patent citations.
Finally, we'll show how to configure the project to run Pagerank on your own data.

To follow along with the examples we'll be showing, you'll first have to do the following steps:


Signup for a free Mortar account


Install the Mortar Development Framework with the command gem install mortar (click here for full instructions)


Clone the mortar-pagerank git repository to your computer and register it as a project with Mortar:
 git clone git@github.com:mortardata/mortar-pagerank.git
 cd mortar-pagerank
 mortar register mortar-pagerank


Pagerank #1: Who's the most influential on Twitter?

The first example runs Pagerank on a subset of the Twitter social graph (for brevity, we only consider users who are already in the top 25k most followed). There is an edge from node A to node B if user A follows user B. For simplicity, we'll refer to nodes as users, but there is nothing Twitter-specific about the code other than the input data.
Pagerank is an iterative algorithm, which calculates new, more accurate pageranks at each iteration based on the results of the previous iteration. Iteration and control flow are not supported natively in Apache Pig scripts, so we use an API called "Embedded Pig" to call and configure pigscripts from within the control flow of a Jython "control script" (Jython is simply a Java-based implementation of the Python programming language). In a loop, the controlscript repeatedly calls a pigscript which does each iteration's Pagerank calculation, dynamically passing it parameters so it receives input and sends output to the proper locations.
The example uses two controlscripts. The script you run is controlscripts/twitter-pagerank.py. It sets configuration parameters and then calls a shared library controlscripts/pagerank_lib.py which runs the Pagerank algorithm. The following links are to Github Gists which contains relevant sections of the code.
Setting Parameters
This gist shows the twitter-pagerank.py controlscript. It simply sets parameters and passes them to code from pagerank_lib.py. This library code goes through three phases: first it preprocesses the graph data; then it iteratively calculates Pageranks in a loop; and finally it postprocesses the iteration output and stores it to S3.
A Preprocessing Step
The first thing we do in the preprocessing phase of pagerank_lib.py is compile the the preprocessing pigscript (pigscripts/pagerank_preprocess.pig). This script calculates initial pagerank values for each user. Then we bind parameters to configure the script--in this case, specifying where input should come from and where output should go to. Finally, we run the script using Embedded Pig. This operation returns results object, which we use to access output data.
The Pagerank Loop
The iteration phase of pagerank_lib.py is a loop where we calculate new pageranks based on the old pageranks. We do the same process of compiling a pigscript which does the pagerank calculations, binding parameters to it, and running it.
After many iterations, the pageranks converge and will change very little from one iteration to the next. The iteration pigscript calculates a metric "aggregate_rank_change" which describes this behaviour (see the project README for details on this). When this metric falls beneath a convergence threshold, we break out of the loop.
A Postprocessing Step
The last phase of pagerank_lib.py is postprocessing. It sorts the final pageranks from greatest to least and outputs them to S3.

To run the Twitter example on a 4-node cluster, run the command:
mortar run twitter-pagerank -s 4 -1

It will finish in about an hour and a half. The "-1" means to stop the cluster after your job completes. Here's a selection of the output. Note that the data was collected in 2010, which is why Justin Bieber is conspicuously missing.
Twitter Pagerank Output
Pagerank #2: Which organizations produce the most influential patents?

How much code needs to be changed to get the controlscripts/twitter-pagerank script to use a different dataset? Just 4 lines of configuration settings. Here's a diff between controlscripts/patents-pagerank.py, which runs Pagerank on a graph of US patent citations (2007-2012), and controlscripts/twitter-pagerank.py to prove it.
Of course, this assumes you already have a suitable dataset in the form of a directed graph. Generating a usable graph from the patent citation data required some manipulation of the data. Patents, like academic papers, are required to cite previous patents which have related innovations. Because these citations are time-constrained--you can't cite patents from the future--the citation graph is acyclic and not suitable for Pagerank.
To solve this problem, we used a "graph reduction". Instead of nodes being patents, we made nodes be organizations which patents are assigned to (such as IBM or Microsoft). An edge between organization A and organization B has a weight equal to the # of patents assigned to A which cite patents assigned to B. This reduction eliminates the sparsity of the graph and permits cycles, making it suitable for Pagerank. The script pigscripts/generate_patent_citation_graph.pig implements this graph reduction in Pig.

To run the Pagerank example on a 2-node cluster, run the command:
mortar run patents-pagerank -s 2 -1

It will finish in a little under an hour. Here's a comparison of how Pagerank ranks organizations compared to simply ranking them by # of citations.
Patents Pagerank Output
You can see that Boehringer Ingelheim is 2nd by # citations, but not even in the top 15 by Pagerank. Looking at the underlying graph data, we found that this was because almost all of Boehringer Ingelheim's citations were from a company Cypress Semiconductor, and no patents in the dataset's timeframe (2007-2012) cited Cypress Semiconductor.
This shows how Pagerank can help avoid a "fanboy effect": where one or many non-influential nodes link unusually frequently to a node, making the latter seem more influential than it ought to be. In the context of web links, this "fanboy effect" would be equivalent to link spam/farming for SEO.
Pagerank #3: Now with your data

You can configure the scripts to use your own data with just two steps.

Generate a directed graph from your data.

There is a template script pigscripts/generate_my_graph.pig which has two examples of generating graphs to help you get started.
Your graph must have the schema "from, to, weight", where "from" and "to" are node-ids (either numbers or strings) and "weight" is a number describing the strength of the link between them.
It must be tab delimited.
If your graph is undirected, you need to make two records for each link A-B: one for A->B, and another for B->A.
If your graph is unweighted, simply set each weight value to 1.0.


Set configuration values in controlscripts/my-pagerank.py.

These consist of input and output paths, as well as parameters to the Pagerank algorithm. See the comments in the file for explanations of each.
If your dataset is small (< 250 MB or so), you can follow the instructions in the file to configure the script to run in "local mode". Mortar's local mode installs Pig and Jython onto your local machine, so you can run your script on small data without the overhead of scheduling Hadoop jobs on a cluster. Running locally also has the advantage of being free no matter how much you use it.


Once you have configured your controlscript, you run it on a cluster or on your local machine. To run it on a cluster, use the command:
mortar run my-pagerank -s N

where N is a number of computing-nodes appropriate to the size of your data (try 3 per GB of uncompressed input data).
To run it on your local machine, use the command:
mortar local:run my-pagerank

Most graphs should produce good results in 10-20 iterations. If you find that you cannot get results for your data in a timely manner on a reasonably-sized cluster, the bottom section of this reference page has tips on how to improve Pagerank's performance.