Skip to content

Instantly share code, notes, and snippets.

View jspacker's full-sized avatar

Jonathan Packer jspacker

  • Foresite Labs
  • Boston, MA
View GitHub Profile
@jspacker
jspacker / gist:5286642
Last active December 15, 2015 16:09
twitter-pagerank controlscript: setting parameters
from pagerank_lib import Pagerank
# A directed graph with the schema "from, to, weight" and a tab delimiter.
EDGES_INPUT = "s3n://mortar-example-data/twitter-pagerank/influential_user_graph.gz"
# Iteration Parameters -- see README.md for more information
DAMPING_FACTOR = 0.85
CONVERGENCE_THRESHOLD = 0.0015 # we set the convergence parameter higher than usual, for sake of speeding up the example
MAX_NUM_ITERATIONS = 20
@jspacker
jspacker / gist:5286649
Last active December 15, 2015 16:09
twitter-pagerank controlscript: a preprocessing step
print "Starting preprocessing step."
preprocess = Pig.compileFromFile(self.preprocessing_script)
preprocess_params = {
"INPUT_PATH": self.edges_input,
"PAGERANKS_OUTPUT_PATH": self.preprocess_pageranks,
"NUM_NODES_OUTPUT_PATH": self.preprocess_num_nodes
}
preprocess_bound = preprocess.bind(preprocess_params)
preprocess_stats = preprocess_bound.runSingle()
@jspacker
jspacker / gist:5286656
Last active December 15, 2015 16:08
twitter-pagerank controlscript: the pagerank loop
iteration = Pig.compileFromFile(self.iteration_script)
for i in range(self.max_num_iterations):
print "Starting iteration step: %s" % str(i + 1)
# Append the iteration number to the input/output stems
iteration_input = self.preprocess_pageranks if i == 0 else (self.iteration_pageranks_prefix + str(i-1))
iteration_pageranks_output = self.iteration_pageranks_prefix + str(i)
iteration_rank_changes_output = self.iteration_rank_changes_prefix + str(i)
iteration_bound = iteration.bind({
@jspacker
jspacker / gist:5286657
Last active December 15, 2015 16:09
twitter-pagerank controlscript: a choice of two postprocessing steps
print "Starting postprocessing step."
postprocess = Pig.compileFromFile(self.postprocessing_script)
postprocess_params = {
"PAGERANKS_INPUT_PATH": iteration_pagerank_result,
"OUTPUT_PATH": self.output_path
}
postprocess_bound = postprocess.bind(postprocess_params)
postprocess_stats = postprocess_bound.runSingle()
@jspacker
jspacker / gist:5286682
Last active December 15, 2015 16:09
diff between patents-pagerank and twitter-pagerank controlscripts
$ diff controlscripts/patents-pagerank.py controlscripts/twitter-pagerank.py
4c4
< EDGES_INPUT = "s3n://mortar-example-data/patents-pagerank/patent_organization_citation_graph"
---
> EDGES_INPUT = "s3n://mortar-example-data/twitter-pagerank/influential_user_graph.gz"
7,8c7,8
< DAMPING_FACTOR = 0.7
< CONVERGENCE_THRESHOLD = 0.0001
---
> DAMPING_FACTOR = 0.85
@jspacker
jspacker / gist:5286897
Last active December 15, 2015 16:09
twitter-pagerank output
BarackObama 0.0011093453259027883
kevinrose 9.50505168542551E-4
aplusk 7.931083722441003E-4
cnnbrk 7.842462485854226E-4
THE_REAL_SHAQ 7.586056702621888E-4
wefollow 7.163209390371575E-4
iamdiddy 6.888448387130898E-4
johncmayer 6.65759272468103E-4
Oprah 6.065305665210466E-4
jimmyfallon 5.92066251161108E-4
@jspacker
jspacker / gist:5286900
Created April 1, 2013 19:00
patents-pagerank output: pagerank rankings vs simple rankings by # citations
By # citations By Pagerank
-----------------------------------------------------------------------------------------------------------
1 International Business Machines Corporation Samsung Electronics Co.
2 Boehringer Ingelheim International GmbH International Business Machines Corporation
3 Sony Corporation Matsushita Electric Industrial Co.
4 Samsung Electronics Co. Panasonic Corporation
5 Microsoft Corporation Hitachi
6 Panasonic Corporation Canon Kabushiki Kaisha
7 Infineon Technologies AG Kabushiki Kaisha Toshiba
8 Hitachi Ricoh Company
@jspacker
jspacker / gist:5287444
Last active September 15, 2021 21:32
pagerank blogpost super-rough draft

What's Important In My Data? Measuring Influence Using PageRank And Mortar

Networks are everywhere in a data-driven world: social networks, product purchasing networks, supply chain networks, or even biological networks. If your company sells anything to anyone, you will have data that can be modelled as a network, the mathematical term for which is a "graph". Analyzing these graphs can explain how fundamental social, commercial, and physical systems behave, and consequently, how to make money from them (Google revenue in 2012: $50 billion).

The problem is, there is often so much data that it can be hard to tell what one should even try to analyze. One of the first questions to ask then is "which parts of my graph dataset are the most important?"--for example, before one can investigate how Twitter users become influential, one has to find who the most influential Twitter users are in the first place.

A well-known algorithm for finding the most important nodes in a graph is called [Pagerank](http://en.wi

@jspacker
jspacker / my-pagerank.py
Last active September 15, 2021 21:34
Template script to use Pagerank with your data
from pagerank_lib import Pagerank
# Your input data should be a directed graph with the schema "from, to, weight" and a tab delimiter.
#
# "from" and "to" can be of datatypes int, long, chararray, and bytearray.
# "weight" can be of datatypes int, long, float, and double.
#
# If your graph is undirected, you must add two edges for each connection, ex. A to B and B to A
# If your graph is unweighted, simply set each weight to 1.0
#
@jspacker
jspacker / pagerank_lib.py
Created April 8, 2013 20:09
Pagerank library code
from org.apache.pig.scripting import Pig
class Pagerank:
def __init__(self, edges_input, \
damping_factor=0.85, convergence_threshold=0.0001, max_num_iterations=20, \
temporary_output_prefix="hdfs://pig-pagerank", \
output_path=None, \
preprocessing_script="../pigscripts/pagerank_preprocess.pig", \
iteration_script="../pigscripts/pagerank_iterate.pig", \
postprocessing_script="../pigscripts/pagerank_postprocess.pig"):