Skip to content

Instantly share code, notes, and snippets.

@rjurney
Created May 13, 2011 01:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjurney/969815 to your computer and use it in GitHub Desktop.
Save rjurney/969815 to your computer and use it in GitHub Desktop.
Summarizing the Enron Data
#!`which jruby`
#
# The purpose of this script is to experiment with Pacer/Tinkerpop stack for
# graph transformation. In it we will summarize a much larger graph to produce
# a new, smaller graph that can fit into RAM via TinkerGraph for more rapid,
# real-time analysis.
#
require 'rubygems'
require 'pacer'
require 'pacer-neo4j'
graph = Pacer.neo4j "/tmp/neo4j"
# Summarize relationships by computing raw, non-normalized weights between them
# - the number of emails sent and successfully recieved between each ego.
senders = graph.v.filter {|v| v[:type] == 'Email Address'}
groupings = senders.group.
key_route { |sender| sender[:address] }.
values_route(:sender) { |sender| sender.out_e('SENT').in_v(:type == 'Message').
out_e('RECEIVED_BY').in_v(:type == 'Email Address')[:address] }
result = groupings.reduce(proc { Hash.new(0) }, :sender) { |h, e| h[e] += 1; h }
puts "Summary computed..."
# Now create a new in-RAM TinkerGraph containing these summaries.
summary_graph = Pacer.tg
# Create vertices for senders
vertices = Hash.new
result.keys.each do |sender|
vertices[sender] = summary_graph.create_vertex :type => 'email', :address => sender
end
puts "Summary Graph: vertices created..."
# Create outbound edges between them, weighted by volume of emails sent
puts "Starting creating vertices..."
i = 0
vertices.keys.each do |sender|
result[sender].each do |recipient, volume|
summary_graph.create_edge nil, vertices[sender], vertices[recipient], :sent, {:volume => volume}
$stdout.write "." if i % 1000 == 0
end
end
puts "Summary Graph: edges created..."
summary_graph.export("/tmp/enron_summary.xml")
graph.shutdown
summary_graph.shutdown
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment