Created
June 5, 2011 09:25
-
-
Save rjurney/1008821 to your computer and use it in GitHub Desktop.
Solution to slow subgraph problem :)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'graph_client' | |
require 'pacer' | |
client = GraphClient.new | |
k = 16 | |
graph = client.get "louise.kitchen@enron.com" | |
#k_nodes = graph.v.filter{|v| v.out_e.count > k}.result | |
#subgraph = k_nodes.out_e.in_v.only(k_nodes).subgraph | |
puts graph | |
d = [] | |
graph.v.each do |v| | |
if v.out_e.count < k | |
d << v | |
end | |
end | |
d.each do |v| | |
graph.remove_vertex v | |
end | |
puts graph |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
As of Pacer 0.8.1 (to be released shortly) you could replace lines 12 to 21 with the following which would be significantly faster.
v.out_edges is much faster than v.out_e because out_edges is the native method for an individual vertex which returns an Iterable that then simply gets counted. On the other hand, v.out_e wraps the vertex in a 4 step route and turns that to a pipeline (IdentityPipe -> OutEdgesPipe -> CountSideEffectPipe -> SideEffectCapPipe) that then must be fully iterated. That construction of the route and pipeline isn't terribly heavyweight but inside a loop it is best to avoid it if possible.
Using .delete! on a route is much faster than looping through each of its vertices and calling remove_vertex on each one because it wraps the job using bulk_job so that the entire process can be done in a minimal number of transactions rather than a separate transaction for every vertex plus (potentially) each connected edge for each deleted vertex.
Finally, doing the entire job within the context of a route is much more memory efficient because each vertex is processed as it is emitted from the iterator, so no more than 1 vertex actually needs to be in memory at a time. Much better than collecting everything into an array in advance of processing!
Cheers,
Darrick