Skip to content

Instantly share code, notes, and snippets.

@rhunter
Created September 13, 2014 10:35
Show Gist options
  • Save rhunter/e76771a03dc912cf2332 to your computer and use it in GitHub Desktop.
Save rhunter/e76771a03dc912cf2332 to your computer and use it in GitHub Desktop.
ruby-rdf performance
require 'rdf'
require 'rdf/do'
require 'rdf/turtle'
require 'do_sqlite3'
require 'sparql'
require 'benchmark'
sqlite_repo = RDF::DataObjects::Repository.new 'sqlite3::memory:'
mem_graph = RDF::Graph.new
sse = SPARQL.parse %q{
SELECT ?time
WHERE {
?obs1 a <http://example.com/sometype> .
?obs2 a <http://example.com/sometype> .
?obs1 <http://example.com/end> ?time .
?obs2 <http://example.com/start> ?time .
?obs1 <http://example.com/someprop> ?fromProp .
?obs2 <http://example.com/someprop> ?toProp .
?obs2 <http://example.com/someprop> "interesting" .
FILTER ( ?fromProp != ?toProp )
}
}
##
#repo << RDF::Turtle::Reader.new(<<-TURTLE)
#PREFIX ex: <http://example.com/>
#
# _:b1 a ex:sometype .
# _:b1 ex:start 0 .
# _:b1 ex:end 10 .
# _:b1 ex:someprop "boring"
# _:b2 a ex:sometype .
# _:b2 ex:start 10 .
# _:b2 ex:end 20 .
# _:b3 ex:someprop "interesting"
# _:b3 a ex:sometype .
# _:b3 ex:start 20 .
# _:b3 ex:end 30 .
# _:b4 ex:someprop "boring"
# _:b4 a ex:sometype .
# _:b4 ex:start 30 .
# _:b4 ex:end 40 .
#
def populate_with_statements(desired_subjects, repo)
ex = RDF::Vocabulary.new('http://example.com/')
(1..desired_subjects).each do |i|
subject = ex["observation#{i}"]
repo << RDF::Statement.new(subject, RDF::type, ex.sometype)
repo << RDF::Statement.new(subject, ex.start, RDF::Literal.new(i * 10))
repo << RDF::Statement.new(subject, ex.end, RDF::Literal.new((i * 10)+10))
repo << RDF::Statement.new(subject, ex.someprop, RDF::Literal.new(i % 3 == 0 ? 'interesting' : 'boring'))
end
end
puts ""
puts "RDF::Graph query times"
puts ""
puts "subjects statements time "
puts "----------------------------------------"
[5, 8, 10, 12, 15, 30, 60, 100, 200, 300, 400, 500, 600, 700, 800, 1600].each do |number_of_subjects|
populate_with_statements(number_of_subjects, mem_graph)
printf "%10d %10d ", number_of_subjects, mem_graph.count
time = Benchmark.realtime { sse.execute(mem_graph) }
printf "%18.5f\n", time
break if time > 30
end
puts "========================================"
puts ""
# to see the SQL statements being executed:
# ::DataObjects.logger.set_log(STDERR, :debug)
puts ""
puts "DO+SQLite query times"
puts ""
puts "subjects statements time "
puts "----------------------------------------"
[5, 8, 10, 12, 15, 30, 60, 100, 200, 300, 400, 500, 600, 700, 800, 1600].each do |number_of_subjects|
populate_with_statements(number_of_subjects, sqlite_repo)
printf "%10d %10d ", number_of_subjects, sqlite_repo.count
time = Benchmark.realtime { sse.execute(sqlite_repo) }
printf "%18.5f\n", time
break if time > 30
end
puts "========================================"
puts ""

I have some data I'm playing with that seemed like a good fit for RDF-style interaction, so I thought I'd give ruby-rdf a whirl.

I'm using ruby-rdf, sparql, and rdf-do (backed by do_sqlite3).

I've found that even on a relatively small dataset (just a couple of hundred statements), it takes minutes to execute a SPARQL query like the following:

  SELECT ?time
  WHERE {
    ?obs1 a <http://example.com/sometype> .
    ?obs2 a <http://example.com/sometype> .
    ?obs1 <http://example.com/end> ?time .
    ?obs2 <http://example.com/start> ?time .
    ?obs1 <http://example.com/someprop> ?fromProp .
    ?obs2 <http://example.com/someprop> ?toProp .
    ?obs2 <http://example.com/someprop> "interesting" .
    FILTER ( ?fromProp != ?toProp )
  }

By contrast, the built-in RDF::Graph repository loaded with the same statements executed the same queries in under a second.

This surprised me a little, so I'm wondering:

  • Is this difference something I should expect? (Perhaps RDF::Graph is heavily optimised, but RDF::Repository::DataObjects is not yet)

  • Am I doing something wrong? (maybe the query is too complex, or perhaps SQLite is an inappropriate store for hundreds of statements)

Any guidance would be appreciated.

Comparison: (manually generated)
Statements RDF::Graph DO/SQLite
20 0.007547 0.419351
32 0.012365 1.587681
40 0.01531 2.966287
48 0.026306 5.217077
60 0.038906 10.024653
120 0.142947 77.654245 (over a minute)
240 0.555783 620.746061 (10 minutes)
400 1.750105 3007.14895 (about an hour)
Data (generated by the script below):
RDF::Graph query times
subjects statements time
----------------------------------------
5 20 0.00772
8 32 0.01013
10 40 0.01912
12 48 0.02472
15 60 0.03587
30 120 0.17516
60 240 0.57042
100 400 1.61644
200 800 7.28083
300 1200 15.95894
400 1600 29.17357
500 2000 48.55133
========================================
DO+SQLite query times
subjects statements time
----------------------------------------
5 20 0.78602
8 32 1.85390
10 40 3.39295
12 48 7.50410
15 60 12.74590
30 120 96.94181
========================================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment