johnulist/SPARQL-tests.md

## SPARQL-tests.md

      
    Raw
  

              SPARQL-tests.md
            
          
    SPARQL CONSTRUCT comparison

I had some days left on a physical machine we used for an EU FP7 research project so I took the chance to compare 3 triplestores I or my colleagues worked with in the past months. I do not want to imply anything with this test, it's just me playing around and having fun with RDF. If you have any comments, add it here.
Hardware

The test platform comprises a dedicated server, not a virtual machine, with the following specification:

2 x Intel Xeon E5 2620V2, 2 x (6 x 2.10 GHz) (appears as 24 cores in htop)
128 GB buffered ECC RAM
1000 GB SSD (Samsung 840 EVO)
Ubuntu 14.04

Dataset

The dataset contains 5 million triples (including some which are not valid RDF as "NA" is declared as xsd:int). It contains transports between entities and a date. To optimize query execution time for the particular use case, we want to infer/materialize (what's the right word here?) some triples so we don't have to go through all data all the time.
Source: (http://ktk.netlabs.org/misc/bfs/blv.nt) (622MB)
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix pobo: <http://purl.obolibrary.org/obo/> .


<http://foodsafety.data.admin.ch/move/0> a schema:TransferAction ;
  schema:fromLocation <http://foodsafety.data.admin.ch/business/50454> ;
  schema:toLocation <http://foodsafety.data.admin.ch/business/50415> ;
  dc:date "2012-01-01"^^xsd:date ;
  pobo:UO_0000189 "1"^^xsd:int .
There are around 900'000 TransferAction in there. We torture the server with the following CONSTRUCT (well, INSERT) query:
PREFIX blv: <http://blv.ch/>
PREFIX schema: <http://schema.org/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

INSERT {
    ?othermove blv:notBefore ?move .
}
WHERE {

    ?move a schema:TransferAction ;
    dc:date ?date ;
    schema:toLocation ?toFarm .

    ?othermove a schema:TransferAction ;
    dc:date ?otherdate ;
    schema:fromLocation ?toFarm .

    FILTER (?date <= ?otherdate)

} 
After successful execution, I check how many triples were generated:
SELECT  (COUNT(*) AS ?c) WHERE {?s <http://blv.ch/notBefore> ?o}
Which should be around 30 million triples.
Results

Note that I did not do any optimization on the configurations. My idea was to take what vendors ship by default and see how long it takes. Because that's what users usually do ;)
Virtuoso


Homepage: http://virtuoso.openlinksw.com/
Version: Virtuoso version 07.20.3215 on Linux (x86_64-unknown-linux-gnu), Single Server Edition
Host: docker, image tenforce/virtuoso
Query execution time: 23 minutes

Remarks

Loading RDF was fast, did it with iSQL according to the documentation of the Docker image. Virtuoso does not seem to use more than one core. During the whole execution time I had 100% load on one of the 24 cores, the rest did nothing.
Stardog


Homepage: http://stardog.com/
Version: 4.0.5, Enterprise license (1 month trial key)
Host: docker, image java:latest as there is no public docker image available.
Run: Default configuration started with stardog-admin server start
Query execution time: 4.00 minutes

Remarks

Loading was fast, did it with stardog data add on command line. I had the impression there is some query optimization going on. In the beginning there was not too much activity on the different cores. After a while the box became more busy and I saw quite some load on all cores. By far the fastest query execution time.
Blazegraph


Homepage: https://www.blazegraph.com/
Version: 2.1.0
Host: docker, image java:latest as there is no public docker image available.
Run: java -server -Xmx8g -jar blazegraph.jar
Query execution time: 33 minutes

Remarks

I first used a docker image but didn't notice that this was the old 1.x version. I ran into a bug while executing the query on a 24 core machine and they asked me to retry with 2.x so make sure you use this as well as all docker images seem to be 1.x. Loading was fast, loaded it in the SPARQL UPDATE web interface from URI. Blazegraph was the most active on all cores, I basically had the whole time quite some load on them. I tried as well with 64GB or memory allocated to the VM but that was apparently not a bottleneck.
Jena Fuseki


Homepage: https://jena.apache.org/documentation/serving_data/
Version: Version 2.0.1-SNAPSHOT
Host: docker, image stain/jena-fuseki
Query execution time: TODO  minutes

Remarks

I started the docker image and loaded the data with tdbloader into /fuseki/databases/blv. After that I created a new database in the web interface which apparently didn't override the TDB store. The loading time is fast. While executing the query there is high load on all cores. After a while I run into GC issues:
[2016-04-26 21:41:47] Fuseki     INFO  [26] 500 GC overhead limit exceeded (837.944 s)

Not sure if there is much I can do here, will post to Jena list.
Ontotext GraphDB


Homepage: http://ontotext.com/products/graphdb/
Version: GraphDB Free 7.0
Host: docker, image java:latest as there is no public docker image available.
Run: ~/graphdb-free-7.0.0/bin# ./graphdb
Query execution time:  16 minutes__

Remarks

I created a new default store configuration, didn't change anything on the default settings regarding cache size etc. Loading via URL, loading was fast. I see load only on one core.