The first goal to describe a little what would be call misperimenting with graphDB, RDF, and all fun stuff, without being to formal, and (I am very sorry for that) some misenglish everywhere.
And of course if it could help or inspire anyone for more useful tests it should be great (in did this a few month ago but have not time to pursue).
There are not so many graphDB implementations, it is quiet odd considering the hype of NoSql idea and current social graph applications.
Yet another great 'not so fast to grasp concept' of those last year was this Semantic Web thing.
Honestly, representation apart, I admit liking RDF concept without knowing a lot on the subject. That's a simple enough bricks to be a solid foundation : the kind of generic/universal representation you come towards just to see it already exists.
Ok, I want to graphDB, and I want to RDFs so lets try both in one time, RDF is just a graph representation (like XML), so it is certainly easy to put some in a graphDB.
So looking at graphDBs with some commercial usage, and recent activities, I fastly go to look at Neo4j : certainly not my first choice (the JVM and I got some rough words regularly) but certainly my most mainstream choice (oh there is some scala in the source of its query language, shouldn't be that bad after all) and the doc is good.
RDF is standard with quiet a lot of tools, and some really nice initiatives. For not being a user, I'm jealous of KDE way of metadataing everything through some RDF backend. So what better choice to get a raw idea of my tests performances than using the quadstore of Virtuoso to compare (when I say compare I say get a first look, not benchmark).
So between Neo4j and Virtuoso, it is two different query languages : a standard for RDF (sparql) and the neo4j graph request language : Cypher. SPARQL has a SQL like taste which makes it really easy, and Cypher is quiet intuitive for oriented graph (I love using arrows in my languages).
Yes conclusion here is obviously wrong, but I need to spoil a little otherwise nobody will read everything.
- I like neo4j, and Cypher is really intuitive.
- For pure RDF I should use virtuoso.
- Translating Sparql to Cypher, is quiet easy in my simple tests.
- Graph stills miss a point for RDF (otherwise what use should we have for RDF stores) : the qualifier of a relation between two nodes is not another node : there is a way to represent RDF more truthfully (with the description between two resources being a third resources), but it should obviously be less efficient than my naïve method.
- Being able to describe your description is what I like in RDF, but in real life it is far from being the common use case. Just to say I still think I need my graphDB where every edges are natively usable as a node, and machine compiled (yes I know it is not really a graphDB).
For dataset and usecase, I choose an RDF benchmark, so data is here and Sparql request are here to (I did some minor changes on Sparql requests) : berlin sparql benchmark
This is a design I do not like, but it should be more efficient than the others :
+----------+ description +----------+ |Resource 1|------------------------>|Resource 2| +----------+ +----------+ to +-----+ relation +-----+ |node1|----------------------------->|Node2| +-----+ +-----+
The problem is description in rdf can be use as a resource, and relation of a graph cannot be use as a node. In fact using relation as a node could still be done programmatically by comparing a relation name with a node name. Indexes (see next part) might be added.
Yet a correct (and highly inefficient) graph representation could have been :
+--------+ _>|relation|_ _/ +--------+ \_ "from" ++ uid __/ \_ "to" ++ uid __/ \_ +-----+ _/ \ +-----+ |node1|_/ \>|node2| +-----+ +-----+
It is obvious that the need to compare edges uid to resolve relation is way to costly.
To make it simple I do not care about namespaces, and use a dirty characters escaping for graph relation (not all uri characters are allowed but it is true I should have use a metadata description (I remember seeing in the doc the right thing to do to store those escaped characters but I did those tests in december and do not remember)).
The tricky part here was to find what part of virtuoso deals with the rdf quadstore, for my need I should have liked a separate distribution quadstore only of virtuoso.
After finding the right part in the doc, it was simple and fast:
- Install virtuoso package for archlinux (6.1.6-1 at the time)
- Start it dirty :
touch virtuo.ini sudo virtuoso-t +foreground +configfile /var/lib/virtuoso/db/virtuoso.ini
and the server is online at 1111 (when using haskell api you need to use this port) - Then access through webadmin
- Init some data (don't delete it after import you will need it to feed neo4j): see data generator avec options
-s nt -pc 1000 -fc -dir . -fn out
aka size is only 1000 (I use N-Triple because it could be use by Swish and virtuoso too). - Then import it via quad store upload in previous conductor webapp : 90 megs injected in about 2-3 minutes. In /dav/test1.
Here I took THE wrong choice : I should have read the documentation about mass import of data, and generate an input with a short haskell program; but I use the only haskell related neo4j library on hackage.
The cypher haskell library is designed for doing cypher requests to Neo4j through its REST api. So in my design I just create node by node, relation by relation with a REST http connection (and transaction) between each -> Indeed very Slow (it was still interessing to use Swish in conjonction to cypher). TODO link to source. Anyway this code is a mess, but It shows how simple it is to transcript RDF to graph.
So my haskell program reads previously generated dataset (through swish parser) and create all that very very slowly.
Please note that to keep it simple I typed nothing in neo4j (all string), so import is very generic but very wrong to (I could have done it for int at least).
Berlin sparql uses cases lacks some deep graph crawling (it is more real world and sometime it has a disturbing sql flavor), so I just chose some easy cases, and my first query is the most basic one 'Query 1 from case 1'
By comparing the query in sparql and neo4j (first is sparql), you can see how close it is (it will be more obvious in next requests).
PREFIX bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/>
PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT DISTINCT ?product ?label ?value1
WHERE {
?product rdfs:label ?label .
?product a bsbm-inst:ProductType58 .
?product bsbm:productFeature bsbm-inst:ProductFeature182 .
?product bsbm:productFeature bsbm-inst:ProductFeature179 .
?product bsbm:productPropertyNumeric6 ?value1 .
}
ORDER BY ?label
LIMIT 10
start type=node:node_auto_index(val = '<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/ProductType58>')
match type <-[:_ihttp_c_s_swww_pw3_porg_s1999_s02_s22_mrdf_msyntax_mns_dtype_s]- product -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dlabel_s]-> label
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s]-> feature1
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s]-> feature2
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyNumeric6_s]-> num1
where feature1.val='<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/ProductFeature182>'
and feature2.val='<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/ProductFeature179>'
return product.val,label.val,num1.val
order by label.val
limit 10;
Interesting thing is that cypher is graph related, and need a start node : in this case a starting product. Also nice this arrow syntax which stands for a directed edge. So obviously to select the start node we need to add an index (haskell import program use request which use the auto-index). That is lucene indexing, and I think it may explain the lower performance of every first query (after restarting neo4j service).
Request on virtuoso were done with isql command (something like isql-vt 111 errors=stdout <virtq1 >test.out
).
Request on neo4j were done through webadmin console.
I got similar results :
+------------------------------------------------------------------------------------------------------------------------------------------+
==> | "<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer19/Product897>" | "nonradical warehousing" | "1068" |
==> | "<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer21/Product996>" | "skivvies opportunism knavishly" | "831" |
284 ms the first time then more or less 40 ms for neo4j (maybe 20 ms), when some 10ms on virtuoso. But looking at the request we are on something which will do fine on a relational database.
More interesting (some 90 ms on virtuoso) :
SPARQL PREFIX bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/>
PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?label ?comment ?producer ?productFeature ?propertyTextual1 ?propertyTextual2 ?propertyTextual3
?propertyNumeric1 ?propertyNumeric2 ?propertyTextual4 ?propertyTextual5 ?propertyNumeric4
WHERE {
<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343> rdfs:label ?label .
<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343> rdfs:comment ?comment .
<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343> bsbm:producer ?p .
?p rdfs:label ?producer .
<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343> dc:publisher ?p .
<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343> bsbm:productFeature ?f .
?f rdfs:label ?productFeature .
<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343> bsbm:productPropertyTextual1 ?propertyTextual1 .
<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343> bsbm:productPropertyTextual2 ?propertyTextual2 .
<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343> bsbm:productPropertyTextual3 ?propertyTextual3 .
<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343> bsbm:productPropertyNumeric1 ?propertyNumeric1 .
<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343> bsbm:productPropertyNumeric2 ?propertyNumeric2 .
OPTIONAL { <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343> bsbm:productPropertyTextual4 ?propertyTextual4 }
OPTIONAL { <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343> bsbm:productPropertyTextual5 ?propertyTextual5 }
OPTIONAL { <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343> bsbm:productPropertyNumeric4 ?propertyNumeric4 }
};
And from 1620-353 at start then 300ms most of the time (some wrong 26 or 40 ms may be related to cache) on neo4j :
start product=node:node_auto_index(val = '<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>')
match product -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dlabel_s]-> label
, product -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dcomment_s]-> comment
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproducer_s]-> producer -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dlabel_s]-> producerlab
, product -[:_ihttp_c_s_spurl_porg_sdc_selements_s1_p1_spublisher_s]-> publisher
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s] -> features -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dlabel_s]-> featurelabels
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyTextual1_s]-> ptext1
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyTextual2_s]-> ptext2
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyTextual3_s]-> ptext3
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyNumeric1_s]-> pnum1
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyNumeric2_s]-> pnum2
, product -[?:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyTextual4_s]-> ptext4
, product -[?:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyTextual5_s]-> ptext5
return label.val,comment.val,producerlab.val,featurelabels.val,ptext1.val,ptext2.val,ptext3.val,pnum1.val,pnum2.val,ptext4.val,ptext5.val
So nothing more to say than before, test condition are not optimal to compare both products. Yet something interesting was the impact of the 'optional' conditions ('?' in neo4j), without neo4j requests took between 43 and 22 ms which shows Optional seems to have a big cost.
Something more modern (and not from the standard benchmark): for a given product I try to find similar products depending on offers(features).
SPARQL PREFIX bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/>
PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT count(?features2)as ?nb, ?labCand
WHERE {
<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343> bsbm:productFeature ?features .
?prodCand bsbm:productFeature ?features .
?prodCand bsbm:productFeature ?features2 .
?prodCand2 bsbm:productFeature ?features2 .
?prodCand2 rdfs:label ?labCand
} ORDER BY ?nb;
start product=node:node_auto_index(val = '<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/datafromproducer8/product343>')
match product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductfeature_s] -> features <-[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductfeature_s]- prodcand
, prodcand -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s] -> features2 <-[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s]- prodcand2 -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dlabel_s]-> labelcand
return count( features2) as nb, labelcand.val
order by nb;
Those queries are two level deep, with only one level of features we got some 25 ms in sparql, 50 ms with neo4j, and the same set of results.
With two level, it is really interesting to note that the result differs between both requests. On neo4j, from 20second to 15 or 11 second with counting 2290 and only 20 more result than for one level. With sparql, some 5.8 seconds but a count of 3145 for best result.
The bias seems to result from the fact that resulting products from first level inspection are counted in the second inspection when using sparql. When using neo4j, nodes from the first level inspection are not inspected again in the second level.
This is very important to note, neo4j is closest to what I was looking for. With my sparql request, the most relevant results are the result from the first level of inspection (those I got when running the request on only one level of features).
Knowing if we can find neo4j results from sparql results and the other way, is a not so easy question, I think we should at least approximate (which is bad in case of a deeper inspection) : a nice question but a little to mathematical for me (at hour for sure).
Last attempt to do something of this dataset, same as third query but depending on offers and similar feature, and number of review (it is simpler to read the query than trying to understand my sentence).
SPARQL PREFIX bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/>
PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT count(?reviews) as ?nb, ?labCand
WHERE {
<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343> bsbm:productFeature ?features .
?prodCand bsbm:productFeature ?features .
?reviews bsbm:reviewFor ?prodCand .
?reviews bsbm:rating3 1 .
?prodCand rdfs:label ?labCand .
} ORDER BY ?nb;
start product=node:node_auto_index(val = '<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>')
match product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s] -> features <-[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s]- prodcand -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dlabel_s]-> labelcand
, prodcand <-[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sreviewFor_s]- reviews -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_srating3_s]-> ratings
where ratings.val='1'
return count(ratings) as nb, labelcand.val
order by nb;
On sparql (rating is an integer): 50ms average. On neo4j some 6s at start then some 100ms, lets test it with any rating (here you can see that using string for everything was stupid) :
start product=node:node_auto_index(val = '<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>')
match product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s] -> features <-[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s]- prodcand -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dlabel_s]-> labelcand
, prodcand <-[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sreviewFor_s]- reviews -[rel]-> ratings
where ratings.val='1'
and type(rel)=~'_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_srating._s'
return count(ratings) as nb, labelcand.val
order by nb;
A regexp, nice, there is lot more in the doc, and certainly other nice extension in virtuoso. Obviously regexp has a cost : now 500ms (but more result to).
- A native RDF database seems faster for RDF than a graphDB (I still think this should be pretty close with my naïve modeling but the second modeling (which I did not test) should be a mess). And yet considering that I am only misperimenting, results are highly subjectives.
- Queries 3 and 4 are done by counting relations, no coeff/weight are applied. It is a kind of map/reduce, yet very basic. Finding an advantage for graphDB usage could have been achieved with social graph like request : similarity given any relations with a limit of distance and any similarity (plus some shortest paths...) ! Meaning using a flexible/evolutive modeling (some metadata seems yet required : weights...). But flexible and evolutive should mean using the second modeling for unrestricted qualifying : pb = limitation of cypher request langage for unqualified double relation (the language is designed to manage direct relation). So like I said in part 1, I still miss an hybrid between a graphDB and RDF storage.
- The difference in query 3 might be the only thing which may not be misperiment...
Very interesting post. You missed one aspect of the modeling though.
Even if it is all triples in RDF in a property graph you have both properties and relationships. So those values that are actually properties would be put into node properties in the graph database and not via relationships into other nodes.
I would also use easier to read rel-types as they make the queries actually readable and keep an URI-rel-type mapping somewhere.
Would love to see another version of this that uses a more friendly modeling in the graph database, and then put it onto the new neo4j.org/develop/linked_data page.