public
Last active

Notes on testing Neo4j and Virtuoso

  • Download Gist
MisperimentOfGraphDBRDF.md
Markdown

Misgoals of this misperiment

The first goal to describe a little what would be call misperimenting with graphDB, RDF, and all fun stuff, without being to formal, and (I am very sorry for that) some misenglish everywhere.

And of course if it could help or inspire anyone for more useful tests it should be great (in did this a few month ago but have not time to pursue).

A graphDB for everything

There are not so many graphDB implementations, it is quiet odd considering the hype of NoSql idea and current social graph applications.

Yet another great 'not so fast to grasp concept' of those last year was this Semantic Web thing.

Honestly, representation apart, I admit liking RDF concept without knowing a lot on the subject. That's a simple enough bricks to be a solid foundation : the kind of generic/universal representation you come towards just to see it already exists.

Ok, I want to graphDB, and I want to RDFs so lets try both in one time, RDF is just a graph representation (like XML), so it is certainly easy to put some in a graphDB.

So looking at graphDBs with some commercial usage, and recent activities, I fastly go to look at Neo4j : certainly not my first choice (the JVM and I got some rough words regularly) but certainly my most mainstream choice (oh there is some scala in the source of its query language, shouldn't be that bad after all) and the doc is good.

A raw idea of performances

RDF is standard with quiet a lot of tools, and some really nice initiatives. For not being a user, I'm jealous of KDE way of metadataing everything through some RDF backend. So what better choice to get a raw idea of my tests performances than using the quadstore of Virtuoso to compare (when I say compare I say get a first look, not benchmark).

Getting a grasp of request languages

So between Neo4j and Virtuoso, it is two different query languages : a standard for RDF (sparql) and the neo4j graph request language : Cypher. SPARQL has a SQL like taste which makes it really easy, and Cypher is quiet intuitive for oriented graph (I love using arrows in my languages).

Conclusion part 1

Yes conclusion here is obviously wrong, but I need to spoil a little otherwise nobody will read everything.

  • I like neo4j, and Cypher is really intuitive.
  • For pure RDF I should use virtuoso.
  • Translating Sparql to Cypher, is quiet easy in my simple tests.
  • Graph stills miss a point for RDF (otherwise what use should we have for RDF stores) : the qualifier of a relation between two nodes is not another node : there is a way to represent RDF more truthfully (with the description between two resources being a third resources), but it should obviously be less efficient than my naïve method.
  • Being able to describe your description is what I like in RDF, but in real life it is far from being the common use case. Just to say I still think I need my graphDB where every edges are natively usable as a node, and machine compiled (yes I know it is not really a graphDB).

The misexperiment

For dataset and usecase, I choose an RDF benchmark, so data is here and Sparql request are here to (I did some minor changes on Sparql requests) : berlin sparql benchmark

RDF to Graph design

This is a design I do not like, but it should be more efficient than the others :


   +----------+        description      +----------+
   |Resource 1|------------------------>|Resource 2|
   +----------+                         +----------+

to

   +-----+             relation         +-----+
   |node1|----------------------------->|Node2|
   +-----+                              +-----+

The problem is description in rdf can be use as a resource, and relation of a graph cannot be use as a node. In fact using relation as a node could still be done programmatically by comparing a relation name with a node name. Indexes (see next part) might be added.

Yet a correct (and highly inefficient) graph representation could have been :

                       +--------+
                     _>|relation|_         
                   _/  +--------+ \_       
  "from" ++ uid __/                 \_  "to" ++ uid
             __/                      \_   
  +-----+  _/                           \  +-----+
  |node1|_/                              \>|node2|
  +-----+                                  +-----+

It is obvious that the need to compare edges uid to resolve relation is way to costly.

To make it simple I do not care about namespaces, and use a dirty characters escaping for graph relation (not all uri characters are allowed but it is true I should have use a metadata description (I remember seeing in the doc the right thing to do to store those escaped characters but I did those tests in december and do not remember)).

Feeding virtuoso

The tricky part here was to find what part of virtuoso deals with the rdf quadstore, for my need I should have liked a separate distribution quadstore only of virtuoso.

After finding the right part in the doc, it was simple and fast:

  • Install virtuoso package for archlinux (6.1.6-1 at the time)
  • Start it dirty : touch virtuo.ini sudo virtuoso-t +foreground +configfile /var/lib/virtuoso/db/virtuoso.ini and the server is online at 1111 (when using haskell api you need to use this port)
  • Then access through webadmin
  • Init some data (don't delete it after import you will need it to feed neo4j): see data generator avec options -s nt -pc 1000 -fc -dir . -fn out aka size is only 1000 (I use N-Triple because it could be use by Swish and virtuoso too).
  • Then import it via quad store upload in previous conductor webapp : 90 megs injected in about 2-3 minutes. In /dav/test1.

Feeding neo4j

Here I took THE wrong choice : I should have read the documentation about mass import of data, and generate an input with a short haskell program; but I use the only haskell related neo4j library on hackage.

The cypher haskell library is designed for doing cypher requests to Neo4j through its REST api. So in my design I just create node by node, relation by relation with a REST http connection (and transaction) between each -> Indeed very Slow (it was still interessing to use Swish in conjonction to cypher). TODO link to source. Anyway this code is a mess, but It shows how simple it is to transcript RDF to graph.

So my haskell program reads previously generated dataset (through swish parser) and create all that very very slowly.

Please note that to keep it simple I typed nothing in neo4j (all string), so import is very generic but very wrong to (I could have done it for int at least).

First query

Berlin sparql uses cases lacks some deep graph crawling (it is more real world and sometime it has a disturbing sql flavor), so I just chose some easy cases, and my first query is the most basic one 'Query 1 from case 1'

By comparing the query in sparql and neo4j (first is sparql), you can see how close it is (it will be more obvious in next requests).

PREFIX bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/>
PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT DISTINCT ?product ?label ?value1
WHERE { 
 ?product rdfs:label ?label .
 ?product a bsbm-inst:ProductType58 .
 ?product bsbm:productFeature bsbm-inst:ProductFeature182 . 
 ?product bsbm:productFeature bsbm-inst:ProductFeature179 . 
?product bsbm:productPropertyNumeric6 ?value1 .  
    }
ORDER BY ?label
LIMIT 10
start type=node:node_auto_index(val = '<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/ProductType58>')
match type <-[:_ihttp_c_s_swww_pw3_porg_s1999_s02_s22_mrdf_msyntax_mns_dtype_s]- product -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dlabel_s]-> label
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s]-> feature1
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s]-> feature2
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyNumeric6_s]-> num1
where feature1.val='<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/ProductFeature182>'
and feature2.val='<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/ProductFeature179>'
return product.val,label.val,num1.val
order by label.val 
limit 10;

Interesting thing is that cypher is graph related, and need a start node : in this case a starting product. Also nice this arrow syntax which stands for a directed edge. So obviously to select the start node we need to add an index (haskell import program use request which use the auto-index). That is lucene indexing, and I think it may explain the lower performance of every first query (after restarting neo4j service).

Request on virtuoso were done with isql command (something like isql-vt 111 errors=stdout <virtq1 >test.out). Request on neo4j were done through webadmin console.

I got similar results :

+------------------------------------------------------------------------------------------------------------------------------------------+
==> | "<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer19/Product897>" | "nonradical warehousing"         | "1068"   |
==> | "<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer21/Product996>" | "skivvies opportunism knavishly" | "831"    |

284 ms the first time then more or less 40 ms for neo4j (maybe 20 ms), when some 10ms on virtuoso. But looking at the request we are on something which will do fine on a relational database.

Query 2

More interesting (some 90 ms on virtuoso) :

SPARQL PREFIX bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/>
PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT ?label ?comment ?producer ?productFeature ?propertyTextual1 ?propertyTextual2 ?propertyTextual3
 ?propertyNumeric1 ?propertyNumeric2 ?propertyTextual4 ?propertyTextual5 ?propertyNumeric4 
WHERE {
 <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>  rdfs:label ?label .
    <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>  rdfs:comment ?comment .
    <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>  bsbm:producer ?p .
    ?p rdfs:label ?producer .
 <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>  dc:publisher ?p . 
    <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>  bsbm:productFeature ?f .
    ?f rdfs:label ?productFeature .
    <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>  bsbm:productPropertyTextual1 ?propertyTextual1 .
    <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>  bsbm:productPropertyTextual2 ?propertyTextual2 .
 <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>  bsbm:productPropertyTextual3 ?propertyTextual3 .
    <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>  bsbm:productPropertyNumeric1 ?propertyNumeric1 .
    <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>  bsbm:productPropertyNumeric2 ?propertyNumeric2 .
    OPTIONAL { <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>  bsbm:productPropertyTextual4 ?propertyTextual4 }
 OPTIONAL { <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>  bsbm:productPropertyTextual5 ?propertyTextual5 }
 OPTIONAL { <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>  bsbm:productPropertyNumeric4 ?propertyNumeric4 }
};

And from 1620-353 at start then 300ms most of the time (some wrong 26 or 40 ms may be related to cache) on neo4j :

start product=node:node_auto_index(val = '<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>')
match product -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dlabel_s]-> label
, product -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dcomment_s]-> comment
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproducer_s]-> producer -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dlabel_s]-> producerlab
, product -[:_ihttp_c_s_spurl_porg_sdc_selements_s1_p1_spublisher_s]-> publisher
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s] -> features -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dlabel_s]-> featurelabels
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyTextual1_s]-> ptext1
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyTextual2_s]-> ptext2
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyTextual3_s]-> ptext3
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyNumeric1_s]-> pnum1
, product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyNumeric2_s]-> pnum2
, product -[?:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyTextual4_s]-> ptext4
, product -[?:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductPropertyTextual5_s]-> ptext5
return label.val,comment.val,producerlab.val,featurelabels.val,ptext1.val,ptext2.val,ptext3.val,pnum1.val,pnum2.val,ptext4.val,ptext5.val

So nothing more to say than before, test condition are not optimal to compare both products. Yet something interesting was the impact of the 'optional' conditions ('?' in neo4j), without neo4j requests took between 43 and 22 ms which shows Optional seems to have a big cost.

Query 3

Something more modern (and not from the standard benchmark): for a given product I try to find similar products depending on offers(features).

SPARQL PREFIX bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/>
PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT count(?features2)as ?nb, ?labCand 
WHERE {
 <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>  bsbm:productFeature ?features .
 ?prodCand  bsbm:productFeature ?features .
 ?prodCand  bsbm:productFeature ?features2 .
 ?prodCand2  bsbm:productFeature ?features2 .
 ?prodCand2 rdfs:label ?labCand
} ORDER BY ?nb;
start product=node:node_auto_index(val = '<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/datafromproducer8/product343>')
match product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductfeature_s] -> features <-[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductfeature_s]- prodcand 
, prodcand -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s] -> features2 <-[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s]- prodcand2 -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dlabel_s]-> labelcand
return count( features2) as nb, labelcand.val 
order by nb;

Those queries are two level deep, with only one level of features we got some 25 ms in sparql, 50 ms with neo4j, and the same set of results.

With two level, it is really interesting to note that the result differs between both requests. On neo4j, from 20second to 15 or 11 second with counting 2290 and only 20 more result than for one level. With sparql, some 5.8 seconds but a count of 3145 for best result.
The bias seems to result from the fact that resulting products from first level inspection are counted in the second inspection when using sparql. When using neo4j, nodes from the first level inspection are not inspected again in the second level.
This is very important to note, neo4j is closest to what I was looking for. With my sparql request, the most relevant results are the result from the first level of inspection (those I got when running the request on only one level of features).
Knowing if we can find neo4j results from sparql results and the other way, is a not so easy question, I think we should at least approximate (which is bad in case of a deeper inspection) : a nice question but a little to mathematical for me (at hour for sure).

Query 4

Last attempt to do something of this dataset, same as third query but depending on offers and similar feature, and number of review (it is simpler to read the query than trying to understand my sentence).

SPARQL PREFIX bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/>
PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT count(?reviews) as ?nb, ?labCand 
WHERE {
 <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>  bsbm:productFeature ?features .
 ?prodCand  bsbm:productFeature ?features .
 ?reviews bsbm:reviewFor ?prodCand .
 ?reviews  bsbm:rating3 1 .
 ?prodCand rdfs:label ?labCand .
} ORDER BY ?nb;
start product=node:node_auto_index(val = '<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>')
match product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s] -> features <-[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s]- prodcand -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dlabel_s]-> labelcand
, prodcand <-[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sreviewFor_s]- reviews -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_srating3_s]-> ratings
where ratings.val='1'
return count(ratings) as nb, labelcand.val
order by nb;

On sparql (rating is an integer): 50ms average. On neo4j some 6s at start then some 100ms, lets test it with any rating (here you can see that using string for everything was stupid) :

start product=node:node_auto_index(val = '<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer8/Product343>')
match product -[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s] -> features <-[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sproductFeature_s]- prodcand -[:_ihttp_c_s_swww_pw3_porg_s2000_s01_srdf_mschema_dlabel_s]-> labelcand
, prodcand <-[:_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_sreviewFor_s]- reviews -[rel]-> ratings
where ratings.val='1'
and type(rel)=~'_ihttp_c_s_swww4_pwiwiss_pfu_mberlin_pde_sbizer_sbsbm_sv01_svocabulary_srating._s'
return count(ratings) as nb, labelcand.val 
order by nb;

A regexp, nice, there is lot more in the doc, and certainly other nice extension in virtuoso. Obviously regexp has a cost : now 500ms (but more result to).

Conclusion part 2

  • A native RDF database seems faster for RDF than a graphDB (I still think this should be pretty close with my naïve modeling but the second modeling (which I did not test) should be a mess). And yet considering that I am only misperimenting, results are highly subjectives.
  • Queries 3 and 4 are done by counting relations, no coeff/weight are applied. It is a kind of map/reduce, yet very basic. Finding an advantage for graphDB usage could have been achieved with social graph like request : similarity given any relations with a limit of distance and any similarity (plus some shortest paths...) ! Meaning using a flexible/evolutive modeling (some metadata seems yet required : weights...). But flexible and evolutive should mean using the second modeling for unrestricted qualifying : pb = limitation of cypher request langage for unqualified double relation (the language is designed to manage direct relation). So like I said in part 1, I still miss an hybrid between a graphDB and RDF storage.
  • The difference in query 3 might be the only thing which may not be misperiment...
NeoSimple.hs
Haskell
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
{-# LANGUAGE OverloadedStrings #-}
 
module NeoSimple(
storeNode,
storeLit,
storeArc,
storeArcLit
) where
 
import Database.Cypher
import Network.HTTP.Conduit
import Control.Exception(bracket)
import Data.Aeson.Types
import Data.Text(pack,dropAround)
import Data.HashMap.Strict(empty)
import Data.Text(Text,replace)
import Control.Monad.IO.Class(liftIO)
import Control.Arrow(first)
import Control.Monad(liftM)
 
 
 
selectNode = "START ret=node:node_auto_index(val = '#') WHERE ret.type! = 'res' RETURN ret"
 
createLit = "CREATE n = {val: '#',type: 'lit'} return n"
createNode = "CREATE n = {val: '#',type: 'res'} return n"
createArc = "START a = node(#n1#), b = node(#n2#) CREATE a-[r:#rname# {val: '#rname#',type: 'prop'}]-> b return r"
 
 
 
 
 
toCypher :: Text -> Cypher (CypherVals (Entity (Object)))
toCypher t = cypher t $ Object empty
myCypher :: Cypher (CypherVals (Entity (Object)))
myCypher = cypher "start n=node(0) return n" $ Object empty
--myCypher = cypher "\"query\":\"start n=node(0) return n\"" $ Null
 
-- TODO use lookup and maybe and remplace acc by takewhile d'un reverse
getIndex :: CypherVals (Entity Object) -> Text
getIndex (CypherVals (ent:[])) = pack $ reverse $ takeWhile (/= '/') $ reverse $ entity_id ent
getIndex (CypherVals (ent:_)) = "error more than one node"
getIndex (CypherVals _) = "error getting index"
 
storeNode :: Text -> Cypher Text
storeNode t = do res <- toCypher $ replace "#" t selectNode
--liftIO $ print res
ret <- create t res
--liftIO $ print $ replace "#" t createNode
return $ getIndex ret
where create t (CypherVals []) = toCypher $ replace "#" t createNode
create t r = return r
 
storeLit :: Text -> Cypher Text
--storeLit t = (liftIO $ print $ replace "#" t createNode) >> return "todo"
storeLit t = liftM getIndex $ toCypher $ replace "#" t createNode
 
storeArc :: Text-> Text -> Text -> Cypher Text
storeArc s p o = do --liftIO $ print $ replace "#n1#" s $ replace "#n2#" o $ replace "#rname#" (strip p) createArc
liftM getIndex $ toCypher $ replace "#n1#" s $ replace "#n2#" o $ replace "#rname#" (strip p) createArc
where strip t = replace "-" "_m" $ replace "." "_p" $ replace "<" "_i" $ replace "#" "_d" $ replace ">" "_s" $ replace "/" "_s" $ replace ":" "_c" $ replace "_" "__" t
-- TODO fmap or foldr over table of pair for strip
storeArcLit :: Text-> Text -> Text -> Cypher Text
storeArcLit = storeArc
neoinject.hs
Haskell
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
{-# LANGUAGE DeriveDataTypeable #-}
{-# LANGUAGE OverloadedStrings #-}
import Swish.RDF.Parser.NTriples
import Swish.RDF.Graph(NSGraph(..), Arc(..), RDFLabel(..), emptyFormulaMap)
import Data.Text.Lazy.IO
import System.IO(IOMode(..),withFile)
import Control.Monad(liftM)
import qualified Data.Text.Lazy as T
import qualified Data.Map as M
import qualified Data.Set as S
import Data.Foldable(traverse_)
import NeoSimple
import Network.HTTP.Conduit
import Data.Text(Text,pack)
import Database.Cypher
import Database.Cypher(forkCypher)
import Control.Exception.Base(bracket)
import Control.Monad.IO.Class(liftIO)
import System.Directory(setCurrentDirectory)
import System.Directory(getDirectoryContents)
import Control.Exception(catch)
import Control.Exception(throwIO)
import Control.Exception(throw)
import Control.Exception(Exception)
import Data.Typeable(Typeable)
import System.Directory(copyFile)
import System.Directory(removeFile)
 
main = parseDir "~/bsbmtools-0.2/out3/" "~/bsbmtools-0.2/out2/"
 
dbInfo = DBInfo "127.0.0.1" 7474
 
disp :: ParseResult -> IO()
disp (Left t) = print t
disp (Right g) = print g
 
parseDir :: String -> String -> IO ()
parseDir dir mvdir = setCurrentDirectory dir >> getDirectoryContents "." >>= mapM_ (parseFile mvdir) . filter (`notElem` [".",".."])
 
parseFile :: String -> String -> IO () -- TODO use readFile
parseFile moveLocation fileName = print ("Importing file" ++ fileName) >> withFile fileName ReadMode (\h -> hGetContents h >>= (return . parseNT) >>= actions) >> copyFile fileName (moveLocation ++ fileName) >> removeFile fileName
 
actions :: ParseResult -> IO()
actions (Left t) = print $ "Error parsing input file : " ++ t
actions (Right (NSGraph ns fm st)) | M.null fm = bracket (newManager def) (\man -> traverse_ (\s -> runECypher (actionArc s) dbInfo man) st >> return ()) (closeManager) -- return () to force runCypher need a synchro to fork -> do not use forkCypher : reimplement
actions (Right (NSGraph ns fm st)) = print "no support for graph statement"
 
toText = pack . show
newtype CpherException = CpherException String
deriving (Show, Typeable)
 
instance Exception CpherException
 
runECypher a b c = catch (runCypher a b c) (\ e -> print $ "Error : " ++ (show (e::CpherException)))
 
--actions t = seq (parseNT t) (return())
-- TODO foldR sur action avec un retour en Maybe (voir cypher error)
actionArc :: Arc RDFLabel -> Cypher()
actionArc a@(Arc s p o) = do ids <- actionRes s
ido <- actionRes o
arc <- doArc ids p ido
--liftIO $ print arc
return ()
where doArc ids (Res rp) ido = storeArc ids (toText rp) ido
doArc s p o = return $ pack $ "Unsupported arc : " ++ show a
 
actionRes :: RDFLabel -> Cypher Text
--actionRes r = (liftIO $ print r) >> actionRess r
actionRes r = actionRess r
actionRess :: RDFLabel -> Cypher Text
actionRess (Blank _) = storeLit ""
actionRess NoNode = storeLit ""
actionRess (Var text) = storeLit $ pack text
actionRess (TypedLit text _) = storeLit text
actionRess (Lit text) = storeLit text
actionRess (LangLit text _) = storeLit text
actionRess (Res scName) = storeNode $ toText scName
--actionRess n = (liftIO $ print $ "Unsupported node : " ++ show n) >> (throw $ CpherException "Error Node")

Very interesting post. You missed one aspect of the modeling though.

Even if it is all triples in RDF in a property graph you have both properties and relationships. So those values that are actually properties would be put into node properties in the graph database and not via relationships into other nodes.

I would also use easier to read rel-types as they make the queries actually readable and keep an URI-rel-type mapping somewhere.

Would love to see another version of this that uses a more friendly modeling in the graph database, and then put it onto the new neo4j.org/develop/linked_data page.

Very interesting. I use Virtuoso for Linked Data projects and I found some kind of queries (e.g. top-k using 3-4 level of relationship and complex rating) where it does not work very well (or are not possible at all). I would like to understand better the limit and when one it's better than the other.
A good start could be http://www.w3.org/wiki/Social_Network_Intelligence_BenchMark

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.