Skip to content

Instantly share code, notes, and snippets.

@tomzeppenfeldt
Last active August 29, 2015 14:01
Show Gist options
  • Save tomzeppenfeldt/05d92f567adbe971afc5 to your computer and use it in GitHub Desktop.
Save tomzeppenfeldt/05d92f567adbe971afc5 to your computer and use it in GitHub Desktop.
Hierarchical facets
= Using hierarchical facets
We have a usecase with documents that are tagged with keywords in a theasaurus. This gists explains the model and is at the same time an invitation to suggest improvements. Because it would be nice to have something that performs better.
NOTE : In this example we have a quite regular thesaurus. IRL the thesaurus with branches of varying depth and docs are tagged with both leaf and non-leaf nodes (i.e. nodes without and with children repectively)
== The model
//setup
[source,cypher]
----
CREATE (doc1:`doc` {`name`:"doc 1"})
CREATE (doc2:`doc` {`name`:"doc 2"})
CREATE (doc3:`doc` {`name`:"doc 3"})
CREATE (doc4:`doc` {`name`:"doc 4"})
CREATE (doc5:`doc` {`name`:"doc 5"})
CREATE (root:`term` {`name`:"root"})
CREATE (term2:`term` {`name`:"term 2"})
CREATE (term3:`term` {`name`:"term 3"})
CREATE (term4:`term` {`name`:"term 4"})
CREATE (term5:`term` {`name`:"term 5"})
CREATE (term6:`term` {`name`:"term 6"})
CREATE (term7:`term` {`name`:"term 7"})
CREATE (term2)-[:BT]->(root)
CREATE (term3)-[:BT]->(root)
CREATE (term4)-[:BT]->(term2)
CREATE (term5)-[:BT]->(term2)
CREATE (term6)-[:BT]->(term3)
CREATE (term7)-[:BT]->(term3)
CREATE (doc1)-[:HAS_TERM]->(term2)
CREATE (doc1)-[:HAS_TERM]->(term4)
CREATE (doc1)-[:HAS_TERM]->(term5)
CREATE (doc2)-[:HAS_TERM]->(root)
CREATE (doc2)-[:HAS_TERM]->(term3)
CREATE (doc3)-[:HAS_TERM]->(term6)
CREATE (doc3)-[:HAS_TERM]->(term4)
CREATE (doc3)-[:HAS_TERM]->(term2)
CREATE (doc4)-[:HAS_TERM]->(term7)
CREATE (doc5)-[:HAS_TERM]->(term2)
CREATE (doc5)-[:HAS_TERM]->(term3)
----
//graph
== Get docs for each term
=== DIRECT
This query returns the docs to which the specified term is directly (the term has a link to the doc)
[source,cypher]
----
MATCH (d:doc)-[:HAS_TERM]->(t:term)
RETURN t.name AS term,collect(DISTINCT d.name) AS docs ORDER BY term
----
//table
=== DIRECT and INDIRECT
This query returns the docs to which the specified term is linked, both directly (the term has a link to the doc) and indirectly (one of the more detailed terms is linked to the doc)
[source,cypher]
----
MATCH (d:doc)-[:HAS_TERM|BT*0..]->(t:term)
RETURN t.name AS term,collect(DISTINCT d.name) AS docs ORDER BY term
----
//table
== Get count for root and its children
[source,cypher]
----
match (d:doc)-[:HAS_TERM|BT*0..]->(t:term)-[:BT*0..1]->(t2:term {name:"root"})
return t.name AS term, count(DISTINCT d) as docs order by docs desc
----
//table
== Get count for "term 3" and its children
[source,cypher]
----
match (d:doc)-[:HAS_TERM|BT*0..]->(t:term)-[:BT*0..1]->(t2:term {name:"term 3"})
return t.name AS term, count(DISTINCT d) as docs order by docs desc
----
//table
== Question
The main question is, is there a faster way based on pure Neo4j? Trying this approach with 10k docs, a thesausrus with 12k terms and 1.2M [:HAS_TERM] relationships takes over 20 secs when getting counts for "root" and diirect children.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment