rvanbruggen/1-browser_guide-contacttracing_with_relationship_indexes.mdx

## 1-browser_guide-contacttracing_with_relationship_indexes.mdx

      
    Raw
  

              1-browser_guide-contacttracing_with_relationship_indexes.mdx
            
          
    Revisiting contact tracing with Neo4j 4.3's relationship indexes

New release of Neo4j 4.3 came out. One of the key features are relationship property indexes - a really interesting feature.
Two main points of attention:

Performance improvements: all of a sudden the Neo4j Cypher query planner is going to be able to use a lot more information, provided by these relationship indexes. The planner is becoming smarter - and therefore queries will become faster. We will explore this below.
Modelling implications: the introduction of these indexes will have far-reaching implications with regards to how we model certain things. More options are good, of course!

Create a synthetic contact tracing graph - size of Antwerp

Similar to the work I did last year on contact tracing. Take a look at (http://blog.bruggen.com/2020/06/what-recommender-systems-and-contact.html) to see how that went.

Using the faker plugin
Download it from github page. Install is super easy. Just need to make sure the config is updated too - whitelisted fkr.* just like we do with gds.* and apoc.*.

Only difference: pushing the scale up to the size of my home city of Antwerp, Belgium.
Create 500000 persons

Need to have enough memory - but should be able to do it in one transaction.
foreach (i in range(1,500000) |
    create (p:Person { id : i })
    set p += fkr.person('1940-01-01','2021-06-01')
    set p.healthstatus = fkr.stringElement("Sick,Healthy")
    set p.confirmedtime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
    set p.birthDate = datetime(p.birthDate)
    set p.addresslocation = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
    set p.name = p.fullName
    remove p.fullName
);
Create 10000 places

Adding the places is instantaneous:
foreach (i in range(1,10000) |
    create (p:Place { id: i, name: "Place nr "+i})
    set p.type = fkr.stringElement("Grocery shop,Theater,Restaurant,School,Hospital,Mall,Bar,Park")
    set p.location = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
);
Put in places some indexes on the NODES

Don't really need them for this demo - but could be useful for other queries.
CREATE INDEX placenodeid FOR (p:Place) ON (p.id);
CREATE INDEX placenodelocation FOR (p:Place) ON (p.location);
CREATE INDEX placenodename FOR (p:Place) ON (p.name);
CREATE INDEX personnodeid FOR (p:Person) ON (p.id);
CREATE INDEX personnodenam FOR (p:Person) ON (p.name);
CREATE INDEX personnodehealthstatus FOR (p:Person) ON (p.healthstatus);
CREATE INDEX personnodeconfirmedtime FOR (p:Person) ON (p.confirmedtime);

Add 1500000 random visits to places

Using periodic committing of transactions. 89 seconds is not bad!
CALL apoc.periodic.iterate(
    'with range(1,1500000) as range
        unwind range as iteration return iteration', 
    'match (p:Person {id: toInteger(rand()*500000)+1}), (pl:Place {id:toInteger(rand()*10000)+1 })
        create (p)-[:PERFORMS_VISIT]->(v:Visit { id: iteration})-[:LOCATED_AT]->(pl)
        create (p)-[virel:VISITS]->(pl)
        set v.starttime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
        set virel.starttime = v.starttime
        set v.endtime = v.starttime + duration("PT"+toInteger(round(rand()*10))+"H"+toInteger(round(rand()*60))+"M")
        set virel.endtime = v.endtime
        set v.visittime = duration.between(v.starttime,v.endtime)
        set v.visittimeinseconds = v.visittime.seconds
        set virel.visittime = v.visittime
        set virel.visittimeinseconds = v.visittimeinseconds', 
    {batchSize:25000, parallel:false});
Some people will be unconnected

The randomisation makes for some people to stay unconnected. Not a problem - real life that would also be the case, right? Some people just don't go out :) ...
match (p:Person)
where not ((p)--())
return count(p);

Querying for starttimes using OLD model / node indexes

Index the visit nodes

CREATE INDEX visitnodestarttime FOR (v:Visit) ON (v.starttime);
Query on visit nodes

profile match (p:Person)-[:PERFORMS_VISIT]->(v:Visit)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;
The profile / query plan before or after the index are very different:

using NodeByLabelScan: lots of db hits.
using NodeIndexSeekByRange: making the performance fly From 4403ms to 7ms.


Querying for starttimes using NEW model

Now we can actually forget about the intermediat (:Visit) nodes, and just use the [:VISITS] relationships.
Index the VISITS relationships

Very similar to add the index to the relationship property:
CREATE INDEX visitrelstarttime FOR ()-[v:VISITS]->() ON (v.starttime);
Now we can run the equivalent query on the new model.
Query on VISITS relationships

This is what that query looks like:
profile match (p:Person)-[v:VISITS]->(pl:Place)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;

Without the relationship index: using NodeByLabelScan, causing lots of db hits and 6 seconds of waiting.
With the relationship index: using DirectedRelationshipIndexSeekByRange - dropping the db hits and decimating the wait time to less than 8 millseconds.


Conclusion:

Great performance, and a simpler model.
Rik Van Bruggen

Twitter
Blog
LinkedIn


## 2-contacttracing_with_relationship_indexes.md

      
    Raw
  

              2-contacttracing_with_relationship_indexes.md
            
          
    Revisiting contact tracing with Neo4j 4.3's relationship indexes

New release of Neo4j 4.3 came out. One of the key features are relationship property indexes - a really interesting feature.
Two main points of attention:

Performance improvements: all of a sudden the Neo4j Cypher query planner is going to be able to use a lot more information, provided by these relationship indexes. The planner is becoming smarter - and therefore queries will become faster. We will explore this below.
Modelling implications: the introduction of these indexes will have far-reaching implications with regards to how we model certain things. More options are good, of course!


Create a synthetic contact tracing graph - size of Antwerp

Similar to the work I did last year on contact tracing. Take a look at (http://blog.bruggen.com/2020/06/what-recommender-systems-and-contact.html) to see how that went.

Using the faker plugin
Download it from github page. Install is super easy. Just need to make sure the config is updated too - whitelisted fkr.* just like we do with gds.* and apoc.*.

Only difference: pushing the scale up to the size of my home city of Antwerp, Belgium.
Create 500000 persons

Need to have enough memory - but should be able to do it in one transaction.
foreach (i in range(1,500000) |
    create (p:Person { id : i })
    set p += fkr.person('1940-01-01','2021-06-01')
    set p.healthstatus = fkr.stringElement("Sick,Healthy")
    set p.confirmedtime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
    set p.birthDate = datetime(p.birthDate)
    set p.addresslocation = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
    set p.name = p.fullName
    remove p.fullName
);

Create 10000 places

Adding the places is instantaneous:
foreach (i in range(1,10000) |
    create (p:Place { id: i, name: "Place nr "+i})
    set p.type = fkr.stringElement("Grocery shop,Theater,Restaurant,School,Hospital,Mall,Bar,Park")
    set p.location = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
);

Put in places some indexes on the NODES

Don't really need them for this demo - but could be useful for other queries.
CREATE INDEX placenodeid FOR (p:Place) ON (p.id);
CREATE INDEX placenodelocation FOR (p:Place) ON (p.location);
CREATE INDEX placenodename FOR (p:Place) ON (p.name);
CREATE INDEX personnodeid FOR (p:Person) ON (p.id);
CREATE INDEX personnodenam FOR (p:Person) ON (p.name);
CREATE INDEX personnodehealthstatus FOR (p:Person) ON (p.healthstatus);
CREATE INDEX personnodeconfirmedtime FOR (p:Person) ON (p.confirmedtime);


Add 1500000 random visits to places

Using periodic committing of transactions. 89 seconds is not bad!
CALL apoc.periodic.iterate(
    'with range(1,1500000) as range
        unwind range as iteration return iteration', 
    'match (p:Person {id: toInteger(rand()*500000)+1}), (pl:Place {id:toInteger(rand()*10000)+1 })
        create (p)-[:PERFORMS_VISIT]->(v:Visit { id: iteration})-[:LOCATED_AT]->(pl)
        create (p)-[virel:VISITS]->(pl)
        set v.starttime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
        set virel.starttime = v.starttime
        set v.endtime = v.starttime + duration("PT"+toInteger(round(rand()*10))+"H"+toInteger(round(rand()*60))+"M")
        set virel.endtime = v.endtime
        set v.visittime = duration.between(v.starttime,v.endtime)
        set v.visittimeinseconds = v.visittime.seconds
        set virel.visittime = v.visittime
        set virel.visittimeinseconds = v.visittimeinseconds', 
    {batchSize:25000, parallel:false});

Some people will be unconnected

The randomisation makes for some people to stay unconnected. Not a problem - real life that would also be the case, right? Some people just don't go out :) ...
match (p:Person)
where not ((p)--())
return count(p);


Querying for starttimes using OLD model / node indexes

Index the visit nodes

CREATE INDEX visitnodestarttime FOR (v:Visit) ON (v.starttime);

Query on visit nodes

profile match (p:Person)-[:PERFORMS_VISIT]->(v:Visit)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;
The profile / query plan before or after the index are very different:

using NodeByLabelScan: lots of db hits.

using NodeIndexSeekByRange: making the performance fly From 4403ms to 7ms.


Querying for starttimes using NEW model

Now we can actually forget about the intermediat (:Visit) nodes, and just use the [:VISITS] relationships.
Index the VISITS relationships

Very similar to add the index to the relationship property:
CREATE INDEX visitrelstarttime FOR ()-[v:VISITS]->() ON (v.starttime);

Now we can run the equivalent query on the new model.
Query on VISITS relationships

This is what that query looks like:
profile match (p:Person)-[v:VISITS]->(pl:Place)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;

Without the relationship index: using NodeByLabelScan, causing lots of db hits and 6 seconds of waiting.

With the relationship index: using DirectedRelationshipIndexSeekByRange - dropping the db hits and decimating the wait time to less than 8 millseconds.


Conclusion:

Great performance, and a simpler model.
Rik Van Bruggen

Twitter
Blog
LinkedIn