A Neo4j Browser Guide to explore a Contact Tracing database
Background
I wrote a series of 4 blogposts about this topic. For more detail please refer to:
-
Part 1: how I go about creating a synthetic dataset, and import that into Neo4j
-
Part 2: how I can start running some interesting queries on the dataset, making me understand some of the interesting data points in there and questions that one might ask
-
Part 3: how I can use graph data science on this dataset, and understand some of the predictive metrics like pagerank, betweenness and use community detection to direct policies
-
Part 4: a number of loose ends that I touched on during my exploration - but surely did not exhaust.
There’s so much potential in this dataset, and in this problem domain in general. I feel like I have gone into the rabbit hole and have just resurfaced for some air. But who knows, maybe I will dive back in and do some more digging - after all, this is interesting stuff, and I love working on interesting topics.
In this guide I will show you the statements in an orderly sequence. Let’s start.
Creating the data: in a google sheet.
Here’s the full spreadsheet with synthetic data. The csv files are to be found here:
Now we can use these scripts to load the data.
All of these scripts are also on also on github: github
Importing the google sheet data into Neo4j
System requirements: these queries have been tested on Neo4j Enterprise 3.5.17 and 4.0.3, apoc 3.5.0.9 and 4.0.0.6 respectively, and gds 1.1.
Import the persons from the Person sheet:
load csv with headers from
"https://docs.google.com/spreadsheets/u/0/d/1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE/export?format=csv&id=1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE&gid=0" as csv
create (p:Person {id: csv.PersonId, name:csv.PersonName, healthstatus:csv.Healthstatus, confirmedtime:datetime(csv.ConfirmedTime), addresslocation:point({x: toFloat(csv.AddressLat), y: toFloat(csv.AddressLong)})});
Import the places from the places worksheet:
load csv with headers from
"https://docs.google.com/spreadsheets/u/0/d/1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE/export?format=csv&id=1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE&gid=205425553" as csv
create (p:Place {id: csv.PlaceId, name:csv.PlaceName, type:csv.PlaceType, location:point({x: toFloat(csv.Lat), y: toFloat(csv.Long)})});
Create a couple of indexes to make easier/faster to create the Visit nodes and relationships:
create index on :Place(id);
create index on :Place(location);
create index on :Place(name);
create index on :Person(id);
create index on :Person(name);
create index on :Person(healthstatus);
create index on :Person(confirmedtime);
Import the VISITS from the Visits sheet. Note that we are loading duplicate info here: both with a VISIT node, and a VISITS relationship. They can both be useful.
load csv with headers from
"https://docs.google.com/spreadsheets/d/1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE/export?format=csv&id=1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE&gid=1261126668" as csv
match (p:Person {id:csv.PersonId}), (pl:Place {id:csv.PlaceId})
create (p)-[:PERFORMS_VISIT]->(v:Visit {id:csv.VisitId, starttime:datetime(csv.StartTime), endtime:datetime(csv.EndTime)})-[:LOCATED_AT]->(pl)
create (p)-[vi:VISITS {id:csv.VisitId, starttime:datetime(csv.StartTime), endtime:datetime(csv.EndTime)}]->(pl)
set v.duration=duration.inSeconds(v.starttime,v.endtime)
set vi.duration=duration.inSeconds(vi.starttime,vi.endtime);
OPTIONAL: connect places to a Region, Country, Continent
create (r:Region {name:"Antwerp"})-[:PART_OF]->(c:Country {name:"Belgium"})-[:PART_OF]->(co:Continent {name:"Europe"});
match (r:Region {name:"Antwerp"}), (pl:Place)
create (pl)-[:PART_OF]->(r);
Querying Data
System requirements: these queries have been tested on Neo4j Enterprise 3.5.17 and 4.0.3, apoc 3.5.0.9 and 4.0.0.6 respectively, and gds 1.1.
The queries are also on github.
Who has a sick person potentially infected
match (p:Person {healthstatus:"Sick"})
with p
limit 1
match (p)--(v1:Visit)--(pl:Place)--(v2:Visit)--(p2:Person {healthstatus:"Healthy"})
return p.name as Spreader, v1.starttime as SpreaderStarttime, v2.endtime as SpreaderEndtime, pl.name as PlaceVisited, p2.name as Target, v2.starttime as TargetStarttime, v2.endtime as TargetEndttime;
Who has a sick person potentially infected - VISUAL
match (p:Person {healthstatus:"Sick"})
with p
limit 1
match path = (p)-->(v1:Visit)-->(pl:Place)<--(v2:Visit)<--(p2:Person {healthstatus:"Healthy"})
return path;
Simplifying the query by using the VISITS relationship
match (p:Person {healthstatus:"Sick"})
with p
limit 1
match path = (p)-[:VISITS]->(pl:Place)<-[:VISITS]-(p2:Person {healthstatus:"Healthy"})
return path;
Who has a sick person infected - with time overlap
The latest of start times must occur before (or at the same time) as the earliest of the end times for the ranges to overlap.
Note that at the time or writing, apoc.coll.min and apoc.coll.max do not work on apoc 4.0.0.7 or later. Please use apoc version 4.0.0.6 which you can find over here
match (p:Person {healthstatus:"Sick"})-->(v1:Visit)-->(pl:Place)
with p,v1,pl
limit 10
match path = (p)-->(v1)-->(pl)<--(v2:Visit)<--(p2:Person {healthstatus:"Healthy"})
WITH path, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
return path;
Who has a sick person infected - with time overlap AND SIMPLIFIED with the VISITS relationship. The latest of start times must occur before (or at the same time) as the earliest of the end times for the ranges to overlap.
match (p:Person {healthstatus:"Sick"})-[v1:VISITS]->(pl:Place)
with p,v1,pl
limit 10
match path = (p)-[v1]->(pl)<-[v2:VISITS]-(p2:Person {healthstatus:"Healthy"})
WITH path, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
return path;
Who has a sick person infected - with time overlap +/- 2hrs. The latest of start times must occur before (or at the same time) as the earliest of the end times for the ranges to overlap.
match (p:Person {healthstatus:"Sick"})-->(s1:Stay)-->(pl:Place)
with p,s1,pl
limit 10
match path = (p)-->(s1)-->(pl)<--(s2:Stay)<--(p2:Person {healthstatus:"Healthy"})
WITH path, apoc.coll.max([s1.starttime.epochMillis, s2.starttime.epochMillis]) as maxStart,
apoc.coll.min([s1.endtime.epochMillis, s2.endtime.epochMillis]) as minEnd
where maxStart-720000 <= minEnd+720000
return path;
Find sick person that has visited places since being infected
match (p:Person {healthstatus:"Sick"})-[visited]->(pl:Place)
where p.confirmedtime < visited.starttime
return p, visited, pl
limit 10;
Find connections between sick people
match (p1:Person {healthstatus:"Sick"}),(p2:Person {healthstatus:"Sick"})
where id(p1)<id(p2)
with p1, p2
match path = allshortestpaths ((p1)-[*]-(p2))
return path
limit 10;
How many sick and healthy people
match (p:Person)
return distinct p.healthstatus, count(*);
Which healthy person has the highest risk - based on amount over overlaptime with sick people
match (hp:Person {healthstatus:"Healthy"})-[v1:VISITS]->(pl:Place)<-[v2:VISITS]-(sp:Person {healthstatus:"Sick"})
with hp, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
return hp.name, hp.healthstatus, sum(minEnd-maxStart) as overlaptime
order by overlaptime desc;
Which healthy person has the highest risk - based on amount over overlaptime with sick people - VISUAL
match (hp:Person {healthstatus:"Healthy"})-[v1:VISITS]->(pl:Place)<-[v2:VISITS]-(sp:Person {healthstatus:"Sick"})
with hp, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
with hp, sum(minEnd-maxStart) as overlaptime
order by overlaptime desc
limit 10
match (hp)-[v]-(pl:Place)
return hp,v,pl;
Places with most sick visits
match (p:Person {healthstatus:"Sick"})-[v:VISITS]->(pl:Place)
with distinct pl.name as placename, count(v) as nrofsickvisits, apoc.node.degree.in(pl,'VISITS') as totalnrofvisits
order by nrofsickvisits desc
limit 10
return placename, nrofsickvisits, totalnrofvisits, round(toFloat(nrofsickvisits)/toFloat(totalnrofvisits)*10000)/100 as percentageofsickvisits;
Places with most sick visits - VISUAL
match (p:Person {healthstatus:"Sick"})-[v:VISITS]->(pl:Place)
with distinct pl.name as placename, count(v) as nrofsickvisits, pl
order by nrofsickvisits desc
limit 10
match (pl)<-[v]-(p:Person)
return pl,p,v;
Graph Data Science on the Contact Tracing Graph
Note that at the time of writing, these queries have been tested on Neo4j 3.5.17. Neo4j 4.0.3 currently not yet supports the GDS plugin.
All the scripts are of course also on github
REQUIREMENT: create the MEETS relationship based on OVERLAPTIME
This is a relationship between two PERSON nodes that we will need for our graph data science exercises.
match (p1:Person)-[v1:VISITS]->(pl:Place)<-[v2:VISITS]-(p2:Person)
where id(p1)<id(p2)
with p1, p2, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
with p1, p2, sum(minEnd-maxStart) as meetTime
create (p1)-[:MEETS {meettime: duration({seconds: meetTime/1000})}]->(p2);
Graph Algo nr 1: calculating pagerank of Persons
:param limit => (10);
:param config => ({
nodeProjection: 'Person',
relationshipProjection: {
relType: {
type: 'MEETS',
orientation: 'NATURAL',
properties: {}
}
},
relationshipWeightProperty: null,
dampingFactor: 0.85,
maxIterations: 20,
writeProperty: 'pagerank'
});
CALL gds.pageRank.write($config);
Look at the Person pagerank table results:
MATCH (node)
WHERE not(node[$config.writeProperty] is null)
RETURN node.name as name, node[$config.writeProperty] AS pagerank, node.betweenness as betweenness
ORDER BY pagerank DESC
LIMIT 10;
Look at the Person pagerank graph results VISUALLY:
MATCH (node)
WHERE not(node[$config.writeProperty] is null)
with node, node[$config.writeProperty] AS score
ORDER BY score DESC
LIMIT 10
match (node)-[r]-(conn)
return node, r, conn
Graph Algo nr 2: calculating BETWEENNESS of Person nodes
:param limit => (20);
:param config => ({
nodeProjection: 'Person',
relationshipProjection: {
relType: {
type: 'MEETS',
orientation: 'NATURAL',
properties: {}
}
},
writeProperty: 'betweenness'
});
CALL gds.alpha.betweenness.write($config);
Look at the Person betweenness results table:
MATCH (node)
WHERE not(node[$config.writeProperty] is null)
RETURN node.name as name, node.pagerank as pagerank, node[$config.writeProperty] AS betweenness
ORDER BY betweenness DESC
LIMIT 10;
Look at the Person betweenness results VISUALLY:
MATCH (node)
WHERE not(node[$config.writeProperty] is null)
with node, node[$config.writeProperty] AS score
ORDER BY score DESC
LIMIT 10
match (node)-[r]-(conn)
return node, r, conn;
Graph Algo nr 3: LOUVAIN Community detection
Preparation for relationship weight property: needs integer, is currently set up as a duration!
MATCH p=()-[r:MEETS]->()
set r.meettimeinseconds=r.meettime.seconds;
Now we can calculate communities using Louvain:
:param limit => ( 50);
:param config => ({
nodeProjection: 'Person',
relationshipProjection: {
relType: {
type: 'MEETS',
orientation: 'NATURAL',
properties: {
meettimeinseconds: {
property: 'meettimeinseconds',
defaultValue: 1
}
}
}
},
relationshipWeightProperty: 'meettimeinseconds',
includeIntermediateCommunities: false,
seedProperty: '',
writeProperty: 'louvain'
});
CALL gds.louvain.write($config);
What are the different communities?
match (p:Person)
return distinct p.louvain, count(p)
order by count(p) desc;
Explore community 489:
match (p1:Person {louvain: 489})-[v:VISITS]->(pl:Place), (p1)-[m:MEETS]->(p2:Person)
return p1, p2, pl, v, m;