Yes! It's been a few months, but Saint Nicholas just brought us a brand new and shiny release of Neo4j 4.4 to play with. One of the key features is a generic transaction batching capability, similar to what we have been using in apoc.periodic.iterate
but now built right into the core of the database. It is referred to as the CALL in Transaction capability - and of course it is a really interesting feature.
So in this article I will be revisiting this blogpost, but without the need for APOC's apoc.periodic.iterate
feature. Let's see how that goes.
The first step of course is going to be similar to, if not exactly the same as, the work I did in 2020 on contact tracing. Take a look at (http://blog.bruggen.com/2020/06/what-recommender-systems-and-contact.html) to see how that went. The key thing to recall there is that I was using the fantastic faker
plugin. You can download it yourself from the github page. Install is super easy. Just need to make sure the config is updated too - and that you whitelisted fkr.*
just like you do with gds.*
and apoc.*
.
As with the previous post, I will be pushing the scale up to the size of my home city of Antwerp, Belgium. And critically, we would not even use APOC - but use the transaction batching instead.
Previously we did this in one transaction - which is probably at the limits of what I would normally do. But since we now have this transaction batching mechanism in Cypher, let's use it:
:auto UNWIND range(1,500000) as id
CALL {
WITH id
CREATE (p:Person {id: id})
SET p += fkr.person('1950-01-01','2021-12-01')
SET p.healthstatus = fkr.stringElement("Sick,Healthy")
SET p.confirmedtime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
SET p.birthDate = datetime(p.birthDate)
SET p.addresslocation = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
SET p.name = p.fullName
REMOVE p.fullName
} IN transactions of 25000 ROWS;
This returns a little more slowly than a single shot transaction would, but that is to be expected. Here's the result:
Then, we will create the (Place) nodes.
Adding the places is instantaneous, even with two batches of 5000:
:auto UNWIND range (1,10000) as id
CALL {
WITH id
CREATE (p:Place { id: id, name: "Place nr "+id})
SET p.type = fkr.stringElement("Grocery shop,Theater,Restaurant,School,Hospital,Mall,Bar,Park")
SET p.location = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
} IN transactions of 5000 rows;
We don't really need them for this demo - but could be useful for other queries. Note that we are using the relationship-centric model here - as we proved in the last blogpost that this is at least as capable, and much simpler, as the reified model that used (Visit)
nodes.
So here we add the node indexes:
CREATE INDEX placenodeid FOR (p:Place) ON (p.id);
CREATE INDEX placenodelocation FOR (p:Place) ON (p.location);
CREATE INDEX placenodename FOR (p:Place) ON (p.name);
CREATE INDEX personnodeid FOR (p:Person) ON (p.id);
CREATE INDEX personnodenam FOR (p:Person) ON (p.name);
CREATE INDEX personnodehealthstatus FOR (p:Person) ON (p.healthstatus);
CREATE INDEX personnodeconfirmedtime FOR (p:Person) ON (p.confirmedtime);
And we also the index to the -[:VISITS]->
relationship property:
CREATE INDEX visitrelstarttime FOR ()-[v:VISITS]->() ON (v.starttime);
Now we can add the 1,5M relationships - the real test of the new transaction batching functionality.
It's pretty straightforward and similar to the previous examples, so let's just dive in:
:auto UNWIND range(1,1500000) as iteration
CALL {
WITH iteration
MATCH (p:Person {id: toInteger(rand()*500000)+1}), (pl:Place {id:toInteger(rand()*10000)+1 })
create (p)-[virel:VISITS]->(pl)
set virel.starttime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
set virel.endtime = virel.starttime + duration("PT"+toInteger(round(rand()*10))+"H"+toInteger(round(rand()*60))+"M")
set virel.visittime = duration.between(virel.starttime,virel.endtime)
set virel.visittimeinseconds = virel.visittime.seconds
} IN TRANSACTIONS of 25000 rows;
The result was pretty quick: 75 seconds, not even!
Just for completeness, I will revisit the main query that we explored in the previous blogpost here as well. This is what that query looks like:
match (p:Person)-[v:VISITS]->(pl:Place)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;
The new transaction batching functionality makes for a great addition to our toolbox - and clear performs quite well. Looking forward to using it in other use cases, already!
Cheers
Rik Van Bruggen