Skip to content

Instantly share code, notes, and snippets.

@jruts
Last active September 28, 2023 14:22
Show Gist options
  • Save jruts/fe782ff2531d509784a24b655ad8ae76 to your computer and use it in GitHub Desktop.
Save jruts/fe782ff2531d509784a24b655ad8ae76 to your computer and use it in GitHub Desktop.
How to delete duplicate nodes and their relationships in neo4j with cypher?

How to delete duplicate nodes and their relationships in neo4j with cypher based on a property of that node?

The problem is easy to understand. We have 'duplicate' nodes in our database based on the 'id' field on the node properties.

Well this sounds easy enough, until you have to actually do it.

First step

My first attempt was to try and figure out which nodes are actualy duplicate (based on a property on the node). This seems to be pretty straightforward.

Cypher:

MATCH (g:geo) 
WITH g.id as id, collect(g) AS nodes 
WHERE size(nodes) >  1
RETURN nodes

This query finds all the nodes with label geo (change this to your own labels) and creates a list of all the nodes where the id property occurs more than once.
This results into a list of all the geo nodes that ar considered duplicate.

Example Result

nodes
geo:123, geo:123
geo:578, geo:578, geo:578

Second step

Now we want to try and get rid of the duplicate ones, but of course keep 1 remaining.

Cypher:

MATCH (g:geo) 
WITH g.id as id, collect(g) AS nodes 
WHERE size(nodes) >  1
FOREACH (g in tail(nodes) | DELETE g)

This results in this error: org.neo4j.kernel.api.exceptions.ConstraintViolationTransactionFailureException: Cannot delete node<866>, because it still has relationships. To delete this node, you must first delete its relationships.

Seems like we are close. But lets explain what we did in this query first.
We used the same query as before, but instead of returning the nodes we will loop over tail(nodes) and delete the nodes we find in there.
What tail does is getting all the nodes in the list, except the first one. So we ensure that 1 always remains.

Third step

As the previous error described, we need to remove the relationships of the duplicated first.

Cypher:

MATCH (g:geo) 
WITH g.id as id, collect(g) AS nodes 
WHERE size(nodes) >  1
UNWIND tail(nodes) as tails
MATCH (tails)-[r]-()
DELETE r

And poof, magic happened. This was successful and the relationships of the duplicates are now removed.
This means that we can run the query of step 2 again to delete the actual duplicates!

Solution

Delete duplicate relations of the duplicate nodes first

MATCH (g:geo) 
WITH g.id as id, collect(g) AS nodes 
WHERE size(nodes) >  1
UNWIND tail(nodes) as tails
MATCH (tails)-[r]-()
DELETE r

Delete actual duplicate nodes

MATCH (g:geo) 
WITH g.id as id, collect(g) AS nodes 
WHERE size(nodes) >  1
FOREACH (g in tail(nodes) | DELETE g)

I hope it can help some of you out!

@mading0817
Copy link

@Well5a nice one!

@jay-xzj
Copy link

jay-xzj commented Jul 19, 2020

You can also use the "DETACH DELETE" clause which deletes a node with all its relations.
This is faster because you have only one query to execute:

MATCH (g:geo)
WITH g.id as id, collect(g) AS nodes 
WHERE size(nodes) >  1
FOREACH (g in tail(nodes) | DETACH DELETE g)

hi, I have a database with over 4 million nodes with the same label.
how can I delete duplicated nodes with DETACH DELETE on this?

I tried your Cypher, got this.
image

@goodhamgupta
Copy link

Thank you this was super helpful! 😄

@okaysidd
Copy link

okaysidd commented Sep 3, 2020

De-duplication would, in most cases, deal with not deleting the relationships, but transferring the relationships to one node that remains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment