public
Created

neo4j query to find thoughtworks colleagues of colleagues

  • Download Gist
gistfile1.txt
1 2 3 4 5
START n = node(*)
MATCH n-[r:colleagues*1..2]->c, n-[r2:member_of]->office
WHERE n.type? = 'person' and has(n.name) and (not(has(r2.end_date))) and office.name = 'London - UK South' and (not(has(c.thoughtquitter)))
RETURN n.name, count(distinct(c)) AS connections, office.name
ORDER BY connections DESC

Hmm this is a graph global query, can you transform it to an index lookup for the office:
n.type='person' should be implied or?
You can try to see how many nodes this touches by leaving off the where clause and just returning count(c)
Ideally we are pulling the predicates into the match, I'm not 100% sure if that happens here.

START office=node:offices(name='London - UK South')
MATCH office<-[r2:member_of]-n-[r:colleagues*1..2]->c 
WHERE n.type? = 'person' and has(n.name) and (not(has(r2.end_date))) and (not(has(c.thoughtquitter)))
RETURN n.name, count(distinct(c)) AS connections, office.name 
ORDER BY connections DESC

That worked much faster - took 60 seconds to work out the result that time. Obviously the query is faster but I think my result yesterday might have been a bit influenced by my machine being generally crap! I ended up restarting it in the evening to get it at least be a little bit responsive.

Oh that's sad, get yourself a new retina MBP or just rent a AWS high memory instance :)

And I'd be interested in the numbers anyway. Shouldn't take that long after all. If you can share the dataset privately I'd love to do some profiling. Btw. are you using neography against neo4j server? Might be interesting do do a raw query in the server console and/or the streaming option for comparison.

While we're here, this query isn't working as I expect it to:

START office=node:offices(name='London - UK South')
MATCH office<-[r:member_of]-person, person-[r2?:member_of]->other
WHERE (NOT(HAS(r.end_date)))
RETURN person, COUNT(DISTINCT(other))

Basically people who have only worked in London have a count of 0 where I would expect that it should come back as 1? Am I doing something wrong or is it working incorrectly?

And yeh maybe I should just run it on an AWS instance and see how it works - just a bit wary of putting the data on there cause normally it's behind CAS...although you could easily argue that having it on my hard drive is probably equally risky!

Those numbers that I gave you were from running the query through the web console actually. I haven't tried the streaming option - will read about it and give it a try

other is then null and null values are not counted by count()

I meant the numbers of nodes touched :) ah, interesting, so it should be pretty fast, did you run the query once or multiple times (so you get rid of the parsing costs (and loading scala for the first query) and also have hotter caches.

Ah ok - in which case I change my question - why is other null? I would expect that it should come up with the London office?

I tried that query a couple more times back to back, 44.7 seconds, 44.1 seconds, 44.9 seconds

Other is null becase, you have the london office already bound to the office node and there is an optional relationship which means if it cannot be satisified all the nodes dangling from that will be null as well (like an outer JOIN in SQL).

Ahhh ok. What I actually want to do is find out how many offices people who are currently working in the London office have worked at.

If I add a relationship called 'current_office' and then have the query:

START office=node:offices(name='London - UK South')
MATCH office<-[r:current_office]-person, person-[r2?:member_of]->other
WHERE (NOT(HAS(r.end_date)))
RETURN person, COUNT(DISTINCT(other))

Will that return a different result or will it be the same because the office & person nodes are still bound?

If that's the case is there any way you can think for me to do this query?

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.