Skip to content

Instantly share code, notes, and snippets.

@jexp
Last active August 29, 2015 13:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jexp/9273836 to your computer and use it in GitHub Desktop.
Save jexp/9273836 to your computer and use it in GitHub Desktop.

A complex query result projection in cypher

Introduction

Sometimes it is better to request a complete object-tree from the (graph-)database at once instead sending more than one requests to the database and merging the results in program code.

However, since Neo4j release 2.0, cypher made this job easy, but you have to write a monster query.

One of these monsters are described in this Gist with the purpose to let you learn a bit more about cypher.

The model

Asuming a customer relation model, drawn in a graph that is stored in neo4j with following nodes:

  • a Customer node, representing the basic customer data

  • an OrgUnit node, representing the team of our company

  • an Employee node, representing an Employee of our company

  • a Name node, representing any name

  • an Address node, representing any address

and the corresponding releations between them.

crm graph model
Figure 1. A simple crm model

For simplicity reasons, names and addresses are not detailed as they would in productive models.

The task

What do we want to achieve?

Let’s assume, the graph database is backed behind a customer service portal, which is a webpage, where the customer can do some self service operations.

The portal will show the responsible persons in charge for the currently logged-in customer in a "Request Support" site.

To fetch the informations for this site, we have to query the database and returning a json, that contains all addresses (telephone numbers, email addresses, geographic addresses a.s.o.) of all employees, that supports the current logged-in customer.

The solution - step by step

The Basic Path

First, in respect to our model, we need to fetch all employees, that are members of an orgunit that acts as a support-team for the given customer.

With cypher there is no problem finding the correct path, starting from the customer and moving hand over hand to the employees:

MATCH
  (c:Customer {custNo: '1234567890'})-[:SUPPORTED_BY]->(o:OrgUnit)<-[:MEMBER_OF]-(e:Employee)
RETURN
  c, o, e

This will give us a result set containing customer, orgunit and employee nodes. Good, but there are no names!

Fetching the Names

Instead of enlarging our basic path, we divide the names-fetching in different match parts:

MATCH
  (c:Customer {custNo: '1234567890'})-[:SUPPORTED_BY]->(o:OrgUnit)<-[:MEMBER_OF]-(e:Employee),
  (c)-[:NAMED_BY]->(cn:Name),
  (o)-[:NAMED_BY]->(on:Name),
  (e)-[:NAMED_BY]->(en:Name)
RETURN
  c, cn, o, on, e, en

As a result we get more result rows with all the named nodes. Simple! Let’s fetching addresses too.

Fetching the addresses

As with names, addresses are related to orgunit and employee nodes, so we have to provide another match clause.

MATCH
  (c:Customer {custNo: '1234567890'})-[:SUPPORTED_BY]->(o:OrgUnit)<-[:MEMBER_OF]-(e:Employee),
  (c)-[:NAMED_BY]->(cn:Name),
  (o)-[:NAMED_BY]->(on:Name),
  (e)-[:NAMED_BY]->(en:Name),
  (o)-[:LOCATED_AT]->(oa:Address),
  (e)-[:LOCATED_AT]->(ea:Address)
RETURN
  c, cn, o, on, oa, e, en, ea

No wonder - the nodes are returned.

But wait! Assuming an employee, that "support’s" the customer does not have an address…​ The result set will be missing this employee!

Why that?

In the cypher above, the MATCH -clause states that there is an existing relation between an employee (labeled: e) and an address (labeled: ea) of a relationship-type :LOCATED_AT. Neo4j will return only those nodes and relations that matches.

Optional Address and Names fetching

But no problem: The guys from neotechnology built in a cypher statement, named OPTIONAL MATCH. This will do the trick!

MATCH
  (c:Customer {custNo: '1234567890'})-[:SUPPORTED_BY]->(o:OrgUnit)<-[:MEMBER_OF]-(e:Employee)
OPTIONAL MATCH
  (c)-[:NAMED_BY]->(cn:Name)
OPTIONAL MATCH
  (o)-[:NAMED_BY]->(on:Name)
OPTIONAL MATCH
  (e)-[:NAMED_BY]->(en:Name)
OPTIONAL MATCH
  (o)-[:LOCATED_AT]->(oa:Address)
OPTIONAL MATCH
  (e)-[:LOCATED_AT]->(ea:Address)
RETURN
  c, cn, o, on, oa, e, en, ea

Now, when there is an employee without an address, the ea node for this employee e is NULL in the result set.

Projection

The task, described above, was, that the cypher is returning a json object tree directly. Ok, so we cannot achieve this, when we return all the nodes without any structure.

With Neo4j 2.0 there is a very cool feature: You can create (and return) maps within cypher. This is a kind of projecting the resultset in a form, we’re focused on.

So what we like to have is something like displayed in the following diagram:

query result
Figure 2. Expected Result of the query

Here’s an example of the expected query result:

{
  "custNo"            : "1234567890",                  (1)
  "name"              : "ACME Productions Co.",
  "supportteams"      : [                              (2)
    {
      "shortName"     : "snwt",
      "name"          : "South-North-West-Team",
      "addresses"     : [                              (3)
        {
          "type"      : "LANDLINE",
          "data"      : "+49 1234/567689"
        },
        {
          "type"      : "EMAIL",
          "data"      : "snwt@BestCompany.org"
        }
      ],
      "members"       : [                              (4)
        {
          "empNo"     : "A-001",
          "name"      : "Max Mustermann"
          "addresses" : [                              (5)
            {
              "type"  : "LANDLINE",
              "data"  : "+49 1234/56789-123"
            }
          ]
        },
        {
          "empNo"     : "A-002",
          "name"      : "Manuela Mustermann"
          "addresses" : [
            {
              "type"  : "LANDLINE",
              "data"  : "+49 1234/56789-124"
            }
          ]
        },
        {
          "empNo"     : "A-003",
          "name"      : "Jon Doe"
          "addresses" : [
            {
              "type"  : "LANDLINE",
              "data"  : "+49 1234/56789-125"
            }
          ]
        }
      ]
    }
  ]
}
  1. the data of the customer (c)

  2. the support-teams of the customer, which are orgunits (o) related to the customer via the :SUPPORTED_BY relationship

  3. the addresses of the orgunit (oa)

  4. the members (e) of the orgunit, which are employees related via :MEMBER_OF

  5. the addresses of the members (ea)

First Trial:

Ok. Stretching Fingers! Writing the query’s RETURN clause:

MATCH
  (c:Customer {custNo: '1234567890'})-[:SUPPORTED_BY]->(o:OrgUnit)<-[:MEMBER_OF]-(e:Employee)
OPTIONAL MATCH
  (c)-[:NAMED_BY]->(cn:Name)
OPTIONAL MATCH
  (o)-[:NAMED_BY]->(on:Name)
OPTIONAL MATCH
  (e)-[:NAMED_BY]->(en:Name)
OPTIONAL MATCH
  (o)-[:LOCATED_AT]->(oa:Address)
OPTIONAL MATCH
  (e)-[:LOCATED_AT]->(ea:Address)
RETURN
  {
    custNo: c.custNo,
    name: cn.name,
    supportTeams: COLLECT(
      {
        shortName: o.shortName,
        name: on.name,
        addresses: COLLECT(oa),
        members: COLLECT(
          {
            empNo: e.empNo,
            name: en.name,
            addresses: COLLECT(ea)
          }
        )
      }
    )
  }

NOPE!

There’s an error with this query: "Can’t use aggregate functions inside of aggregate functions."

Which simply means: "No COLLECTs inside COLLECTs!"

Second Trial:

Luckily cypher provides us a possibility to separate the things: The secret is the WITH clause. There we can built up the inner collections first, and with these prepared collections, we can build the whole projection in the RETURN clause.

Let’s look what it is in a correct statement:

MATCH
  (c:Customer {custNo: '1234567890'})-[:SUPPORTED_BY]->(o:OrgUnit)<-[:MEMBER_OF]-(e:Employee)
OPTIONAL MATCH
  (c)-[:NAMED_BY]->(cn:Name)
OPTIONAL MATCH
  (o)-[:NAMED_BY]->(on:Name)
OPTIONAL MATCH
  (e)-[:NAMED_BY]->(en:Name)
OPTIONAL MATCH
  (o)-[:LOCATED_AT]->(oa:Address)
OPTIONAL MATCH
  (e)-[:LOCATED_AT]->(ea:Address)
WITH
  c AS c, // (1)
  cn AS cn,
  o AS o,
  on AS on,
  COLLECT(oa) AS oas, // (2)
  e AS e,
  en AS en,
  COLLECT(ea) AS eas // (3)
WITH
  c AS c,
  cn AS cn,
  o AS o,
  on AS on,
  oas AS oas, // (4)
  COLLECT(
    {
      empNo: e.empNo,
      name: en.name,
      addresses: eas
    }
  ) AS es // (5)
RETURN
  {
    custNo: c.custNo,
    name: cn.name,
    supportTeams: COLLECT( // (6)
      {
        shortName: o.shortName,
        name: on.name,
        addresses: oas,
        members: es
      }
    )
  }
  1. we need to let the hierarchically upper nodes fall through the WITH steps

  2. collect all the orgunit addresses and name them as oas

  3. collect all the employee addresses and name them as eas

  4. let the previously collected orgunit addresses fall through

  5. because we have to collect all the members of an orgunit first, named as es

  6. finally we build the root map with only one inner collection

Not quite correct

While the last cypher is syntactically correct - it isn’t semantically correct in that it does contain duplicate addresses for the employees.

What?

Scrolling up to the fetching query that returns all nodes and look at the result table given for this query! I just modified the return-statement to:

...
RETURN ID(c), ID(cn), ID(o), ID(on), ID(oa), ID(e), ID(en), ID(ea)

so that the result will show, what’s actually happend to the rows in the result:

+--------------------------------------------------------------------+
| ID(c) | ID(cn) | ID(o) | ID(on) | ID(oa) | ID(e) | ID(en) | ID(ea) |
+--------------------------------------------------------------------+
| 638   | 639    | 163   | 164    | 167    | 907   | 908    | 910    |
| 638   | 639    | 163   | 164    | 167    | 907   | 908    | 909    |
| 638   | 639    | 163   | 164    | 167    | 903   | 904    | 906    |
| 638   | 639    | 163   | 164    | 167    | 903   | 904    | 905    |
| 638   | 639    | 163   | 164    | 166    | 907   | 908    | 910    |
| 638   | 639    | 163   | 164    | 166    | 907   | 908    | 909    |
| 638   | 639    | 163   | 164    | 166    | 903   | 904    | 906    |
| 638   | 639    | 163   | 164    | 166    | 903   | 904    | 905    |
| 638   | 639    | 163   | 164    | 165    | 907   | 908    | 910    |
| 638   | 639    | 163   | 164    | 165    | 907   | 908    | 909    |
| 638   | 639    | 163   | 164    | 165    | 903   | 904    | 906    |
| 638   | 639    | 163   | 164    | 165    | 903   | 904    | 905    |
+--------------------------------------------------------------------+
12 rows
138 ms

So, there are:

  • one customer node c with id: 638

  • one customer name node cn with id: 639

  • one orgunit node o with id: 163

  • one orgunit name node on with id: 164

  • three orgunit address nodes oa with ids: 165, 166 and 167

  • two employee nodes ea with ids: 903 and 907

  • two employee name nodes en with ids: 904 and 908

  • four employee address nodes ea with ids: 905, 906, 909 and 910

The resultset, cypher returns is the union of all matching nodes.

A COLLECT clause in the first WITH section of our "Second Trial" query collects all eas for all the other results.

Visualizing this by using the following return statement:

...
RETURN ID(c), ID(cn), ID(o), ID(on), collect(ID(oa)), ID(e), ID(en), collect(ID(ea))

will result in:

+----------------------------------------------------------------------------------------------------------+
| ID(c) | ID(cn) | ID(t) | ID(on) | collect(ID(oa))           | ID(e) | ID(en) | collect(ID(ea))           |
+----------------------------------------------------------------------------------------------------------+
| 638   | 639    | 163   | 164    | [167,167,166,166,165,165] | 903   | 904    | [906,905,906,905,906,905] |
| 638   | 639    | 163   | 164    | [167,167,166,166,165,165] | 907   | 908    | [910,909,910,909,910,909] |
+----------------------------------------------------------------------------------------------------------+
2 rows
121 ms

Look at the collect(ID(ea)) result! There’s the secret reason for the duplicates in the result.

Cypher’s documentation chapter 10.5.5. is the trick: There’s a DISTINCT statement. Though it is not documented, the DISTINCT clause also work inside the COLLECT aggregation. Let’s try this:

...
RETURN ID(c), ID(cn), ID(o), ID(on), collect(distinct ID(oa)), ID(e), ID(en), collect(distinct ID(ea))

leads to:

+--------------------------------------------------------------------------------------------------------+
| ID(c) | ID(cn) | ID(t) | ID(tn) | collect(distinct ID(ta)) | ID(e) | ID(en) | collect(distinct ID(ea)) |
+--------------------------------------------------------------------------------------------------------+
| 638   | 639    | 163   | 164    | [167,166,165]            | 903   | 904    | [906,905]                |
| 638   | 639    | 163   | 164    | [167,166,165]            | 907   | 908    | [910,909]                |
+--------------------------------------------------------------------------------------------------------+
2 rows
132 ms

Yeah! No more duplicates. With this in account, we can summarize our cypher query:

MATCH
  (c:Customer {custNo: '1234567890'})-[:SUPPORTED_BY]->(o:OrgUnit)<-[:MEMBER_OF]-(e:Employee)
OPTIONAL MATCH
  (c)-[:NAMED_BY]->(cn:Name)
OPTIONAL MATCH
  (o)-[:NAMED_BY]->(on:Name)
OPTIONAL MATCH
  (e)-[:NAMED_BY]->(en:Name)
OPTIONAL MATCH
  (o)-[:LOCATED_AT]->(oa:Address)
OPTIONAL MATCH
  (e)-[:LOCATED_AT]->(ea:Address)
WITH
  c AS c,
  cn AS cn,
  o AS o,
  on AS on,
  COLLECT(DISTINCT oa) AS oas,
  e AS e,
  en AS en,
  COLLECT(DISTINCT ea) AS eas
WITH
  c AS c,
  cn AS cn,
  o AS o,
  on AS on,
  oas AS oas,
  COLLECT( DISTINCT
    {
      empNo: e.empNo,
      name: en.name,
      addresses: eas
    }
  ) AS es
RETURN
  {
    custNo: c.custNo,
    name: cn.name,
    supportTeams: COLLECT( DISTINCT
      {
        shortName: o.shortName,
        name: on.name,
        addresses: oas,
        members: es
      }
    )
  }

Fine tuning

Checking validity of relations

Remembering the model of our graph, shown above? Each relation got two properties: validFrom and validTo used to define, when an address or a name is valid for the customer/orgunit or when the orgunit really supports the customer.

Now, there can be more relations to addresses, names a.s.o. but for the moment they are not valid, because they are ancient/historic addresses/names of the customer/orgunit or will become valid in the future.

Values of these properties are Java-Epoche Timestamps, namely long values, counting milliseconds since midnight 1970-01-01 UTC. The same is for the cypher function timestamp() which returns the current timestamp.

So there can be a WHERE clause that checks for validity of a relation like in the next example:

MATCH
  (c:Customer {custNo: '1234567890'})
OPTIONAL MATCH
  (c)-[rn:NAMED_BY]->(cn:Name)
WHERE
  rn.validFrom >= timestamp()
  AND rn.validTo <= timestamp()
RETURN
  c, cn, rn.validFrom, rn.validTo, timestamp()

Well. This is only true, in the following cases:

  • There is an existing relationship :NAMED_BY between these nodes

  • and validFrom of this relationship is specified

  • and validTo of this relationship is specified

But if we like to use NULL for validTo to state, that this relationship does not loose it’s validity in the future, we’re stopped, because null is not less than or equals to any long value.

Instead of extending the WHERE clause with null checks, cypher provides us a better function, named: coalesce. Chapter 12.2.4 of Neo4j documentation explains, that the first "non null" value is returned. Using coalesce in our where clause is shown next:

MATCH
  (c:Customer {custNo: '1234567890'})
OPTIONAL MATCH
  (c)-[rn:NAMED_BY]->(cn:Name)
WHERE
  COALESCE(rn.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(rn.validTo, timestamp())
RETURN
  c, cn, rn.validFrom, rn.validTo, timestamp()

Checking null values

Assuming that there is an orgunit without an employee node, cypher states, that each property-request of a NULL node will return NULL too. With this in account, the result of:

...
RETURN
  COLLECT( DISTINCT
    {
      empNo: e.empNo,
      name: en.name,
      addresses: eas
    }
  ) AS es

will return a json like this:

[
  {
    "empNo" : null,
    "name"  : null,
    "addresses" : []
  }
]

If we do not want these null-objects, we must check the existence of the employee node e first. There is a CASE WHEN …​ THEN …​ ELSE …​ END clause defined in cypher that is usefull for this check:

...
RETURN
  CASE WHEN e IS NOT NULL THEN (1)
    COLLECT( DISTINCT
      {
        empNo: e.empNo,
        name: en.name,
        addresses: eas
      }
    )
  ELSE
    [] (2)
  END
  AS es
  1. when there are employees in the result set, create the collection of the employees, named es

  2. otherwise, return an empty collection as es

Summarize

Cypher is a great syntax to develop queries! Next it is a great syntax to develop complex queries with complex results projected to any kind of hierarchical maps too.

Surely, the given query is an example that could even improved a lot. But it demonstrates that - with a little knowledge of how cypher works - there are tremendous possibilities for using cypher in every kind of your project’s requirement.

Given a concrete model, similiar to the above mentioned "Simple CRM model" with about 1M Customers, 1K OrgUnits, 1M Employees and ~ 3 Addresses for each of them, the complete query is executed in less then 200ms (with default memory settings) and returning a JSON object that can directly be processed in the Javascript of the web site for example. This simplifies the development of such aspects very much.

Finally, the complete cypher query

MATCH
  (c:Customer {custNo: '1234567890'})-[rco:SUPPORTED_BY]->(o:OrgUnit)<-[roe:MEMBER_OF]-(e:Employee)
WHERE
  COALESCE(rco.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(rco.validTo, timestamp())
  AND COALESCE(roe.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(roe.validTo, timestamp())
OPTIONAL MATCH
  (c)-[rcn:NAMED_BY]->(cn:Name)
WHERE
  COALESCE(rcn.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(rcn.validTo, timestamp())
OPTIONAL MATCH
  (o)-[ron:NAMED_BY]->(on:Name)
WHERE
  COALESCE(ron.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(ron.validTo, timestamp())
OPTIONAL MATCH
  (e)-[ren:NAMED_BY]->(en:Name)
WHERE
  COALESCE(ren.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(ren.validTo, timestamp())
OPTIONAL MATCH
  (o)-[roa:LOCATED_AT]->(oa:Address)
WHERE
  COALESCE(roa.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(roa.validTo, timestamp())
OPTIONAL MATCH
  (e)-[rea:LOCATED_AT]->(ea:Address)
WHERE
  COALESCE(rea.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(rea.validTo, timestamp())
WITH
  c AS c,
  cn AS cn,
  o AS o,
  on AS on,
  CASE WHEN oa IS NOT NULL THEN
    COLLECT(DISTINCT oa)
  ELSE
    []
  END AS oas,
  e AS e,
  en AS en,
  CASE WHEN ea IS NOT NULL THEN
    COLLECT(DISTINCT ea)
  ELSE
    []
  END AS eas
WITH
  c AS c,
  cn AS cn,
  o AS o,
  on AS on,
  oas AS oas,
  CASE WHEN e IS NOT NULL THEN
    COLLECT( DISTINCT
      {
        empNo: e.empNo,
        name:
          CASE WHEN HAS(en.name) THEN
            en.name
          ELSE
            ''
          END,
        addresses: eas
      }
    )
  ELSE
    []
  END AS es
RETURN
  {
    custNo: c.custNo,
    name:
      CASE WHEN HAS(cn.name) THEN
        cn.name
      ELSE
        ''
      END,
    supportTeams:
      CASE WHEN o IS NOT NULL THEN
        COLLECT( DISTINCT
          {
            shortName: o.shortName,
            name:
              CASE WHEN HAS(on.name) THEN
                on.name
              ELSE
                ''
              END,
            addresses: oas,
            members: es
          }
        )
      ELSE
        []
      END
  } AS `customer with support-team-contacts`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment