jexp/monster-query.adoc

## monster-query.adoc

      
    Raw
  

              monster-query.adoc
            
          
    A complex query result projection in cypher


Table of Contents

Introduction
The model
The task
The solution - step by step

The Basic Path
Fetching the Names
Fetching the addresses
Optional Address and Names fetching
Projection
Fine tuning


Summarize
Finally, the complete cypher query


Introduction


Sometimes it is better to request a complete object-tree from the (graph-)database at once instead sending more than one requests to the database and merging the results in program code.


However, since Neo4j release 2.0, cypher made this job easy, but you have to write a monster query.


One of these monsters are described in this Gist with the purpose to let you learn a bit more about cypher.


The model


Asuming a customer relation model, drawn in a graph that is stored in neo4j with following nodes:


a Customer node, representing the basic customer data


an OrgUnit node, representing the team of our company


an Employee node, representing an Employee of our company


a Name node, representing any name


an Address node, representing any address


and the corresponding releations between them.


Figure 1. A simple crm model


For simplicity reasons, names and addresses are not detailed as they would in productive models.


The task


What do we want to achieve?


Let’s assume, the graph database is backed behind a customer service portal, which is a webpage, where the customer can do some self service operations.


The portal will show the responsible persons in charge for the currently logged-in customer in a "Request Support" site.


To fetch the informations for this site, we have to query the database and returning a json, that contains all addresses (telephone numbers, email addresses, geographic addresses a.s.o.) of all employees, that supports the current logged-in customer.


The solution - step by step


The Basic Path


First, in respect to our model, we need to fetch all employees, that are members of an orgunit that acts as a support-team for the given customer.


With cypher there is no problem finding the correct path, starting from the customer and moving hand over hand to the employees:


MATCH
  (c:Customer {custNo: '1234567890'})-[:SUPPORTED_BY]->(o:OrgUnit)<-[:MEMBER_OF]-(e:Employee)
RETURN
  c, o, e


This will give us a result set containing customer, orgunit and employee nodes. Good, but there are no names!


Fetching the Names


Instead of enlarging our basic path, we divide the names-fetching in different match parts:


MATCH
  (c:Customer {custNo: '1234567890'})-[:SUPPORTED_BY]->(o:OrgUnit)<-[:MEMBER_OF]-(e:Employee),
  (c)-[:NAMED_BY]->(cn:Name),
  (o)-[:NAMED_BY]->(on:Name),
  (e)-[:NAMED_BY]->(en:Name)
RETURN
  c, cn, o, on, e, en


As a result we get more result rows with all the named nodes. Simple! Let’s fetching addresses too.


Fetching the addresses


As with names, addresses are related to orgunit and employee nodes, so we have to provide another match clause.


MATCH
  (c:Customer {custNo: '1234567890'})-[:SUPPORTED_BY]->(o:OrgUnit)<-[:MEMBER_OF]-(e:Employee),
  (c)-[:NAMED_BY]->(cn:Name),
  (o)-[:NAMED_BY]->(on:Name),
  (e)-[:NAMED_BY]->(en:Name),
  (o)-[:LOCATED_AT]->(oa:Address),
  (e)-[:LOCATED_AT]->(ea:Address)
RETURN
  c, cn, o, on, oa, e, en, ea


No wonder - the nodes are returned.


But wait! Assuming an employee, that "support’s" the customer does not have an address… The result set will be missing this employee!


Why that?


In the cypher above, the MATCH -clause states that there is an existing relation between an employee (labeled: e) and an address (labeled: ea) of a relationship-type :LOCATED_AT. Neo4j will return only those nodes and relations that matches.


Optional Address and Names fetching


But no problem: The guys from neotechnology built in a cypher statement, named OPTIONAL MATCH. This will do the trick!


MATCH
  (c:Customer {custNo: '1234567890'})-[:SUPPORTED_BY]->(o:OrgUnit)<-[:MEMBER_OF]-(e:Employee)
OPTIONAL MATCH
  (c)-[:NAMED_BY]->(cn:Name)
OPTIONAL MATCH
  (o)-[:NAMED_BY]->(on:Name)
OPTIONAL MATCH
  (e)-[:NAMED_BY]->(en:Name)
OPTIONAL MATCH
  (o)-[:LOCATED_AT]->(oa:Address)
OPTIONAL MATCH
  (e)-[:LOCATED_AT]->(ea:Address)
RETURN
  c, cn, o, on, oa, e, en, ea


Now, when there is an employee without an address, the ea node for this employee e is NULL in the result set.


Projection


The task, described above, was, that the cypher is returning a json object tree directly. Ok, so we cannot achieve this, when we return all the nodes without any structure.


With Neo4j 2.0 there is a very cool feature: You can create (and return) maps within cypher. This is a kind of projecting the resultset in a form, we’re focused on.


So what we like to have is something like displayed in the following diagram:


Figure 2. Expected Result of the query


Here’s an example of the expected query result:


{
  "custNo"            : "1234567890",                  (1)
  "name"              : "ACME Productions Co.",
  "supportteams"      : [                              (2)
    {
      "shortName"     : "snwt",
      "name"          : "South-North-West-Team",
      "addresses"     : [                              (3)
        {
          "type"      : "LANDLINE",
          "data"      : "+49 1234/567689"
        },
        {
          "type"      : "EMAIL",
          "data"      : "snwt@BestCompany.org"
        }
      ],
      "members"       : [                              (4)
        {
          "empNo"     : "A-001",
          "name"      : "Max Mustermann"
          "addresses" : [                              (5)
            {
              "type"  : "LANDLINE",
              "data"  : "+49 1234/56789-123"
            }
          ]
        },
        {
          "empNo"     : "A-002",
          "name"      : "Manuela Mustermann"
          "addresses" : [
            {
              "type"  : "LANDLINE",
              "data"  : "+49 1234/56789-124"
            }
          ]
        },
        {
          "empNo"     : "A-003",
          "name"      : "Jon Doe"
          "addresses" : [
            {
              "type"  : "LANDLINE",
              "data"  : "+49 1234/56789-125"
            }
          ]
        }
      ]
    }
  ]
}


the data of the customer (c)


the support-teams of the customer, which are orgunits (o) related to the customer via the :SUPPORTED_BY relationship


the addresses of the orgunit (oa)


the members (e) of the orgunit, which are employees related via :MEMBER_OF


the addresses of the members (ea)


First Trial:


Ok. Stretching Fingers! Writing the query’s RETURN clause:


MATCH
  (c:Customer {custNo: '1234567890'})-[:SUPPORTED_BY]->(o:OrgUnit)<-[:MEMBER_OF]-(e:Employee)
OPTIONAL MATCH
  (c)-[:NAMED_BY]->(cn:Name)
OPTIONAL MATCH
  (o)-[:NAMED_BY]->(on:Name)
OPTIONAL MATCH
  (e)-[:NAMED_BY]->(en:Name)
OPTIONAL MATCH
  (o)-[:LOCATED_AT]->(oa:Address)
OPTIONAL MATCH
  (e)-[:LOCATED_AT]->(ea:Address)
RETURN
  {
    custNo: c.custNo,
    name: cn.name,
    supportTeams: COLLECT(
      {
        shortName: o.shortName,
        name: on.name,
        addresses: COLLECT(oa),
        members: COLLECT(
          {
            empNo: e.empNo,
            name: en.name,
            addresses: COLLECT(ea)
          }
        )
      }
    )
  }


NOPE!


There’s an error with this query: "Can’t use aggregate functions inside of aggregate functions."


Which simply means: "No COLLECTs inside COLLECTs!"


Second Trial:


Luckily cypher provides us a possibility to separate the things: The secret is the WITH clause. There we can built up the inner collections first, and with these prepared collections, we can build the whole projection in the RETURN clause.


Let’s look what it is in a correct statement:


MATCH
  (c:Customer {custNo: '1234567890'})-[:SUPPORTED_BY]->(o:OrgUnit)<-[:MEMBER_OF]-(e:Employee)
OPTIONAL MATCH
  (c)-[:NAMED_BY]->(cn:Name)
OPTIONAL MATCH
  (o)-[:NAMED_BY]->(on:Name)
OPTIONAL MATCH
  (e)-[:NAMED_BY]->(en:Name)
OPTIONAL MATCH
  (o)-[:LOCATED_AT]->(oa:Address)
OPTIONAL MATCH
  (e)-[:LOCATED_AT]->(ea:Address)
WITH
  c AS c, // (1)
  cn AS cn,
  o AS o,
  on AS on,
  COLLECT(oa) AS oas, // (2)
  e AS e,
  en AS en,
  COLLECT(ea) AS eas // (3)
WITH
  c AS c,
  cn AS cn,
  o AS o,
  on AS on,
  oas AS oas, // (4)
  COLLECT(
    {
      empNo: e.empNo,
      name: en.name,
      addresses: eas
    }
  ) AS es // (5)
RETURN
  {
    custNo: c.custNo,
    name: cn.name,
    supportTeams: COLLECT( // (6)
      {
        shortName: o.shortName,
        name: on.name,
        addresses: oas,
        members: es
      }
    )
  }


we need to let the hierarchically upper nodes fall through the WITH steps


collect all the orgunit addresses and name them as oas


collect all the employee addresses and name them as eas


let the previously collected orgunit addresses fall through


because we have to collect all the members of an orgunit first, named as es


finally we build the root map with only one inner collection


Not quite correct


While the last cypher is syntactically correct - it isn’t semantically correct in that it does contain duplicate addresses for the employees.


What?


Scrolling up to the fetching query that returns all nodes and look at the result table given for this query! I just modified the return-statement to:


...
RETURN ID(c), ID(cn), ID(o), ID(on), ID(oa), ID(e), ID(en), ID(ea)


so that the result will show, what’s actually happend to the rows in the result:


+--------------------------------------------------------------------+
| ID(c) | ID(cn) | ID(o) | ID(on) | ID(oa) | ID(e) | ID(en) | ID(ea) |
+--------------------------------------------------------------------+
| 638   | 639    | 163   | 164    | 167    | 907   | 908    | 910    |
| 638   | 639    | 163   | 164    | 167    | 907   | 908    | 909    |
| 638   | 639    | 163   | 164    | 167    | 903   | 904    | 906    |
| 638   | 639    | 163   | 164    | 167    | 903   | 904    | 905    |
| 638   | 639    | 163   | 164    | 166    | 907   | 908    | 910    |
| 638   | 639    | 163   | 164    | 166    | 907   | 908    | 909    |
| 638   | 639    | 163   | 164    | 166    | 903   | 904    | 906    |
| 638   | 639    | 163   | 164    | 166    | 903   | 904    | 905    |
| 638   | 639    | 163   | 164    | 165    | 907   | 908    | 910    |
| 638   | 639    | 163   | 164    | 165    | 907   | 908    | 909    |
| 638   | 639    | 163   | 164    | 165    | 903   | 904    | 906    |
| 638   | 639    | 163   | 164    | 165    | 903   | 904    | 905    |
+--------------------------------------------------------------------+
12 rows
138 ms


So, there are:


one customer node c with id: 638


one customer name node cn with id: 639


one orgunit node o with id: 163


one orgunit name node on with id: 164


three orgunit address nodes oa with ids: 165, 166 and 167


two employee nodes ea with ids: 903 and 907


two employee name nodes en with ids: 904 and 908


four employee address nodes ea with ids: 905, 906, 909 and 910


The resultset, cypher returns is the union of all matching nodes.


A COLLECT clause in the first WITH section of our "Second Trial" query collects all eas for all the other results.


Visualizing this by using the following return statement:


...
RETURN ID(c), ID(cn), ID(o), ID(on), collect(ID(oa)), ID(e), ID(en), collect(ID(ea))


will result in:


+----------------------------------------------------------------------------------------------------------+
| ID(c) | ID(cn) | ID(t) | ID(on) | collect(ID(oa))           | ID(e) | ID(en) | collect(ID(ea))           |
+----------------------------------------------------------------------------------------------------------+
| 638   | 639    | 163   | 164    | [167,167,166,166,165,165] | 903   | 904    | [906,905,906,905,906,905] |
| 638   | 639    | 163   | 164    | [167,167,166,166,165,165] | 907   | 908    | [910,909,910,909,910,909] |
+----------------------------------------------------------------------------------------------------------+
2 rows
121 ms


Look at the collect(ID(ea)) result! There’s the secret reason for the duplicates in the result.


Cypher’s documentation chapter 10.5.5. is the trick: There’s a DISTINCT statement. Though it is not documented, the DISTINCT clause also work inside the COLLECT aggregation. Let’s try this:


...
RETURN ID(c), ID(cn), ID(o), ID(on), collect(distinct ID(oa)), ID(e), ID(en), collect(distinct ID(ea))


leads to:


+--------------------------------------------------------------------------------------------------------+
| ID(c) | ID(cn) | ID(t) | ID(tn) | collect(distinct ID(ta)) | ID(e) | ID(en) | collect(distinct ID(ea)) |
+--------------------------------------------------------------------------------------------------------+
| 638   | 639    | 163   | 164    | [167,166,165]            | 903   | 904    | [906,905]                |
| 638   | 639    | 163   | 164    | [167,166,165]            | 907   | 908    | [910,909]                |
+--------------------------------------------------------------------------------------------------------+
2 rows
132 ms


Yeah! No more duplicates. With this in account, we can summarize our cypher query:


MATCH
  (c:Customer {custNo: '1234567890'})-[:SUPPORTED_BY]->(o:OrgUnit)<-[:MEMBER_OF]-(e:Employee)
OPTIONAL MATCH
  (c)-[:NAMED_BY]->(cn:Name)
OPTIONAL MATCH
  (o)-[:NAMED_BY]->(on:Name)
OPTIONAL MATCH
  (e)-[:NAMED_BY]->(en:Name)
OPTIONAL MATCH
  (o)-[:LOCATED_AT]->(oa:Address)
OPTIONAL MATCH
  (e)-[:LOCATED_AT]->(ea:Address)
WITH
  c AS c,
  cn AS cn,
  o AS o,
  on AS on,
  COLLECT(DISTINCT oa) AS oas,
  e AS e,
  en AS en,
  COLLECT(DISTINCT ea) AS eas
WITH
  c AS c,
  cn AS cn,
  o AS o,
  on AS on,
  oas AS oas,
  COLLECT( DISTINCT
    {
      empNo: e.empNo,
      name: en.name,
      addresses: eas
    }
  ) AS es
RETURN
  {
    custNo: c.custNo,
    name: cn.name,
    supportTeams: COLLECT( DISTINCT
      {
        shortName: o.shortName,
        name: on.name,
        addresses: oas,
        members: es
      }
    )
  }


Fine tuning


Checking validity of relations


Remembering the model of our graph, shown above? Each relation got two properties: validFrom and validTo used to define, when an address or a name is valid for the customer/orgunit or when the orgunit really supports the customer.


Now, there can be more relations to addresses, names a.s.o. but for the moment they are not valid, because they are ancient/historic addresses/names of the customer/orgunit or will become valid in the future.


Values of these properties are Java-Epoche Timestamps, namely long values, counting milliseconds since midnight 1970-01-01 UTC. The same is for the cypher function timestamp() which returns the current timestamp.


So there can be a WHERE clause that checks for validity of a relation like in the next example:


MATCH
  (c:Customer {custNo: '1234567890'})
OPTIONAL MATCH
  (c)-[rn:NAMED_BY]->(cn:Name)
WHERE
  rn.validFrom >= timestamp()
  AND rn.validTo <= timestamp()
RETURN
  c, cn, rn.validFrom, rn.validTo, timestamp()


Well. This is only true, in the following cases:


There is an existing relationship :NAMED_BY between these nodes


and validFrom of this relationship is specified


and validTo of this relationship is specified


But if we like to use NULL for validTo to state, that this relationship does not loose it’s validity in the future, we’re stopped, because null is not less than or equals to any long value.


Instead of extending the WHERE clause with null checks, cypher provides us a better function, named: coalesce. Chapter 12.2.4 of Neo4j documentation explains, that the first "non null" value is returned. Using coalesce in our where clause is shown next:


MATCH
  (c:Customer {custNo: '1234567890'})
OPTIONAL MATCH
  (c)-[rn:NAMED_BY]->(cn:Name)
WHERE
  COALESCE(rn.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(rn.validTo, timestamp())
RETURN
  c, cn, rn.validFrom, rn.validTo, timestamp()


Checking null values


Assuming that there is an orgunit without an employee node, cypher states, that each property-request of a NULL node will return NULL too. With this in account, the result of:


...
RETURN
  COLLECT( DISTINCT
    {
      empNo: e.empNo,
      name: en.name,
      addresses: eas
    }
  ) AS es


will return a json like this:


[
  {
    "empNo" : null,
    "name"  : null,
    "addresses" : []
  }
]


If we do not want these null-objects, we must check the existence of the employee node e first. There is a CASE WHEN … THEN … ELSE … END clause defined in cypher that is usefull for this check:


...
RETURN
  CASE WHEN e IS NOT NULL THEN (1)
    COLLECT( DISTINCT
      {
        empNo: e.empNo,
        name: en.name,
        addresses: eas
      }
    )
  ELSE
    [] (2)
  END
  AS es


when there are employees in the result set, create the collection of the employees, named es


otherwise, return an empty collection as es


Summarize


Cypher is a great syntax to develop queries! Next it is a great syntax to develop complex queries with complex results projected to any kind of hierarchical maps too.


Surely, the given query is an example that could even improved a lot. But it demonstrates that - with a little knowledge of how cypher works - there are tremendous possibilities for using cypher in every kind of your project’s requirement.


Given a concrete model, similiar to the above mentioned "Simple CRM model" with about 1M Customers, 1K OrgUnits, 1M Employees and ~ 3 Addresses for each of them, the complete query is executed in less then 200ms (with default memory settings) and returning a JSON object that can directly be processed in the Javascript of the web site for example. This simplifies the development of such aspects very much.


Finally, the complete cypher query


MATCH
  (c:Customer {custNo: '1234567890'})-[rco:SUPPORTED_BY]->(o:OrgUnit)<-[roe:MEMBER_OF]-(e:Employee)
WHERE
  COALESCE(rco.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(rco.validTo, timestamp())
  AND COALESCE(roe.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(roe.validTo, timestamp())
OPTIONAL MATCH
  (c)-[rcn:NAMED_BY]->(cn:Name)
WHERE
  COALESCE(rcn.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(rcn.validTo, timestamp())
OPTIONAL MATCH
  (o)-[ron:NAMED_BY]->(on:Name)
WHERE
  COALESCE(ron.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(ron.validTo, timestamp())
OPTIONAL MATCH
  (e)-[ren:NAMED_BY]->(en:Name)
WHERE
  COALESCE(ren.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(ren.validTo, timestamp())
OPTIONAL MATCH
  (o)-[roa:LOCATED_AT]->(oa:Address)
WHERE
  COALESCE(roa.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(roa.validTo, timestamp())
OPTIONAL MATCH
  (e)-[rea:LOCATED_AT]->(ea:Address)
WHERE
  COALESCE(rea.validFrom, 0) <= timestamp()
  AND timestamp() <= COALESCE(rea.validTo, timestamp())
WITH
  c AS c,
  cn AS cn,
  o AS o,
  on AS on,
  CASE WHEN oa IS NOT NULL THEN
    COLLECT(DISTINCT oa)
  ELSE
    []
  END AS oas,
  e AS e,
  en AS en,
  CASE WHEN ea IS NOT NULL THEN
    COLLECT(DISTINCT ea)
  ELSE
    []
  END AS eas
WITH
  c AS c,
  cn AS cn,
  o AS o,
  on AS on,
  oas AS oas,
  CASE WHEN e IS NOT NULL THEN
    COLLECT( DISTINCT
      {
        empNo: e.empNo,
        name:
          CASE WHEN HAS(en.name) THEN
            en.name
          ELSE
            ''
          END,
        addresses: eas
      }
    )
  ELSE
    []
  END AS es
RETURN
  {
    custNo: c.custNo,
    name:
      CASE WHEN HAS(cn.name) THEN
        cn.name
      ELSE
        ''
      END,
    supportTeams:
      CASE WHEN o IS NOT NULL THEN
        COLLECT( DISTINCT
          {
            shortName: o.shortName,
            name:
              CASE WHEN HAS(on.name) THEN
                on.name
              ELSE
                ''
              END,
            addresses: oas,
            members: es
          }
        )
      ELSE
        []
      END
  } AS `customer with support-team-contacts`