Skip to content

Instantly share code, notes, and snippets.

@patrickkwang
Created July 22, 2021 00:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save patrickkwang/6f0ec1b204e000a690f53dd9107238f1 to your computer and use it in GitHub Desktop.
Save patrickkwang/6f0ec1b204e000a690f53dd9107238f1 to your computer and use it in GitHub Desktop.

Transpiling TRAPI into Cypher

https://github.com/ranking-agent/reasoner-transpiler/

The reasoner-transpiler Python library provides tools for converting a TRAPI query graph into a Cypher (Neo4j) query that generates the corresponding knowledge graph and results - it performs the "lookup" operation.

Features

  • generates TRAPI-compliant knowledge graph and results, directly from Neo4j!
  • handles semantic operations with the biolink-model
    • sub-categories
    • sub-predicates
    • symmetric and inverse predicates
  • handles qnodes with "is_set": true
  • handles arbitrary query graphs: n-hops, branches, loops, etc.

Example

Let's say we want to find all phenotypes associated with type-2 diabetes.

from reasoner_transpiler.cypher import get_query

qgraph = {
    "nodes": {
        "diabetes": {
            "ids": ["MONDO:0005148"],
        },
        "phenotype": {
            "categories": ["biolink:PhenotypicFeature"],
        },
    },
    "edges": {
        "has phenotype": {
            "subject": "diabetes",
            "predicates": ["biolink:has_phenotype"],
            "object": "phenotype",
        },
    },
}

print(get_query(qgraph))

The reasoner transpiler generates the following Cypher query.

MATCH (`diabetes` {`id`: "MONDO:0005148"})-[`has phenotype`:`biolink:has_phenotype`|`biolink:phenotype_of`]-(`phenotype`:`biolink:PhenotypicFeature`)
WHERE (
    (
        type(`has phenotype`) in ["biolink:has_phenotype"]
        AND startNode(`has phenotype`) = `diabetes`
    )
    OR (
        type(`has phenotype`) in ["biolink:phenotype_of"]
        AND startNode(`has phenotype`) = `phenotype`
    )
)
WITH 
    {
        node_bindings: {
            `diabetes`: (CASE WHEN `diabetes` IS NOT NULL THEN [{id: `diabetes`.id}] ELSE [] END),
            `phenotype`: (CASE WHEN `phenotype` IS NOT NULL THEN [{id: `phenotype`.id}] ELSE [] END)
        },
        edge_bindings: {
            `has phenotype`: [ei IN collect(DISTINCT `has phenotype`.id) WHERE ei IS NOT null | {id: ei}]
        }
    } AS result,
    {
        nodes: collect(DISTINCT `diabetes`) + collect(DISTINCT `phenotype`),
        edges: collect(DISTINCT `has phenotype`)
    } AS knowledge_graph
UNWIND knowledge_graph.nodes AS knode UNWIND knowledge_graph.edges AS kedge
WITH
    collect(DISTINCT result) AS results,
    {
        nodes: apoc.map.fromLists([n IN collect(DISTINCT knode) | n.id], [n IN collect(DISTINCT knode) | {
            categories: labels(n),
            name: n.name,
            attributes: [key in apoc.coll.subtract(keys(n), ["id", "category"]) | {
                original_attribute_name: key,
                attribute_type_id: COALESCE({publications: "EDAM:data_0971"}[key], "NA"),
                value: n[key]
            }]
        }]),
        edges: apoc.map.fromLists(
            [e IN collect(DISTINCT kedge) | e.id],
            [e IN collect(DISTINCT kedge) | {
                predicate: type(e),
                subject: startNode(e).id,
                object: endNode(e).id,
                attributes: [key in apoc.coll.subtract(keys(e), ["id", "predicate"]) | {
                    original_attribute_name: key,
                    attribute_type_id: COALESCE({publications: "EDAM:data_0971"}[key], "NA"),
                    value: e[key]
                }]
            }]
        )
    } AS knowledge_graph
RETURN results, knowledge_graph

There's a lot going on. Let's break this down.

MATCH (`diabetes` {`id`: "MONDO:0005148"})-[`has phenotype`:`biolink:has_phenotype`|`biolink:phenotype_of`]-(`phenotype`:`biolink:PhenotypicFeature`)

The MATCH clause finds all node-relationship-node triples including MONDO:0005148 and a phenotype with the predicate "has phenotype" or "phenotype of". It does not enforce a relationship direction.

WHERE (
    (
        type(`has phenotype`) in ["biolink:has_phenotype"]
        AND startNode(`has phenotype`) = `diabetes`
    )
    OR (
        type(`has phenotype`) in ["biolink:phenotype_of"]
        AND startNode(`has phenotype`) = `phenotype`
    )
)

The WHERE clause retains only the triples of the form "diabetes has phenotype..." and "...phenotype of diabetes". Thus we properly capture both the requested predicate and its inverse.

We have the right stuff now. Everything from here on is reshaping it into the TRAPI format.

WITH 
    {
        node_bindings: {
            `diabetes`: (CASE WHEN `diabetes` IS NOT NULL THEN [{id: `diabetes`.id}] ELSE [] END),
            `phenotype`: (CASE WHEN `phenotype` IS NOT NULL THEN [{id: `phenotype`.id}] ELSE [] END)
        },
        edge_bindings: {
            `has phenotype`: [ei IN collect(DISTINCT `has phenotype`.id) WHERE ei IS NOT null | {id: ei}]
        }
    } AS result,
    {
        nodes: collect(DISTINCT `diabetes`) + collect(DISTINCT `phenotype`),
        edges: collect(DISTINCT `has phenotype`)
    } AS knowledge_graph

The first WITH clause builds the individual results and compiles nodes and relationships into little per-result proto-knowledge graphs. The knowledge graph nodes and edges still need a lot of reformatting.

UNWIND knowledge_graph.nodes AS knode UNWIND knowledge_graph.edges AS kedge

The UNWIND clauses break all of the individual knodes and kedges out of each result so that we can later combine them into a single big knowledge graph. If our little collect-unwind-collect procedure seems overcomplicated, that's because it is. The complexity is necessary in cases where we get "is_set": true because we can bind multiple knowledge-graph nodes to a single query-graph node in the same result.

WITH
    collect(DISTINCT result) AS results,
    {
        nodes: apoc.map.fromLists([n IN collect(DISTINCT knode) | n.id], [n IN collect(DISTINCT knode) | {
            categories: labels(n),
            name: n.name,
            attributes: [key in apoc.coll.subtract(keys(n), ["id", "category"]) | {
                original_attribute_name: key,
                attribute_type_id: coalesce({publications: "EDAM:data_0971"}[key], "NA"),
                value: n[key]
            }]
        }]),
        edges: apoc.map.fromLists(
            [e IN collect(DISTINCT kedge) | e.id],
            [e IN collect(DISTINCT kedge) | {
                predicate: type(e),
                subject: startNode(e).id,
                object: endNode(e).id,
                attributes: [key in apoc.coll.subtract(keys(e), ["id", "predicate"]) | {
                    original_attribute_name: key,
                    attribute_type_id: coalesce({publications: "EDAM:data_0971"}[key], "NA"),
                    value: e[key]
                }]
            }]
        )
    } AS knowledge_graph

The final WITH clause collects everything into a single results list and knowledge graph, while reformatting the nodes/edges as maps and constructing proper attributes. This is one place where apoc is used to help with list/map manipulations. Some more complicated situations use it in other ways.

What comes out of Neo4j is a TRAPI-compliant knowledge graph and results - there is no need to reformat the output in Python.

Caveats

The reasoner transpiler makes assumptions about how the data are represented in Neo4j. Its assumptions are largely consistent with KGX standards for Neo4j, but are tuned specifically for the data structure used by Plater KPs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment