Skip to content

Instantly share code, notes, and snippets.

@wagenrace
Created April 8, 2022 06:47
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wagenrace/97f634093c9630a3b4e7a441a77987b5 to your computer and use it in GitHub Desktop.
Save wagenrace/97f634093c9630a3b4e7a441a77987b5 to your computer and use it in GitHub Desktop.
Loading the GI50 of NCI60 into neo4j
// CSV file can be downloaded here:
// https://wiki.nci.nih.gov/download/attachments/147193864/GI50.zip?version=2&modificationDate=1649214698000&api=v2
LOAD CSV WITH HEADERS FROM 'file:///GI50.csv' AS row
MERGE (chem:Chemical {nsc: toInteger(row.NSC)})
MERGE (cell:CellLine {name: row.CELL_NAME})
MERGE (dis:Disease {name: row.PANEL_NAME})
WITH chem, cell, dis, row
MERGE (chem)-[:GI50 {concentration: row.AVERAGE, research: "NCI60", unit: row.CONCENTRATION_UNIT, experiment_id: row.EXPID, count: row.COUNT}]->(cell)
MERGE (cell)-[:CELL_LINE_OF]->(dis);
@jexp
Copy link

jexp commented Apr 8, 2022

Neo4j Loading Improvements

what was missing:

  • constraints
  • using periodic commit to batch updates
  • merge on too many non-id properties for relationship
  • merge instead of create for unique data
  • supernodes
  • Eager operator in the single-pass statement -> split up into two or more import statements

the model is really suboptimal, it creates a lot of dense nodes with huge relationship counts
might be better to model Experiment as a node and connect it to Chemical and CellLine

Config

  • neo4j 4.4
  • 2G heap, 1G pagecache (is probably too much)

Fastest approach (80 seconds)


drop constraint chemical_nsc;
drop constraint cell_line_name;
drop constraint disease_name;

return datetime();


LOAD CSV  WITH HEADERS FROM 'file:///GI50.csv' AS row 
RETURN count(*), count(distinct row.EXPID);

LOAD CSV  WITH HEADERS FROM 'file:///GI50.csv' AS row 
WITH distinct row.EXPID as experiment_id, row.NSC as nsc, row.CELL_NAME as cellName
RETURN count(*);

// fastest
USING PERIODIC COMMIT 100000
LOAD CSV  WITH HEADERS FROM 'file:///GI50.csv' AS row 
WITH distinct row.CELL_NAME as cellName
CREATE (cell:CellLine {name: cellName});

USING PERIODIC COMMIT 100000
LOAD CSV  WITH HEADERS FROM 'file:///GI50.csv' AS row 
WITH distinct row.PANEL_NAME as panelName
CREATE (dis:Disease {name: panelName});

USING PERIODIC COMMIT 100000
LOAD CSV  WITH HEADERS FROM 'file:///GI50.csv' AS row 
WITH distinct toInteger(row.NSC) as nsc
CREATE (chem:Chemical {nsc: nsc});

create constraint chemical_nsc if not exists for (c:Chemical) require c.nsc is unique;
create constraint cell_line_name if not exists for (cl:CellLine) require cl.name is unique;
create constraint disease_name if not exists for (d:Disease) require d.name is unique;

USING PERIODIC COMMIT 100000
LOAD CSV  WITH HEADERS FROM 'file:///GI50.csv' AS row 
WITH distinct row.CELL_NAME as cellName, row.PANEL_NAME as panelName
MATCH (cell:CellLine {name: cellName})
MATCH (dis:Disease {name: panelName})
CREATE (cell)-[:CELL_LINE_OF]->(dis);


USING PERIODIC COMMIT 100000
LOAD CSV  WITH HEADERS FROM 'file:///GI50.csv' AS row 
MATCH (chem:Chemical {nsc: toInteger(row.NSC)})
MATCH (cell:CellLine {name: row.CELL_NAME})
CREATE (chem)-[r:GI50 {experiment_id: row.EXPID, concentration: row.AVERAGE, research: "NCI60", unit: row.CONCENTRATION_UNIT, count: row.COUNT}]->(cell)
;

return datetime();

Medium Fast Approach (30 mins)

create constraint chemical_nsc if not exists for (c:Chemical) require c.nsc is unique;
create constraint cell_line_name if not exists for (cl:CellLine) require cl.name is unique;
create constraint disease_name if not exists for (d:Disease) require d.name is unique;

// medium fast
USING PERIODIC COMMIT 50000
LOAD CSV  WITH HEADERS FROM 'file:///GI50.csv' AS row 
MERGE (chem:Chemical {nsc: toInteger(row.NSC)})
MERGE (cell:CellLine {name: row.CELL_NAME})
MERGE (chem)-[r:GI50 {experiment_id: row.EXPID}]->(cell)
ON CREATE SET r += {concentration: row.AVERAGE, research: "NCI60", unit: row.CONCENTRATION_UNIT, count: row.COUNT}
;


// medium fast
USING PERIODIC COMMIT 50000
LOAD CSV  WITH HEADERS FROM 'file:///GI50.csv' AS row 
WITH distinct row.CELL_NAME as cellName, row.PANEL_NAME as panelName
MATCH (cell:CellLine {name: cellName})
MERGE (dis:Disease {name: panelName})
MERGE (cell)-[:CELL_LINE_OF]->(dis);

Original Approach


/* was missing constraints, periodic commit, and has an eager operation
explain
USING PERIODIC COMMIT 10000
LOAD CSV  WITH HEADERS FROM 'file:///GI50.csv' AS row 
MERGE (chem:Chemical {nsc: toInteger(row.NSC)})
MERGE (cell:CellLine {name: row.CELL_NAME})
MERGE (dis:Disease {name: row.PANEL_NAME})
WITH chem, cell, dis, row
MERGE (chem)-[r:GI50 {experiment_id: row.EXPID}]->(cell)
ON CREATE SET r += {concentration: row.AVERAGE, research: "NCI60", unit: row.CONCENTRATION_UNIT, count: row.COUNT}
MERGE (cell)-[:CELL_LINE_OF]->(dis);
*/

@wagenrace
Copy link
Author

Thank you, I will look into it. I clearly have something to learn about cypher and Neo4j

@jexp
Copy link

jexp commented Apr 8, 2022

Sorry, it should be better out of the box, Imho in browser there are warnings for missing constraints/indexes.

The modeling question still stands though, i.e. would it make sense to model Experiment as a node.

@wagenrace
Copy link
Author

@jexp Thank you for you help. I got it to work (need :auto before USING PERIODIC COMMIT in desktop version) this is extremely fast compared that what I was doing.

I was busy creating a benchmark between Redis and Neo4j where Neo4j is 15x slower with my stupid stupid query compared to redis. Is that noteworthy?
But with match normal queries they are the same

https://github.com/wagenrace/medical_data_blog/tree/bench_mark/Adding_NCI60/bench_mark

@wagenrace
Copy link
Author

I think creating experiments would make sense, I stayed away from these question till I have more experience to set it up clearer. I want to add graph datascience (similarity and community detection) first and then revisit the problem again with more experience

@jexp
Copy link

jexp commented Apr 8, 2022

I don't think it makes sense to create that benchmark, perhaps as a description of a learning experience, but I guess you have better things to spend your time on. (we have docs/courses on graphacademy that explain these things)

You might have even used our data import tool http://data-importer.graphapp.io/

Bildschirmfoto 2022-04-08 um 12 54 53

@jexp
Copy link

jexp commented Apr 8, 2022

Bildschirmfoto 2022-04-08 um 13 08 15

If you want to try it here is a model + csv file that can be loaded from the "..." top right.
https://drive.google.com/file/d/1EPFRGPhvDSoE9hKGa05-iPHq2fztBQxd/view?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment