widged/load_csv_article.adoc

## load_csv_article.adoc

      
    Raw
  

              load_csv_article.adoc
            
          
LOAD CSV into Neo4j quickly and successfully


Since version 2.1 Neo4j provides out-of-the box support for CSV ingestion. The LOAD CSV command that was added to the Cypher Query language is a versatile and powerful ETL tool.
It allows you to ingest CSV data from any URL into a friendly parameter stream for your simple or complex graph update operation, that … conversion.


But hear my words of advice before you jump directly into using it. There are some tweaks and configuration aspects that you should know to be successful on the first run.


Data volume: LOAD CSV was built to support around 1M rows per import, it still works with 10M rows but you have to wait a bit, at 100M it’ll try your patience.
Except for tiny datasets never run it without the saveguard of periodic commits, which prevent you from large transactions overflowing your available database memory (JVM Heap).


The CSV used in this example is pretty basic, but enough to show some issues and make a point, it’s people and companies they work(ed) for.


PersonName,"Company Name",year
"Kenny Bastani","Neo Technology",2013
"Michael Hunger","Neo Technology",2010
"James Ward","Heroku",2011
"Someone",,
"John","Doe.com","ninetynine"


OS, RAM and DISK


First make sure the machine you want to import on is fit to import the data. That means you should have some RAM available, if your 8GB machine only sports 1GB of free RAM and already uses 3GB swap, this is a sure sign you shouldn’t try. The more RAM the better, I’d recommend to reserve at least 4GB to import your data.


If you run on SSD’s you’re lucky, on spinning disks it will definitely take longer.


Note


If you are on Windows take extra care. On Windows the database memory (JVM Heap) is used for BOTH Neo4j’s objects & caches AND the memory-mapping for storage-files. On Unix and OSX the memory mapping resides on the outside.
So make sure that on Windows you have enough Java memory allocated to cater for both aspects.


Memory Config


For your memory configuration, a sane starting point is to give Neo4j 3-4GB RAM as Java heap (including the memory mapping size on Windows), and a sensible amount (e.g. 1GB+) as memory mapping.
The more memory you have the faster it will import your data.


So make sure to edit conf/neo4j-wrapper.conf and set:


# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=4096
wrapper.java.maxmemory=4096


In conf/neo4j.properties set:


# Default values for the low-level graph engine
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=100M
neostore.propertystore.db.strings.mapped_memory=100M
neostore.propertystore.db.arrays.mapped_memory=0M


Data Loading


If you load your CSV file over the network make sure it is fast enough to sustain the ingestion rate you’d like to have. Otherwise if possible download it, and use a file:// URL.


Note


File URLs are tricky. On OSX and Unix use file:///path/to/data.csv, on Windows, please use file:c:/path/to/data.csv. Beware spaces in file-names and relative paths. Http-URLs are much easier.
LOAD CSV afaik supports cookies, redirects and https.


Data Quality


There are some challenges when loading CSV data from "organic" files.


You might not load the right file. On a local filesystem Cypher will complain if it can’t find it, but will happily ingest your latest movie-trilogy while trying to find the first comma.
For remote URLs it will fail if the server returns a 404, but otherwise try to load HTML error pages, which might be fun.


If you have the right file, make sure that it loads correctly. I.e. all the columsn and lines are separated by delimeters to your liking. You can easily try that by sampling the first few lines and see that they are returned correctly.


Initial checking for the first 5 lines


Does everything look ok, anything that looks dubious? Fix it in your input data!


LOAD CSV FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/a147631215456d86a77edc7ec08c128b9ef05e3b/people_leading_empty.csv"
  AS line
WITH line LIMIT 4
RETURN line


Caveats are: Leading blank line, wrong or double line breaks.


Check for correct columns


LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/a147631215456d86a77edc7ec08c128b9ef05e3b/people_leading_empty.csv"
  AS line
WITH line LIMIT 5
RETURN line.PersonName, line.Company, line.company, line.`Company Name`


Leading empty line will break WITH HEADERS.


LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line
WITH line LIMIT 5
RETURN line.PersonName, line.Company, line.company, line.`Company Name`


This is better, still some misspelled columns.


Note


Misspelled column name (it’s case sensitive), empty columns which are treated as null


You can skip null values by adding a


WHERE line.value IS NOT NULL after the LOAD CSV .. WITH line


If you don’t, you might end up with errors in Neo4j’s indexing subsystem trying to index null values.


If you want to you can alternatively provide defaults with


coalesce(value_or_null,value2_or_null,..., default-value).


Filter out NULL values


LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line
WITH line

WHERE line.PersonName IS NOT NULL

RETURN line.PersonName, coalesce(line.Company,"None")


Data Conversion


Cypher reads all CSV columns as Strings by default. You can use conversion functions like toInt(str), toFloat(str), or boolean expressions, split(str, " ") and substring(str,start,count) (e.g. for extracting day, month, year from a date-string). Note that the conversion functions return null if they can’t convert. So use the null handling from above to handle unclean data.


Note: Make sure to use the same conversion for the same value in all the places, both when creating nodes and relationships and also when using the CSV data to look them up again.
A helpful tip is to do the conversion upfront with WITH toInt(line.number) as number


LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line

WITH line, toInt(line.year) as year
RETURN line.PersonName, year


Field Separation


You can specify alternative field separators for your data. For a tab-separated file (.tsv) you can use ... AS line FIELDTERMINATOR \'\t\' for semicolons +… AS line FIELDTERMINATOR \';'


Batch Your Transactions


This is really important


If you import more than 100k elements in a single transactions, it is very likely (depending on your available memory), that you’ll fail. And it might not be a quick death of your operation. It can drag on for quite a while desparately trying to recover memory to store its intermediate transaction data.


So make sure, to ALWAYS prefix your LOAD CSV with USING PERIODIC COMMIT 1000. The number given is the number of import rows after which a commit of the imported data happens. Depending on the complexity of your import operation, you might create from 100 elements per 1000 rows (if you have a lot of duplicates) up to 100,000 when you have complex operations that generate up to 100 nodes and relationships per row of input.
That’s why a commit size of 1000 might be a safe bet (or not).


There is also an issue within Neo4j’s index operations that makes it work better with smaller commit-sizes.
If you use LOAD CSV without any create or update operation, you cannot use PERIODIC COMMIT. If you use it from within an embedded Java-Application, make sure to _not start a manual transaction beforehand.


Batch Transactions after every 1000 Rows


USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line
CREATE (:Person {name:line.PersonName})


Indexing


Usually you import your nodes and relationships in one step, creating connections as you process each line. As most CSV files are representing denormalized tables, you’ll get a lot of duplication in them, especially for the joined entity tables (which will probably converted to nodes). So you want to use either MERGE or MATCH to lookup nodes in your graph database to connect or update them in a later step.


Make sure you created the neccessary indexes and constraints upfront, so that both operations can utilize them and lookup values really quickly.


Creating Indexes and Constraints Upfront


This example shows a mult-step import where people are created upfront from one source, and then later on only looked up to connect them to merged Companies.


CREATE CONSTRAINT ON (c:Company) ASSERT c.name IS UNIQUE;
CREATE INDEX ON :Person(name);

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line
CREATE (p:Person {name:line.PersonName});

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/c5f98f269b4a219ebb88bd3eb83d331f439a8201/people.csv"
  AS line
MATCH (p:Person {name:line.PersonName})
MERGE (c:Company {name:line.`Company Name`});
CREATE (p)-[:WORKS_FOR]->(c)


Test First


Use the same sampling approach as before, but now only take the first 10k or 100k lines and import them. Try importing into a throwaway test database with the neo4j-shell (see below).
If that goes well, remove the added data or clean the database (by deleting the db-directory).


Use the Neo4j-Shell for larger Imports


The Neo4j Browser is great for quick query and playing around with your import data, but if you really want to import millions of rows, go with the Neo4j shell.


If you downloaded the zip distribution of Neo4j, the shell can be found in bin/neo4j-shell (bin\Neo4jShell.bat on Windows). By default it connects to a running server but you can also specify a dedicated database directory with the -path people.db parameter. With -config conf/neo4j.properties you’d use the same config as the Neo4j server for that db-directory.


For importing lots of data you should probably edit the shell script and edit this line EXTRA_JVM_ARGUMENTS="-Xmx4G -Xms4G -Xmn1G" to provide sensible Java heap settings.
You can add your import statements (including index creation) to a Cypher script and execute it with -file import.cql. Don’t forget the semicolons at the end of your statements.


If you run into errors, you might try export STACKTRACES=true and re-run the command to have a good error message to share when asking on StackOverflow.


The Import Query in all its Beauty


Clean out the database for my final import.


MATCH (n)
WITH n LIMIT 10000
OPTIONAL MATCH (n)-[r]->()
DELETE n,r


CREATE CONSTRAINT ON (c:Company) ASSERT c.name IS UNIQUE;
CREATE INDEX ON :Person(name);

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/1bd8c19bf8b49d9eb7149918cc11a34faf996dd8/people.tsv"
  AS line FIELDTERMINATOR '\t'

WITH line, toInt(line.year) as year, coalesce(line.`Company Name`,"None") as company

WHERE year IS NOT NULL

MERGE (p:Person {name:line.PersonName})
MERGE (c:Company {name:company})
CREATE (p)-[r:WORKS_FOR {since:year}]->(c)
RETURN p,r,c


Ready to go?


That’s it. If you hit any issue, I haven’t covered, here please don’t hesitate to reach out to me, either by commenting below or dropping me an emal to michael at neo4j.org


Real World Example


I just went to http://www.mockaroo.com to generate a CSV sample. I did a lastname and country tuple and generated 1M lines of them (18MB data), put them in my public dropbox folder.


The data looks like this.


last_name,country
Fuller,Saint Vincent and the Grenadines
Lynch,Israel
Crawford,Iceland
Fowler,Belgium
... 1M lines ...
Walker,Mali
Wilson,Turkey
Hart,Saint Vincent and the Grenadines
Fowler,Nigeria


Then I checked the CSV as outlined above, and used bin/neo4j-shell -path people.db -config conf/neo4j.properties -file import-names.cql


CREATE CONSTRAINT ON (c:Country) ASSERT c.name IS UNIQUE;
CREATE INDEX ON :Person(name);

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/name_country.csv"
  AS line

WITH line
WHERE line.last_name IS NOT NULL and line.country IS NOT NULL

MERGE (p:Person {name:line.last_name})
MERGE (c:Country {name:line.country})
CREATE (p)-[r:LIVES_IN]->(c);


The output shown here, you and also look at the  the full log.


+-------------------+
| No data returned. |
+-------------------+
Nodes created: 499
Relationships created: 1000000
Properties set: 499
Labels added: 499
119200 ms


References


You find some more details about LOAD CSV in the Neo4j Manual


You can find the CSV files used in this article in this GitHub Gist


Read at my git-commit import article that also uses LOAD CSV


## people.csv
PersonName,"Company Name",year
"Kenny Bastani","Neo Technology",2013
"Michael Hunger","Neo Technology",2010
"James Ward","Heroku",2011,
"Matthew McCullough","GitHub",2010
"Someone",,
"John","Doe.com","ninetynine"

## people.tsv

          
            PersonName
            Company Name
            year

            
              Kenny Bastani
              Neo Technology
              2013

            
              Michael Hunger
              Neo Technology
              2010

            
              James Ward
              Heroku
              2011

            
              Someone

            
              John
              Doe.com
              ninetynine

## people_leading_empty.csv

          
            PersonName
            Company Name
            year

            
              Kenny Bastani
              Neo Technology
              2013

            
              Michael Hunger
              Neo Technology
              2010

            
              James Ward
              Heroku
              2011

            
              Someone

            
              John
              Doe.com
              ninetynine

## shell_import.log
$ bin/neo4j-shell -path people.db -config conf/neo4j.properties
NOTE: Local Neo4j graph database service at 'people.db'
Welcome to the Neo4j Shell! Enter 'help' for a list of commands


neo4j-sh (?)$ LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/name_country.csv"
>   AS line
> return count(*);
+----------+
| count(*) |
+----------+
| 1000000  |
+----------+
1 row
10320 ms

neo4j-sh (?)$ LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/name_country.csv" as line with line limit 10 return line;
+------------------------------------------------------------------------+
| line                                                                   |
+------------------------------------------------------------------------+
| {last_name -> "Fuller", country -> "Saint Vincent and the Grenadines"} |
| {last_name -> "Lynch", country -> "Israel"}                            |
| {last_name -> "Crawford", country -> "Iceland"}                        |
| {last_name -> "Fowler", country -> "Belgium"}                          |
| {last_name -> "Larson", country -> "Panama"}                           |
| {last_name -> "Jenkins", country -> "Singapore"}                       |
| {last_name -> "Burke", country -> "Netherlands Antilles"}              |
| {last_name -> "Simmons", country -> "Isle of Man"}                     |
| {last_name -> "Martinez", country -> "Bahamas"}                        |
| {last_name -> "Berry", country -> "Guadeloupe"}                        |
+------------------------------------------------------------------------+
10 rows
830 ms

neo4j-sh (?)$ CREATE CONSTRAINT ON (c:Country) ASSERT c.name IS UNIQUE;
+-------------------+
| No data returned. |
+-------------------+
Constraints added: 1
98 ms

neo4j-sh (?)$ CREATE INDEX ON :Person(name);
+-------------------+
| No data returned. |
+-------------------+
Indexes added: 1
8 ms

neo4j-sh (?)$ schema await

neo4j-sh (?)$ schema
Indexes
  ON :Country(name) ONLINE (for uniqueness constraint)
  ON :Person(name)  ONLINE

Constraints
  ON (country:Country) ASSERT country.name IS UNIQUE

neo4j-sh (?)$ USING PERIODIC COMMIT 1000
> LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/name_country.csv"
>   AS line
>
> WITH line
> WHERE line.last_name IS NOT NULL and line.country IS NOT NULL
>
> MERGE (p:Person {name:line.last_name})
> MERGE (c:Country {name:line.country})
> CREATE (p)-[r:LIVES_IN]->(c);
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 499
Relationships created: 1000000
Properties set: 499
Labels added: 499
119200 ms
	PersonName,"Company Name",year
	"Kenny Bastani","Neo Technology",2013
	"Michael Hunger","Neo Technology",2010
	"James Ward","Heroku",2011,
	"Matthew McCullough","GitHub",2010
	"Someone",,
	"John","Doe.com","ninetynine"
PersonName	Company Name	year
Kenny Bastani	Neo Technology	2013
Michael Hunger	Neo Technology	2010
James Ward	Heroku	2011
Someone
John	Doe.com	ninetynine
	$ bin/neo4j-shell -path people.db -config conf/neo4j.properties
	NOTE: Local Neo4j graph database service at 'people.db'
	Welcome to the Neo4j Shell! Enter 'help' for a list of commands


	neo4j-sh (?)$ LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/name_country.csv"
	> AS line
	> return count(*);
	+----------+
	\| count(*) \|
	+----------+
	\| 1000000 \|
	+----------+
	1 row
	10320 ms

	neo4j-sh (?)$ LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/name_country.csv" as line with line limit 10 return line;
	+------------------------------------------------------------------------+
	\| line \|
	+------------------------------------------------------------------------+
	\| {last_name -> "Fuller", country -> "Saint Vincent and the Grenadines"} \|
	\| {last_name -> "Lynch", country -> "Israel"} \|
	\| {last_name -> "Crawford", country -> "Iceland"} \|
	\| {last_name -> "Fowler", country -> "Belgium"} \|
	\| {last_name -> "Larson", country -> "Panama"} \|
	\| {last_name -> "Jenkins", country -> "Singapore"} \|
	\| {last_name -> "Burke", country -> "Netherlands Antilles"} \|
	\| {last_name -> "Simmons", country -> "Isle of Man"} \|
	\| {last_name -> "Martinez", country -> "Bahamas"} \|
	\| {last_name -> "Berry", country -> "Guadeloupe"} \|
	+------------------------------------------------------------------------+
	10 rows
	830 ms

	neo4j-sh (?)$ CREATE CONSTRAINT ON (c:Country) ASSERT c.name IS UNIQUE;
	+-------------------+
	\| No data returned. \|
	+-------------------+
	Constraints added: 1
	98 ms

	neo4j-sh (?)$ CREATE INDEX ON :Person(name);
	+-------------------+
	\| No data returned. \|
	+-------------------+
	Indexes added: 1
	8 ms

	neo4j-sh (?)$ schema await

	neo4j-sh (?)$ schema
	Indexes
	ON :Country(name) ONLINE (for uniqueness constraint)
	ON :Person(name) ONLINE

	Constraints
	ON (country:Country) ASSERT country.name IS UNIQUE

	neo4j-sh (?)$ USING PERIODIC COMMIT 1000
	> LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/14493611/name_country.csv"
	> AS line
	>
	> WITH line
	> WHERE line.last_name IS NOT NULL and line.country IS NOT NULL
	>
	> MERGE (p:Person {name:line.last_name})
	> MERGE (c:Country {name:line.country})
	> CREATE (p)-[r:LIVES_IN]->(c);
	+-------------------+
	\| No data returned. \|
	+-------------------+
	Nodes created: 499
	Relationships created: 1000000
	Properties set: 499
	Labels added: 499
	119200 ms