Skip to content

Instantly share code, notes, and snippets.

@pbamotra
Forked from jexp/product_hierarchies.adoc
Created June 27, 2017 22:22
Show Gist options
  • Save pbamotra/b2798fa0a62a71e90ab6d5d8dacff870 to your computer and use it in GitHub Desktop.
Save pbamotra/b2798fa0a62a71e90ab6d5d8dacff870 to your computer and use it in GitHub Desktop.
Google Merchant Product Hierarchies Import

Importing a CSV of Categories in one Pass

I enjoyed reading it and then wondered if we can’t import it in a single run, when we treat the CSV file just as collection of columns which we can access using Cypher’s collection functions.

Having downloaded and installed and started, Neo4j, let’s download the CSV data, so we have it available locally.

curl https://gist.githubusercontent.com/rvanbruggen/adaac7415eb42912e950/raw/taxonomy.csv -o /tmp/taxonomy.csv

Then we can use the neo4j-shell to import the data. Just start it with bin/neo4j-shell and paste the statements as you go.

Data Checking

How many rows?
load csv with headers from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
return count(*);
Check the Data
load csv with headers from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
return row limit 3;
Look at the bare rows
load csv from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
// skip header
with row skip 1
return row limit 3;

Convert CSV into simpler format

We treat the row as collection of strings this time. First col is id, then names up to first null

load csv from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
// skip header
with row skip 1
return toInt(row[0]) as id,
// keep non null columns
     filter(name in row[1..] WHERE NOT (name is null or name = "")) as names
limit 3;

Duplicate Checking

Let’s check for duplicate names by taking the last non-null category-name, aggregating on it, counting the occurrences and reporting those that have more than one.

load csv from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
with row skip 1
with filter(name in row[1..] WHERE NOT (name is null or name = "")) as names
// split into two WITHs for readability
with names[-1] as name, count(*) as c
where c > 1
return name,c;

Import

Now let’s set up a constraint on :Cat(id) and an index on :Cat(name) and create the "Root" category.

create constraint on (c:Cat) assert c.id is unique;
create index on :Cat(name);
create (:Cat {id:0,name:"Root"});

My colleague Mark Needham had the idea to convert he full denormalized tree into the minimal data we need to create the hiearchy.

We’re only interested in the leaf which we have to create with id and name. And then connect it to its direct parent, the rest of the hierarchy was already created in previous rows.

load csv from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
with toInt(row[0]) as id,
     filter(name in row[1..] WHERE NOT (name is null or name = "")) as names
skip 1 {import-limit}
WITH id,
     coalesce(names[-2],"Root") as parentName,
	 names[-1] as childName
MERGE (child:Cat {id:id}) ON CREATE SET child.name = childName
WITH child,parentName
MATCH (parent:Cat {name:parentName})
CREATE (parent)<-[:PART_OF]-(child);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment