Skip to content

Instantly share code, notes, and snippets.

@nrouyer
Last active April 16, 2016 14:10
Show Gist options
  • Save nrouyer/349374ddccfd1973dd38 to your computer and use it in GitHub Desktop.
Save nrouyer/349374ddccfd1973dd38 to your computer and use it in GitHub Desktop.
= Open Food Facts
:neo4j-version: 2.3.2
:author: Nicolas Rouyer
:toc: right
:twitter: @rrrouyer
:description: Open Food Facts
:tags: domain:open data, use-case:open food facts
This interactive Neo4j graph tutorial shows how to handle open food facts data... for the best of your health !
'''
:toc: left
'''
[[introduction]]
== Open food facts
image::http://static.openfoodfacts.org/images/misc/openfoodfacts-logo-en-178x150.png[Open Food Facts]
Open food facts is the free food product database !
It gathers information and data on food products from around the world.
This database is completed thanks to individual, international contributors who scan product barcodes and upload pictures of their label.
http://fr.openfoodfacts.org/
[[graph_creation]]
=== Creating open food facts graph
[source,cypher]
----
// OPEN FOOD FACTS - CREATE INDEX ON PRODUCT CODE
CREATE INDEX ON :Product(code);
// OPEN FOOD FACTS - CREATE INDEX ON INGREDIENT FOOD
CREATE INDEX ON :Ingredient(food);
// OPEN FOOD FACTS - LOAD PRODUCT NODES
LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/nrouyer/fdcea6bbb5ea8e3377fb2ec3139b0c17/raw/f93de61779cccd74e9eb94566a6efc3358b00db1/off_products_163.csv" AS csvLine
FIELDTERMINATOR ";"
CREATE (p:Product { code: csvLine.code,
name: coalesce(csvLine.name,"NA"),
sodiumPer100g: coalesce(csvLine.sodiumPer100g,"NA"),
fatPer100g: coalesce(csvLine.fatPer100g,"NA"),
proteinsPer100g: coalesce(csvLine.proteinsPer100g,"NA"),
nutritionScoreFrPer100g: coalesce(csvLine.nutritionScoreFrPer100g,"NA"),
energyPer100g: coalesce(csvLine.energyPer100g,"NA"),
fiberPer100g: coalesce(csvLine.fiberPer100g,"NA"),
sugarsPer100g: coalesce(csvLine.sugarsPer100g,"NA"),
saltPer100g: coalesce(csvLine.saltPer100g,"NA"),
nutritionScoreUkPer100g: coalesce(csvLine.nutritionScoreUkPer100g,"NA")
});
// LOAD INGREDIENTS
LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/nrouyer/40f6b8d87f7f239f5a0f62e7756f8879/raw/1cc542d70a1bc1829d2643eb02d046f733545bb8/off_ingredients_163.csv" AS csvLine
FIELDTERMINATOR ';'
MERGE (i:Ingredient { food: csvLine.Ingredient });
// LOAD COMPOSITION RELATIONSHIPS
LOAD CSV WITH HEADERS FROM "https://gist.githubusercontent.com/nrouyer/8cc54359a569d5df445f8fa1066f2daa/raw/ecc608e6b59c971db448a4fd59c62e14c21dd0cc/off_composition_163.csv" AS csvLine
FIELDTERMINATOR ';'
MATCH (p:Product { code: csvLine.code })
MATCH (i:Ingredient { food: csvLine.food })
MERGE (p)-[:CONTAINS { rank: coalesce(csvLine.rank,"NA") }]->(i);
----
Graph data loaded !
'''
[[graph_consultation]]
=== Sodas' ingredients war : Pepsi vs 7Up
As a warm up, let us compare Pepsi and 7Up composition (whose tastes are radically different...)
[source,cypher]
----
// OPEN FOOD FACTS - GET 7UP INGREDIENTS SHORT NAME
MATCH (p:Product {name:'7Up'})-[:CONTAINS]->(i:Ingredient)
WITH i, SPLIT(i.food, '/') AS Ingredients
RETURN Ingredients[4] AS Ingredient
----
[source,cypher]
----
// OPEN FOOD FACTS - GET PEPSI INGREDIENTS SHORT NAME
MATCH (p:Product {name:'Pepsi, Nouveau goût !'})-[:CONTAINS]->(i:Ingredient)
WITH i, SPLIT(i.food, '/') AS Ingredients
RETURN Ingredients[4] AS Ingredient
----
[source,cypher]
----
// OPEN FOOD FACTS - GET INGREDIENTS COMMON TO PEPSI & 7UP
MATCH (p1:Product {name:'7Up'})-[:CONTAINS]->(i:Ingredient)
MATCH (p2:Product {name:'Pepsi, Nouveau goût !'})-[:CONTAINS]->(i)
RETURN i.food AS Ingredient
----
'''
[[graph_food_neighbours]]
=== My neighbourfood
With Cypher we can easily query the food data model and find closest enighbours to any given product (that is, the products that have the most common ingredients)
[source,cypher]
----
// OPEN FOOD FACTS - CLOSEST NEIGHBOURS (2)
MATCH (p1:Product {name: 'Chair à saucisse'} )-[c1:CONTAINS]->(i:Ingredient)<-[c2:CONTAINS]-(p2:Product)
RETURN p2.name AS Neighbour, collect(i.food) AS Ingredients_In_Common, count(i.food) AS STRENGTH
ORDER BY STRENGTH DESC
----
[[graph_refactoring]]
=== Refactoring OFF graph
Let us simply perform a cosmetic customization on our Open Food Facts graph :
[source,cypher]
----
MATCH (i:Ingredient)
WITH i, SPLIT(i.food, '/') AS Ingredients
SET i.shortname = Ingredients[4]
----
Then we query the closest neighbours again, with a better formatted result.
[source,cypher]
----
MATCH (p1:Product {name: 'Chair à saucisse'} )-[c1:CONTAINS]->(i:Ingredient)<-[c2:CONTAINS]-(p2:Product)
RETURN p2.name AS Neighbour, collect(i.shortname) AS Ingredients_In_Common, count(i.food) AS STRENGTH
ORDER BY STRENGTH DESC
----
[[shortest_food_path]]
=== Find shortest path between products
Hey, let us randomly select 2 food products. Can we discover anything with the shortest path between them ?
[source,cypher]
----
// OPEN FOOD FACTS - SHORTEST PATH
MATCH (rollmops:Product {name:"Rollmop Herrings"}),
(macncheese:Product {code:"00036559"}),
p =(rollmops)-[:CONTAINS*1..6]-(macncheese)
WHERE ANY(x IN NODES(p) WHERE x:Ingredient)
WITH p ORDER BY LENGTH(p) LIMIT 1
RETURN p
----
'''
[[conclusion]]
=== Let's feed the food graph...
This great, open, database helps find insights on our day-to-day essential. It was made for more transparency and to share universal knowledge. +
image::http://static.openfoodfacts.org/images/svg/crowdsourcing-icon.svg[Yes we scan !!!]
There are excellent works performed with the whole database on [Kaggle](https://www.kaggle.com/ "The Home of Data Science"). +
Please enjoy and post your remarks: +
mailto:rouyer.nicolas@gmail.com>[Nicolas ROUYER]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment