Skip to content

Instantly share code, notes, and snippets.

@JavierCane
Last active August 29, 2015 14:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save JavierCane/30efe33818fc70276ad2 to your computer and use it in GitHub Desktop.
Save JavierCane/30efe33818fc70276ad2 to your computer and use it in GitHub Desktop.
CBDE - Neo4j for the TPC-H structure
= CBDE - Pràctica Neo4j
:neo4j-version: 2.0.3
:author: Martí Zamora Casals & Javier Ferrer González
:twitter: @JavierCane
== Disclaimer
Pràctica feta per en Martí Zamora Casals i en https://twitter.com/JavierCane[Javier Ferrer González]. +
Hem utilitzat l'eina http://gist.neo4j.org/?github-neo4j-contrib%2Fgists%2F%2Fmeta%2FHowTo.adoc[GraphGist] per comoditat i prestacions. Concretament, el GraphGist d'aquest Gist http://gist.neo4j.org/?30efe33818fc70276ad2[es pot veure aquí]. +
Hem fet recerca d'operacions al web de la documentació de Neo4j. Per exemple pel cas del http://docs.neo4j.org/chunked/stable/query-with.html[_WITH_].
Per últim, hem pogut donar estil aquest document gràcies a http://asciidoctor.org/docs/asciidoc-syntax-quick-reference/[AsciiDoc].
== Disseny de l'estructura
El disseny l'hem fet tenint en compte que els joins a les bases de dades de grafs són barats, però s'ha d'anar alerta amb el nombre de nodes que s'està tractant a cada moment. +
Per tant, les queries intenten ser el màxim de selectives ja des del començament.
Per fer-ho, hem aplicat les següents mesures:
* Hem mantingut els nodes de regions que van bé per evitar comparacions a les queries 2 i 4.
* També hem mantingut els nodes nation que són útils per la restricció sobre nacions de la query 4.
* Hem afegit un node Marketsegment que permet fer la selecció de la query 3 molt menys costosa.
* Finalment hem posat la informació de part-supplier en una aresta.
== Queries
=== Inicialització del sistema
// setup
// hide
[source,cypher]
----
// Regions.
CREATE ( Europe: Region { r_name: "Europe" } )
CREATE ( Asia: Region { r_name: "Asia" } )
// Nations.
CREATE ( Spain: Nation { n_name: "Spain" } ) <-[:CONTAINS]- ( Europe )
CREATE ( France: Nation { n_name: "France" } ) <-[:CONTAINS]- ( Europe )
CREATE ( China: Nation { n_name: "China" } ) <-[:CONTAINS]- ( Asia )
// Suppliers.
CREATE ( MediaMarkt: Supplier { s_acctbal: 10.1, s_name: "Media Markt", s_address: "C/ Fals 1234", s_phone: "+34 93 123 45 67", s_comment: "Spanish supplier"} ) <-[:HAS_SUP]- ( Spain )
CREATE ( Fnac: Supplier { s_acctbal: 20.2, s_name: "Fnac", s_address: "C/ False 4321", s_phone: "+33 5 12 34 56 78", s_comment: "French supplier" } ) <-[:HAS_SUP]- ( France )
CREATE ( Rakuten: Supplier { s_acctbal: 30.3, s_name: "Rakuten", s_address: "C/ Chiguon 9876", s_phone: "+86 123 4567 8901", s_comment: "Chinese supplier" } ) <-[:HAS_SUP]- ( China )
// Parts.
CREATE ( MacBookAir: Part { p_partKey: 1, p_mfgr: "AAA", p_size: 10, p_type: "Laptop" } )
CREATE ( MacBookPro: Part { p_partKey: 2, p_mfgr: "AAA", p_size: 999999, p_type: "Laptop Multipurpose" } )
CREATE ( DellInspiron: Part { p_partKey: 3, p_mfgr: "AAA", p_size: 10, p_type: "Standard Laptop" } )
CREATE ( DellXps: Part { p_partKey: 4, p_mfgr: "AAA", p_size: 10, p_type: "Laptop" } )
// Line Items.
// Line Items Order 1.
CREATE ( MediaMarktMacBookAirO1: LineItem { l_returnflag: "A", l_linestatus: "A", l_quantity: 1, l_extendedprice: 1.1, l_discount: 0.1, l_tax: 0.21, l_shipdate: 1333333333335 } )
CREATE ( MediaMarktMacBookProO1: LineItem { l_returnflag: "A", l_linestatus: "A", l_quantity: 10, l_extendedprice: 10.1, l_discount: 0.1, l_tax: 0.21, l_shipdate: 1333333333332 } )
CREATE ( FnacDellInspironO1: LineItem { l_returnflag: "A", l_linestatus: "B", l_quantity: 100, l_extendedprice: 100.1, l_discount: 0.1, l_tax: 0.21, l_shipdate: 1333333333333 } )
// Line Items Order 2.
CREATE ( FnacDellXpsO2: LineItem { l_returnflag: "B", l_linestatus: "A", l_quantity: 2, l_extendedprice: 2.2, l_discount: 0.2, l_tax: 0.21, l_shipdate: 1333333333334 } )
// Line Items Order 3.
CREATE ( FnacMacBookAirO3: LineItem { l_returnflag: "B", l_linestatus: "B", l_quantity: 30, l_extendedprice: 30.3, l_discount: 0.3, l_tax: 0.21, l_shipdate: 1333333333335 } )
CREATE ( FnacMacBookProO3: LineItem { l_returnflag: "B", l_linestatus: "B", l_quantity: 300, l_extendedprice: 300.3, l_discount: 0.3, l_tax: 0.21, l_shipdate: 1333333333336 } )
// Line Items Order 4.
CREATE ( RakutenDellInspironO4: LineItem { l_returnflag: "C", l_linestatus: "C", l_quantity: 4, l_extendedprice: 400.4, l_discount: 0.4, l_tax: 0.21, l_shipdate: 1444444444446 } )
// Orders.
CREATE ( O1: Order { o_orderkey: 1, o_orderdate: 1222222222221, o_shippriority: 1 } ) <-[:HAS_ORDER]- ( Spain )
CREATE ( O2: Order { o_orderkey: 2, o_orderdate: 1222222222222, o_shippriority: 2 } ) <-[:HAS_ORDER]- ( France )
CREATE ( O3: Order { o_orderkey: 3, o_orderdate: 1999999999991, o_shippriority: 4 } ) <-[:HAS_ORDER]- ( France )
CREATE ( O4: Order { o_orderkey: 4, o_orderdate: 1222222222221, o_shippriority: 4 } ) <-[:HAS_ORDER]- ( China )
// Marketing Segments.
CREATE ( OrganicSeo: MktSegment { m_name: "organic_seo" } )
CREATE ( Mailing: MktSegment { m_name: "mailing" } )
// Suppliers -SUPPLY-> Parts relations.
CREATE ( MediaMarkt ) -[:SUPPLY { ps_supplycost: 2 }]-> ( MacBookAir )
CREATE ( MediaMarkt ) -[:SUPPLY { ps_supplycost: 2 }]-> ( MacBookPro )
CREATE ( MediaMarkt ) -[:SUPPLY { ps_supplycost: 2 }]-> ( DellInspiron )
CREATE ( MediaMarkt ) -[:SUPPLY { ps_supplycost: 2 }]-> ( DellXps )
CREATE ( Fnac ) -[:SUPPLY { ps_supplycost: 2 }]-> ( DellInspiron )
CREATE ( Fnac ) -[:SUPPLY { ps_supplycost: 2 }]-> ( MacBookAir )
CREATE ( Fnac ) -[:SUPPLY { ps_supplycost: 2 }]-> ( MacBookPro )
CREATE ( Rakuten ) -[:SUPPLY { ps_supplycost: 2 }]-> ( DellInspiron )
// Suppliers -SUPPLIES-> LineItem relations.
CREATE ( MediaMarkt ) -[:SUPPLIES]-> ( MediaMarktMacBookAirO1 )
CREATE ( MediaMarkt ) -[:SUPPLIES]-> ( MediaMarktMacBookProO1 )
CREATE ( Fnac ) -[:SUPPLIES]-> ( FnacDellInspironO1 )
CREATE ( Fnac ) -[:SUPPLIES]-> ( FnacDellXpsO2 )
CREATE ( Fnac ) -[:SUPPLIES]-> ( FnacMacBookAirO3 )
CREATE ( Fnac ) -[:SUPPLIES]-> ( FnacMacBookProO3 )
CREATE ( Rakuten ) -[:SUPPLIES]-> ( RakutenDellInspironO4 )
// Orders -HAS-> LineItem relations.
CREATE ( O1 ) -[:HAS]-> ( MediaMarktMacBookAirO1 )
CREATE ( O1 ) -[:HAS]-> ( MediaMarktMacBookProO1 )
CREATE ( O1 ) -[:HAS]-> ( FnacDellInspironO1 )
CREATE ( O2 ) -[:HAS]-> ( FnacDellXpsO2 )
CREATE ( O3 ) -[:HAS]-> ( FnacMacBookAirO3 )
CREATE ( O3 ) -[:HAS]-> ( FnacMacBookProO3 )
CREATE ( O4 ) -[:HAS]-> ( RakutenDellInspironO4 )
// Marketing Segments -DOES-> Orders relations.
CREATE ( OrganicSeo ) -[:DOES]-> ( O1 )
CREATE ( OrganicSeo ) -[:DOES]-> ( O3 )
CREATE ( OrganicSeo ) -[:DOES]-> ( O4 )
CREATE ( Mailing ) -[:DOES]-> ( O2 )
----
Després de fer la inicialització del sistema, el graph resultant es aquest:
// graph
=== Query 1
L'únic a destacar d'aquesta query seria el fet de com Neo4j fa l'equivalent al +GROUP BY+ de SQL. +
Bàsicament tenim que especificar les funcions d'agregació dins de la clàusula addicional +WITH+ per a què, automàticament, faci l'agrupació.
==== Query SQL:
[source]
----
SELECT
l_returnflag,
l_linestatus,
sum( l_quantity ) as sum_qty,
sum( l_extendedprice ) as sum_base_price,
sum( l_extendedprice * ( 1 - l_discount ) ) as sum_disc_price,
sum( l_extendedprice * ( 1 - l_discount ) * ( 1 + l_tax ) ) as sum_charge,
avg( l_quantity ) as avg_qty,
avg( l_extendedprice ) as avg_price,
avg( l_discount ) as avg_disc,
count( * ) as count_order
FROM
lineitem
WHERE
l_shipdate <= '[date]'
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus;
----
==== Query Cypher:
[source,cypher]
----
MATCH
( li:LineItem )
WHERE
li.l_shipdate <= 1333333333336
WITH
li.l_returnflag AS l_returnflag,
li.l_linestatus AS l_linestatus,
SUM( li.l_quantity ) AS sum_qty,
SUM( li.l_extendedprice ) AS sum_base_price,
SUM( li.l_extendedprice * ( 1 - li.l_discount ) ) AS sum_disc_price,
SUM( li.l_extendedprice * ( 1 - li.l_discount ) * ( 1 + li.l_tax ) ) AS sum_charge,
AVG( li.l_quantity ) AS avg_qty,
AVG( li.l_extendedprice ) AS avg_price,
AVG( li.l_discount ) AS avg_disc,
COUNT( * ) AS count_order
RETURN
l_returnflag,
l_linestatus,
sum_qty,
sum_base_price,
sum_disc_price,
sum_charge,
avg_qty,
avg_price,
avg_disc,
count_order
ORDER BY
l_returnflag,
l_linestatus;
----
Resultats:
// table
=== Query 2
El que podríem destacar d'aquesta segona query, seria el fet que a Neo4j, es pot ficar una expressió regular a la clàusula +WHERE+ gràcies a l'operand +=~+. +
A més, aquí també s'introdueix el concepte de múltiples clàusules +MATCH+-+WHERE+ anidades a una mateixa consulta. +
També caldria destacar el fet que al primer +MATCH+ estem fent el filtratge de la +Region+ d'inici que ens interessa, d'aquesta manera suposem que serà més òptim.
* p_partKey 2: Es queda fora per +p_type+ = "Laptop Multipurpose"
* +RakutenDellInspironO4+ es queda fora perquè la relació +Suppliers -SUPPLY-> Parts+ es amb el +Supplier+ "Rakuten", que és de la +Region+ != "Europe"
==== Query SQL:
[source]
----
SELECT
s_acctbal,
s_name,
n_name,
p_partkey,
p_mfgr,
s_address,
s_phone,
s_comment
FROM
part,
JOIN partsupp ON p_partkey = ps_partkey,
JOIN supplier ON s_suppkey = ps_suppkey,
JOIN nation ON s_nationkey = n_nationkey,
JOIN region ON n_regionkey = r_regionkey
WHERE
p_size = [SIZE]
AND p_type LIKE '%[TYPE]'
AND r_name = '[REGION]'
AND ps_supplycost = (
SELECT
MIN( ps_supplycost )
FROM
partsupp,
JOIN supplier ON s_suppkey = ps_suppkey,
JOIN nation ON s_nationkey = n_nationkey,
JOIN region ON n_regionkey = r_regionkey
WHERE
p_partkey = ps_partkey
AND r_name = '[REGION]'
)
ORDER BY
s_acctbal DESC,
n_name,
s_name,
p_partkey;
----
==== Query Cypher:
[source,cypher]
----
MATCH
( p:Part { p_size : 10 } ) <-[ps:SUPPLY]- ( :Supplier ) <-[:HAS_SUP]- (n) <-[:CONTAINS]- ( r:Region { r_name: "Europe" } )
WHERE
p.p_type =~ ".*Laptop$"
WITH
p,
MIN( ps.ps_supplycost ) AS minps
MATCH
(p) <-[ps:SUPPLY { ps_supplycost : minps }]- ( s:Supplier ) <-[:HAS_SUP]- (n) <-[:CONTAINS]- ( r:Region { r_name:"Europe" } )
RETURN
s.s_acctbal,
s.s_name,
n.n_name,
p.p_partKey,
p.p_mfgr,
s.s_address,
s.s_phone,
s.s_comment
ORDER BY
s.s_acctbal DESC,
n.n_name,
s.s_name,
p.p_partkey;
----
Resultats:
// table
=== Query 3
En aquesta query primer accedim als nodes de +MktSegment+, per tant el nombre de comparacions serà baix perquè no esperem que hi hagi gaires +MktSegment+. +
Llavors obtenim les +Order+ i les filtrem per +o_orderdate+. +
Després obtenim els +LineItem+ i també els filtrem per +l_shipdate+. +
Al final fem l'agrupació i retornem els valors ordenats. +
* o_orderkey 1: Entra
* o_orderkey 2: Es queda fora per +c_mktsegment+ = "mailing"
* o_orderkey 3: Es queda fora per +o_orderdate+ = 1999999999991
* o_orderkey 4: Entra
==== Query SQL:
[source]
----
SELECT
l_orderkey,
SUM( l_extendedprice * ( 1 - l_discount ) ) AS revenue,
o_orderdate,
o_shippriority
FROM
customer,
JOIN orders ON c_custkey = o_custkey,
JOIN lineitem ON l_orderkey = o_orderkey
WHERE
c_mktsegment = '[SEGMENT]'
AND o_orderdate < '[DATE1]'
AND l_shipdate > '[DATE2]'
GROUP BY
l_orderkey,
o_orderdate,
o_shippriority
ORDER BY
revenue DESC,
o_orderdate;
----
==== Query Cypher:
[source,cypher]
----
MATCH
( m:MktSegment { m_name:"organic_seo" } ) - [:DOES] -> ( o:Order )
WHERE
o.o_orderdate < 1222222222229
MATCH
(o) - [:HAS] -> ( li:LineItem )
WHERE
li.l_shipdate > 1333333333330
WITH
o,
SUM( li.l_extendedprice * ( 1 - li.l_discount ) ) AS revenue
RETURN
o.o_orderkey,
revenue,
o.o_orderdate,
o.o_shippriority
ORDER BY
revenue DESC,
o.o_orderdate;
----
Resultats:
// table
=== Query 4
En aquest cas entrem per +Region+ i obtenim les +Nation+ d'aquesta. +
Per a cada +Nation+ 'n', n'obtenim les +Order+, després filtrem les +Order+ per +o_orderdate+. +
Obtenim els +LineItem+ de les +Order+ filtrades i llavors obtenim els +Supplier+ d'aquests +LineItem+ assegurant-nos que pertanyen a la +Nation+ 'n'. +
Finalment fem l'agrupació i retornem els resultats ordenats. +
* o_orderkey 1: Entra
* o_orderkey 2: Entra
* o_orderkey 3: Es queda fora per +o_orderdate+ = 1999999999991
* o_orderkey 4: Es queda fora per +r_name+ ( +Order <-[:HAS_ORDER]- Region+ ) != "Europe"
==== Query SQL:
[source]
----
SELECT
n_name,
SUM( l_extendedprice * ( 1 - l_discount ) ) AS revenue
FROM
customer,
JOIN orders ON c_custkey = o_custkey,
JOIN lineitem ON l_orderkey = o_orderkey,
JOIN supplier ON l_suppkey = s_suppkey,
JOIN nation ON s_nationkey = n_nationkey,
JOIN region ON n_regionkey = r_regionkey
WHERE
c_nationkey = s_nationkey
AND r_name = '[REGION]'
AND o_orderdate >= date '[DATE]'
AND o_orderdate < date '[DATE]' + interval '1' year
GROUP BY
n_name
ORDER BY
revenue DESC;
----
==== Query Cypher:
[source,cypher]
----
MATCH
( r:Region { r_name:"Europe" } ) -[:CONTAINS]-> ( n:Nation ) -[:HAS_ORDER]-> ( o:Order )
WHERE
o.o_orderdate >= 1222222222220
AND o.o_orderdate < 1253758222000
MATCH
(o) -[:HAS]-> ( l:LineItem ) <-[:SUPPLIES]- ( s:Supplier ) <-[:HAS_SUP]- ( n )
WITH
n,
SUM( l.l_extendedprice * ( 1 - l.l_discount ) ) AS revenue
RETURN
n.n_name,
revenue
ORDER BY
revenue DESC;
----
Resultats:
// table
== Consola interactiva
Aquesta és una utilitat que proporciona l'eina GraphGist i que pot estar bé si es vol continuar jugant amb aquest joc de proves: +
//console
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment