Last active
August 29, 2015 14:01
-
-
Save JavierCane/30efe33818fc70276ad2 to your computer and use it in GitHub Desktop.
CBDE - Neo4j for the TPC-H structure
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
= CBDE - Pràctica Neo4j | |
:neo4j-version: 2.0.3 | |
:author: Martí Zamora Casals & Javier Ferrer González | |
:twitter: @JavierCane | |
== Disclaimer | |
Pràctica feta per en Martí Zamora Casals i en https://twitter.com/JavierCane[Javier Ferrer González]. + | |
Hem utilitzat l'eina http://gist.neo4j.org/?github-neo4j-contrib%2Fgists%2F%2Fmeta%2FHowTo.adoc[GraphGist] per comoditat i prestacions. Concretament, el GraphGist d'aquest Gist http://gist.neo4j.org/?30efe33818fc70276ad2[es pot veure aquí]. + | |
Hem fet recerca d'operacions al web de la documentació de Neo4j. Per exemple pel cas del http://docs.neo4j.org/chunked/stable/query-with.html[_WITH_]. | |
Per últim, hem pogut donar estil aquest document gràcies a http://asciidoctor.org/docs/asciidoc-syntax-quick-reference/[AsciiDoc]. | |
== Disseny de l'estructura | |
El disseny l'hem fet tenint en compte que els joins a les bases de dades de grafs són barats, però s'ha d'anar alerta amb el nombre de nodes que s'està tractant a cada moment. + | |
Per tant, les queries intenten ser el màxim de selectives ja des del començament. | |
Per fer-ho, hem aplicat les següents mesures: | |
* Hem mantingut els nodes de regions que van bé per evitar comparacions a les queries 2 i 4. | |
* També hem mantingut els nodes nation que són útils per la restricció sobre nacions de la query 4. | |
* Hem afegit un node Marketsegment que permet fer la selecció de la query 3 molt menys costosa. | |
* Finalment hem posat la informació de part-supplier en una aresta. | |
== Queries | |
=== Inicialització del sistema | |
// setup | |
// hide | |
[source,cypher] | |
---- | |
// Regions. | |
CREATE ( Europe: Region { r_name: "Europe" } ) | |
CREATE ( Asia: Region { r_name: "Asia" } ) | |
// Nations. | |
CREATE ( Spain: Nation { n_name: "Spain" } ) <-[:CONTAINS]- ( Europe ) | |
CREATE ( France: Nation { n_name: "France" } ) <-[:CONTAINS]- ( Europe ) | |
CREATE ( China: Nation { n_name: "China" } ) <-[:CONTAINS]- ( Asia ) | |
// Suppliers. | |
CREATE ( MediaMarkt: Supplier { s_acctbal: 10.1, s_name: "Media Markt", s_address: "C/ Fals 1234", s_phone: "+34 93 123 45 67", s_comment: "Spanish supplier"} ) <-[:HAS_SUP]- ( Spain ) | |
CREATE ( Fnac: Supplier { s_acctbal: 20.2, s_name: "Fnac", s_address: "C/ False 4321", s_phone: "+33 5 12 34 56 78", s_comment: "French supplier" } ) <-[:HAS_SUP]- ( France ) | |
CREATE ( Rakuten: Supplier { s_acctbal: 30.3, s_name: "Rakuten", s_address: "C/ Chiguon 9876", s_phone: "+86 123 4567 8901", s_comment: "Chinese supplier" } ) <-[:HAS_SUP]- ( China ) | |
// Parts. | |
CREATE ( MacBookAir: Part { p_partKey: 1, p_mfgr: "AAA", p_size: 10, p_type: "Laptop" } ) | |
CREATE ( MacBookPro: Part { p_partKey: 2, p_mfgr: "AAA", p_size: 999999, p_type: "Laptop Multipurpose" } ) | |
CREATE ( DellInspiron: Part { p_partKey: 3, p_mfgr: "AAA", p_size: 10, p_type: "Standard Laptop" } ) | |
CREATE ( DellXps: Part { p_partKey: 4, p_mfgr: "AAA", p_size: 10, p_type: "Laptop" } ) | |
// Line Items. | |
// Line Items Order 1. | |
CREATE ( MediaMarktMacBookAirO1: LineItem { l_returnflag: "A", l_linestatus: "A", l_quantity: 1, l_extendedprice: 1.1, l_discount: 0.1, l_tax: 0.21, l_shipdate: 1333333333335 } ) | |
CREATE ( MediaMarktMacBookProO1: LineItem { l_returnflag: "A", l_linestatus: "A", l_quantity: 10, l_extendedprice: 10.1, l_discount: 0.1, l_tax: 0.21, l_shipdate: 1333333333332 } ) | |
CREATE ( FnacDellInspironO1: LineItem { l_returnflag: "A", l_linestatus: "B", l_quantity: 100, l_extendedprice: 100.1, l_discount: 0.1, l_tax: 0.21, l_shipdate: 1333333333333 } ) | |
// Line Items Order 2. | |
CREATE ( FnacDellXpsO2: LineItem { l_returnflag: "B", l_linestatus: "A", l_quantity: 2, l_extendedprice: 2.2, l_discount: 0.2, l_tax: 0.21, l_shipdate: 1333333333334 } ) | |
// Line Items Order 3. | |
CREATE ( FnacMacBookAirO3: LineItem { l_returnflag: "B", l_linestatus: "B", l_quantity: 30, l_extendedprice: 30.3, l_discount: 0.3, l_tax: 0.21, l_shipdate: 1333333333335 } ) | |
CREATE ( FnacMacBookProO3: LineItem { l_returnflag: "B", l_linestatus: "B", l_quantity: 300, l_extendedprice: 300.3, l_discount: 0.3, l_tax: 0.21, l_shipdate: 1333333333336 } ) | |
// Line Items Order 4. | |
CREATE ( RakutenDellInspironO4: LineItem { l_returnflag: "C", l_linestatus: "C", l_quantity: 4, l_extendedprice: 400.4, l_discount: 0.4, l_tax: 0.21, l_shipdate: 1444444444446 } ) | |
// Orders. | |
CREATE ( O1: Order { o_orderkey: 1, o_orderdate: 1222222222221, o_shippriority: 1 } ) <-[:HAS_ORDER]- ( Spain ) | |
CREATE ( O2: Order { o_orderkey: 2, o_orderdate: 1222222222222, o_shippriority: 2 } ) <-[:HAS_ORDER]- ( France ) | |
CREATE ( O3: Order { o_orderkey: 3, o_orderdate: 1999999999991, o_shippriority: 4 } ) <-[:HAS_ORDER]- ( France ) | |
CREATE ( O4: Order { o_orderkey: 4, o_orderdate: 1222222222221, o_shippriority: 4 } ) <-[:HAS_ORDER]- ( China ) | |
// Marketing Segments. | |
CREATE ( OrganicSeo: MktSegment { m_name: "organic_seo" } ) | |
CREATE ( Mailing: MktSegment { m_name: "mailing" } ) | |
// Suppliers -SUPPLY-> Parts relations. | |
CREATE ( MediaMarkt ) -[:SUPPLY { ps_supplycost: 2 }]-> ( MacBookAir ) | |
CREATE ( MediaMarkt ) -[:SUPPLY { ps_supplycost: 2 }]-> ( MacBookPro ) | |
CREATE ( MediaMarkt ) -[:SUPPLY { ps_supplycost: 2 }]-> ( DellInspiron ) | |
CREATE ( MediaMarkt ) -[:SUPPLY { ps_supplycost: 2 }]-> ( DellXps ) | |
CREATE ( Fnac ) -[:SUPPLY { ps_supplycost: 2 }]-> ( DellInspiron ) | |
CREATE ( Fnac ) -[:SUPPLY { ps_supplycost: 2 }]-> ( MacBookAir ) | |
CREATE ( Fnac ) -[:SUPPLY { ps_supplycost: 2 }]-> ( MacBookPro ) | |
CREATE ( Rakuten ) -[:SUPPLY { ps_supplycost: 2 }]-> ( DellInspiron ) | |
// Suppliers -SUPPLIES-> LineItem relations. | |
CREATE ( MediaMarkt ) -[:SUPPLIES]-> ( MediaMarktMacBookAirO1 ) | |
CREATE ( MediaMarkt ) -[:SUPPLIES]-> ( MediaMarktMacBookProO1 ) | |
CREATE ( Fnac ) -[:SUPPLIES]-> ( FnacDellInspironO1 ) | |
CREATE ( Fnac ) -[:SUPPLIES]-> ( FnacDellXpsO2 ) | |
CREATE ( Fnac ) -[:SUPPLIES]-> ( FnacMacBookAirO3 ) | |
CREATE ( Fnac ) -[:SUPPLIES]-> ( FnacMacBookProO3 ) | |
CREATE ( Rakuten ) -[:SUPPLIES]-> ( RakutenDellInspironO4 ) | |
// Orders -HAS-> LineItem relations. | |
CREATE ( O1 ) -[:HAS]-> ( MediaMarktMacBookAirO1 ) | |
CREATE ( O1 ) -[:HAS]-> ( MediaMarktMacBookProO1 ) | |
CREATE ( O1 ) -[:HAS]-> ( FnacDellInspironO1 ) | |
CREATE ( O2 ) -[:HAS]-> ( FnacDellXpsO2 ) | |
CREATE ( O3 ) -[:HAS]-> ( FnacMacBookAirO3 ) | |
CREATE ( O3 ) -[:HAS]-> ( FnacMacBookProO3 ) | |
CREATE ( O4 ) -[:HAS]-> ( RakutenDellInspironO4 ) | |
// Marketing Segments -DOES-> Orders relations. | |
CREATE ( OrganicSeo ) -[:DOES]-> ( O1 ) | |
CREATE ( OrganicSeo ) -[:DOES]-> ( O3 ) | |
CREATE ( OrganicSeo ) -[:DOES]-> ( O4 ) | |
CREATE ( Mailing ) -[:DOES]-> ( O2 ) | |
---- | |
Després de fer la inicialització del sistema, el graph resultant es aquest: | |
// graph | |
=== Query 1 | |
L'únic a destacar d'aquesta query seria el fet de com Neo4j fa l'equivalent al +GROUP BY+ de SQL. + | |
Bàsicament tenim que especificar les funcions d'agregació dins de la clàusula addicional +WITH+ per a què, automàticament, faci l'agrupació. | |
==== Query SQL: | |
[source] | |
---- | |
SELECT | |
l_returnflag, | |
l_linestatus, | |
sum( l_quantity ) as sum_qty, | |
sum( l_extendedprice ) as sum_base_price, | |
sum( l_extendedprice * ( 1 - l_discount ) ) as sum_disc_price, | |
sum( l_extendedprice * ( 1 - l_discount ) * ( 1 + l_tax ) ) as sum_charge, | |
avg( l_quantity ) as avg_qty, | |
avg( l_extendedprice ) as avg_price, | |
avg( l_discount ) as avg_disc, | |
count( * ) as count_order | |
FROM | |
lineitem | |
WHERE | |
l_shipdate <= '[date]' | |
GROUP BY | |
l_returnflag, | |
l_linestatus | |
ORDER BY | |
l_returnflag, | |
l_linestatus; | |
---- | |
==== Query Cypher: | |
[source,cypher] | |
---- | |
MATCH | |
( li:LineItem ) | |
WHERE | |
li.l_shipdate <= 1333333333336 | |
WITH | |
li.l_returnflag AS l_returnflag, | |
li.l_linestatus AS l_linestatus, | |
SUM( li.l_quantity ) AS sum_qty, | |
SUM( li.l_extendedprice ) AS sum_base_price, | |
SUM( li.l_extendedprice * ( 1 - li.l_discount ) ) AS sum_disc_price, | |
SUM( li.l_extendedprice * ( 1 - li.l_discount ) * ( 1 + li.l_tax ) ) AS sum_charge, | |
AVG( li.l_quantity ) AS avg_qty, | |
AVG( li.l_extendedprice ) AS avg_price, | |
AVG( li.l_discount ) AS avg_disc, | |
COUNT( * ) AS count_order | |
RETURN | |
l_returnflag, | |
l_linestatus, | |
sum_qty, | |
sum_base_price, | |
sum_disc_price, | |
sum_charge, | |
avg_qty, | |
avg_price, | |
avg_disc, | |
count_order | |
ORDER BY | |
l_returnflag, | |
l_linestatus; | |
---- | |
Resultats: | |
// table | |
=== Query 2 | |
El que podríem destacar d'aquesta segona query, seria el fet que a Neo4j, es pot ficar una expressió regular a la clàusula +WHERE+ gràcies a l'operand +=~+. + | |
A més, aquí també s'introdueix el concepte de múltiples clàusules +MATCH+-+WHERE+ anidades a una mateixa consulta. + | |
També caldria destacar el fet que al primer +MATCH+ estem fent el filtratge de la +Region+ d'inici que ens interessa, d'aquesta manera suposem que serà més òptim. | |
* p_partKey 2: Es queda fora per +p_type+ = "Laptop Multipurpose" | |
* +RakutenDellInspironO4+ es queda fora perquè la relació +Suppliers -SUPPLY-> Parts+ es amb el +Supplier+ "Rakuten", que és de la +Region+ != "Europe" | |
==== Query SQL: | |
[source] | |
---- | |
SELECT | |
s_acctbal, | |
s_name, | |
n_name, | |
p_partkey, | |
p_mfgr, | |
s_address, | |
s_phone, | |
s_comment | |
FROM | |
part, | |
JOIN partsupp ON p_partkey = ps_partkey, | |
JOIN supplier ON s_suppkey = ps_suppkey, | |
JOIN nation ON s_nationkey = n_nationkey, | |
JOIN region ON n_regionkey = r_regionkey | |
WHERE | |
p_size = [SIZE] | |
AND p_type LIKE '%[TYPE]' | |
AND r_name = '[REGION]' | |
AND ps_supplycost = ( | |
SELECT | |
MIN( ps_supplycost ) | |
FROM | |
partsupp, | |
JOIN supplier ON s_suppkey = ps_suppkey, | |
JOIN nation ON s_nationkey = n_nationkey, | |
JOIN region ON n_regionkey = r_regionkey | |
WHERE | |
p_partkey = ps_partkey | |
AND r_name = '[REGION]' | |
) | |
ORDER BY | |
s_acctbal DESC, | |
n_name, | |
s_name, | |
p_partkey; | |
---- | |
==== Query Cypher: | |
[source,cypher] | |
---- | |
MATCH | |
( p:Part { p_size : 10 } ) <-[ps:SUPPLY]- ( :Supplier ) <-[:HAS_SUP]- (n) <-[:CONTAINS]- ( r:Region { r_name: "Europe" } ) | |
WHERE | |
p.p_type =~ ".*Laptop$" | |
WITH | |
p, | |
MIN( ps.ps_supplycost ) AS minps | |
MATCH | |
(p) <-[ps:SUPPLY { ps_supplycost : minps }]- ( s:Supplier ) <-[:HAS_SUP]- (n) <-[:CONTAINS]- ( r:Region { r_name:"Europe" } ) | |
RETURN | |
s.s_acctbal, | |
s.s_name, | |
n.n_name, | |
p.p_partKey, | |
p.p_mfgr, | |
s.s_address, | |
s.s_phone, | |
s.s_comment | |
ORDER BY | |
s.s_acctbal DESC, | |
n.n_name, | |
s.s_name, | |
p.p_partkey; | |
---- | |
Resultats: | |
// table | |
=== Query 3 | |
En aquesta query primer accedim als nodes de +MktSegment+, per tant el nombre de comparacions serà baix perquè no esperem que hi hagi gaires +MktSegment+. + | |
Llavors obtenim les +Order+ i les filtrem per +o_orderdate+. + | |
Després obtenim els +LineItem+ i també els filtrem per +l_shipdate+. + | |
Al final fem l'agrupació i retornem els valors ordenats. + | |
* o_orderkey 1: Entra | |
* o_orderkey 2: Es queda fora per +c_mktsegment+ = "mailing" | |
* o_orderkey 3: Es queda fora per +o_orderdate+ = 1999999999991 | |
* o_orderkey 4: Entra | |
==== Query SQL: | |
[source] | |
---- | |
SELECT | |
l_orderkey, | |
SUM( l_extendedprice * ( 1 - l_discount ) ) AS revenue, | |
o_orderdate, | |
o_shippriority | |
FROM | |
customer, | |
JOIN orders ON c_custkey = o_custkey, | |
JOIN lineitem ON l_orderkey = o_orderkey | |
WHERE | |
c_mktsegment = '[SEGMENT]' | |
AND o_orderdate < '[DATE1]' | |
AND l_shipdate > '[DATE2]' | |
GROUP BY | |
l_orderkey, | |
o_orderdate, | |
o_shippriority | |
ORDER BY | |
revenue DESC, | |
o_orderdate; | |
---- | |
==== Query Cypher: | |
[source,cypher] | |
---- | |
MATCH | |
( m:MktSegment { m_name:"organic_seo" } ) - [:DOES] -> ( o:Order ) | |
WHERE | |
o.o_orderdate < 1222222222229 | |
MATCH | |
(o) - [:HAS] -> ( li:LineItem ) | |
WHERE | |
li.l_shipdate > 1333333333330 | |
WITH | |
o, | |
SUM( li.l_extendedprice * ( 1 - li.l_discount ) ) AS revenue | |
RETURN | |
o.o_orderkey, | |
revenue, | |
o.o_orderdate, | |
o.o_shippriority | |
ORDER BY | |
revenue DESC, | |
o.o_orderdate; | |
---- | |
Resultats: | |
// table | |
=== Query 4 | |
En aquest cas entrem per +Region+ i obtenim les +Nation+ d'aquesta. + | |
Per a cada +Nation+ 'n', n'obtenim les +Order+, després filtrem les +Order+ per +o_orderdate+. + | |
Obtenim els +LineItem+ de les +Order+ filtrades i llavors obtenim els +Supplier+ d'aquests +LineItem+ assegurant-nos que pertanyen a la +Nation+ 'n'. + | |
Finalment fem l'agrupació i retornem els resultats ordenats. + | |
* o_orderkey 1: Entra | |
* o_orderkey 2: Entra | |
* o_orderkey 3: Es queda fora per +o_orderdate+ = 1999999999991 | |
* o_orderkey 4: Es queda fora per +r_name+ ( +Order <-[:HAS_ORDER]- Region+ ) != "Europe" | |
==== Query SQL: | |
[source] | |
---- | |
SELECT | |
n_name, | |
SUM( l_extendedprice * ( 1 - l_discount ) ) AS revenue | |
FROM | |
customer, | |
JOIN orders ON c_custkey = o_custkey, | |
JOIN lineitem ON l_orderkey = o_orderkey, | |
JOIN supplier ON l_suppkey = s_suppkey, | |
JOIN nation ON s_nationkey = n_nationkey, | |
JOIN region ON n_regionkey = r_regionkey | |
WHERE | |
c_nationkey = s_nationkey | |
AND r_name = '[REGION]' | |
AND o_orderdate >= date '[DATE]' | |
AND o_orderdate < date '[DATE]' + interval '1' year | |
GROUP BY | |
n_name | |
ORDER BY | |
revenue DESC; | |
---- | |
==== Query Cypher: | |
[source,cypher] | |
---- | |
MATCH | |
( r:Region { r_name:"Europe" } ) -[:CONTAINS]-> ( n:Nation ) -[:HAS_ORDER]-> ( o:Order ) | |
WHERE | |
o.o_orderdate >= 1222222222220 | |
AND o.o_orderdate < 1253758222000 | |
MATCH | |
(o) -[:HAS]-> ( l:LineItem ) <-[:SUPPLIES]- ( s:Supplier ) <-[:HAS_SUP]- ( n ) | |
WITH | |
n, | |
SUM( l.l_extendedprice * ( 1 - l.l_discount ) ) AS revenue | |
RETURN | |
n.n_name, | |
revenue | |
ORDER BY | |
revenue DESC; | |
---- | |
Resultats: | |
// table | |
== Consola interactiva | |
Aquesta és una utilitat que proporciona l'eina GraphGist i que pot estar bé si es vol continuar jugant amb aquest joc de proves: + | |
//console |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment