kissrobber/graph_database_memo.md

## graph_database_memo.md

      
    Raw
  

              graph_database_memo.md
            
          
    グラフデータベースの何がいいのか？


RDBMSでよくね？

本1冊ぐらい読んで判断してください

https://neo4j.com/book-graph-databases/
いろいろあるけど判断ポイントはこのあたりかな

http://www.allthingsdistributed.com/2015/08/titan-graphdb-integration-in-dynamodb.html

In this way, graphs can scale to billions of vertices and edges, while allowing efficient queries and traversal of any subset of the graph with consistent low latency that doesn’t grow proportionally to the overall graph size. This is an important benefit for many use cases that involve accessing and traversing small subsets of a large graph. A concrete example is generating a product recommendation based on purchase interests of a user’s friends, where the relevant social connections are a small subset of the total network. Another example is for tracking inventory in a vast logistics system,  where only a subset of its locations is relevant for a specific item.

https://www.sitepoint.com/why-you-should-use-neo4j-in-your-next-ruby-app/#comment-2689399402

Why is this great? Imagine a world with no foreign keys. Each entity in your database can have many relationships referring directly to other entities. If you want to explore the relationships there are no table or index scans, just a few connections to follow. This matches up well with the typical object model. It is more powerful, though, because Neo4j, while providing a lot of the database functionality that we expect, gives us tools to query for complex patterns in our data.


My typical answer is that something like a database where you're doing logging of lots of repetitive data is usually a better fit for on RDMS (or even Mongo since you're wouldn't generally have foreign keys there). For a graph database obviously things that are already graphy are on the other side of the spectrum (e.g. social networks or hierarchical structures)


Another typical answer I've seen is that you'd want Neo4j for when your data has a lot of relationships. I've found, though, that relationships start coming from places that you don't expect when you have the ability to create them easily ;)

グラフデータベースもいろいろあるけどNeo4jがいいと思った

Neo4jと、Titanというグラフデータベースも簡単に調べた結果、Neo4jの方がいいと思った。
（もちろん用途によって選択肢は変わるでしょう）
Neo4j

難しい事はおいといて、とりあえずインストールするとものすごく良くできたチュートリアルがあるので、自分で勝手にためしてください。

https://neo4j.com/download/

そのあとで、このあたり読むとだいたい把握できる

https://neo4j.com/developer/get-started/
Titan

http://titan.thinkaurelius.com/

AWSで公式サポートされてる？ところがやっぱり一番惹かれた

詳しくはこのあたり

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Tools.TitanDB.html
Titanというよりもグラフデータベース自体の導入記事としてもいいかも

https://blogs.aws.amazon.com/bigdata/post/Tx12NN92B1F5K0C/Building-a-Graph-Database-on-AWS-Using-Amazon-DynamoDB-and-Titan
でもやっぱりNeo4jの方がいいと思った

1番の理由はデータ構造
Titanのデータ構造

http://s3.thinkaurelius.com/docs/titan/current/data-model.html

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Tools.TitanDB.BestPractices.html
Neo4jのデータ構造

https://neo4j.com/book-graph-databases/

の６章のNative Graph Storageに詳しく書かれている
両方読んで、(以下略)
その他

Enterprise Clusteringの実力は？

https://neo4j.com/customers/

LinkedInとかも使ってるし
Neo4jのホスティングサービス

https://neo4j.com/developer/guide-cloud-deployment/
RailsでNeo4jを使ってみる

http://neo4jrb.readthedocs.io/en/7.0.x/

Railsでアプリ作った事あればすんなり理解できると思うけど、適当に気になった事などを書いときます。


Rails app作成したらまず、pretty_logged_cypher_queriesを設定しておくと
Cypherの理解も進むのでオススメ

http://neo4jrb.readthedocs.io/en/7.0.x/Configuration.html


グラフデータベースは、RDBMSと違ってリレーションも第1級オブジェクトなので、
ActiveRecordとは違って、
ActiveNode, ActiveRel(リレーション)があるYo


http://neo4jrb.readthedocs.io/en/7.0.x/ActiveNode.html

なるほど継承をNeo4jで表すときはLabelを複数つけるのね
（Labelについては、http://neo4jrb.readthedocs.io/en/7.0.x/Introduction.html#terminology）


http://neo4jrb.readthedocs.io/en/7.0.x/ActiveNode.html#eager-loading

with_associationsがActiveRecordでいうeager_load

なにもつけないと勝手にpreloadぽくなる。（←これはナイスな気もするけど、余計なお世話な気もする。）


http://neo4jrb.readthedocs.io/en/7.0.x/Querying.html

Chaining associationsのところ

student.lessons.teachersみたいに has_manyをchainできる。（当然１クエリ）


proxy_as


ActiveNodeでidとしてUUIDが生成されるけど、
実行されるCypher Queryにはそれとは違う別のIDが使われている事が多いのに気がつく。

これはneo4jのID関数で取得できるneo4jが内部的に生成するIDで、これ自体が直接データ構造上の位置をあらわす。

内部的には、RDBMSでおなじみのB Tree探索とかすらする必要がないという事。

(詳しくは https://neo4j.com/book-graph-databases/ の６章のNative Graph Storage)

ちなみにこのIDは、バージョンアップなどで値が変わる可能性があるので、
あくまでも一時的なクエリ生成のみに使用すべきのよう。
https://github.com/neo4jrb/neo4j/blob/master/docs/UniqueIDs.rst


Cypher query

Neo4j版のSQL

チュートリアルやったあとに、
https://neo4j.com/docs/cypher-refcard/current/ みながら
https://neo4j.com/graphgists/ を適当にみていると

使えるようになったような気になれます。

Label = テーブル(複数つけられる。継承を表すときは親ラベルも一緒につける)

Optinal match (left joinみたいなかんじ)

With (サブクエリ的な用途)

UNIONのpost processについてだけど、SQLサブクエリ的な用途でも使える
https://neo4j.com/blog/cypher-union-query-using-collect-clause/
Cyperクエリのチューニング

https://neo4j.com/blog/tuning-cypher-queries/
この部分の説明はチューニング云々ではなくてグラフデータベースの理解として重要

With the first create index on, we are setting an index on the title of a :movie; with the second, we are setting an index on the name of a :person, both of which allow us to create unique indexes. In graph databases, indexes are only used to find the starting point for queries, while in relational databases indexes are used to return a set of rows that are then used for JOINS.

で、この結果の違いを見る。よい例題だと思う。

Again, an index is used to find the starting point. We have now directed our query to zoom in on Tom Hanks and Meg Ryan and find the connections between them. This gives the query plan a very different shape:


たしかにグラフデータベースのデータ構造を考えると右側のほうがはるかに効率よいのがわかる。

この右のパターンはむしろMySQLの場合はスロークエリ監視してたらよく見るパターンのあれと同じに見える

一時テーブルどおしのjoinをしてしまうとインデックスが使えなくてデータ量が一定以上になると急に遅くなる

こういう点でもアプリケーションエンジニアは両者の内部の動きをきちんと把握しておく事が重要。


With > Split MATCH Clauses to Reduce Cardinality


Using Size with Relationships

これgistとかみてもあんまり意識されてない気がする


Using SCAN


GraphGist(Neo4j公式の事例集みたいなの)


https://neo4j.com/graphgist/ac0b2c27-2a5f-4943-8b4b-100273cb285e

ネタとして面白かった

Cypher例


http://portal.graphgist.org/graph_gists/zombie-apocalypse

ネタとして面白かった


http://portal.graphgist.org/graph_gists/bank-fraud-detection

このモデルは通常の用途に使えるとは思えない。たぶん検出専用に作られたモデルかな


http://portal.graphgist.org/graph_gists/project-management

クリティカルパスもシンプルな１クエリで出せる。

クエリでEarliest Start/Finishを設定するあたりとか懐かしい


http://portal.graphgist.org/graph_gists/network-dependency-graph


http://portal.graphgist.org/graph_gists/credit-card-fraud-detection
http://linkurio.us/


http://portal.graphgist.org/graph_gists/competency-management-a-matter-of-filtering-and-recommendation-engines

ちゃんと理解できてないけど勉強になった。これは業務でそのまま使えそうな


However, it may be critical to only select those candidates who have skills, required by activities, within a certain competency area. Therefore, not to filter through node properties, but through their links or data relationships. Furthermore, it is also essential to expand the search to (and eventually beyond) 3rd degree connections between skills, activities and areas. In other words, we are looking for how potential candidates are connected to competency areas, within a depth of 3. A SQL database will need to execute more JOIN operations to provide the answer – a task that is difficult to code and creates a time-consuming query. As the depth of connections queried expands, this search will become increasingly difficult with an RDBMS and will result in incredibly poor performance.


After 1 year of operations, these parameters result in a graph of approximately 1M nodes. For a graph of this size, the query traversing paths of depth 3 (see above) requires over 30 seconds for a RDBMS to perform, but will only take less than 0.2 seconds with Neo4j [23]. The difference can be critical, whenever querying the database is part of an online tool.


http://portal.graphgist.org/graph_gists/finding-influencers-in-a-social-network

Twitterインフルエンサーを検出する。リツイートはこういうモデリングになるのか


http://portal.graphgist.org/graph_gists/piping-water


http://portal.graphgist.org/graph_gists/organization-learning