Skip to content

Instantly share code, notes, and snippets.

@kissrobber
Last active April 16, 2018 02:36
Show Gist options
  • Save kissrobber/062f645834bdc611798b3978de897adb to your computer and use it in GitHub Desktop.
Save kissrobber/062f645834bdc611798b3978de897adb to your computer and use it in GitHub Desktop.
Graph Database memo

グラフデータベースの何がいいのか?

RDBMSでよくね?

本1冊ぐらい読んで判断してください
https://neo4j.com/book-graph-databases/

いろいろあるけど判断ポイントはこのあたりかな

http://www.allthingsdistributed.com/2015/08/titan-graphdb-integration-in-dynamodb.html

In this way, graphs can scale to billions of vertices and edges, while allowing efficient queries and traversal of any subset of the graph with consistent low latency that doesn’t grow proportionally to the overall graph size. This is an important benefit for many use cases that involve accessing and traversing small subsets of a large graph. A concrete example is generating a product recommendation based on purchase interests of a user’s friends, where the relevant social connections are a small subset of the total network. Another example is for tracking inventory in a vast logistics system, where only a subset of its locations is relevant for a specific item.

https://www.sitepoint.com/why-you-should-use-neo4j-in-your-next-ruby-app/#comment-2689399402

Why is this great? Imagine a world with no foreign keys. Each entity in your database can have many relationships referring directly to other entities. If you want to explore the relationships there are no table or index scans, just a few connections to follow. This matches up well with the typical object model. It is more powerful, though, because Neo4j, while providing a lot of the database functionality that we expect, gives us tools to query for complex patterns in our data.

My typical answer is that something like a database where you're doing logging of lots of repetitive data is usually a better fit for on RDMS (or even Mongo since you're wouldn't generally have foreign keys there). For a graph database obviously things that are already graphy are on the other side of the spectrum (e.g. social networks or hierarchical structures)

Another typical answer I've seen is that you'd want Neo4j for when your data has a lot of relationships. I've found, though, that relationships start coming from places that you don't expect when you have the ability to create them easily ;)

グラフデータベースもいろいろあるけどNeo4jがいいと思った

Neo4jと、Titanというグラフデータベースも簡単に調べた結果、Neo4jの方がいいと思った。 (もちろん用途によって選択肢は変わるでしょう)

Neo4j

難しい事はおいといて、とりあえずインストールするとものすごく良くできたチュートリアルがあるので、自分で勝手にためしてください。
https://neo4j.com/download/
そのあとで、このあたり読むとだいたい把握できる
https://neo4j.com/developer/get-started/

Titan

http://titan.thinkaurelius.com/
AWSで公式サポートされてる?ところがやっぱり一番惹かれた
詳しくはこのあたり
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Tools.TitanDB.html

Titanというよりもグラフデータベース自体の導入記事としてもいいかも
https://blogs.aws.amazon.com/bigdata/post/Tx12NN92B1F5K0C/Building-a-Graph-Database-on-AWS-Using-Amazon-DynamoDB-and-Titan

でもやっぱりNeo4jの方がいいと思った

1番の理由はデータ構造

Titanのデータ構造

http://s3.thinkaurelius.com/docs/titan/current/data-model.html
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Tools.TitanDB.BestPractices.html

Neo4jのデータ構造

https://neo4j.com/book-graph-databases/
の6章のNative Graph Storageに詳しく書かれている

両方読んで、(以下略)

その他

Enterprise Clusteringの実力は?
https://neo4j.com/customers/
LinkedInとかも使ってるし

Neo4jのホスティングサービス
https://neo4j.com/developer/guide-cloud-deployment/

RailsでNeo4jを使ってみる

http://neo4jrb.readthedocs.io/en/7.0.x/
Railsでアプリ作った事あればすんなり理解できると思うけど、適当に気になった事などを書いときます。

Cypher query

Neo4j版のSQL
チュートリアルやったあとに、 https://neo4j.com/docs/cypher-refcard/current/ みながら https://neo4j.com/graphgists/ を適当にみていると
使えるようになったような気になれます。

Label = テーブル(複数つけられる。継承を表すときは親ラベルも一緒につける)
Optinal match (left joinみたいなかんじ)
With (サブクエリ的な用途)

UNIONのpost processについてだけど、SQLサブクエリ的な用途でも使える https://neo4j.com/blog/cypher-union-query-using-collect-clause/

Cyperクエリのチューニング

https://neo4j.com/blog/tuning-cypher-queries/

この部分の説明はチューニング云々ではなくてグラフデータベースの理解として重要

With the first create index on, we are setting an index on the title of a :movie; with the second, we are setting an index on the name of a :person, both of which allow us to create unique indexes. In graph databases, indexes are only used to find the starting point for queries, while in relational databases indexes are used to return a set of rows that are then used for JOINS.

で、この結果の違いを見る。よい例題だと思う。

Again, an index is used to find the starting point. We have now directed our query to zoom in on Tom Hanks and Meg Ryan and find the connections between them. This gives the query plan a very different shape:

たしかにグラフデータベースのデータ構造を考えると右側のほうがはるかに効率よいのがわかる。
この右のパターンはむしろMySQLの場合はスロークエリ監視してたらよく見るパターンのあれと同じに見える
一時テーブルどおしのjoinをしてしまうとインデックスが使えなくてデータ量が一定以上になると急に遅くなる
こういう点でもアプリケーションエンジニアは両者の内部の動きをきちんと把握しておく事が重要。

  • With > Split MATCH Clauses to Reduce Cardinality

  • Using Size with Relationships
    これgistとかみてもあんまり意識されてない気がする

  • Using SCAN

GraphGist(Neo4j公式の事例集みたいなの)

However, it may be critical to only select those candidates who have skills, required by activities, within a certain competency area. Therefore, not to filter through node properties, but through their links or data relationships. Furthermore, it is also essential to expand the search to (and eventually beyond) 3rd degree connections between skills, activities and areas. In other words, we are looking for how potential candidates are connected to competency areas, within a depth of 3. A SQL database will need to execute more JOIN operations to provide the answer – a task that is difficult to code and creates a time-consuming query. As the depth of connections queried expands, this search will become increasingly difficult with an RDBMS and will result in incredibly poor performance.

After 1 year of operations, these parameters result in a graph of approximately 1M nodes. For a graph of this size, the query traversing paths of depth 3 (see above) requires over 30 seconds for a RDBMS to perform, but will only take less than 0.2 seconds with Neo4j [23]. The difference can be critical, whenever querying the database is part of an online tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment