Skip to content

Instantly share code, notes, and snippets.

@kamath
Last active December 23, 2020 22:57
Show Gist options
  • Save kamath/57530de71bf2ccb63c45ed82a9a5176c to your computer and use it in GitHub Desktop.
Save kamath/57530de71bf2ccb63c45ed82a9a5176c to your computer and use it in GitHub Desktop.

Serverless Graph DB

In an increasingly serverless tech industry, graph databases like Neo4J still require infrastructure provisioning. When you start a Neo4J instance, you run the risk of starting a server you may not use, thereby paying for uptime on a server that you think will scale up, but have no guarantee that it will.

AWS Glue and Athena - serverless ETLs and databases

AWS Glue has "crawlers" that can schematize JSON, text, and CSV files, and store that data in a serverless database, called AWS Athena. The output of a Glue crawler is typically a Parquet file that is stored in S3 (regular cloud storage for files), which Athena reads as a table in its database. AWS Glue also allows for Spark jobs that allow you to relationalize the output of a Crawler, meaning you can turn any unstructured data into structured data that can be queried with SQL in Athena. The fact that it uses Parquet also means it enforces strong data typing that typical CSVs and JSON files don't allow. It also compresses regular text/CSV files, so the data usage is also less than if you were to just use regular Neo4J.

Neo4J also has the disadvantage of having unstructured data within its nodes. This makes the task of querying based on the actual data within the nodes rather complex - Neo4J is best used to analyze relationships between nodes. The nice thing about Glue crawlers is that if you update the unstructured data with a new feature, it will automatically still relationalize that data and create a new column to query from. This means that even though Neo4J can allow for unstructured data, Glue will make sure your data is easily queryable with much stronger data integrity.

Athena only charges per query, and the charge is $5 per TB scanned - a very modest price for that scale. Storage is charged via S3, which costs $.023 per GB up to 50 TB, at which point it starts to increase in cost.

How do we implement the graph database features?

Due to the ability of Glue to detect new features and add them as columns, we can effectively create columns that serve as edges, along with columns that describe the edges, as Neo4J allows you to do. With the benefit of Spark and data partitioning, you can choose how you join, select, and filter your data, so you don't have to necessarily run large JOIN queries as you would in a traditional RDBMS.

Value Proposition: Extremely cheap, scalable database with advantages of structured data from relational DB and the ease of use of a graph database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment