Skip to content

Instantly share code, notes, and snippets.

@nicolasdao
Last active August 13, 2020 21:52
Show Gist options
  • Save nicolasdao/c510e2558830c7e93343ed2ecec302b7 to your computer and use it in GitHub Desktop.
Save nicolasdao/c510e2558830c7e93343ed2ecec302b7 to your computer and use it in GitHub Desktop.

AWS Athena

Click to expand Table of Contents

Concepts

AWS Glue

Overview

AWS Glue helps Big Data systems like Hadoop or AWS Athena to query data files in datastores like S3. Those files are usually missing metadata descriptions that would enable querying them. That's what AWS Glue does.

Crawler

An AWS Glue Crawler helps building AWS Athena tables' schema. Point the crawler to an S3 bucket, and it will generate a schema. If the data structure never change, you can run the crawler only once. If the data structure changes over time, you can schedule the crawler to run periodically using a CRON tab inside the crawler's configuration.

Once the table is created, the user can use AWS Athena to query the S3 bucket using SQL-like syntax.

CTAS - Create Table As Select

This is equivalent to a materialized view in PostgresSql. A physical location on S3 will be created to store the result of a SELECT.

Bucketing

Only for CTAS

This is one way to index a CTAS table (the other way being partitioning). Use table's properties with high-cardinality when bucketing. If the property you wish to index has low-cardinality or is sometimes not defined, then use partinioning instead.

To know more about how to set this up, go to https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html

Partitioning

Only for CTAS

This is one way to index a CTAS table (the other way being Bucketing). Use table's properties with low-cardinality when partitioning. If the property you wish to index has high-cardinality and is always defined, then use bucketing instead.

To know more about how to set this up, go to https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html

Querying Athena

SELECT

Syntax

[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expression [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition ]
[ UNION [ ALL | DISTINCT ] union_query ]
[ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
[ LIMIT [ count | ALL ] ]

Examples

SELECT listingid, listingsummary.status, element_at(description, 1)
FROM castle_inventory 
LIMIT 2

Notice:

  • listingsummary.status is the way to access a JSON property
  • element_at(description, 1) is the way to access a an element in an array. The element_at functions uses a non-zero based indexing. If you pass a negative number, it will return an element starting from the end of the array. For more info about arrays, go to https://docs.aws.amazon.com/athena/latest/ug/querying-arrays.html.

Athena JS SDK APIs

Official doc: https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Athena.html

Use Cases

Combining Athena With S3 and Kinesis

S3

Storing your data files into S3 is the usual use case.

To Use Kinesis Or To Not Use It?

In short: Use Kinesis Firehose to store your data into S3 if your write throughput exceeds ~3,000 operations per seconds. Otherwise, write straight into S3.

S3 already automatically scales to offer a huge read/write throughput (~3,500 writes per seconds as of Oct 2018). As for the reads, it is theoretically "infinite" has AWS paralellizes your queries for each prefix. If you have 1,000,000 read on 1,000,000 different prefixes, you theoretically could reach 1,000,000 reads per seconds. If you only read the same prefix, then you'll have ~5,000 reads per seconds. In any cases, for high-frequency reads, it is usually not recommended to read straight from S3. Instead, the best practice is to use a CDN (usually AWS CloudFront), in which case, your read throughput becomes even higher.

So what's the use case for using Kinesis Firehose?

Well, that's simple, if you can predict that your system needs to persist data at a higher frequency than ~3000 writes per seconds, than you should write to Kinesis Firehose to offload the excess pressure from S3. However, there are a few limitations:

  1. The data you send to AWS Kinesis Firehose cannot exceed 1000 kiB before base64-encoding.
  2. You will loose control of how the data are partitioned in S3. At the time of writing this article (Oct. 18), Kinesis Firehose will store your data in batches using the following S3 prefix: s3://<your-bucket>/<year>/<month>/<date>/<hour>. There will be 4 partitions associated to your table will be: year, month, date, hour.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment