nicolasdao/aws_athena.md

## aws_athena.md

      
    Raw
  

              aws_athena.md
            
          
    AWS Athena

Click to expand Table of Contents


Concepts
Querying Athena
Athena JS SDK APIs
Use Cases

Combining Athena With S3 and Kinesis


Concepts

AWS Glue

Overview
AWS Glue helps Big Data systems like Hadoop or AWS Athena to query data files in datastores like S3. Those files are usually missing metadata descriptions that would enable querying them. That's what AWS Glue does.
Crawler
An AWS Glue Crawler helps building AWS Athena tables' schema. Point the crawler to an S3 bucket, and it will generate a schema. If the data structure never change, you can run the crawler only once. If the data structure changes over time, you can schedule the crawler to run periodically using a CRON tab inside the crawler's configuration.
Once the table is created, the user can use AWS Athena to query the S3 bucket using SQL-like syntax.
CTAS - Create Table As Select

This is equivalent to a materialized view in PostgresSql. A physical location on S3 will be created to store the result of a SELECT.
Bucketing


Only for CTAS

This is one way to index a CTAS table (the other way being partitioning). Use table's properties with high-cardinality when bucketing. If the property you wish to index has low-cardinality or is sometimes not defined, then use partinioning instead.
To know more about how to set this up, go to https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
Partitioning


Only for CTAS

This is one way to index a CTAS table (the other way being Bucketing). Use table's properties with low-cardinality when partitioning. If the property you wish to index has high-cardinality and is always defined, then use bucketing instead.
To know more about how to set this up, go to https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
Querying Athena

SELECT

Syntax

[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expression [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition ]
[ UNION [ ALL | DISTINCT ] union_query ]
[ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
[ LIMIT [ count | ALL ] ]
Examples

SELECT listingid, listingsummary.status, element_at(description, 1)
FROM castle_inventory 
LIMIT 2
Notice:

listingsummary.status is the way to access a JSON property
element_at(description, 1) is the way to access a an element in an array. The element_at functions uses a non-zero based indexing. If you pass a negative number, it will return an element starting from the end of the array. For more info about arrays, go to https://docs.aws.amazon.com/athena/latest/ug/querying-arrays.html.

Athena JS SDK APIs


Official doc: https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Athena.html

Use Cases

Combining Athena With S3 and Kinesis

S3

Storing your data files into S3 is the usual use case.
To Use Kinesis Or To Not Use It?


In short: Use Kinesis Firehose to store your data into S3 if your write throughput exceeds ~3,000 operations per seconds. Otherwise, write straight into S3.

S3 already automatically scales to offer a huge read/write throughput (~3,500 writes per seconds as of Oct 2018). As for the reads, it is theoretically "infinite" has AWS paralellizes your queries for each prefix. If you have 1,000,000 read on 1,000,000 different prefixes, you theoretically could reach 1,000,000 reads per seconds. If you only read the same prefix, then you'll have ~5,000 reads per seconds. In any cases, for high-frequency reads, it is usually not recommended to read straight from S3. Instead, the best practice is to use a CDN (usually AWS CloudFront), in which case, your read throughput becomes even higher.
So what's the use case for using Kinesis Firehose?
Well, that's simple, if you can predict that your system needs to persist data at a higher frequency than ~3000 writes per seconds, than you should write to Kinesis Firehose to offload the excess pressure from S3. However, there are a few limitations:

The data you send to AWS Kinesis Firehose cannot exceed 1000 kiB before base64-encoding.
You will loose control of how the data are partitioned in S3. At the time of writing this article (Oct. 18), Kinesis Firehose will store your data in batches using the following S3 prefix: s3://<your-bucket>/<year>/<month>/<date>/<hour>. There will be 4 partitions associated to your table will be: year, month, date, hour.