Click to expand Table of Contents
Overview
AWS Glue helps Big Data systems like Hadoop or AWS Athena to query data files in datastores like S3. Those files are usually missing metadata descriptions that would enable querying them. That's what AWS Glue does.
Crawler
An AWS Glue Crawler helps building AWS Athena tables' schema. Point the crawler to an S3 bucket, and it will generate a schema. If the data structure never change, you can run the crawler only once. If the data structure changes over time, you can schedule the crawler to run periodically using a CRON tab inside the crawler's configuration.
Once the table is created, the user can use AWS Athena to query the S3 bucket using SQL-like syntax.
This is equivalent to a materialized view in PostgresSql. A physical location on S3 will be created to store the result of a SELECT
.
Only for CTAS
This is one way to index a CTAS table (the other way being partitioning). Use table's properties with high-cardinality when bucketing. If the property you wish to index has low-cardinality or is sometimes not defined, then use partinioning instead.
To know more about how to set this up, go to https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
Only for CTAS
This is one way to index a CTAS table (the other way being Bucketing). Use table's properties with low-cardinality when partitioning. If the property you wish to index has high-cardinality and is always defined, then use bucketing instead.
To know more about how to set this up, go to https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expression [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition ]
[ UNION [ ALL | DISTINCT ] union_query ]
[ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
[ LIMIT [ count | ALL ] ]
SELECT listingid, listingsummary.status, element_at(description, 1)
FROM castle_inventory
LIMIT 2
Notice:
listingsummary.status
is the way to access a JSON propertyelement_at(description, 1)
is the way to access a an element in an array. Theelement_at
functions uses a non-zero based indexing. If you pass a negative number, it will return an element starting from the end of the array. For more info about arrays, go to https://docs.aws.amazon.com/athena/latest/ug/querying-arrays.html.
Official doc: https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Athena.html
Storing your data files into S3 is the usual use case.
In short: Use Kinesis Firehose to store your data into S3 if your write throughput exceeds ~3,000 operations per seconds. Otherwise, write straight into S3.
S3 already automatically scales to offer a huge read/write throughput (~3,500 writes per seconds as of Oct 2018). As for the reads, it is theoretically "infinite" has AWS paralellizes your queries for each prefix. If you have 1,000,000 read on 1,000,000 different prefixes, you theoretically could reach 1,000,000 reads per seconds. If you only read the same prefix, then you'll have ~5,000 reads per seconds. In any cases, for high-frequency reads, it is usually not recommended to read straight from S3. Instead, the best practice is to use a CDN (usually AWS CloudFront), in which case, your read throughput becomes even higher.
So what's the use case for using Kinesis Firehose?
Well, that's simple, if you can predict that your system needs to persist data at a higher frequency than ~3000 writes per seconds, than you should write to Kinesis Firehose to offload the excess pressure from S3. However, there are a few limitations:
- The data you send to AWS Kinesis Firehose cannot exceed 1000 kiB before base64-encoding.
- You will loose control of how the data are partitioned in S3. At the time of writing this article (Oct. 18), Kinesis Firehose will store your data in batches using the following S3 prefix:
s3://<your-bucket>/<year>/<month>/<date>/<hour>
. There will be 4 partitions associated to your table will be:year
,month
,date
,hour
.