Skip to content

Instantly share code, notes, and snippets.

@farski
Last active November 21, 2021 21:49
Show Gist options
  • Save farski/a00fe39bf1eff0a9131d685560e620d0 to your computer and use it in GitHub Desktop.
Save farski/a00fe39bf1eff0a9131d685560e620d0 to your computer and use it in GitHub Desktop.
Athena snippets

Snippets

These snippets are based on various various AWS documents, blogs, etc, but have been revised for clarity and consistency, and to incorprate modern best practices.

Resources

Patition Projection

Partition projection is a feature of Athena that allows for querying of large databases (i.e., data in S3) more efficiently without having to create partitions ahead of time. You can think of them as just-in-time ephemeral partitions that are created based on SELECT criteria at the time of the query.

Example:

There's an S3 bucket called planet-data that contains data files for different planets. The files are all prefixed with the name of the planet (e.g, s3://planet-data/Earth/… and s3://planet-data/Mars/…). If no partitions are created for an Athena table that queries this bucket, all files for all planets would be scanned on every query, even when the query is only looking at a single planet. So when running a query like SELECT * WHERE planet_id = 'Earth', all the files, including those with the Mars/, Saturn/, etc prefixes, will be loaded and scanned, even though the result would be only records that pertain to Earth which all came out of the Earth/ prefix. This leads to a lot of unnecessary data being scanned, which increases operational costs when using Athena.

To reduce the amount of data being scanned, Athena can use partitions. A partition allows Athena to scan a subset of objects in S3. In this example, creating partitions for each planet would allow you to query for reports for a specific planet, and only the objects with the matching prefix would be scanned during the query.

Partitions can be defined explicitly, by adding metadata to the table which, in this case, would say: create partitions for each planet name, and the possible names are Earth, Mars, etc. Creating partitions this way can be done manually, or it can be automated using Glue Crawlers or other methods, but the partitions must exist before queries can find the related data in S3. So if a new planet called HD110082b was discovered and reports started getting added to the S3 bucket for it with a new prefix, but no partition was created for that planet, Athena wouldn't be able to scan those reports using a partition.

Alternatively, Athena supports partition projection, which allows for scanning partitioned S3 data without needing to define those partitions ahead of time. Instead, a projection is defined, which maps a field in a SELECT query to a prefix (or partial prefix) in the data being scanned in S3. If a projection is defined called PlanetName, it can be used in a template that describes objects in S3 (like s3://planet-data/${PlanetName}/). Now, when a query includes a WHERE clause that specifies one or more PlanetName values (e.g., SELECT * WHERE PlanetName = 'Earth'), Athena will scan for objects objects in S3 using the template (e.g., s3://planet-data/Earth/). This matches the structure of the data in the bucket from the beginning of the example, without having explicitly defined any partitions.

Projection partitions are commonly used when querying time-based files, such as log files. If log files are being automatically delivered to a bucket with a naming structure like myBucket/2021/05/03/log1.csv, a partition projection can be defined to automatically partition the data by year, month, and day, so that when running a SELECT for a specific set of days, only the log files for those days are scanned.

You can see what's happening under the covers in a case like this if you turn on access logging for the bucket underlying the Athena table.

Example:

Given a date-based daily partition projection called log_date, and a projection template of s3://myBucket/myLogs/${log_date}, imagine a query like SELECT * from myLogsTable WHERE log_date >= date('2021-05-05') AND log_date < date('2021-05-07'). This query is selecting all logs for May 5th and 6th of 2021. The table's partition projection has a daily resolution, so only the partitions for those two days will be needed by the query, meaning only the log files for those two days will be scanned from S3.

Examining the S3 access logs for the bucket after running the query would reveal that Athena is making a ListObjectsV2 API call on the bucket for each partition, and then a GetObject call for each object that was returned in the list.

For example:

  • "GET /?list-type=2&prefix=myLogs%2F2021%205%2F05%2F&fetch-owner=false HTTP/1.1"
  • "GET /myLogs/2021/05/05/log1.csv HTTP/1.1"
  • "GET /myLogs/2021/05/05/log2.csv HTTP/1.1"
  • "GET /myLogs/2021/05/05/log3.csv HTTP/1.1"
  • "GET /?list-type=2&prefix=myLogs%2F2021%205%2F06%2F&fetch-owner=false HTTP/1.1"
  • "GET /myLogs/2021/05/06/log1.csv HTTP/1.1"
  • "GET /myLogs/2021/05/06/log2.csv HTTP/1.1"

Hopefully that example demonstrates how a partition projection defined on a table can be used in a SELECT query to limit which objects in S3 are actually accessed by Athena.

Creating partition projections often leads to having several similar fields in the query results: the fields from the data itself, and the fields created as part of the projection. In the previous example, this would be the log_date field and perhaps a timestamp value included in each log line. It's important to remember that Athena will only utilize partitions when the partition projection field is used. If log_date were omitted from a SELECT that was filtering using timestamp, the results would only include the rows from the filter, but Athena would have scanned the entire, unpartitioned set of data from S3.

To filter a query using partitioning and with a resolution that is greater than the pertions, both the data fields and partition projection fields would be used in the same query.

Note: There is currently a limitation with partion projections, where a / is always included at the end of the prefix parameter of the ListObjectsV2 request, even if one was not included in the projection template. This makes it impossible to use projection partitions for any S3 objects that don't actually have a / in the key at that particular spot (which is the case for a number of AWS service log files).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment