Skip to content

Instantly share code, notes, and snippets.

@yammik
Last active October 11, 2022 20:25
Show Gist options
  • Save yammik/f801cd27dc4206dbd234bdffabd0e87a to your computer and use it in GitHub Desktop.
Save yammik/f801cd27dc4206dbd234bdffabd0e87a to your computer and use it in GitHub Desktop.

Chosen approach

AWS Glue

AWS Glue is a serverless ETL service for data analysis:

With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. Also, you can immediately search and query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

Since our nonpartitioned data are already in S3, we can set up Glue to read directly from the bucket with a predefined schema.

We can use AWS Glue to repartition:

Extract data from S3

- unsure if we can filter by the type of the first partition, i.e. can we only extract data that have YEAR as the first partition?
- if so, will the repartitioned outputs replace/overwrite the contents of the partitioned data with the same output path?

Transform

- extract year, month, day from timestamp and add as columns

Load

- load into the target S3 bucket
- re-partition using the extracted values
- can provision multiple workers, set timeout

Rationale

Glue outputs work with Athena, Amplitude, perhaps Totango? Can also add steps for detecting sensitive data, filtering values, etc. These functions are pretty standard and don't require being hacky.

Pitfalls

Alternative approaches

Write golang scripts and run on lambda. (how long does a run take? cost?)

Next steps

  1. where do we want these jobs to live? can that be tied to terraform at all?
  • test creating glue jobs on tf
  • ideally, created via tf, but visible in the editor
  1. confirm athena dynamic partitioning works with hive style partitioning
  2. do this today
  3. would we have to repartition the existing non-hive data?
  4. do the existing consumers of these s3 buckets have to be updated? (well they have to be anyways to consume the new partition)
  5. confirm the format of the JSON in the source file (might be golang side conversion)
  6. Also consider running
@yammik
Copy link
Author

yammik commented Oct 11, 2022

  1. update the datetime partition style to just be "yyyy/mm/dd"
  2. create athena table
  3. add to discussion repo
  4. how to do this in terraform

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment