AWS Glue is a serverless ETL service for data analysis:
With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. Also, you can immediately search and query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
Since our nonpartitioned data are already in S3, we can set up Glue to read directly from the bucket with a predefined schema.
We can use AWS Glue to repartition:
- unsure if we can filter by the type of the first partition, i.e. can we only extract data that have YEAR as the first partition?
- if so, will the repartitioned outputs replace/overwrite the contents of the partitioned data with the same output path?
- extract year, month, day from timestamp and add as columns
- load into the target S3 bucket
- re-partition using the extracted values
- can provision multiple workers, set timeout
Glue outputs work with Athena, Amplitude, perhaps Totango? Can also add steps for detecting sensitive data, filtering values, etc. These functions are pretty standard and don't require being hacky.
Write golang scripts and run on lambda. (how long does a run take? cost?)
- where do we want these jobs to live? can that be tied to terraform at all?
- test creating glue jobs on tf
- ideally, created via tf, but visible in the editor
- confirm athena dynamic partitioning works with hive style partitioning
- do this today
- would we have to repartition the existing non-hive data?
- do the existing consumers of these s3 buckets have to be updated? (well they have to be anyways to consume the new partition)
- confirm the format of the JSON in the source file (might be golang side conversion)
- Also consider running
let's see if we can define a glue job in terraform
define an athena table that looks like this
what to do with existing data that's not hive-partitioned
if the formats match, there may be duplicate data
add these notes to discussion repo