yammik/metrics-backfill.md

## metrics-backfill.md

      
    Raw
  

              metrics-backfill.md
            
          
    Chosen approach

AWS Glue

AWS Glue is a serverless ETL service for data analysis:

With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. Also, you can immediately search and query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

Since our nonpartitioned data are already in S3, we can set up Glue to read directly from the bucket with a predefined schema.
We can use AWS Glue to repartition:
Extract data from S3

- unsure if we can filter by the type of the first partition, i.e. can we only extract data that have YEAR as the first partition?
- if so, will the repartitioned outputs replace/overwrite the contents of the partitioned data with the same output path?

Transform

- extract year, month, day from timestamp and add as columns

Load

- load into the target S3 bucket
- re-partition using the extracted values
- can provision multiple workers, set timeout

Rationale

Glue outputs work with Athena, Amplitude, perhaps Totango?
Can also add steps for detecting sensitive data, filtering values, etc. These functions are pretty standard and don't require being hacky.
Pitfalls

Alternative approaches

Write golang scripts and run on lambda. (how long does a run take? cost?)
Next steps


where do we want these jobs to live? can that be tied to terraform at all?


test creating glue jobs on tf
ideally, created via tf, but visible in the editor


confirm athena dynamic partitioning works with hive style partitioning
do this today
would we have to repartition the existing non-hive data?
do the existing consumers of these s3 buckets have to be updated? (well they have to be anyways to consume the new partition)
confirm the format of the JSON in the source file (might be golang side conversion)
Also consider running