Skip to content

Instantly share code, notes, and snippets.

@sahlone
Last active June 19, 2018 13:29
Show Gist options
  • Save sahlone/1836f87514fe42c014bbd0f5521b0602 to your computer and use it in GitHub Desktop.
Save sahlone/1836f87514fe42c014bbd0f5521b0602 to your computer and use it in GitHub Desktop.
Click/Conversions ETL to S3 and Athena

The scope of the gist is to define the process of loading of Clicks/Conversion data from Tracker to S3/Athena for Business processes

The process is described as :

  1. The tracker receives the click/conversions data from outside sources and pushes to Kafka topic

  2. The Matcher reads the Kafka topic produced by the tracker and matches the clicks and conversions data to produce the Matched conversion data

  3. Now the Job handles the data from Kafka topic produced by the Matcher and upload the data to s3. From s3, we can define the schema in Athena and use the Athena to run the SQL queries on top of the data

System Diagram

Athena Formats

@sahlone
Copy link
Author

sahlone commented Jun 8, 2018

@vincentdaniel

  1. Whats is a data scan : What it means is when you fire a query, the data is scanned and results are produced from that data. SO the amount of data scanned by athena is included in costing. That's where partitioning can help but after talking to Gair I came to know there s no perfect criteria for queries. But Gair was ok with it as finally, we will move data to Bigquery from s3. For now partitioning will help in uploading data to s3 as we will do it in batches.
    Note : Athena doesnt actually store data. Its exactly like Hive, its just store metadata and gets data on demand
  2. Is Athena supported in Terraform: Yes there is a support for that as well
  3. What about Glue: Glue is actually used for ETL jobs, they are actually Spark jobs running on schedule so I don't think we need that as we will manage them ourselves. The only thing they give us is automatic schema defining, we are not that lazy to not take care of that and yes you pay for every crawler operation.

@gaiar
Copy link

gaiar commented Jun 19, 2018

BigQuerry import currently not required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment