Skip to content

Instantly share code, notes, and snippets.

@sahlone
Last active June 19, 2018 13:29
Show Gist options
  • Save sahlone/1836f87514fe42c014bbd0f5521b0602 to your computer and use it in GitHub Desktop.
Save sahlone/1836f87514fe42c014bbd0f5521b0602 to your computer and use it in GitHub Desktop.
Click/Conversions ETL to S3 and Athena

The scope of the gist is to define the process of loading of Clicks/Conversion data from Tracker to S3/Athena for Business processes

The process is described as :

  1. The tracker receives the click/conversions data from outside sources and pushes to Kafka topic

  2. The Matcher reads the Kafka topic produced by the tracker and matches the clicks and conversions data to produce the Matched conversion data

  3. Now the Job handles the data from Kafka topic produced by the Matcher and upload the data to s3. From s3, we can define the schema in Athena and use the Athena to run the SQL queries on top of the data

System Diagram

Athena Formats

@sahlone
Copy link
Author

sahlone commented Jun 8, 2018

The ETL job to push to big query will not be a part of System for now as discussed with Gaiar.
We need some insights here for the type of format we should use and how about defining the process
I have also added the screenshot of formats that we can use and what flexibility we get with the type of format

@vincentdaniel
Copy link

One comment about the goal:

  • I think we will need to send clicks and conversion to S3, not only "matched conversions" from matcher (need to confirm with Gaiar). That would mean you would have to consume 3 different topics: clicks, conversions, matched conversions

Some questions:

  • What does it mean to "scan data" on every query? Does it mean it will see if there is any new file or will it read all of the files every time?
  • Gaiar mentioned glue as well. Isn't it needed at all?
  • Will we need to define table schemas in Athena? If so, is there any terraform plugin to help us manage that?

@sahlone
Copy link
Author

sahlone commented Jun 8, 2018

@vincentdaniel

  1. Whats is a data scan : What it means is when you fire a query, the data is scanned and results are produced from that data. SO the amount of data scanned by athena is included in costing. That's where partitioning can help but after talking to Gair I came to know there s no perfect criteria for queries. But Gair was ok with it as finally, we will move data to Bigquery from s3. For now partitioning will help in uploading data to s3 as we will do it in batches.
    Note : Athena doesnt actually store data. Its exactly like Hive, its just store metadata and gets data on demand
  2. Is Athena supported in Terraform: Yes there is a support for that as well
  3. What about Glue: Glue is actually used for ETL jobs, they are actually Spark jobs running on schedule so I don't think we need that as we will manage them ourselves. The only thing they give us is automatic schema defining, we are not that lazy to not take care of that and yes you pay for every crawler operation.

@gaiar
Copy link

gaiar commented Jun 19, 2018

BigQuerry import currently not required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment