Skip to content

Instantly share code, notes, and snippets.

@samklr
Created November 26, 2020 21:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save samklr/743d927dd0a5f5671c64b1d346e7b318 to your computer and use it in GitHub Desktop.
Save samklr/743d927dd0a5f5671c64b1d346e7b318 to your computer and use it in GitHub Desktop.
Data Engineering assignement

Context The Integration team has deployed a cron job to dump a CSV file containing all the new Shopify configurations daily at 2 AM UTC. The task will be to build a daily pipeline that will :

download the CSV file from https://alg-data-public.s3.amazonaws.com/[YYYY-MM-DD].csv, filter out each row with empty application_id, add a has_specific_prefix column set to true if the value of index_prefix differs from shopify_ else to false load the valid rows to a Postresql instance The pipeline should process files from 2019-04-01 to 2019-04-07.

The candidate can choose any orchestration tool they are most comfortable with. Using Apache Airflow is appreciated as it is the one we are using at the moment.

This pipeline should be runnable easily using docker and docker-compose.

This pipeline is relatively simple on purpose because we want you to concentrate on delivering an assignment as close as possible to something we could put in production.

The candidate will of course be evaluated on the implementation of the solution and whether it works. But also strongly on whether the code is production-ready.

Hence why you should keep in mind the following :

The code must be available on a Github repo Python is not an option Writing documentation and a detailed readme is not an option Writing python unit tests is not an option The quality of the code will be assessed Usage of GIT and its best practice is strongly encouraged as this is a tool we use daily at Algolia Finally, we prefer the candidate to take a few more days to polish the code instead of rushing it to ship fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment