samklr/gist:743d927dd0a5f5671c64b1d346e7b318

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Context
The Integration team has deployed a cron job to dump a CSV file containing all the new Shopify configurations daily at 2 AM UTC.
The task will be to build a daily pipeline that will :
download the CSV file from https://alg-data-public.s3.amazonaws.com/[YYYY-MM-DD].csv,
filter out each row with empty application_id,
add a has_specific_prefix column set to true if the value of index_prefix differs from shopify_ else to false
load the valid rows to a Postresql instance
The pipeline should process files from 2019-04-01 to 2019-04-07.
The candidate can choose any orchestration tool they are most comfortable with. Using Apache Airflow is appreciated as it is the one we are using at the moment.
This pipeline should be runnable easily using docker and docker-compose.
This pipeline is relatively simple on purpose because we want you to concentrate on delivering an assignment as close as possible to something we could put in production.
The candidate will of course be evaluated on the implementation of the solution and whether it works. But also strongly on whether the code is production-ready.
Hence why you should keep in mind the following :
The code must be available on a Github repo
Python is not an option
Writing documentation and a detailed readme is not an option
Writing python unit tests is not an option
The quality of the code will be assessed
Usage of GIT and its best practice is strongly encouraged as this is a tool we use daily at Algolia
Finally, we prefer the candidate to take a few more days to polish the code instead of rushing it to ship fast.