seahyc/data_team_take_home.md

## data_team_take_home.md

      
    Raw
  

              data_team_take_home.md
            
          
    General Brief

This assignment has 2 parts to it. You may attempt the bonus section, which will definitely add to the final score if well-executed upon.
Main Task

Implement this part of the task with Apache Spark using Docker.
Given a set of keywords, find the top k most similar documents among a set of N documents from the dataset - (click this link to download all-the-news.zip), using Apache Spark and Docker.
Your Apache Spark setup should consist of more than 1 slave.
Your submission:

Any necessary Dockerfile, image or scripts
A Python-CLI program with Python 3.x, PEP8
README that consists of the following:

Instructions on how to setup your submission
Instructions on how to run your submission
Description of the algorithm that you have implemented and how it retrieves the top-k similar documents
Explanation of your choice of algorithm/ heuristic


Bonus Section

For the same task above, instead of using Apache Spark, implement a solution for the same problem with Elasticsearch and Docker.
Your submission:

Any necessary Dockerfile, image or scripts
A Python-CLI program with Python 3.x, PEP8
README detailing:

How to setup your submission
How to run your submission