This assignment has 2 parts to it. You may attempt the bonus section, which will definitely add to the final score if well-executed upon.
Implement this part of the task with Apache Spark using Docker. Given a set of keywords, find the top k most similar documents among a set of N documents from the dataset - (click this link to download all-the-news.zip), using Apache Spark and Docker. Your Apache Spark setup should consist of more than 1 slave.
Your submission:
- Any necessary Dockerfile, image or scripts
- A Python-CLI program with Python 3.x, PEP8
- README that consists of the following:
- Instructions on how to setup your submission
- Instructions on how to run your submission
- Description of the algorithm that you have implemented and how it retrieves the top-k similar documents
- Explanation of your choice of algorithm/ heuristic
For the same task above, instead of using Apache Spark, implement a solution for the same problem with Elasticsearch and Docker.
Your submission:
- Any necessary Dockerfile, image or scripts
- A Python-CLI program with Python 3.x, PEP8
- README detailing:
- How to setup your submission
- How to run your submission