Skip to content

Instantly share code, notes, and snippets.

@seahyc
Last active October 29, 2019 06:13
Show Gist options
  • Save seahyc/5224d9b9ff61a10190469ee7c4373a4a to your computer and use it in GitHub Desktop.
Save seahyc/5224d9b9ff61a10190469ee7c4373a4a to your computer and use it in GitHub Desktop.
Added general brief

General Brief

This assignment has 2 parts to it. You may attempt the bonus section, which will definitely add to the final score if well-executed upon.

Main Task

Implement this part of the task with Apache Spark using Docker. Given a set of keywords, find the top k most similar documents among a set of N documents from the dataset - (click this link to download all-the-news.zip), using Apache Spark and Docker. Your Apache Spark setup should consist of more than 1 slave.

Your submission:

  • Any necessary Dockerfile, image or scripts
  • A Python-CLI program with Python 3.x, PEP8
  • README that consists of the following:
    • Instructions on how to setup your submission
    • Instructions on how to run your submission
    • Description of the algorithm that you have implemented and how it retrieves the top-k similar documents
    • Explanation of your choice of algorithm/ heuristic

Bonus Section

For the same task above, instead of using Apache Spark, implement a solution for the same problem with Elasticsearch and Docker.

Your submission:

  • Any necessary Dockerfile, image or scripts
  • A Python-CLI program with Python 3.x, PEP8
  • README detailing:
    • How to setup your submission
    • How to run your submission
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment