Skip to content

Instantly share code, notes, and snippets.

@matehat
Last active September 7, 2018 07:16
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save matehat/4156fc7fb14713d6281b5eb6e17bdd4f to your computer and use it in GitHub Desktop.
Save matehat/4156fc7fb14713d6281b5eb6e17bdd4f to your computer and use it in GitHub Desktop.
How to quickly setup a BigDL/PySpark local cluster using Docker to mess around

First, make sure you've installed docker on your system

  1. Create some folder you wish to put all files related to your notebook (data files, log files, etc), and navigate to this directory

  2. Run this in your terminal:

    $ docker run --name pyspark --rm -p 8888:8888 -v .:/home/jovyan/work jupyter/pyspark-notebook

    Docker images will download, and you'll eventually see the output of Jupyter starting. At the end you'll see something like

    http://(120e4fd32df5 or 127.0.0.1):8888/?token=17729f5aa6dd4f54dd8d16d029a39b85396427543161cc78
    

    Which means you should paste http://localhost:8888/ followed by the ?token=<your token> which is unique to your session, in the address bar of your browser.

  3. Now open a new terminal and type this:

    $ docker exec -ti pyspark bash

    This opens a bash session inside that running container.

  4. Now, in this session you can install some python dependencies, like BigDL, and some others:

    $ pip install BigDL==0.6.0 pylib

    Most other libraries we need are already installed.

Now, in your notebook, you can create new Python 3 notebook, using pyspark, bigdl, and so on. Every file you put in your initial folder will be visible from the notebook interface and will be persisted across docker sessions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment