Skip to content

Instantly share code, notes, and snippets.

@joshuacook
Last active July 15, 2021 17:57
Show Gist options
  • Save joshuacook/fbda6fdbec7dc6b0fb9bd7ed9953004a to your computer and use it in GitHub Desktop.
Save joshuacook/fbda6fdbec7dc6b0fb9bd7ed9953004a to your computer and use it in GitHub Desktop.

Local Spark Development

Infrastructure

Launch a Jupyter Notebook server using Docker and the jupyter/pyspark-notebook image on your local machine.

Copy and paste the below into your terminal.

docker run -d -v `pwd`:/home/jovyan -p 80:8888 jupyter/pyspark-notebook

This will launch a Jupyter Notebook server available at http://localhost/. By default this server has authentication and requires a token.

Accessing Jupyter

  1. Retrieve the container id of the Jupyter Notebook Server

    docker ps
    

    This command displays currently running Docker containers. Look for the container using the image jupyter/pyspark-notebook.

    Copy the CONTAINER ID.

  2. Retrieve the token. Run the following command, replace CONTAINERID with the value copied in the previous step.

    docker exec CONTAINERID jupyter notebook list
    

    You should see an output like the following:

    Currently running servers:
    http://0.0.0.0:8888/?token=b362ef9ea151f45b29cdcf9e9c39e9c914ef2d93478bce17 :: /home/jovyan
    

    Copy the value after token=. This is the authentication token.

  3. Access the server at http://localhost/ and use the authentication token to sign in.

Create a New Spark Session in Jupyter Notebook

from os import environ
environ['PYSPARK_SUBMIT_ARGS'] = '--packages "io.delta:delta-core_2.11:0.5.0" pyspark-shell'

from pyspark import sql

spark = sql.SparkSession.builder \
        .master("local[8]") \
        .getOrCreate()


def display(dataframe):
    return dataframe.show()

Troubleshooting

If you see an error like this:

docker: Error response from daemon: driver failed programming external connectivity on endpoint nifty_solomon (8645fa398b2b8e8a9ec19c8c41aebcfc734b5fa4979721f7cda08d51e8fc17cd): Bind for 0.0.0.0:8888 failed: port is already allocated.

that means you are already running jupyter on that port.

@vfortierdatabricks
Copy link

We have to figure out how this could be made easy for Windows users too. Most banks use Windows as the compute environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment