Skip to content

Instantly share code, notes, and snippets.

@rberenguel
Last active November 14, 2019 09:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rberenguel/e0191178ffcca6164ed5d94769bc5bd3 to your computer and use it in GitHub Desktop.
Save rberenguel/e0191178ffcca6164ed5d94769bc5bd3 to your computer and use it in GitHub Desktop.
Quick writeup of the requirements for my PySpark workshop at PyDay 2019, Barcelona (https://pybcn.org/pyday-bcn-2019/)

To take full advantage of the workshop you'll need

  • PySpark installed (anything more recent than 2.3 should be fine)
  • Jupyter installed
  • Pandas and Arrow installed
  • All able to talk to each other
  • One or more datasets

You can clone this repository to have the notebook and slides (some things may still change until Saturday, like uploading and upgating the compiled slides, but the notebook is essentially finished).

Everything should work in Binder, but in case network connection doesn't work as expected, please clone the repository and install the requirements. Otherwise you will only be able to watch!


You can install pyspark just using pip install pyspark, doing it in the same environment you have Jupyter should make them talk to each other just fine. You should also run pip install pyarrow, although if this one fails for some reason it's not a big problem. To make analysis more entertaining, also run pip install pandas, again, all in the same environment. You can also run these in conda, with conda install -c conda-forge pyspark although it might be more convenient to use pip (pyspark can get easily confused with many python environments)


If you are familiar enough with Docker, I recommend using a Docker container instead.

Run this before the workshop:

docker pull rberenguel/pyspark_workshop

During the workshop (or before) you can use this docker container with

docker run --name pyspark_workshop -d -p 8888:8888 -p 4040:4040 -p 4041:4041 -v "$PWD":/home/jovyan/work rberenguel/pyspark_workshop

in the folder you want to create your notebook. To open your notebook, run

docker logs pyspark_workshop 

and opening the URL provided in the logs (should look like http://127.0.0.1:8888/?token=36a20c93f0ee8cab4699e2460261e3b16787a68fbb034aee)

This container installs Arrow on top of the usual jupyter/pyspark, to allow for some additional optimisations in Spark.


You should also download a dataset. One easy dataset to play with is this Kaggle dataset of NBA shots. It is just 16mb CSV, but this is enough to learn how to use the basic pieces of Spark locally. Another interesting data set is this European soccer dataset with scores and predictions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment