To take full advantage of the workshop you'll need
- PySpark installed (anything more recent than 2.3 should be fine)
- Jupyter installed
- Pandas and Arrow installed
- All able to talk to each other
- One or more datasets
You can clone this repository to have the notebook and slides (some things may still change until Saturday, like uploading and upgating the compiled slides, but the notebook is essentially finished).
Everything should work in Binder, but in case network connection doesn't work as expected, please clone the repository and install the requirements. Otherwise you will only be able to watch!
You can install pyspark just using pip install pyspark
, doing it in the same environment you have Jupyter should make them talk to each other just fine. You should also run pip install pyarrow
, although if this one fails for some reason it's not a big problem. To make analysis more entertaining, also run pip install pandas
, again, all in the same environment. You can also run these in conda, with conda install -c conda-forge pyspark
although it might be more convenient to use pip (pyspark can get easily confused with many python environments)
If you are familiar enough with Docker, I recommend using a Docker container instead.
Run this before the workshop:
docker pull rberenguel/pyspark_workshop
During the workshop (or before) you can use this docker container with
docker run --name pyspark_workshop -d -p 8888:8888 -p 4040:4040 -p 4041:4041 -v "$PWD":/home/jovyan/work rberenguel/pyspark_workshop
in the folder you want to create your notebook. To open your notebook, run
docker logs pyspark_workshop
and opening the URL provided in the logs (should look like http://127.0.0.1:8888/?token=36a20c93f0ee8cab4699e2460261e3b16787a68fbb034aee
)
This container installs Arrow on top of the usual jupyter/pyspark
, to allow for some additional optimisations in Spark.
You should also download a dataset. One easy dataset to play with is this Kaggle dataset of NBA shots. It is just 16mb CSV, but this is enough to learn how to use the basic pieces of Spark locally. Another interesting data set is this European soccer dataset with scores and predictions.