Skip to content

Instantly share code, notes, and snippets.

@afranzi
Last active May 16, 2019 08:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save afranzi/29bfa88b45a4fd0624adfc2514479d82 to your computer and use it in GitHub Desktop.
Save afranzi/29bfa88b45a4fd0624adfc2514479d82 to your computer and use it in GitHub Desktop.
Setup PySpark with Jupyter notebooks

Start by downloading the latest stable Apache Spark (current 2.4.3).

cd ~/Downloads/
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
mv ~/Downloads/spark-2.4.3-bin-hadoop2.7 ~/
ln -s ~/spark-2.4.3-bin-hadoop2.7 ~/spark̀

Install PySpark and Jupyter in our virtualEnv

pip install pyspark jupyter

Setup Environmnet variables

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"

By defining notebook-dir, we will be able to store & persist our notebooks in the desired folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment