Start by downloading the latest stable Apache Spark (current 2.4.3).
cd ~/Downloads/
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
mv ~/Downloads/spark-2.4.3-bin-hadoop2.7 ~/
ln -s ~/spark-2.4.3-bin-hadoop2.7 ~/spark̀
Install PySpark and Jupyter in our virtualEnv
pip install pyspark jupyter
Setup Environmnet variables
export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"
By defining notebook-dir
, we will be able to store & persist our notebooks in the desired folder.