Setup Environmnet variables
export SPARK_VERSION=2.4.0
export SPARK_PACKAGE=spark-${SPARK_VERSION}-bin-hadoop2.7
export SPARK_HOME=$HOME/spark-${SPARK_VERSION}
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=${HOME}/Projects/notebooks"
Download the desired Spark version.
curl -sL --retry 3 \
"https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/${SPARK_PACKAGE}.tgz" \
| gunzip \
| tar x -C /tmp/ \
&& mv /tmp/$SPARK_PACKAGE $SPARK_HOME
Install PySpark and Jupyter in our virtualEnv.
pip install \
sparkmagic==0.12.6 \
prompt-toolkit==1.0.15 \
pyspark==${SPARK_VERSION} \
jupyter==1.0.0
By defining notebook-dir
, we will be able to store & persist our notebooks in the desired folder.