Skip to content

Instantly share code, notes, and snippets.

@jayant91089
Last active December 15, 2018 12:54
Show Gist options
  • Save jayant91089/2e8d53209ea9a0a682214cf8eb5a9ab3 to your computer and use it in GitHub Desktop.
Save jayant91089/2e8d53209ea9a0a682214cf8eb5a9ab3 to your computer and use it in GitHub Desktop.
Describes how to setup spark 2.1.0 for use with python (pyspark) via jupyter notebook. Does not assume root access. Uses virtualenv.

Get prebuilt spark

mkdir spark_install && cd spark_install
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
tar -xzvf spark-2.1.0-bin-hadoop2.7.tgz 
cd spark-2.1.0-bin-hadoop2.7/

Test prebuilt spark (this should open a spark console, use Ctrl+C to exit )

./bin/spark-shell

Get virtualenv: We assume your python is installed under your home dir, so no sudo is needed.

If you want to install python under your home dir, get the tarball from here and use ./configure --prefix=any/dir/of/your/choice/where/you/have/write/access . Then, you need to make install and add python's bin to the $PATH environment variable.

To install virtualenv

pip install virtualenv
cd ~

Start new virtualenv

virtualenv jupyter_pyspark
source jupyter_pyspark/bin/activate 

Get necessary scientific python packages

pip install numpy
pip install scipy
pip install scikit-learn
pip install pandas

edit bashrc or spark-2.1.0-bin-hadoop2.7/conf/spark-env.sh

nano ~/.bashrc

paste the following in spark-2.1.0-bin-hadoop2.7/conf/spark-env.sh (this file doesn't originally exist, you have to create it)

export SPARK_HOME=/path/to/spark-2.1.0-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON=/path/to/virtualenv/python27/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=8880"
export PYSPARK_PYTHON=/path/to/virtualenv/python27/bin/python

Start a jupyter notebook with pyspark (edit the number of slave processes [4] appropriately)

cd spark_install/spark-2.1.0-bin-hadoop2.7
./bin/pyspark --master local[4]

If you executed all of the above on remote machine from a local linux box via ssh:

You can open a ssh tunnel as follows. This way, you can open the jupyter notebook in your local browser instead of having to use the browser on the remote machine via ssh -X. In case of the following tunnel, you need to open your local browser at http://localhost:8889 and enter the token printed in your terminal in the previous step.

ssh -N -f -L localhost:8889:localhost:8888 yourusername@remotehost

(Above gist has been successfully tested with Ubuntu 14.04 LTS on Intel Xeon E5-2620 and Intel Celeron N3160)

@Bonnners
Copy link

Hi, thanks for this. Please could you tell me how to do:
park-2.1.0-bin-hadoop2.7/conf/spark-env.sh (this file doesn't originally exist, you have to create it)
?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment