Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save yuanzhaoYZ/c148497393cb009e46be44194146145a to your computer and use it in GitHub Desktop.
Save yuanzhaoYZ/c148497393cb009e46be44194146145a to your computer and use it in GitHub Desktop.
Link Apache Spark with IPython Notebook

How to link Apache Spark 2.1.0 with IPython notebook (Ubuntu)

Tested with

Python 2.7, Ubuntu 16.04 LTS, Apache Spark 2.1.0 & Hadoop 2.7

Download Apache Spark & Build it

Download Apache Spark and build it or download the pre-built version.

I suggest to download the pre-built version with Hadoop 2.7.

cd /opt
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
tar xvzf spark-2.1.0-bin-hadoop2.7.tgz
rm -f spark-2.1.0-bin-hadoop2.7.tgz

Install Anaconda

Download and install Anaconda.

Install Jupyter

Once you have installed Anaconda open your terminal and type

conda install jupyter
conda update jupyter

Link Spark with IPython Notebook

Open terminal and type

echo "export SPARK_HOME=/opt/spark-2.1.0-bin-hadoop2.7" >> ~/.bashrc
echo "export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip" >> ~/.bashrc

Now you can source it to make changes available in this terminal

source ~/.bashrc

Run Jupyter Notebook

jupyter notebook --ip=0.0.0.0 --NotebookApp.token=''

Now the Jupyter notebook should open in your browser.

To check whether Spark is correctly linked create a new Python 2 file inside Jupyter Notebook. You should see something like this

In [1]: import pyspark
sc = pyspark.SparkContext('local[*]')
sqlContext = SQLContext(sc)
sc
Out[1]: <pyspark.context.SparkContext at 0x1049bdf90>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment