Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Apache Spark installation + ipython/jupyter notebook integration guide for macOS

Apache Spark installation + ipython/jupyter notebook integration guide for macOS

Tested with Apache Spark 2.1.0, Python 2.7.13 and Java 1.8.0_112

For older versions of Spark and ipython, please, see also previous version of text.

Install Java Development Kit

Download and install it from oracle.com

Add following code to your e.g. .bash_profile

# For Apache Spark
if which java > /dev/null; then export JAVA_HOME=$(/usr/libexec/java_home); fi

Install Apache Spark

You can use Mac OS package manager Brew (http://brew.sh/)

brew update
brew install scala
brew install apache-spark

Set up env variables

Add following code to your e.g. .bash_profile

# For a ipython notebook and pyspark integration
if which pyspark > /dev/null; then
  export SPARK_HOME="/usr/local/Cellar/apache-spark/2.1.0/libexec/"
  export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
  export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
fi

You can check SPARK_HOME path using following brew command

$ brew info apache-spark
apache-spark: stable 2.1.0, HEAD
Engine for large-scale data processing
https://spark.apache.org/
/usr/local/Cellar/apache-spark/2.1.0 (1,312 files, 213.9M) *
  Built from source on 2017-02-13 at 00:58:12
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/apache-spark.rb

Also check py4j version and subpath, it mau differ from version to version.

Ipython profile

Since profiles are not supported in jupyter and now you can see following deprecation warning

$ ipython notebook --profile=pyspark
[TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated and will be removed in future versions.
[TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook` in the future
[W 01:45:07.821 NotebookApp] Unrecognized alias: '--profile=pyspark', it will probably have no effect.

It seems that it is not possible to run various custom startup files as it was with ipython profiles. Thus, the easiest way will be to run pyspark init script at the beginning of your notebook manually or follow alternative way.

Run ipython

$ jupyter-notebook

Initialize pyspark

In [1]: import os
        execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))
Out[1]: <pyspark.context.SparkContext at 0x10a982b10>

sc variable should be available

In [2]: sc
Out[2]: <pyspark.context.SparkContext at 0x10a982b10>

Alternatively

You can also force pyspark shell command to run ipython web notebook instead of command line interactive interpreter. To do so you have to add following env variables:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

and then simply run

$ pyspark

which will open a web notebook with sc available automatically.

Analytics

klinkin commented May 20, 2015

Последний слеш не нужен:

export SPARK_HOME="/usr/local/Cellar/apache-spark/1.3.1_1/libexec/"

enahwe commented Jul 1, 2015

For Spark 1.4.x we have to add 'pyspark-shell' at the end of the environment variable "PYSPARK_SUBMIT_ARGS". So I adapted the script '00-pyspark-setup.py' for Spark 1.3.x and Spark 1.4.x as following, by detecting the version of Spark from the RELEASE file.

Here is the code :

# Configure the necessary Spark environment
import os
import sys

# Spark home
spark_home = os.environ.get("SPARK_HOME")

# If Spark V1.4.x is detected, then add ' pyspark-shell' to
# the end of the 'PYSPARK_SUBMIT_ARGS' environment variable
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.4" in open(spark_release_file).read():
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
    os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

# Add the spark python sub-directory to the path
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, "python/pyspark/shell.py"))
Owner

ololobus commented Dec 4, 2015

Thank you @enahwe! I've added it to the text, hope you've tested it :)
Nevertheless, I've missed you comment at July and now Spark 1.5.x is already released...

Hi, thanks for your suggestions. I did exactly what you have suggested. But, when I get a notebook and type sc, I do not get the expected output. I just get ' '. Something is not configured properly. I have mac, El Capitan , spark 1.5.2 and running Jupiter. Any help?

Owner

ololobus commented Dec 22, 2015

@sri-srinivas I'm not a Spark user currently, so I can't test it, but you can try following 00-pyspark-setup.py startup file with Spark 1.5.*

# Configure the necessary Spark environment
import os
import sys

pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))

# Initialize PySpark
exec(open(os.path.join(spark_home, "python/pyspark/shell.py")).read())

@sri-srinivas I got same problem. Did you solve it?

Hello. how can i set this PYSPARK_SUBMIT_ARGS enviorment variable in windows? and what should i give in its path?

hlin117 commented Aug 9, 2016

Nowadays, brew install apache-spark installs spark 2.0.0. (Possibly a higher version in the future.)

The python library is also updated, and it's not 0.8.2.1 anymore.

@sri-srinivas , @arendale
Check this line
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))
The version of the py4j library should be the one you have installed.
Change it if you have not and run.

Owner

ololobus commented Feb 13, 2017

@enahwe @sri-srinivas @arendale @Nomii5007 @hlin117 @sanjitroy
I've updated text in order to fit with the latest versions of Spark, Java and Python 2 (Spark 2.1.0, Python 2.7.13 and Java 1.8.0_112 tested)

worked like a charm!!

acmiyaguchi commented Jul 30, 2017

It might be useful to avoid hardcoding the py4j library using the following command:

export PYTHONPATH="${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}"

Where should I write that? I have those problems but in an ubuntu machine working with a aws cluster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment