Instantly share code, notes, and snippets.

Embed
What would you like to do?
Apache Spark installation + ipython/jupyter notebook integration guide for macOS

Apache Spark installation + ipython/jupyter notebook integration guide for macOS

Tested with Apache Spark 2.1.0, Python 2.7.13 and Java 1.8.0_112

For older versions of Spark and ipython, please, see also previous version of text.

Install Java Development Kit

Download and install it from oracle.com

Add following code to your e.g. .bash_profile

# For Apache Spark
if which java > /dev/null; then export JAVA_HOME=$(/usr/libexec/java_home); fi

Install Apache Spark

You can use Mac OS package manager Brew (http://brew.sh/)

brew update
brew install scala
brew install apache-spark

Set up env variables

Add following code to your e.g. .bash_profile

# For a ipython notebook and pyspark integration
if which pyspark > /dev/null; then
  export SPARK_HOME="/usr/local/Cellar/apache-spark/2.1.0/libexec/"
  export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
  export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
fi

You can check SPARK_HOME path using following brew command

$ brew info apache-spark
apache-spark: stable 2.1.0, HEAD
Engine for large-scale data processing
https://spark.apache.org/
/usr/local/Cellar/apache-spark/2.1.0 (1,312 files, 213.9M) *
  Built from source on 2017-02-13 at 00:58:12
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/apache-spark.rb

Also check py4j version and subpath, it mau differ from version to version.

Ipython profile

Since profiles are not supported in jupyter and now you can see following deprecation warning

$ ipython notebook --profile=pyspark
[TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated and will be removed in future versions.
[TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook` in the future
[W 01:45:07.821 NotebookApp] Unrecognized alias: '--profile=pyspark', it will probably have no effect.

It seems that it is not possible to run various custom startup files as it was with ipython profiles. Thus, the easiest way will be to run pyspark init script at the beginning of your notebook manually or follow alternative way.

Run ipython

$ jupyter-notebook

Initialize pyspark

In [1]: import os
        execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))
Out[1]: <pyspark.context.SparkContext at 0x10a982b10>

sc variable should be available

In [2]: sc
Out[2]: <pyspark.context.SparkContext at 0x10a982b10>

Alternatively

You can also force pyspark shell command to run ipython web notebook instead of command line interactive interpreter. To do so you have to add following env variables:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

and then simply run

$ pyspark

which will open a web notebook with sc available automatically.

Analytics

@klinkin

This comment has been minimized.

Show comment
Hide comment
@klinkin

klinkin May 20, 2015

Последний слеш не нужен:

export SPARK_HOME="/usr/local/Cellar/apache-spark/1.3.1_1/libexec/"

klinkin commented May 20, 2015

Последний слеш не нужен:

export SPARK_HOME="/usr/local/Cellar/apache-spark/1.3.1_1/libexec/"
@enahwe

This comment has been minimized.

Show comment
Hide comment
@enahwe

enahwe Jul 1, 2015

For Spark 1.4.x we have to add 'pyspark-shell' at the end of the environment variable "PYSPARK_SUBMIT_ARGS". So I adapted the script '00-pyspark-setup.py' for Spark 1.3.x and Spark 1.4.x as following, by detecting the version of Spark from the RELEASE file.

Here is the code :

# Configure the necessary Spark environment
import os
import sys

# Spark home
spark_home = os.environ.get("SPARK_HOME")

# If Spark V1.4.x is detected, then add ' pyspark-shell' to
# the end of the 'PYSPARK_SUBMIT_ARGS' environment variable
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.4" in open(spark_release_file).read():
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
    os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

# Add the spark python sub-directory to the path
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, "python/pyspark/shell.py"))

enahwe commented Jul 1, 2015

For Spark 1.4.x we have to add 'pyspark-shell' at the end of the environment variable "PYSPARK_SUBMIT_ARGS". So I adapted the script '00-pyspark-setup.py' for Spark 1.3.x and Spark 1.4.x as following, by detecting the version of Spark from the RELEASE file.

Here is the code :

# Configure the necessary Spark environment
import os
import sys

# Spark home
spark_home = os.environ.get("SPARK_HOME")

# If Spark V1.4.x is detected, then add ' pyspark-shell' to
# the end of the 'PYSPARK_SUBMIT_ARGS' environment variable
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.4" in open(spark_release_file).read():
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
    os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

# Add the spark python sub-directory to the path
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, "python/pyspark/shell.py"))
@ololobus

This comment has been minimized.

Show comment
Hide comment
@ololobus

ololobus Dec 4, 2015

Thank you @enahwe! I've added it to the text, hope you've tested it :)
Nevertheless, I've missed you comment at July and now Spark 1.5.x is already released...

Owner

ololobus commented Dec 4, 2015

Thank you @enahwe! I've added it to the text, hope you've tested it :)
Nevertheless, I've missed you comment at July and now Spark 1.5.x is already released...

@sri-srinivas

This comment has been minimized.

Show comment
Hide comment
@sri-srinivas

sri-srinivas Dec 17, 2015

Hi, thanks for your suggestions. I did exactly what you have suggested. But, when I get a notebook and type sc, I do not get the expected output. I just get ' '. Something is not configured properly. I have mac, El Capitan , spark 1.5.2 and running Jupiter. Any help?

sri-srinivas commented Dec 17, 2015

Hi, thanks for your suggestions. I did exactly what you have suggested. But, when I get a notebook and type sc, I do not get the expected output. I just get ' '. Something is not configured properly. I have mac, El Capitan , spark 1.5.2 and running Jupiter. Any help?

@ololobus

This comment has been minimized.

Show comment
Hide comment
@ololobus

ololobus Dec 22, 2015

@sri-srinivas I'm not a Spark user currently, so I can't test it, but you can try following 00-pyspark-setup.py startup file with Spark 1.5.*

# Configure the necessary Spark environment
import os
import sys

pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))

# Initialize PySpark
exec(open(os.path.join(spark_home, "python/pyspark/shell.py")).read())
Owner

ololobus commented Dec 22, 2015

@sri-srinivas I'm not a Spark user currently, so I can't test it, but you can try following 00-pyspark-setup.py startup file with Spark 1.5.*

# Configure the necessary Spark environment
import os
import sys

pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))

# Initialize PySpark
exec(open(os.path.join(spark_home, "python/pyspark/shell.py")).read())
@arendale

This comment has been minimized.

Show comment
Hide comment
@arendale

arendale May 26, 2016

@sri-srinivas I got same problem. Did you solve it?

arendale commented May 26, 2016

@sri-srinivas I got same problem. Did you solve it?

@Nomii5007

This comment has been minimized.

Show comment
Hide comment
@Nomii5007

Nomii5007 Jun 16, 2016

Hello. how can i set this PYSPARK_SUBMIT_ARGS enviorment variable in windows? and what should i give in its path?

Nomii5007 commented Jun 16, 2016

Hello. how can i set this PYSPARK_SUBMIT_ARGS enviorment variable in windows? and what should i give in its path?

@hlin117

This comment has been minimized.

Show comment
Hide comment
@hlin117

hlin117 Aug 9, 2016

Nowadays, brew install apache-spark installs spark 2.0.0. (Possibly a higher version in the future.)

The python library is also updated, and it's not 0.8.2.1 anymore.

hlin117 commented Aug 9, 2016

Nowadays, brew install apache-spark installs spark 2.0.0. (Possibly a higher version in the future.)

The python library is also updated, and it's not 0.8.2.1 anymore.

@sanjitroy

This comment has been minimized.

Show comment
Hide comment
@sanjitroy

sanjitroy Aug 14, 2016

@sri-srinivas , @arendale
Check this line
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))
The version of the py4j library should be the one you have installed.
Change it if you have not and run.

sanjitroy commented Aug 14, 2016

@sri-srinivas , @arendale
Check this line
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))
The version of the py4j library should be the one you have installed.
Change it if you have not and run.

@ololobus

This comment has been minimized.

Show comment
Hide comment
@ololobus

ololobus Feb 13, 2017

@enahwe @sri-srinivas @arendale @Nomii5007 @hlin117 @sanjitroy
I've updated text in order to fit with the latest versions of Spark, Java and Python 2 (Spark 2.1.0, Python 2.7.13 and Java 1.8.0_112 tested)

Owner

ololobus commented Feb 13, 2017

@enahwe @sri-srinivas @arendale @Nomii5007 @hlin117 @sanjitroy
I've updated text in order to fit with the latest versions of Spark, Java and Python 2 (Spark 2.1.0, Python 2.7.13 and Java 1.8.0_112 tested)

@aarshayj

This comment has been minimized.

Show comment
Hide comment
@aarshayj

aarshayj May 27, 2017

worked like a charm!!

aarshayj commented May 27, 2017

worked like a charm!!

@acmiyaguchi

This comment has been minimized.

Show comment
Hide comment
@acmiyaguchi

acmiyaguchi Jul 30, 2017

It might be useful to avoid hardcoding the py4j library using the following command:

export PYTHONPATH="${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}"

acmiyaguchi commented Jul 30, 2017

It might be useful to avoid hardcoding the py4j library using the following command:

export PYTHONPATH="${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}"

@avocado2016

This comment has been minimized.

Show comment
Hide comment
@avocado2016

avocado2016 Aug 15, 2017

Where should I write that? I have those problems but in an ubuntu machine working with a aws cluster

avocado2016 commented Aug 15, 2017

Where should I write that? I have those problems but in an ubuntu machine working with a aws cluster

@airwindow

This comment has been minimized.

Show comment
Hide comment
@airwindow

airwindow Feb 23, 2018

In case you are using Python 3.x version, you may run into following error
NameError: name 'execfile' is not defined
Simply replace
execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))
into
exec(open(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py')).read())

airwindow commented Feb 23, 2018

In case you are using Python 3.x version, you may run into following error
NameError: name 'execfile' is not defined
Simply replace
execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))
into
exec(open(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py')).read())

@mhasse7441

This comment has been minimized.

Show comment
Hide comment
@mhasse7441

mhasse7441 Feb 26, 2018

hello all. I am experiencing some issues executing a simple python program:

from pyspark import SparkConf, SparkContext

sc = SparkContext(master="local", appName="Spark Demo")
print(sc.textFile("/Users/mhasse/Desktop/deckofcards.txt").first())

with the following errors:

/Users/mhasse/PycharmProjects/gettingstarted/venv/bin/python /Users/mhasse/PycharmProjects/gettingstarted/sparkdemo.py
Traceback (most recent call last):
File "/Users/mhasse/PycharmProjects/gettingstarted/sparkdemo.py", line 3, in
sc = SparkContext(master="local", appName="Spark Demo")
File "/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6/python/pyspark/context.py", line 112, in init
SparkContext._ensure_initialized(self, gateway=gateway)
File "/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6/python/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6/python/pyspark/java_gateway.py", line 48, in launch_gateway
SPARK_HOME = os.environ["SPARK_HOME"]
File "/Users/mhasse/PycharmProjects/gettingstarted/venv/bin/../lib/python2.7/UserDict.py", line 40, in getitem
raise KeyError(key)
KeyError: 'SPARK_HOME'

when I nano .bash_profile my spark is set as below:

export SPARK_HOME=/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:SPARK_HOME/python/lib/py4j-VERSION-src.zip:$PYTHONPATH

Setting PATH for Python 2.7

The original version is saved in .bash_profile.pysave

PATH="/Library/Frameworks/Python.framework/Versions/2.7/bin:${PATH}"
export PATH

I am new to MAC OS and am struggling to get this to work - any comments or feedback greatly appreciated

mhasse7441 commented Feb 26, 2018

hello all. I am experiencing some issues executing a simple python program:

from pyspark import SparkConf, SparkContext

sc = SparkContext(master="local", appName="Spark Demo")
print(sc.textFile("/Users/mhasse/Desktop/deckofcards.txt").first())

with the following errors:

/Users/mhasse/PycharmProjects/gettingstarted/venv/bin/python /Users/mhasse/PycharmProjects/gettingstarted/sparkdemo.py
Traceback (most recent call last):
File "/Users/mhasse/PycharmProjects/gettingstarted/sparkdemo.py", line 3, in
sc = SparkContext(master="local", appName="Spark Demo")
File "/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6/python/pyspark/context.py", line 112, in init
SparkContext._ensure_initialized(self, gateway=gateway)
File "/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6/python/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6/python/pyspark/java_gateway.py", line 48, in launch_gateway
SPARK_HOME = os.environ["SPARK_HOME"]
File "/Users/mhasse/PycharmProjects/gettingstarted/venv/bin/../lib/python2.7/UserDict.py", line 40, in getitem
raise KeyError(key)
KeyError: 'SPARK_HOME'

when I nano .bash_profile my spark is set as below:

export SPARK_HOME=/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:SPARK_HOME/python/lib/py4j-VERSION-src.zip:$PYTHONPATH

Setting PATH for Python 2.7

The original version is saved in .bash_profile.pysave

PATH="/Library/Frameworks/Python.framework/Versions/2.7/bin:${PATH}"
export PATH

I am new to MAC OS and am struggling to get this to work - any comments or feedback greatly appreciated

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jun 9, 2018

Works like a charm...

ghost commented Jun 9, 2018

Works like a charm...

@suhas22

This comment has been minimized.

Show comment
Hide comment
@suhas22

suhas22 Jul 17, 2018

Works amazingly well!, thanks a ton for this!

suhas22 commented Jul 17, 2018

Works amazingly well!, thanks a ton for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment