Skip to content

Instantly share code, notes, and snippets.

@ololobus
Last active September 26, 2024 08:50
Show Gist options
  • Save ololobus/4c221a0891775eaa86b0 to your computer and use it in GitHub Desktop.
Save ololobus/4c221a0891775eaa86b0 to your computer and use it in GitHub Desktop.
Apache Spark installation + ipython/jupyter notebook integration guide for macOS

Apache Spark installation + ipython/jupyter notebook integration guide for macOS

Tested with Apache Spark 2.1.0, Python 2.7.13 and Java 1.8.0_112

For older versions of Spark and ipython, please, see also previous version of text.

Install Java Development Kit

Download and install it from oracle.com

Add following code to your e.g. .bash_profile

# For Apache Spark
if which java > /dev/null; then export JAVA_HOME=$(/usr/libexec/java_home); fi

Install Apache Spark

You can use Mac OS package manager Brew (http://brew.sh/)

brew update
brew install scala
brew install apache-spark

Set up env variables

Add following code to your e.g. .bash_profile

# For a ipython notebook and pyspark integration
if which pyspark > /dev/null; then
  export SPARK_HOME="/usr/local/Cellar/apache-spark/2.1.0/libexec/"
  export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
  export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
fi

You can check SPARK_HOME path using following brew command

$ brew info apache-spark
apache-spark: stable 2.1.0, HEAD
Engine for large-scale data processing
https://spark.apache.org/
/usr/local/Cellar/apache-spark/2.1.0 (1,312 files, 213.9M) *
  Built from source on 2017-02-13 at 00:58:12
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/apache-spark.rb

Also check py4j version and subpath, it may differ from version to version.

Ipython profile

Since profiles are not supported in jupyter and now you can see following deprecation warning

$ ipython notebook --profile=pyspark
[TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated and will be removed in future versions.
[TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook` in the future
[W 01:45:07.821 NotebookApp] Unrecognized alias: '--profile=pyspark', it will probably have no effect.

It seems that it is not possible to run various custom startup files as it was with ipython profiles. Thus, the easiest way will be to run pyspark init script at the beginning of your notebook manually or follow alternative way.

Run ipython

$ jupyter-notebook

Initialize pyspark

In [1]: import os
        execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))
Out[1]: <pyspark.context.SparkContext at 0x10a982b10>

sc variable should be available

In [2]: sc
Out[2]: <pyspark.context.SparkContext at 0x10a982b10>

Alternatively

You can also force pyspark shell command to run ipython web notebook instead of command line interactive interpreter. To do so you have to add following env variables:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

and then simply run

$ pyspark

which will open a web notebook with sc available automatically.

Analytics

@ololobus
Copy link
Author

@enahwe @sri-srinivas @arendale @Nomii5007 @hlin117 @sanjitroy
I've updated text in order to fit with the latest versions of Spark, Java and Python 2 (Spark 2.1.0, Python 2.7.13 and Java 1.8.0_112 tested)

@thismlguy
Copy link

worked like a charm!!

@acmiyaguchi
Copy link

acmiyaguchi commented Jul 30, 2017

It might be useful to avoid hardcoding the py4j library using the following command:

export PYTHONPATH="${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}"

@vanessaescalante
Copy link

Where should I write that? I have those problems but in an ubuntu machine working with a aws cluster

@airwindow
Copy link

In case you are using Python 3.x version, you may run into following error
NameError: name 'execfile' is not defined
Simply replace
execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))
into
exec(open(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py')).read())

@mhasse7441
Copy link

hello all. I am experiencing some issues executing a simple python program:

from pyspark import SparkConf, SparkContext

sc = SparkContext(master="local", appName="Spark Demo")
print(sc.textFile("/Users/mhasse/Desktop/deckofcards.txt").first())

with the following errors:

/Users/mhasse/PycharmProjects/gettingstarted/venv/bin/python /Users/mhasse/PycharmProjects/gettingstarted/sparkdemo.py
Traceback (most recent call last):
File "/Users/mhasse/PycharmProjects/gettingstarted/sparkdemo.py", line 3, in
sc = SparkContext(master="local", appName="Spark Demo")
File "/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6/python/pyspark/context.py", line 112, in init
SparkContext._ensure_initialized(self, gateway=gateway)
File "/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6/python/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6/python/pyspark/java_gateway.py", line 48, in launch_gateway
SPARK_HOME = os.environ["SPARK_HOME"]
File "/Users/mhasse/PycharmProjects/gettingstarted/venv/bin/../lib/python2.7/UserDict.py", line 40, in getitem
raise KeyError(key)
KeyError: 'SPARK_HOME'

when I nano .bash_profile my spark is set as below:

export SPARK_HOME=/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:SPARK_HOME/python/lib/py4j-VERSION-src.zip:$PYTHONPATH

Setting PATH for Python 2.7

The original version is saved in .bash_profile.pysave

PATH="/Library/Frameworks/Python.framework/Versions/2.7/bin:${PATH}"
export PATH

I am new to MAC OS and am struggling to get this to work - any comments or feedback greatly appreciated

Copy link

ghost commented Jun 9, 2018

Works like a charm...

@suhas22
Copy link

suhas22 commented Jul 17, 2018

Works amazingly well!, thanks a ton for this!

@soheilesm
Copy link

soheilesm commented May 21, 2021

Hi there,

I followed the guideline for installation, but for my own job I face the similar problem that is also described here.

Any idea how I can fix it? I have followed the solutions that commonly say the problem could be wrong path specifications, but my paths seem to be fine, and testing different components such as python, pyspark, py4j seem to be working fine standalone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment