Skip to content

Instantly share code, notes, and snippets.

@ololobus
Last active September 11, 2024 07:00
Show Gist options
  • Save ololobus/4c221a0891775eaa86b0 to your computer and use it in GitHub Desktop.
Save ololobus/4c221a0891775eaa86b0 to your computer and use it in GitHub Desktop.
Apache Spark installation + ipython/jupyter notebook integration guide for macOS

Apache Spark installation + ipython/jupyter notebook integration guide for macOS

Tested with Apache Spark 2.1.0, Python 2.7.13 and Java 1.8.0_112

For older versions of Spark and ipython, please, see also previous version of text.

Install Java Development Kit

Download and install it from oracle.com

Add following code to your e.g. .bash_profile

# For Apache Spark
if which java > /dev/null; then export JAVA_HOME=$(/usr/libexec/java_home); fi

Install Apache Spark

You can use Mac OS package manager Brew (http://brew.sh/)

brew update
brew install scala
brew install apache-spark

Set up env variables

Add following code to your e.g. .bash_profile

# For a ipython notebook and pyspark integration
if which pyspark > /dev/null; then
  export SPARK_HOME="/usr/local/Cellar/apache-spark/2.1.0/libexec/"
  export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
  export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
fi

You can check SPARK_HOME path using following brew command

$ brew info apache-spark
apache-spark: stable 2.1.0, HEAD
Engine for large-scale data processing
https://spark.apache.org/
/usr/local/Cellar/apache-spark/2.1.0 (1,312 files, 213.9M) *
  Built from source on 2017-02-13 at 00:58:12
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/apache-spark.rb

Also check py4j version and subpath, it may differ from version to version.

Ipython profile

Since profiles are not supported in jupyter and now you can see following deprecation warning

$ ipython notebook --profile=pyspark
[TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated and will be removed in future versions.
[TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook` in the future
[W 01:45:07.821 NotebookApp] Unrecognized alias: '--profile=pyspark', it will probably have no effect.

It seems that it is not possible to run various custom startup files as it was with ipython profiles. Thus, the easiest way will be to run pyspark init script at the beginning of your notebook manually or follow alternative way.

Run ipython

$ jupyter-notebook

Initialize pyspark

In [1]: import os
        execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))
Out[1]: <pyspark.context.SparkContext at 0x10a982b10>

sc variable should be available

In [2]: sc
Out[2]: <pyspark.context.SparkContext at 0x10a982b10>

Alternatively

You can also force pyspark shell command to run ipython web notebook instead of command line interactive interpreter. To do so you have to add following env variables:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

and then simply run

$ pyspark

which will open a web notebook with sc available automatically.

Analytics

@klinkin
Copy link

klinkin commented May 20, 2015

Последний слеш не нужен:

export SPARK_HOME="/usr/local/Cellar/apache-spark/1.3.1_1/libexec/"

@enahwe
Copy link

enahwe commented Jul 1, 2015

For Spark 1.4.x we have to add 'pyspark-shell' at the end of the environment variable "PYSPARK_SUBMIT_ARGS". So I adapted the script '00-pyspark-setup.py' for Spark 1.3.x and Spark 1.4.x as following, by detecting the version of Spark from the RELEASE file.

Here is the code :

# Configure the necessary Spark environment
import os
import sys

# Spark home
spark_home = os.environ.get("SPARK_HOME")

# If Spark V1.4.x is detected, then add ' pyspark-shell' to
# the end of the 'PYSPARK_SUBMIT_ARGS' environment variable
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.4" in open(spark_release_file).read():
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
    os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

# Add the spark python sub-directory to the path
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, "python/pyspark/shell.py"))

@ololobus
Copy link
Author

ololobus commented Dec 4, 2015

Thank you @enahwe! I've added it to the text, hope you've tested it :)
Nevertheless, I've missed you comment at July and now Spark 1.5.x is already released...

@sri-srinivas
Copy link

Hi, thanks for your suggestions. I did exactly what you have suggested. But, when I get a notebook and type sc, I do not get the expected output. I just get ' '. Something is not configured properly. I have mac, El Capitan , spark 1.5.2 and running Jupiter. Any help?

@ololobus
Copy link
Author

@sri-srinivas I'm not a Spark user currently, so I can't test it, but you can try following 00-pyspark-setup.py startup file with Spark 1.5.*

# Configure the necessary Spark environment
import os
import sys

pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))

# Initialize PySpark
exec(open(os.path.join(spark_home, "python/pyspark/shell.py")).read())

@arendale
Copy link

@sri-srinivas I got same problem. Did you solve it?

@Nomii5007
Copy link

Hello. how can i set this PYSPARK_SUBMIT_ARGS enviorment variable in windows? and what should i give in its path?

@hlin117
Copy link

hlin117 commented Aug 9, 2016

Nowadays, brew install apache-spark installs spark 2.0.0. (Possibly a higher version in the future.)

The python library is also updated, and it's not 0.8.2.1 anymore.

@sanjitroy
Copy link

@sri-srinivas , @arendale
Check this line
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))
The version of the py4j library should be the one you have installed.
Change it if you have not and run.

@ololobus
Copy link
Author

@enahwe @sri-srinivas @arendale @Nomii5007 @hlin117 @sanjitroy
I've updated text in order to fit with the latest versions of Spark, Java and Python 2 (Spark 2.1.0, Python 2.7.13 and Java 1.8.0_112 tested)

@thismlguy
Copy link

worked like a charm!!

@acmiyaguchi
Copy link

acmiyaguchi commented Jul 30, 2017

It might be useful to avoid hardcoding the py4j library using the following command:

export PYTHONPATH="${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}"

@vanessaescalante
Copy link

Where should I write that? I have those problems but in an ubuntu machine working with a aws cluster

@airwindow
Copy link

In case you are using Python 3.x version, you may run into following error
NameError: name 'execfile' is not defined
Simply replace
execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))
into
exec(open(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py')).read())

@mhasse7441
Copy link

hello all. I am experiencing some issues executing a simple python program:

from pyspark import SparkConf, SparkContext

sc = SparkContext(master="local", appName="Spark Demo")
print(sc.textFile("/Users/mhasse/Desktop/deckofcards.txt").first())

with the following errors:

/Users/mhasse/PycharmProjects/gettingstarted/venv/bin/python /Users/mhasse/PycharmProjects/gettingstarted/sparkdemo.py
Traceback (most recent call last):
File "/Users/mhasse/PycharmProjects/gettingstarted/sparkdemo.py", line 3, in
sc = SparkContext(master="local", appName="Spark Demo")
File "/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6/python/pyspark/context.py", line 112, in init
SparkContext._ensure_initialized(self, gateway=gateway)
File "/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6/python/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6/python/pyspark/java_gateway.py", line 48, in launch_gateway
SPARK_HOME = os.environ["SPARK_HOME"]
File "/Users/mhasse/PycharmProjects/gettingstarted/venv/bin/../lib/python2.7/UserDict.py", line 40, in getitem
raise KeyError(key)
KeyError: 'SPARK_HOME'

when I nano .bash_profile my spark is set as below:

export SPARK_HOME=/Users/mhasse/Documents/spark-1.6.3-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:SPARK_HOME/python/lib/py4j-VERSION-src.zip:$PYTHONPATH

Setting PATH for Python 2.7

The original version is saved in .bash_profile.pysave

PATH="/Library/Frameworks/Python.framework/Versions/2.7/bin:${PATH}"
export PATH

I am new to MAC OS and am struggling to get this to work - any comments or feedback greatly appreciated

Copy link

ghost commented Jun 9, 2018

Works like a charm...

@suhas22
Copy link

suhas22 commented Jul 17, 2018

Works amazingly well!, thanks a ton for this!

@soheilesm
Copy link

soheilesm commented May 21, 2021

Hi there,

I followed the guideline for installation, but for my own job I face the similar problem that is also described here.

Any idea how I can fix it? I have followed the solutions that commonly say the problem could be wrong path specifications, but my paths seem to be fine, and testing different components such as python, pyspark, py4j seem to be working fine standalone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment