Skip to content

Instantly share code, notes, and snippets.

@kalaidin
Forked from ololobus/Spark+ipython_on_MacOS.md
Last active October 13, 2015 09:44
Show Gist options
  • Save kalaidin/1e9114c3a649a9ebc5b9 to your computer and use it in GitHub Desktop.
Save kalaidin/1e9114c3a649a9ebc5b9 to your computer and use it in GitHub Desktop.
Apache Spark + IPython Notebook Guide for Mac OS X

Apache Spark + IPython Notebook Guide for Mac OS X

Tested with Apache Spark 1.5.1, Python 3.3 and Java 1.8.0_60

Install Java Development Kit

Download and install JDK from oracle.com

Add the following code to your .bash_profile

# For Apache Spark
if which java > /dev/null; then export JAVA_HOME=$(/usr/libexec/java_home); fi

Install Apache Spark

Install Spark and Scala with Brew (http://brew.sh/)

brew update
brew install scala
brew install apache-spark

Set up env variables

Add following code to your e.g. .bash_profile

# For a ipython notebook and pyspark integration
if which pyspark > /dev/null; then
  export SPARK_HOME="/usr/local/Cellar/apache-spark/1.5.1/libexec"
  export PYSPARK_SUBMIT_ARGS="--master local[2]"
fi

Change 2 in local[2] above to the desired number of cores to use.

Create an ipython profile

Run

ipython profile create pyspark

Create a startup file

$ vim ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py
# Configure the necessary Spark environment
import os
import sys

pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))

# Initialize PySpark
exec(open(os.path.join(spark_home, "python/pyspark/shell.py")).read())

Run ipython

ipython notebook --profile=pyspark

sc variable should be available

In [1]: sc
Out[1]: <pyspark.context.SparkContext at 0x105a35350>

Test your setup

textFile = sc.textFile("~/data/test.txt")

counts = textFile.flatMap(lambda x: x.split()) \
  .map(lambda x: (x, 1)) \
  .reduceByKey(lambda x, y: x + y)
  
output = counts.collect()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment