Skip to content

Instantly share code, notes, and snippets.

@brianspiering
Created February 24, 2022 20:58
Show Gist options
  • Star 12 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save brianspiering/1e690b593db025b5acee920fa7330366 to your computer and use it in GitHub Desktop.
Save brianspiering/1e690b593db025b5acee920fa7330366 to your computer and use it in GitHub Desktop.
Installation guide to pyspark on M1 Mac

Install Spark

Run all of these commands at the command line (not in a Jupyter Notebook). The command line will have more informative error messages and if we need complete additional steps, we'll get the messages.

Spark is a framework within the Scala programming language. Scala uses the JVM (Java Virtual Machine) so you'll need install Java.

If you use homebrew:

brew install java scala apache-spark 

Installing the Python API for Spark

Let's follow the directions from the documentation: https://spark.apache.org/docs/latest/api/python/getting_started/install.html

pip install pyspark

It is your choice to install pyspark in the base/root or in the metis conda environment. Either way, the most common incompatibility issues result from pyspark not finding java or an incompatible version of java.

See if it works

Open a new terminal and try:

pyspark

If that is working, open ipython in a terminal and try:

import pyspark

spark = pyspark.sql.SparkSession.builder.getOrCreate()

If that is working, open a Jupyter Notebook and try:

import pyspark

spark = pyspark.sql.SparkSession.builder.getOrCreate()

Troubleshooting

Issues with installation on local machines are often path problems. You have to explicitly to tell your computer where the location of software is. The location on a local machine varies widely based on the hardware, the operating system (OS), and installation method. Thus, specific advice is difficult to give.

General advice:

  • Backup the current path (in case you break it).
  • Have a hypothesis and understand the goal of each command (do not random copy n' paste commands from the Interwebs).
  • Frequently open a new terminal window to make sure your state is current.
  • Take frequent walks to clear your mind.

M1 chip advice

By default, homebrew uses a recent version Java (something like Java 17). That might cause errors. Try an older version of Java:

brew install --cask homebrew/cask-versions/adoptopenjdk8

export JAVA_HOME='/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/'

If that works, make sure to add the JAVA_HOME to your bash profile.

The cloud is always a fall back option

You may want to use one of these cloud options:

  • Google's Colab
  • Deepnote

Resources:

@oonisim
Copy link

oonisim commented Feb 8, 2023

Error message Could not find valid SPARK_HOME if SPARK_HOME environment variable is not set in Jupyter Notebook.

From terminal

export SPARK_HOME="/opt/homebrew/Cellar/apache-spark/3.3.1/libexec"      # Must end with 'libexec', NOT '3.3.1'

In Jupyter notebook

SPARK_HOME = "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec"
JAVA_HOME = '/opt/homebrew/opt/openjdk'

os.environ['SPARK_HOME'] = SPARK_HOME
os.environ['JAVA_HOME'] = JAVA_HOME
sys.path.extend([
    f"{SPARK_HOME}/python/lib/py4j-0.10.9.5-src.zip",
    f"{SPARK_HOME}/python/lib/pyspark.zip",
])


from pyspark.sql import SparkSession
spark = SparkSession.builder\
    .master('local[*]') \
    .getOrCreate()

Tested on Ventura 13.0.1 (22A400) M2 chip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment