Skip to content

Instantly share code, notes, and snippets.

@JoshCrosby
Created June 6, 2022 20:02
Show Gist options
  • Save JoshCrosby/47e7b5970978fc4d1904ab4c8e975b78 to your computer and use it in GitHub Desktop.
Save JoshCrosby/47e7b5970978fc4d1904ab4c8e975b78 to your computer and use it in GitHub Desktop.
spark boiler plate for google colab.
import os
# Find the latest version of spark 3.0 from http://www.apache.org/dist/spark/ and enter as the spark version
# For example:
# spark_version = 'spark-3.0.3'
spark_version = 'spark-3.<enter version>'
os.environ['SPARK_VERSION']=spark_version
# Install Spark and Java
!apt-get update
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q http://www.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark
# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{spark_version}-bin-hadoop2.7"
# Start a SparkSession
import findspark
findspark.init()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment