Skip to content

Instantly share code, notes, and snippets.

@mh0w
Last active May 31, 2024 14:11
Show Gist options
  • Save mh0w/584a8ad69ec8c4d4041adce356f11465 to your computer and use it in GitHub Desktop.
Save mh0w/584a8ad69ec8c4d4041adce356f11465 to your computer and use it in GitHub Desktop.
Using spark locally

Based on https://medium.com/sicara/get-started-pyspark-jupyter-guide-tutorial-ae2fe84f594f

1. Requirements

2. Add the java bin and spark bin paths to the PATH variable

  • C:\Users\matthew\repos\Oracle_JDK-22\bin
  • C:\Users\matthew\repos\spark-3.5.1-bin-hadoop3\bin

3. Add environment variables pointing to spark-hadoop, java, and python

  • HADOOP_HOME = C:\Users\matthew\repos\spark-3.5.1-bin-hadoop3
  • SPARK_HOME = C:\Users\matthew\repos\spark-3.5.1-bin-hadoop3
  • JAVA_HOME = C:\Users\matthew\repos\Oracle_JDK-22
  • PYSPARK_DRIVER_PYTHON = C:\Users\matthew\Anaconda3\envs\main\python.exe
  • PYSPARK_PYTHON = C:\Users\matthew\Anaconda3\envs\main\python.exe

4. Run some test Python code

import pyspark
import random

sc = pyspark.SparkContext(appName="Pi")

num_samples = 100000000

def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()

pi = 4 * count / num_samples

print(pi)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment