Skip to content

Instantly share code, notes, and snippets.

@jonashaag
Created February 16, 2022 15:07
Show Gist options
  • Save jonashaag/3d9fb0abb3fa581c845783413a6b50da to your computer and use it in GitHub Desktop.
Save jonashaag/3d9fb0abb3fa581c845783413a6b50da to your computer and use it in GitHub Desktop.
PySpark Continuous Integration setup
from pyspark.sql import SparkSession
def local_pyspark_cluster(n_cpus=1, memory_mb=512) -> SparkSession:
"""Start a local PySpark cluster with default settings.
Returns a client to that session.
"""
return (
SparkSession.builder.master(f"local[{n_cpus}]")
.config("spark.driver.memory", f"{memory_mb}m")
.config("spark.sql.warehouse.dir", "/tmp/")
.getOrCreate()
)
# Source this file using ". windows-setup-pyspark.sh"
export HADOOP_HOME=`mktemp -d`
git clone https://github.com/cdarlint/winutils --depth 1
cp -r winutils/hadoop-3.2.2/* "$HADOOP_HOME"
export PATH="$PATH:$HADOOP_HOME/bin"
export PYSPARK_PYTHON=`which python`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment