Skip to content

Instantly share code, notes, and snippets.

Last active July 2, 2024 19:39
Show Gist options
  • Save borislitvak/4c4fc72ed65f58042b84e2f36e3d720e to your computer and use it in GitHub Desktop.
Save borislitvak/4c4fc72ed65f58042b84e2f36e3d720e to your computer and use it in GitHub Desktop.
Setup PySpark and its S3 connection

Setup PySpark and its S3 connection

  • Create an environment with your favorite venv manager, ala: conda create --name etl_spark_standalone python=3.8 && activate etl_spark_standalone
  • pip install -r requirements.txt. Note: This installs pyspark.

Make a choice

  • Do you want standalone spark/hadoop (any version you want) or

  • Built-in Spark that comes with pip installs ala pip install pyspark==3.0.1

  • Download spark. I picked spark 3.0.1 with hadoop 3.2.0/. Set the SPARK_HOME accordingly.

  • Don't set anything/set SPARK_HOME to the pyspark directory.


Examples follow standalone hadoop:

  • On Windows, clone git clone into c:\dev\hadoop
  • Set envvars accordingly, ala set HADOOP_HOME=C:\dev\hadoop\winutils\hadoop-3.2.0; set SPARK_HOME=C:\dev\spark-3.0.1-bin-hadoop3.2.
  • Add $HADOOP_HOME\bin to PATH
  • Restart your editor/session

In Pycharm set the following envvars in the run configuration, similar to the above: PYTHONUNBUFFERED=1;HADOOP_HOME=C:\dev\hadoop\winutils\hadoop-3.2.0;SPARK_HOME=C:\dev\spark-3.0.1-bin-hadoop3.2

Add AWS jars to your Spark jars folder to access S3

Per 3.2.0 hadoop-aws jar dependencies:

Note: if you pick another version of hadoop, make sure you check the files needed for download in the link above.

In a linux shell emulator:

  • cd $SPARK_HOME/jars
  • wget
  • wget



  session: Session = boto3.session.Session(profile_name='hierarchy_playground')
  credentials = session.get_credentials().get_frozen_credentials()
  # todo: use STS 
  # note: Hive catalog is supported with Delta only on Spark 3.0
  spark = SparkSession \
      .builder \
      .appName("Demo") \
      .config('spark.hadoop.fs.s3a.access.key', credentials.access_key) \
      .config('spark.hadoop.fs.s3a.secret.key', credentials.secret_key) \
      .config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem') \
      .enableHiveSupport() \

Bonus - add delta lake with the following

   .config("spark.jars.packages", "") \
   .config("spark.sql.extensions", "") \
   .config("spark.sql.catalog.spark_catalog", "") 
Copy link

pnijem commented Feb 15, 2021

Helped me with configuring connection from Spark Structured Streaming (Java) to S3. Thanks for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment