- Create an environment with your favorite venv manager, ala:
conda create --name etl_spark_standalone python=3.8 && activate etl_spark_standalone
pip install -r requirements.txt
. Note: This installs pyspark.
-
Do you want standalone spark/hadoop (any version you want) or
-
Built-in Spark that comes with pip installs ala
pip install pyspark==3.0.1
-
Download spark. I picked spark 3.0.1 with hadoop 3.2.0/. Set the SPARK_HOME accordingly.
-
Don't set anything/set SPARK_HOME to the pyspark directory.
Examples follow standalone hadoop:
- On Windows, clone
git clone https://github.com/cdarlint/winutils
intoc:\dev\hadoop
- Set envvars accordingly, ala
set HADOOP_HOME=C:\dev\hadoop\winutils\hadoop-3.2.0; set SPARK_HOME=C:\dev\spark-3.0.1-bin-hadoop3.2
. - Add
$HADOOP_HOME\bin
to PATH - Restart your editor/session
In Pycharm set the following envvars in the run configuration, similar to the above:
PYTHONUNBUFFERED=1;HADOOP_HOME=C:\dev\hadoop\winutils\hadoop-3.2.0;SPARK_HOME=C:\dev\spark-3.0.1-bin-hadoop3.2
Per 3.2.0 hadoop-aws jar dependencies:
Note: if you pick another version of hadoop, make sure you check the files needed for download in the link above.
In a linux shell emulator:
cd $SPARK_HOME/jars
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar
Sources:
session: Session = boto3.session.Session(profile_name='hierarchy_playground')
credentials = session.get_credentials().get_frozen_credentials()
# todo: use STS https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html#STS.Client.get_session_token
# note: Hive catalog is supported with Delta only on Spark 3.0 https://github.com/delta-io/delta/releases
spark = SparkSession \
.builder \
.appName("Demo") \
.config('spark.hadoop.fs.s3a.access.key', credentials.access_key) \
.config('spark.hadoop.fs.s3a.secret.key', credentials.secret_key) \
.config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem') \
.enableHiveSupport() \
.getOrCreate()
# https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
.config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
Helped me with configuring connection from Spark Structured Streaming (Java) to S3. Thanks for sharing!