Skip to content

Instantly share code, notes, and snippets.

@atharvai
Last active November 1, 2023 10:44
Show Gist options
  • Save atharvai/a0bed7d989b3b316fe6f056d674b99b0 to your computer and use it in GitHub Desktop.
Save atharvai/a0bed7d989b3b316fe6f056d674b99b0 to your computer and use it in GitHub Desktop.
PySpark Localstack S3 read CSV example
#!/usr/bin/env bash
set -x
awslocal s3 mb s3://my-bucket
cat > requests.csv <<EOF
"email1"
"email2"
"email3"
EOF
awslocal s3 cp requests.csv s3://my-bucket/
version: '3.7'
services:
localstack:
image: localstack/localstack
environment:
- SERVICES=s3
- DEFAULT_REGION=eu-west-1
- AWS_ACCESS_KEY_ID=foo
- AWS_SECRET_ACCESS_KEY=foo
ports:
- 4566:4566
volumes:
- ./bootstrap-s3.sh:/docker-entrypoint-initaws.d/bootstrap-s3.sh
#!/usr/bin/env bash
# run Localstack
docker-compose up -d
# run spark in local mode
export SPARK_LOCAL_IP=127.0.0.1
spark-submit \
--packages software.amazon.awssdk:s3:2.17.52,org.apache.hadoop:hadoop-aws:3.1.2 \
--conf spark.hadoop.fs.s3a.endpoint=http://localhost:4566 \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.access.key=foo \
--conf spark.hadoop.fs.s3a.secret.key=foo \
--conf spark.hadoop.fs.s3a.path.style.access=true \
spark-s3-test.py
#don't forget to run docker-compose down
# run this in pyspark shell
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("spark_localstack_demo") \
.getOrCreate()
# On EMR you can use `s3` instead of `s3a`
spark.read.csv('s3a://my-bucket/requests.csv').show()
# expected output
# +------+
# | _c0|
# +------+
# |email1|
# |email2|
# |email3|
# +------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment