Skip to content

Instantly share code, notes, and snippets.

@tamsanh
Last active February 15, 2018 23:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tamsanh/81012ef91c0babcb43d883997dda2e0a to your computer and use it in GitHub Desktop.
Save tamsanh/81012ef91c0babcb43d883997dda2e0a to your computer and use it in GitHub Desktop.
Notes from Installing Spark on AWS Linux

Using PySpark Spark 2.2.1

Transform PySparkSQL Column from String to Date

from pyspark.sql.types import TimestampType
data = data.withColumn('status_updated', data['status_updated'].cast(TimestampType()))

Installing Spark on AWS Linux

Errors

No FileSystem for scheme: s3n

Py4JJavaError: An error occurred while calling o30.parquet.
: java.io.IOException: No FileSystem for scheme: s3n

When run as a single node cluster (master and slave on the same node) this worked fine. However, when dealing with remote workers, this is when the issue arose.

This is misleading, as it might not be a problem with the hdfs-site.xml or the core-site.xml, but might be a problem with a missing hadoop-aws.jar.

Fix

The remedy for me was to make the class available on the worker. I did this by symlinking my target jars into the spark home directory's jars folder.

cd $SPARK_HOME/jars
for jar in $TARGET_JAR_FOLDER/*
do
    ln -s $jar
done

Didn't Work

  1. Setting the jars in the spark-defaults.conf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment