from pyspark.sql.types import TimestampType
data = data.withColumn('status_updated', data['status_updated'].cast(TimestampType()))
Py4JJavaError: An error occurred while calling o30.parquet.
: java.io.IOException: No FileSystem for scheme: s3n
When run as a single node cluster (master and slave on the same node) this worked fine. However, when dealing with remote workers, this is when the issue arose.
This is misleading, as it might not be a problem with the hdfs-site.xml
or the core-site.xml
, but might be a problem with a missing hadoop-aws.jar
.
The remedy for me was to make the class available on the worker. I did this by symlinking my target jars into the spark home directory's jars
folder.
cd $SPARK_HOME/jars
for jar in $TARGET_JAR_FOLDER/*
do
ln -s $jar
done
- Setting the jars in the
spark-defaults.conf