tamsanh/spark-notes.md

## spark-notes.md

      
    Raw
  

              spark-notes.md
            
          
    Using PySpark Spark 2.2.1

Transform PySparkSQL Column from String to Date

from pyspark.sql.types import TimestampType
data = data.withColumn('status_updated', data['status_updated'].cast(TimestampType()))

Installing Spark on AWS Linux

Errors

No FileSystem for scheme: s3n

Py4JJavaError: An error occurred while calling o30.parquet.
: java.io.IOException: No FileSystem for scheme: s3n

When run as a single node cluster (master and slave on the same node) this worked fine. However, when dealing with remote workers, this is when the issue arose.
This is misleading, as it might not be a problem with the hdfs-site.xml or the core-site.xml, but might be a problem with a missing hadoop-aws.jar.
Fix

The remedy for me was to make the class available on the worker. I did this by symlinking my target jars into the spark home directory's jars folder.
cd $SPARK_HOME/jars
for jar in $TARGET_JAR_FOLDER/*
do
    ln -s $jar
done

Didn't Work


Setting the jars in the spark-defaults.conf