Skip to content

Instantly share code, notes, and snippets.

@nfarah86
Created October 21, 2021 23:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nfarah86/047922fcbec1fce41b476dc7f66d89cc to your computer and use it in GitHub Desktop.
Save nfarah86/047922fcbec1fce41b476dc7f66d89cc to your computer and use it in GitHub Desktop.
def read_data(spark):
sc=spark.sparkContext
hadoop_configuration=sc._jsc.hadoopConfiguration()
hadoop_configuration.set("fs.s3a.access.key","your access key")
hadoop_configuration.set("fs.s3a.secret.key","your secret key")
hadoop_configuration.set("fs.s3a.endpoint", "s3.amazonaws.com")
hadoop_configuration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
rdata = spark.read.options(header='True', delimiter=',').csv("s3a://spark-rockset-public-nadine/movies.csv")
# see the data
rdata.show()
# check the schema
rdata.printSchema()
# do some transformations
# simple exp of a transformation
rdata = rdata.withColumn("vote_count", col("vote_count").cast("int"))
rdata.printSchema()
return rdata
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment