Skip to content

Instantly share code, notes, and snippets.

@vikas-gonti
Last active June 27, 2018 01:45
Show Gist options
  • Save vikas-gonti/1b70dfa509f1549346f9846f7fbf34dc to your computer and use it in GitHub Desktop.
Save vikas-gonti/1b70dfa509f1549346f9846f7fbf34dc to your computer and use it in GitHub Desktop.
Convert nyse data to parquet
//Convert nyse data (multiple files)to parquet (4 files output)
/* Data is available in local file system under /data/nyse (ls -ltr /data/nyse)
Fields (stockticker:string, transactiondate:string, openprice:float, highprice:float, lowprice:float, closeprice:float, volume:bigint)
Convert file format to parquet
Save it /user/<YOUR_USER_ID>/nyse_parquet*/
/*spark-shell --master yarn \
--conf spark.ui.port=12456 \
--num-executors 4*/
// data is in local server data/nyse/
/* Move to HDFS
hadoop fs -mkdir user/gontiv/nysedata/
hadoop fs -put /data/nyse/ user/gontiv/nysedata/ */
val nyseRDD = sc.textFile("user/gontiv/nysedata/").coalesce(4).
map(rec => {
val t = rec.split(",")
(t(0), t(1),t(2).toFloat,t(3).toFloat,t(4).toFloat,t(5).toFloat,t(6).toInt)
})
val nyseDF = nyseRDD.toDF("stockticker","transactiondate","openprice","highprice","lowprice","closeprice","volume")
sqlContext.setConf("spark.sql.shuffle.partitions", "4")
nyseDF.write.parquet("/user/gontiv/nyse_parquet")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment