Skip to content

Instantly share code, notes, and snippets.

@bzamecnik
Created September 29, 2016 14:17
Show Gist options
  • Save bzamecnik/3dc30a6e244333eb342b609eb5089167 to your computer and use it in GitHub Desktop.
Save bzamecnik/3dc30a6e244333eb342b609eb5089167 to your computer and use it in GitHub Desktop.
Converting CSV to Parquet in Spark 2.0
// run eg. in spark-shell...
// uncompressed CSV without a header
val df = spark.read.csv("input.csv")
df.write.format("parquet").save("output.parquet")
// it produces a directory output.parquet/ with the following content:
//
// ls -l output.parquet/
// -rw-r--r-- 1 bza bza 146346 Sep 29 16:06 part-r-00000-8a89541d-9071-4246-b525-22e894ef3e0b.snappy.parquet
// -rw-r--r-- 1 bza bza 0 Sep 29 16:06 _SUCCESS
//
// the parquet file is snappy-compressed
// gzip-compressed CSV - no problem...
val df_gz = spark.read.csv("input.csv.gz")
// the parquet file can be read again:
val df_parquet = spark.read.parquet("output.parquet")
// add column names when the CSV header is missing:
val columns = Seq("foo", "bar", "baz")
val df_with_columns = df_without_columns.toDF(columns: _*)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment