Created
September 29, 2016 14:17
-
-
Save bzamecnik/3dc30a6e244333eb342b609eb5089167 to your computer and use it in GitHub Desktop.
Converting CSV to Parquet in Spark 2.0
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// run eg. in spark-shell... | |
// uncompressed CSV without a header | |
val df = spark.read.csv("input.csv") | |
df.write.format("parquet").save("output.parquet") | |
// it produces a directory output.parquet/ with the following content: | |
// | |
// ls -l output.parquet/ | |
// -rw-r--r-- 1 bza bza 146346 Sep 29 16:06 part-r-00000-8a89541d-9071-4246-b525-22e894ef3e0b.snappy.parquet | |
// -rw-r--r-- 1 bza bza 0 Sep 29 16:06 _SUCCESS | |
// | |
// the parquet file is snappy-compressed | |
// gzip-compressed CSV - no problem... | |
val df_gz = spark.read.csv("input.csv.gz") | |
// the parquet file can be read again: | |
val df_parquet = spark.read.parquet("output.parquet") | |
// add column names when the CSV header is missing: | |
val columns = Seq("foo", "bar", "baz") | |
val df_with_columns = df_without_columns.toDF(columns: _*) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment