Skip to content

Instantly share code, notes, and snippets.

@loganakamatsu
Last active April 29, 2016 12:58
Show Gist options
  • Save loganakamatsu/0e67f3f55f8da38bbe6775176de8cff8 to your computer and use it in GitHub Desktop.
Save loganakamatsu/0e67f3f55f8da38bbe6775176de8cff8 to your computer and use it in GitHub Desktop.
Convert open-food-facts csv to gzipped Parquet
// Accompanies the blog post at http://loganakamatsu.com/#blog
// The data comes from http://openfoodfacts.org/
/* In Zeppelin, add spark-csv:
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.databricks:spark-csv_2.10:1.2.0")
*/
import org.apache.spark.sql.types.{StructType,StructField,StringType};
val header = sc.textFile("/tmp/en.openfoodfacts.org.products.csv").first
val schema =
StructType(
header.split("\t").map(fieldName => StructField(fieldName, StringType, true))
)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", "\t")
.schema(schema)
.load("/tmp/en.openfoodfacts.org.products.csv")
val coalesced = df.coalesce(1) // The dataframe is pretty small, so we collapse to one partition for writing
sqlContext.setConf("spark.sql.parquet.compression.codec","snappy")
coalesced.write.parquet("/tmp/foods.snappy")
sqlContext.setConf("spark.sql.parquet.compression.codec","gzip")
coalesced.write.parquet("/tmp/foods.gz")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment