Skip to content

Instantly share code, notes, and snippets.

@davidallsopp
Created November 14, 2016 08:53
Show Gist options
  • Save davidallsopp/5dbdcda0dc5fc8f827d57d508b9b23b0 to your computer and use it in GitHub Desktop.
Save davidallsopp/5dbdcda0dc5fc8f827d57d508b9b23b0 to your computer and use it in GitHub Desktop.
Mapping over a Spark DataFrame, via RDD, back to DataFrame so we can use the databricks API to write to an Avro file.
//import sqlContext.implicits._
import com.databricks.spark.avro._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
//val inschema = StructType(List(StructField("name", StringType, true), StructField("age", IntegerType, true)))
val outschema = StructType(List(StructField("summary", StringType, true)))
val input = sc.parallelize(List(Row("fred", 34), Row("wilma", 33)))
val out = input.map{ case Row(name, age) => Row(s"$name is $age") }
val df = sqlContext.createDataFrame(out, outschema)
df.write.avro("df-mapping.avro")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment