Skip to content

Instantly share code, notes, and snippets.

@geoHeil
Last active August 24, 2019 12:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save geoHeil/b1be2ec09f9c5e9a3b887073fe8bf004 to your computer and use it in GitHub Desktop.
Save geoHeil/b1be2ec09f9c5e9a3b887073fe8bf004 to your computer and use it in GitHub Desktop.
azure event hub captured avro file parsing in spark
// using from_json
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.StringType
val schema = spark.read.json(df.select("Body").as[String]).schema
val otherColumns = df.drop("Body").columns.map(col)
val combined = otherColumns :+ from_json(col("Body").cast(StringType), schema).alias("Body_parsed")
val result = df.select(combined:_*)
result.printSchema
// https://spark.apache.org/docs/latest/sql-data-sources-avro.html#to_avro-and-from_avro
// using from_avro
import org.apache.spark.sql.avro._
// generate it using: avro-tools -getschema youravrofile
val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
df.select(from_avro('Body, jsonFormatSchema) as 'Body_parsed).printSchema
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment