Skip to content

Instantly share code, notes, and snippets.

@itaysk
Created January 14, 2017 16:40
  • Star 0 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
Star You must be signed in to star a gist
Embed
What would you like to do?
How to process Event Hub Archive's files using Spark
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("spark-avro-json-sample") \
.config('spark.hadoop.avro.mapred.ignore.inputs.without.extension', 'false') \
.getOrCreate()
#storage->avro
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
#avro->json
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd) # in real world it's better to specify a schema for the JSON
#do whatever you want with `data`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment