Skip to content

Instantly share code, notes, and snippets.

@itaysk
Created January 14, 2017 16:40
Show Gist options
  • Save itaysk/e975bc70f24d4ccadf591bc975437e96 to your computer and use it in GitHub Desktop.
Save itaysk/e975bc70f24d4ccadf591bc975437e96 to your computer and use it in GitHub Desktop.
How to process Event Hub Archive's files using Spark
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("spark-avro-json-sample") \
.config('spark.hadoop.avro.mapred.ignore.inputs.without.extension', 'false') \
.getOrCreate()
#storage->avro
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
#avro->json
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd) # in real world it's better to specify a schema for the JSON
#do whatever you want with `data`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment