Skip to content

Instantly share code, notes, and snippets.

@giefferre
Last active July 10, 2021 18:01
Show Gist options
  • Save giefferre/0998159953466b4273ec8f921d6dc773 to your computer and use it in GitHub Desktop.
Save giefferre/0998159953466b4273ec8f921d6dc773 to your computer and use it in GitHub Desktop.
Save the schema of a Spark DataFrame to be able to reuse it when reading json files.
# read a part of the whole datalake just to extract the schema
part = spark.read.json("s3a://path/to/json/part")
# create a temporary rdd in order to store the schema as binary file
temp_rdd = sc.parallelize(part.schema)
temp_rdd.coalesce(1).saveAsPickleFile("s3a://path/to/destination_schema.pickle")
# from now on, the schema will be saved.
# it could be used to improve the speed of reading json files.
schema_rdd = sc.pickleFile("s3a://path/to/destination_schema.pickle")
reading_schema = StructType(schema_rdd.collect())
your_data_set = spark.read.json("s3a://path/to/entire_data_lake", reading_schema) # this would be quicker than just spark.read.json()
@federicobaiocco
Copy link

Have you tried it in a glue job? I am getting an error:
An error occurred while calling o75.saveAsObjectFile. java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.mapred.DirectOutputCommitter not found

@giefferre
Copy link
Author

@federicobaiocco unfortunately I haven't, sorry. I launched the commands on a AWS EMR cluster using an Apache Zeppelin notebook

@NMRobert
Copy link

Hey @federicobaiocco, if you add this configuration line it should work in Glue:

sc = SparkContext # or whatever you're doing to grab your SparkContext
sc._jsc.hadoopConfiguration().set("mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter")

@gsunita
Copy link

gsunita commented Jul 10, 2021

can we store the file in json or text format and later read schema from it instead of .pickle as I want to edit the schema file and pickle extension is not readable or editable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment