Skip to content

Instantly share code, notes, and snippets.

@mrchristine
Created May 28, 2019 21:12
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mrchristine/027d43cdbecdb363f0f36b0115cf9f1e to your computer and use it in GitHub Desktop.
Save mrchristine/027d43cdbecdb363f0f36b0115cf9f1e to your computer and use it in GitHub Desktop.
Read / Write Spark Schema to JSON
##### READ SPARK DATAFRAME
df = spark.read.option("header", "true").option("inferSchema", "true").csv(fname)
# store the schema from the CSV w/ the header in the first file, and infer the types for the columns
df_schema = df.schema
##### SAVE JSON SCHEMA INTO S3 / BLOB STORAGE
# save the schema to load from the streaming job, which we will load during the next job
dbutils.fs.rm("/home/mwc/airline_schema.json", True)
with open("/dbfs/home/mwc/airline_schema.json", "w") as f:
f.write(df.schema.json())
##### LOAD JSON SCHEMA BACK TO DATAFRAME SCHEMA OBJECT
import json
from pyspark.sql.functions import *
from pyspark.sql.types import *
schema = '/dbfs/home/mwc/airline_schema.json'
with open(schema, 'r') as content_file:
schema_json = content_file.read()
new_schema = StructType.fromJson(json.loads(schema_json))
@prajal55
Copy link

How to upload a json schema to s3?
How to load a json file containing the schema from s3, use that json schema to read csv file?

@SDogra02
Copy link

Upload JSON schema to S3

s3_client=boto3.client('s3')
schema=df.schema.json()
data=json.dumps(schema)
s3_client.put_object(Body=data,Bucket='S3-BucketName',Key='FileName.json')

@prajal55
Copy link

prajal55 commented Mar 7, 2023

Upload JSON schema to S3

s3_client=boto3.client('s3') schema=df.schema.json() data=json.dumps(schema) s3_client.put_object(Body=data,Bucket='S3-BucketName',Key='FileName.json')

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment