Skip to content

Instantly share code, notes, and snippets.

@prasanthkothuri
Last active April 11, 2019 11:23
Show Gist options
  • Save prasanthkothuri/b4886a9e4a0125830cd11e603a9335ab to your computer and use it in GitHub Desktop.
Save prasanthkothuri/b4886a9e4a0125830cd11e603a9335ab to your computer and use it in GitHub Desktop.

Reading json into Spark Dataframe

method 1 (efficient, specify the schema on construction the dataframe)

from pyspark.sql.types import *
schema = StructType([StructField('aggregated', StringType(), True),
                     StructField('body', StringType(), True),
                     StructField('entity', StringType(), True),
                     StructField('metric_id', StringType(), True),
                     StructField('metric_name', StringType(), True),
                     StructField('producer', StringType(), True),
                     StructField('submitter_environment', StringType(), True),
                     StructField('submitter_host', StringType(), True),
                     StructField('submitter_hostgroup', StringType(), True),
                     StructField('timestamp', StringType(), True),
                     StructField('toplevel_hostgroup', StringType(), True),
                     StructField('type', StringType(), True),
                     StructField('version', StringType(), True)])
df = spark.read.schema(schema).json("/project/itmon/archive/lemon/hadoop_ng/2018-12/")

method 2 (efficient, use it for one time generation of the schema)

df = spark.read.json("/project/itmon/archive/lemon/hadoop_ng/2018-12/part-r-00000")
schema_json = df.schema.json()
from pyspark.sql.types import *
import json 
schema = StructType.fromJson(json.loads(schema_json))
df = spark.read.schema(schema).json("/project/itmon/archive/lemon/hadoop_ng/2018-12")

method 3 (very inefficient, whole file is parsed to construct the dataframe)

df = spark.read.json("/project/itmon/archive/lemon/hadoop_ng/2018-12/part-r-00000")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment