prasanthkothuri/spark_optimization.md

## spark_optimization.md

      
    Raw
  

              spark_optimization.md
            
          
    Reading json into Spark Dataframe
method 1 (efficient, specify the schema on construction the dataframe)
from pyspark.sql.types import *
schema = StructType([StructField('aggregated', StringType(), True),
                     StructField('body', StringType(), True),
                     StructField('entity', StringType(), True),
                     StructField('metric_id', StringType(), True),
                     StructField('metric_name', StringType(), True),
                     StructField('producer', StringType(), True),
                     StructField('submitter_environment', StringType(), True),
                     StructField('submitter_host', StringType(), True),
                     StructField('submitter_hostgroup', StringType(), True),
                     StructField('timestamp', StringType(), True),
                     StructField('toplevel_hostgroup', StringType(), True),
                     StructField('type', StringType(), True),
                     StructField('version', StringType(), True)])
df = spark.read.schema(schema).json("/project/itmon/archive/lemon/hadoop_ng/2018-12/")
method 2 (efficient, use it for one time generation of the schema)
df = spark.read.json("/project/itmon/archive/lemon/hadoop_ng/2018-12/part-r-00000")
schema_json = df.schema.json()
from pyspark.sql.types import *
import json 
schema = StructType.fromJson(json.loads(schema_json))
df = spark.read.schema(schema).json("/project/itmon/archive/lemon/hadoop_ng/2018-12")
method 3 (very inefficient, whole file is parsed to construct the dataframe)
df = spark.read.json("/project/itmon/archive/lemon/hadoop_ng/2018-12/part-r-00000")