Nagella Raja Shyam RajaShyam

## docker-compose.yml
#
# Copyright (C) 2024 Dremio
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software

## Spark_streaming
Dropwizard metrics:
==================
1. Push metrics into Ganglia, Graphite etc..(Can be enabed using SQL configuration)
    spark.conf.set("spark.sql.streaming.metricsEnabled","true")

2. Enable INFO or DEBUG logging levels for org.apache.spark.sql.kafka010.KafkaSource to see what happens inside.
   Add the following line to conf/log4j.properties:
     log4j.logger.org.apache.spark.sql.kafka010.KafkaSource=DEBUG


## Spark_measure
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Methodologies - From Cern

Spark Measure github link - https://github.com/LucaCanali/sparkMeasure
  - Can be used for measuring metrics of spark job
  - Can be started as easily by specifying in packages - bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.11:0.13

 Measuring spark:
 1. WebUI
 2. Execution Plans and DAG's
 3. WebUI event timeline - see what each task is doing

## Spark memory model
Notes taken from Spark summit 2018 Europe:(By Wenchen Fan, Databricks)

Executor:
=========
1. Each executor contains Memory manager and Thread pool
2. The 5 key areas in Memory model of executor are
   1. Data source - Such as json, csv, parquet etc
   2. Internal format - Data represented in Binary format
   3. Operators - Such as filter, join, substr, regexp etc..
   4. Memory manager -

## Ganglia_basics
Ganglia
- An open source scalable cluster performance monitoring tool
- Available almost on all OS

Data flow:
=========
Demon one per node/LPAR(Logical partition):
1. On every node a demon runs named as "gmond" - Ganglia monitor demon, which uses configuration /etc/gmond.conf
2. Say we have 3 nodes, on each node "gmond" runs and 3 of them share information such as
  File access

## Parquet_with_Spark
Parquet Benfits:
===============
- Columnar storage
- Efficient storage
- Efficient data IO and cpu utilisation.
- Reads less no:of blocks
- Key concepts
  Block size
  Row Group - columns data
  page

## Orc_basics
ORC File Basics:
================
- Columnar format: Enables user to read & decompress just the bytes(pieces) they need
- Fast
- Indexed - Can jump into middle of file
- Self describing - Includes all info about type and encoding
- Rich type system - Supports wide complex types such as - timestamp, struct, map, list and union

File compatibility:
==================

## Different_ways_of_UDF
1. Standalone function:

def _add_one(x):
    """Adds one"""
    if x is not None:
        return x + 1

add_one = udf(_add_one, IntegerType())

Importance: This allows for full control flow, including exception handling, but duplicates variables.
	#
	# Copyright (C) 2024 Dremio
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	Dropwizard metrics:
	==================
	1. Push metrics into Ganglia, Graphite etc..(Can be enabed using SQL configuration)
	spark.conf.set("spark.sql.streaming.metricsEnabled","true")

	2. Enable INFO or DEBUG logging levels for org.apache.spark.sql.kafka010.KafkaSource to see what happens inside.
	Add the following line to conf/log4j.properties:
	log4j.logger.org.apache.spark.sql.kafka010.KafkaSource=DEBUG
	Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Methodologies - From Cern

	Spark Measure github link - https://github.com/LucaCanali/sparkMeasure
	- Can be used for measuring metrics of spark job
	- Can be started as easily by specifying in packages - bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.11:0.13

	Measuring spark:
	1. WebUI
	2. Execution Plans and DAG's
	3. WebUI event timeline - see what each task is doing
	Notes taken from Spark summit 2018 Europe:(By Wenchen Fan, Databricks)

	Executor:
	=========
	1. Each executor contains Memory manager and Thread pool
	2. The 5 key areas in Memory model of executor are
	1. Data source - Such as json, csv, parquet etc
	2. Internal format - Data represented in Binary format
	3. Operators - Such as filter, join, substr, regexp etc..
	4. Memory manager -
	Ganglia
	- An open source scalable cluster performance monitoring tool
	- Available almost on all OS

	Data flow:
	=========
	Demon one per node/LPAR(Logical partition):
	1. On every node a demon runs named as "gmond" - Ganglia monitor demon, which uses configuration /etc/gmond.conf
	2. Say we have 3 nodes, on each node "gmond" runs and 3 of them share information such as
	File access
	Parquet Benfits:
	===============
	- Columnar storage
	- Efficient storage
	- Efficient data IO and cpu utilisation.
	- Reads less no:of blocks
	- Key concepts
	Block size
	Row Group - columns data
	page
	ORC File Basics:
	================
	- Columnar format: Enables user to read & decompress just the bytes(pieces) they need
	- Fast
	- Indexed - Can jump into middle of file
	- Self describing - Includes all info about type and encoding
	- Rich type system - Supports wide complex types such as - timestamp, struct, map, list and union

	File compatibility:
	==================
	1. Standalone function:

	def _add_one(x):
	"""Adds one"""
	if x is not None:
	return x + 1

	add_one = udf(_add_one, IntegerType())

	Importance: This allows for full control flow, including exception handling, but duplicates variables.