Skip to content

Instantly share code, notes, and snippets.

View RajaShyam's full-sized avatar

Nagella Raja Shyam RajaShyam

View GitHub Profile
@RajaShyam
RajaShyam / Spark_streaming
Last active November 24, 2018 08:36
Spark Streaming
Dropwizard metrics:
==================
1. Push metrics into Ganglia, Graphite etc..(Can be enabed using SQL configuration)
spark.conf.set("spark.sql.streaming.metricsEnabled","true")
2. Enable INFO or DEBUG logging levels for org.apache.spark.sql.kafka010.KafkaSource to see what happens inside.
Add the following line to conf/log4j.properties:
log4j.logger.org.apache.spark.sql.kafka010.KafkaSource=DEBUG
@RajaShyam
RajaShyam / Spark_measure
Created October 9, 2018 11:01
Spark Measure
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Methodologies - From Cern
Spark Measure github link - https://github.com/LucaCanali/sparkMeasure
- Can be used for measuring metrics of spark job
- Can be started as easily by specifying in packages - bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.11:0.13
Measuring spark:
1. WebUI
2. Execution Plans and DAG's
3. WebUI event timeline - see what each task is doing
@RajaShyam
RajaShyam / Spark memory model
Created October 5, 2018 16:02
A developers view into spark memory model
Notes taken from Spark summit 2018 Europe:(By Wenchen Fan, Databricks)
Executor:
=========
1. Each executor contains Memory manager and Thread pool
2. The 5 key areas in Memory model of executor are
1. Data source - Such as json, csv, parquet etc
2. Internal format - Data represented in Binary format
3. Operators - Such as filter, join, substr, regexp etc..
4. Memory manager -
@RajaShyam
RajaShyam / Ganglia_basics
Created May 30, 2018 02:28
Ganglia and Its basics
Ganglia
- An open source scalable cluster performance monitoring tool
- Available almost on all OS
Data flow:
=========
Demon one per node/LPAR(Logical partition):
1. On every node a demon runs named as "gmond" - Ganglia monitor demon, which uses configuration /etc/gmond.conf
2. Say we have 3 nodes, on each node "gmond" runs and 3 of them share information such as
File access
Parquet Benfits:
===============
- Columnar storage
- Efficient storage
- Efficient data IO and cpu utilisation.
- Reads less no:of blocks
- Key concepts
Block size
Row Group - columns data
page
@RajaShyam
RajaShyam / Orc_basics
Created May 27, 2018 08:35
Basics on ORC file format
ORC File Basics:
================
- Columnar format: Enables user to read & decompress just the bytes(pieces) they need
- Fast
- Indexed - Can jump into middle of file
- Self describing - Includes all info about type and encoding
- Rich type system - Supports wide complex types such as - timestamp, struct, map, list and union
File compatibility:
==================
@RajaShyam
RajaShyam / Different_ways_of_UDF
Last active June 4, 2018 22:07
pyspark exploration
1. Standalone function:
def _add_one(x):
"""Adds one"""
if x is not None:
return x + 1
add_one = udf(_add_one, IntegerType())
Importance: This allows for full control flow, including exception handling, but duplicates variables.