This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. Standalone function: | |
def _add_one(x): | |
"""Adds one""" | |
if x is not None: | |
return x + 1 | |
add_one = udf(_add_one, IntegerType()) | |
Importance: This allows for full control flow, including exception handling, but duplicates variables. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ORC File Basics: | |
================ | |
- Columnar format: Enables user to read & decompress just the bytes(pieces) they need | |
- Fast | |
- Indexed - Can jump into middle of file | |
- Self describing - Includes all info about type and encoding | |
- Rich type system - Supports wide complex types such as - timestamp, struct, map, list and union | |
File compatibility: | |
================== |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Parquet Benfits: | |
=============== | |
- Columnar storage | |
- Efficient storage | |
- Efficient data IO and cpu utilisation. | |
- Reads less no:of blocks | |
- Key concepts | |
Block size | |
Row Group - columns data | |
page |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Ganglia | |
- An open source scalable cluster performance monitoring tool | |
- Available almost on all OS | |
Data flow: | |
========= | |
Demon one per node/LPAR(Logical partition): | |
1. On every node a demon runs named as "gmond" - Ganglia monitor demon, which uses configuration /etc/gmond.conf | |
2. Say we have 3 nodes, on each node "gmond" runs and 3 of them share information such as | |
File access |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Notes taken from Spark summit 2018 Europe:(By Wenchen Fan, Databricks) | |
Executor: | |
========= | |
1. Each executor contains Memory manager and Thread pool | |
2. The 5 key areas in Memory model of executor are | |
1. Data source - Such as json, csv, parquet etc | |
2. Internal format - Data represented in Binary format | |
3. Operators - Such as filter, join, substr, regexp etc.. | |
4. Memory manager - |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Methodologies - From Cern | |
Spark Measure github link - https://github.com/LucaCanali/sparkMeasure | |
- Can be used for measuring metrics of spark job | |
- Can be started as easily by specifying in packages - bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.11:0.13 | |
Measuring spark: | |
1. WebUI | |
2. Execution Plans and DAG's | |
3. WebUI event timeline - see what each task is doing |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Dropwizard metrics: | |
================== | |
1. Push metrics into Ganglia, Graphite etc..(Can be enabed using SQL configuration) | |
spark.conf.set("spark.sql.streaming.metricsEnabled","true") | |
2. Enable INFO or DEBUG logging levels for org.apache.spark.sql.kafka010.KafkaSource to see what happens inside. | |
Add the following line to conf/log4j.properties: | |
log4j.logger.org.apache.spark.sql.kafka010.KafkaSource=DEBUG | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# | |
# Copyright (C) 2024 Dremio | |
# | |
# Licensed under the Apache License, Version 2.0 (the "License"); | |
# you may not use this file except in compliance with the License. | |
# You may obtain a copy of the License at | |
# | |
# http://www.apache.org/licenses/LICENSE-2.0 | |
# | |
# Unless required by applicable law or agreed to in writing, software |