chetkhatri/Tune Big Data Clusters and Optimization

## Tune Big Data Clusters and Optimization
● Understand the Business
● Understand the Data
● Cleanse the Data
● Do Analytics the Data
● Predict the Data
● Visualize the data
● Build Insight that helps to grow Business Revenue
● Explain to Executive (CxO)
● Take Decision
● Increase Revenue


1. Data Quality (Removing Noisy, Missing Data)
2. Feature Engineering
3. Choosing Best Model: " based on culture of Data, For ex. If continues datapoints go with Linear Regression , If categorical binomial prediction requires then go with Logistic Regression, For Random sample of data(Feature randomization) and have better generalization performance. other like Gradient Boosting Trees for optimal linear combination of trees and weighted sum of predictions of individual trees."
Try from Linear Regression to Deep Learning (RNN, CNN)
4. Ensamble Model (Regression + Random Forest + XGBoost)
5. Tune Hyperparameters(For ex in Deep Neural Network, Needs to tune mini-batch size, learning rate, epoch, hidden layers)
6. Model Compression - Port model to embedded / mobile devices using Compress matrices(Sparsify, Shrink, Break, Quantize)
7. Run on smartphone

# Big Data Cluster Tuning

Time Wait Interval - TCP - 4 min
TPS (Transaction Per Second)
Max.port
max.connection

sysctl net.ipv4.ip_local_port_range
sysctl net.ipv4.tcp_fin_timeout

Max Thread - sysctl -a | grep threads_max
echo 120000 > /proc/sys/kernal/threads_max
echo 600000 > /proc/sys

cat /proc/sys/kernal/threads_max

Number of Thread = Total Virtual Memory / (Stacksize * 1024 * 2024)

## JVM Heap Memory Setting

List Ram: free -m
Storage: df -h

ulimit -s  // Stack memory
ulimit -v  // Virtual Memory

echo 120000 > /proc/sys/kernal/threads_max
echo 600000 > /proc/sys/kernal/max_map_count
echo 200000 > /proc/sys/kernal/pid_max

## Virtual Memory Configuration
#############
Step:1 Swap
#############
sudo fallocate -l 20G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
sudo swapon -s
sudo nano /etc/fstab
/swapfile   none    swap    sw    0   0


################
Step:2 Open file
################
ulimit -n

sudo nano /etc/security/limits.conf
*    soft nofile 64000
*    hard nofile 64000
root soft nofile 64000
root hard nofile 64000

sudo nano /etc/pam.d/common-session
session required        pam_limits.so
sudo nano /etc/pam.d/common-session-noninteractive
session required        pam_limits.so

# Tune kafka Cluster

````````````````
producerConfig:
   buffer.memory: default
   #batch.size: "327679"
   batch.size: "655357"
   linger.ms: "5"
   compression.type: lz4
   retries: default
   send.buffer.bytes: default
   connections.max.idle.ms: default

bootstrap.servers
batch.size
linger.ms
connections.max.idle.ms = 10000
compression.type
retries


# Spark Cluster Hyper parameter Tuning
````````````````````
1) ./spark-shell --conf

--conf spark.executor.memory=50g
--conf spark.driver.memory=150g
--conf spark.kryoserializer.buffer.max=256
--conf spark.driver.maxResultSize=1g
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.rpc.askTimeout=300s
--conf spark.dynamicAllocation.minExecutors=5
--conf spark.sql.shuffle.partitions=1024

2) spark.master: spark://master:7077
  spark.deploy.mode: cluster
  hdfsPath: hdfs://master:9000/home/spark/chetan/
  spark.app.name: DemoAnalytics

spark-defaults.conf

spark.master                     spark://master.prod.chetan.com:7077
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.eventLog.enabled           true
spark.history.fs.logDirectory    file:/data/tmp/spark-events
#spark.eventLog.dir=hdfs://namenode_host:namenode_port/user/spark/applicationHistory4
spark.eventLog.dir file:/data/tmp/spark-events
	● Understand the Business
	● Understand the Data
	● Cleanse the Data
	● Do Analytics the Data
	● Predict the Data
	● Visualize the data
	● Build Insight that helps to grow Business Revenue
	● Explain to Executive (CxO)
	● Take Decision
	● Increase Revenue


	1. Data Quality (Removing Noisy, Missing Data)
	2. Feature Engineering
	3. Choosing Best Model: " based on culture of Data, For ex. If continues datapoints go with Linear Regression , If categorical binomial prediction requires then go with Logistic Regression, For Random sample of data(Feature randomization) and have better generalization performance. other like Gradient Boosting Trees for optimal linear combination of trees and weighted sum of predictions of individual trees."
	Try from Linear Regression to Deep Learning (RNN, CNN)
	4. Ensamble Model (Regression + Random Forest + XGBoost)
	5. Tune Hyperparameters(For ex in Deep Neural Network, Needs to tune mini-batch size, learning rate, epoch, hidden layers)
	6. Model Compression - Port model to embedded / mobile devices using Compress matrices(Sparsify, Shrink, Break, Quantize)
	7. Run on smartphone

	# Big Data Cluster Tuning

	Time Wait Interval - TCP - 4 min
	TPS (Transaction Per Second)
	Max.port
	max.connection

	sysctl net.ipv4.ip_local_port_range
	sysctl net.ipv4.tcp_fin_timeout

	Max Thread - sysctl -a \| grep threads_max
	echo 120000 > /proc/sys/kernal/threads_max
	echo 600000 > /proc/sys

	cat /proc/sys/kernal/threads_max

	Number of Thread = Total Virtual Memory / (Stacksize * 1024 * 2024)

	## JVM Heap Memory Setting

	List Ram: free -m
	Storage: df -h

	ulimit -s // Stack memory
	ulimit -v // Virtual Memory

	echo 120000 > /proc/sys/kernal/threads_max
	echo 600000 > /proc/sys/kernal/max_map_count
	echo 200000 > /proc/sys/kernal/pid_max

	## Virtual Memory Configuration
	#############
	Step:1 Swap
	#############
	sudo fallocate -l 20G /swapfile
	sudo chmod 600 /swapfile
	sudo mkswap /swapfile
	sudo swapon /swapfile
	sudo swapon -s
	sudo nano /etc/fstab
	/swapfile none swap sw 0 0


	################
	Step:2 Open file
	################
	ulimit -n

	sudo nano /etc/security/limits.conf
	* soft nofile 64000
	* hard nofile 64000
	root soft nofile 64000
	root hard nofile 64000

	sudo nano /etc/pam.d/common-session
	session required pam_limits.so
	sudo nano /etc/pam.d/common-session-noninteractive
	session required pam_limits.so

	# Tune kafka Cluster

	````````````````
	producerConfig:
	buffer.memory: default
	#batch.size: "327679"
	batch.size: "655357"
	linger.ms: "5"
	compression.type: lz4
	retries: default
	send.buffer.bytes: default
	connections.max.idle.ms: default

	bootstrap.servers
	batch.size
	linger.ms
	connections.max.idle.ms = 10000
	compression.type
	retries


	# Spark Cluster Hyper parameter Tuning
	````````````````````
	1) ./spark-shell --conf

	--conf spark.executor.memory=50g
	--conf spark.driver.memory=150g
	--conf spark.kryoserializer.buffer.max=256
	--conf spark.driver.maxResultSize=1g
	--conf spark.dynamicAllocation.enabled=true
	--conf spark.shuffle.service.enabled=true
	--conf spark.rpc.askTimeout=300s
	--conf spark.dynamicAllocation.minExecutors=5
	--conf spark.sql.shuffle.partitions=1024

	2) spark.master: spark://master:7077
	spark.deploy.mode: cluster
	hdfsPath: hdfs://master:9000/home/spark/chetan/
	spark.app.name: DemoAnalytics

	spark-defaults.conf

	spark.master spark://master.prod.chetan.com:7077
	spark.serializer org.apache.spark.serializer.KryoSerializer
	spark.eventLog.enabled true
	spark.history.fs.logDirectory file:/data/tmp/spark-events
	#spark.eventLog.dir=hdfs://namenode_host:namenode_port/user/spark/applicationHistory4
	spark.eventLog.dir file:/data/tmp/spark-events