Skip to content

Instantly share code, notes, and snippets.

@chetkhatri
Last active September 24, 2016 07:43
Show Gist options
  • Save chetkhatri/466f4165cb17845af51b6a282cadb127 to your computer and use it in GitHub Desktop.
Save chetkhatri/466f4165cb17845af51b6a282cadb127 to your computer and use it in GitHub Desktop.
● Understand the Business
● Understand the Data
● Cleanse the Data
● Do Analytics the Data
● Predict the Data
● Visualize the data
● Build Insight that helps to grow Business Revenue
● Explain to Executive (CxO)
● Take Decision
● Increase Revenue
1. Data Quality (Removing Noisy, Missing Data)
2. Feature Engineering
3. Choosing Best Model: " based on culture of Data, For ex. If continues datapoints go with Linear Regression , If categorical binomial prediction requires then go with Logistic Regression, For Random sample of data(Feature randomization) and have better generalization performance. other like Gradient Boosting Trees for optimal linear combination of trees and weighted sum of predictions of individual trees."
Try from Linear Regression to Deep Learning (RNN, CNN)
4. Ensamble Model (Regression + Random Forest + XGBoost)
5. Tune Hyperparameters(For ex in Deep Neural Network, Needs to tune mini-batch size, learning rate, epoch, hidden layers)
6. Model Compression - Port model to embedded / mobile devices using Compress matrices(Sparsify, Shrink, Break, Quantize)
7. Run on smartphone
# Big Data Cluster Tuning
Time Wait Interval - TCP - 4 min
TPS (Transaction Per Second)
Max.port
max.connection
sysctl net.ipv4.ip_local_port_range
sysctl net.ipv4.tcp_fin_timeout
Max Thread - sysctl -a | grep threads_max
echo 120000 > /proc/sys/kernal/threads_max
echo 600000 > /proc/sys
cat /proc/sys/kernal/threads_max
Number of Thread = Total Virtual Memory / (Stacksize * 1024 * 2024)
## JVM Heap Memory Setting
List Ram: free -m
Storage: df -h
ulimit -s // Stack memory
ulimit -v // Virtual Memory
echo 120000 > /proc/sys/kernal/threads_max
echo 600000 > /proc/sys/kernal/max_map_count
echo 200000 > /proc/sys/kernal/pid_max
## Virtual Memory Configuration
#############
Step:1 Swap
#############
sudo fallocate -l 20G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
sudo swapon -s
sudo nano /etc/fstab
/swapfile none swap sw 0 0
################
Step:2 Open file
################
ulimit -n
sudo nano /etc/security/limits.conf
* soft nofile 64000
* hard nofile 64000
root soft nofile 64000
root hard nofile 64000
sudo nano /etc/pam.d/common-session
session required pam_limits.so
sudo nano /etc/pam.d/common-session-noninteractive
session required pam_limits.so
# Tune kafka Cluster
````````````````
producerConfig:
buffer.memory: default
#batch.size: "327679"
batch.size: "655357"
linger.ms: "5"
compression.type: lz4
retries: default
send.buffer.bytes: default
connections.max.idle.ms: default
bootstrap.servers
batch.size
linger.ms
connections.max.idle.ms = 10000
compression.type
retries
# Spark Cluster Hyper parameter Tuning
````````````````````
1) ./spark-shell --conf
--conf spark.executor.memory=50g
--conf spark.driver.memory=150g
--conf spark.kryoserializer.buffer.max=256
--conf spark.driver.maxResultSize=1g
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.rpc.askTimeout=300s
--conf spark.dynamicAllocation.minExecutors=5
--conf spark.sql.shuffle.partitions=1024
2) spark.master: spark://master:7077
spark.deploy.mode: cluster
hdfsPath: hdfs://master:9000/home/spark/chetan/
spark.app.name: DemoAnalytics
spark-defaults.conf
spark.master spark://master.prod.chetan.com:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.eventLog.enabled true
spark.history.fs.logDirectory file:/data/tmp/spark-events
#spark.eventLog.dir=hdfs://namenode_host:namenode_port/user/spark/applicationHistory4
spark.eventLog.dir file:/data/tmp/spark-events
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment