Last active
September 24, 2016 07:43
-
-
Save chetkhatri/466f4165cb17845af51b6a282cadb127 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
● Understand the Business | |
● Understand the Data | |
● Cleanse the Data | |
● Do Analytics the Data | |
● Predict the Data | |
● Visualize the data | |
● Build Insight that helps to grow Business Revenue | |
● Explain to Executive (CxO) | |
● Take Decision | |
● Increase Revenue | |
1. Data Quality (Removing Noisy, Missing Data) | |
2. Feature Engineering | |
3. Choosing Best Model: " based on culture of Data, For ex. If continues datapoints go with Linear Regression , If categorical binomial prediction requires then go with Logistic Regression, For Random sample of data(Feature randomization) and have better generalization performance. other like Gradient Boosting Trees for optimal linear combination of trees and weighted sum of predictions of individual trees." | |
Try from Linear Regression to Deep Learning (RNN, CNN) | |
4. Ensamble Model (Regression + Random Forest + XGBoost) | |
5. Tune Hyperparameters(For ex in Deep Neural Network, Needs to tune mini-batch size, learning rate, epoch, hidden layers) | |
6. Model Compression - Port model to embedded / mobile devices using Compress matrices(Sparsify, Shrink, Break, Quantize) | |
7. Run on smartphone | |
# Big Data Cluster Tuning | |
Time Wait Interval - TCP - 4 min | |
TPS (Transaction Per Second) | |
Max.port | |
max.connection | |
sysctl net.ipv4.ip_local_port_range | |
sysctl net.ipv4.tcp_fin_timeout | |
Max Thread - sysctl -a | grep threads_max | |
echo 120000 > /proc/sys/kernal/threads_max | |
echo 600000 > /proc/sys | |
cat /proc/sys/kernal/threads_max | |
Number of Thread = Total Virtual Memory / (Stacksize * 1024 * 2024) | |
## JVM Heap Memory Setting | |
List Ram: free -m | |
Storage: df -h | |
ulimit -s // Stack memory | |
ulimit -v // Virtual Memory | |
echo 120000 > /proc/sys/kernal/threads_max | |
echo 600000 > /proc/sys/kernal/max_map_count | |
echo 200000 > /proc/sys/kernal/pid_max | |
## Virtual Memory Configuration | |
############# | |
Step:1 Swap | |
############# | |
sudo fallocate -l 20G /swapfile | |
sudo chmod 600 /swapfile | |
sudo mkswap /swapfile | |
sudo swapon /swapfile | |
sudo swapon -s | |
sudo nano /etc/fstab | |
/swapfile none swap sw 0 0 | |
################ | |
Step:2 Open file | |
################ | |
ulimit -n | |
sudo nano /etc/security/limits.conf | |
* soft nofile 64000 | |
* hard nofile 64000 | |
root soft nofile 64000 | |
root hard nofile 64000 | |
sudo nano /etc/pam.d/common-session | |
session required pam_limits.so | |
sudo nano /etc/pam.d/common-session-noninteractive | |
session required pam_limits.so | |
# Tune kafka Cluster | |
```````````````` | |
producerConfig: | |
buffer.memory: default | |
#batch.size: "327679" | |
batch.size: "655357" | |
linger.ms: "5" | |
compression.type: lz4 | |
retries: default | |
send.buffer.bytes: default | |
connections.max.idle.ms: default | |
bootstrap.servers | |
batch.size | |
linger.ms | |
connections.max.idle.ms = 10000 | |
compression.type | |
retries | |
# Spark Cluster Hyper parameter Tuning | |
```````````````````` | |
1) ./spark-shell --conf | |
--conf spark.executor.memory=50g | |
--conf spark.driver.memory=150g | |
--conf spark.kryoserializer.buffer.max=256 | |
--conf spark.driver.maxResultSize=1g | |
--conf spark.dynamicAllocation.enabled=true | |
--conf spark.shuffle.service.enabled=true | |
--conf spark.rpc.askTimeout=300s | |
--conf spark.dynamicAllocation.minExecutors=5 | |
--conf spark.sql.shuffle.partitions=1024 | |
2) spark.master: spark://master:7077 | |
spark.deploy.mode: cluster | |
hdfsPath: hdfs://master:9000/home/spark/chetan/ | |
spark.app.name: DemoAnalytics | |
spark-defaults.conf | |
spark.master spark://master.prod.chetan.com:7077 | |
spark.serializer org.apache.spark.serializer.KryoSerializer | |
spark.eventLog.enabled true | |
spark.history.fs.logDirectory file:/data/tmp/spark-events | |
#spark.eventLog.dir=hdfs://namenode_host:namenode_port/user/spark/applicationHistory4 | |
spark.eventLog.dir file:/data/tmp/spark-events |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment