Skip to content

Instantly share code, notes, and snippets.

View tmusabbir's full-sized avatar

Tanzir Musabbir tmusabbir

View GitHub Profile
@tmusabbir
tmusabbir / livy-example.sh
Created March 27, 2018 22:41
Sample commands for spark-submit using Apache Livy
# This is the usual sample spark-submit command to submit the SparkPi sample application
spark-submit --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar
# Now submit the same job from EMR master node (assume the jar file is in test folder):
curl -X POST --data '{"file": "/test/spark-examples.jar", "className": "org.apache.spark.examples.SparkPi"}' -H "Content-Type: application/json" localhost:8998/batches
# Previous example is pointing to localhost as it submitted job from the same host, now submitting job from remote location:
curl -X POST --data '{"file": "/test/spark-examples.jar", "className": "org.apache.spark.examples.SparkPi"}' -H "Content-Type: application/json" <<your-emr-master-dns>>:8998/batches
# Now assume the jar file is in S3 location, in that case, you can follow this:
@tmusabbir
tmusabbir / capacity-scheduler.json
Created March 26, 2018 04:19
Sample YARN Capacity Scheduler config
{
Classification: "capacity-scheduler",
Properties: {
"yarn.scheduler.capacity.root.queues": "default,dev,qa",
"yarn.scheduler.capacity.root.default.capacity": "20",
"yarn.scheduler.capacity.root.default.maximum-capacity": "50",
"yarn.scheduler.capacity.root.dev.capacity": "40",
"yarn.scheduler.capacity.root.dev.maximum-capacity": "100",
"yarn.scheduler.capacity.root.qa.capacity": "40",
"yarn.scheduler.capacity.root.qa.maximum-capacity": "80"
@tmusabbir
tmusabbir / create-spark-cluster.sh
Created March 26, 2018 03:19
AWS CLI command to create EMR cluster with default auto-scaling task group
aws emr create-cluster --auto-scaling-role EMR_AutoScaling_DefaultRole --termination-protected --applications Name=Hadoop Name=Hive Name=Spark --ebs-root-volume-size 10 --ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-xxxx","EmrManagedSlaveSecurityGroup":"sg-xxxxx","EmrManagedMasterSecurityGroup":"sg-xxxxx"}' --service-role EMR_DefaultRole --enable-debugging --release-label emr-5.12.0 --log-uri 's3n://aws-logs-xxxx/elasticmapreduce/' --name 'spark-cluster' --instance-groups '[{"InstanceCount":2,"BidPrice":"0.30","AutoScalingPolicy":{"Constraints":{"MinCapacity":0,"MaxCapacity":20},"Rules":[{"Action":{"SimpleScalingPolicyConfiguration":{"ScalingAdjustment":2,"CoolDown":300,"AdjustmentType":"CHANGE_IN_CAPACITY"}},"Description":"","Trigger":{"CloudWatchAlarmDefinition":{"MetricName":"YARNMemoryAvailablePercentage","ComparisonOperator":"LESS_THAN","Statistic":"AVERAGE","Period":300,"Dimensions":[{"Value":"${emr.clusterId}","Key":"JobFlowId"}],"EvaluationPeriods":1,"Unit":"PERCENT","Na
@tmusabbir
tmusabbir / Hive archive with Oozie
Created June 25, 2014 17:17
a#Hive archive with Oozie
Hive Archiving/Maintenance with the help of Oozie
@tmusabbir
tmusabbir / a#Cassandra Performance Tuning
Last active January 3, 2016 00:19
Cassandra Performance Tuning
Cassandra Performance Tuning
@tmusabbir
tmusabbir / a#Cassandra Stress Test
Last active January 2, 2016 23:49
Cassandra Stress Test
Cassandra Stress Test
@tmusabbir
tmusabbir / a:Install Kafka in CentOS
Last active July 8, 2016 06:21
Install Kafka in CentOS
Install Kafka in CentOS
@tmusabbir
tmusabbir / a:Install Replicated ZooKeeper in CentOS
Last active December 26, 2015 11:59
Install Replicated ZooKeeper in CentOS
Install Replicated ZooKeeper in CentOS
@tmusabbir
tmusabbir / a:Install Opscenter in CentOS environment
Last active December 25, 2015 18:19
Install Opscenter in CentOS environment
Install Opscenter in CentOS environment
@tmusabbir
tmusabbir / a:Setup a Storm cluster on Amazon EC2
Last active September 11, 2018 08:28
Setup a Storm cluster on Amazon EC2
Setup a Storm cluster on Amazon EC2