Last active September 20, 2019 15:37
Run Jupyter Notebook and JupyterHub on Amazon EMR

Jupyter on EMR allows users to save their work on Amazon S3 rather than on local storage on the EMR cluster (master node).

To store notebooks on S3, use:

--notebook-dir <s3://your-bucket/folder/>

To store notebooks in a directory different from the user’s home directory, use:

--notebook-dir <local directory>

The following example CLI command is used to launch a five-node (c3.4xlarge) EMR 5.2.0 cluster with the bootstrap action. The BA will install all the available kernels. It will also install the ggplot and nilearn Python packages and set:

the Jupyter port to 8880
the password to jupyter
the JupyterHub port to 8001
aws emr create-cluster --auto-scaling-role EMR_AutoScaling_DefaultRole --applications Name=Hadoop Name=Hive Name=Pig Name=Hue Name=Spark Name=Ganglia Name=Presto Name=Tez --bootstrap-actions '[{"Path":"s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/","Args":["--toree","--ds-packages","--ml-packages","--python-packages","pandas ggplot","--port","8880","--jupyterhub","--jupyterhub-port","8001","--spark-opts","--packages=com.typesafe:config:1.3.1,org.datasyslab:geospark:0.8.0,com.vividsolutions:jts:1.13,com.databricks:spark-avro_2.11:3.0.0,org.elasticsearch:elasticsearch-spark_2.11:2.4.0","--notebook-dir","s3://","--cached-install","--s3fs","--python3"],"Name":"Install Jupyter notebook"}]' --ec2-attributes '{"KeyName":"<your-ec2-key>","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-1b58686f","EmrManagedSlaveSecurityGroup":"sg-2418c05e","EmrManagedMasterSecurityGroup":"sg-79e63e03"}' --service-role EMR_DefaultRole --enable-debugging --release-label emr-5.6.0 --log-uri 's3n://aws-logs-452442550777-us-west-2/elasticmapreduce/' --name 'Jupyter Notebook' --instance-groups '[{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge","Name":"Core - 2"},{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master - 1"}]' --scale-down-behavior TERMINATE_AT_INSTANCE_HOUR --region us-west-2

Replace with your AWS access key and with the S3 bucket where you store notebooks. You can also change the instance types to suit your needs and budget.

Reference :

hi yuanzhaoYZ, is there a way to connect it to apache livy to manage emr spark cluster?

