Skip to content

Instantly share code, notes, and snippets.

@phonchi
Last active May 19, 2017 00:14
Show Gist options
  • Save phonchi/8a00949a2acf32a894a190fad152ec3f to your computer and use it in GitHub Desktop.
Save phonchi/8a00949a2acf32a894a190fad152ec3f to your computer and use it in GitHub Desktop.
Install on AWS EMR
  1. We can adjust some hadoop setting like use config.json to adjust hadoop block size
[
  {
    "Classification": "hdfs-site",
    "Properties": {
      "dfs.blocksize": "67108864"
    }
  }
]
  1. Choose HBase, Spark, Hadoop...and wait till cluster state change to running
  2. If we need to install many dependency, we have to modify the volume due to the size of default root partition is only 10GB. After modification, we should run sudo resize2fs /dev/xvda1 to resize the root partition.

We may attempt to only copy the shared library to reduce the size of installation The MapR distribution has setting to resize the EBS, while amazon ami does not.

  1. Each node has built in aws cli, thus we can thus use EC2Box to run the installed scripts.

The steps does not work here, due to it only runs on the master node Though we can use bootstrap action to run scripts on all nodes using custome jar :s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar and argument s3://parallelvid/install.sh. It may corrupted the following cluster installations, since it runs before cluster setup.

  1. Then, a setup script can run on master to setup HDFS, Hbase etc.

For example start the Hbase thrift server sudo -E /usr/lib/hbase/bin/hbase-daemon.sh start thrift -p 9097 --infoport 9098 The port may conflict other software, check log before procceed.

  1. The emr distribution is based on Apach BigTop. You can find the related installed library under /usr/lib/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment