Skip to content

Instantly share code, notes, and snippets.

@heuermh
Created March 15, 2017 19:56
Show Gist options
  • Save heuermh/3c697a747c86b58b06a77bbd093c1ed0 to your computer and use it in GitHub Desktop.
Save heuermh/3c697a747c86b58b06a77bbd093c1ed0 to your computer and use it in GitHub Desktop.
ADAM on AWS using CGCloud

ADAM on AWS using CGCloud

Create an EC2 instance as a CGCloud gateway

Installing CGCloud on a Mac is a bit of trouble, so installing to a linux VM or an EC2 instance on AWS might be helpful. Our client chose the latter, installing to a t2.small instance that is available all the time.

$ ssh -A cgcloud.foo.com

From the CGCloud gateway, create spark-box image

After ssh'ing to the CGCloud gateway instance, configure ssh keys and AWS credentials.

$ aws configure
AWS Access Key ID [None]: A...
AWS Secret Access Key [None]: M...

$ cgcloud register-key ~/.ssh/${username}.pub

Then create the spark-box CGCloud image. The zone, VPC, and subnet options were required for the client's VPC and and may not be required for other AWS accounts. This only needs to be done once.

$ cgcloud create \
    --zone ... \
    --vpc ... \
    --subnet ... \
    --create-image \
    --terminate \
    spark-box

Create an Apache Spark cluster with CGCloud

Use CGCloud to create an Apache Spark cluster from the spark-box image created above. Specify --num-workers and --instance-type as appropriate.

$ cgcloud create-cluster \
    --zone ... \
    --vpc ... \
    --subnet ... \
    --cluster-name cluster1 \
    --num-workers 2 \
    --instance-type m3.large \
    spark

Install additional binaries to the spark-master node as necessary. Note the --admin argument required for sudo access.

$ cgcloud ssh
    --zone ... \
    --vpc ... \
    --subnet ... \
    --admin \
    spark-master \
    sudo apt-get update

$ cgcloud ssh
    --zone ... \
    --vpc ... \
    --subnet ... \
    --admin \
    spark-master \
    sudo apt-get install wget git emacs

Ssh into the spark-master node to use Apache Spark via spark-submit or spark-shell.

$ cgcloud ssh \
    --zone us-east-1e \
    --vpc ... \
    --subnet ... \
    --cluster-name cluster1 \
    spark-master

Install ADAM on Apache Spark cluster created with CGCloud

After creating the Apache Spark cluster with CGCloud, install a binary distribution of ADAM or the development version of ADAM from git HEAD.

Install a binary distribution of ADAM

Binary distributions of ADAM are available from the Releases page on GitHub

https://github.com/bigdatagenomics/adam/releases

and from the Maven Central repository

http://search.maven.org/#search%7Cga%7C1%7Cadam-distribution

From the spark-master node, download the ADAM binary distribution built for Spark version 1.x and Scala version 2.10 to match the versions deployed by CGCloud (currently Spark version 1.6.2 and Scala version 2.10).

$ wget https://repo1.maven.org/maven2/org/bdgenomics/adam/adam-distribution_2.10/0.21.0/adam-distribution_2.10-0.21.0-bin.tar.gz

Then unzip and extract the ADAM distribution.

$ tar -xvzf adam-distribution_2.10-0.21.0-bin.tar.gz
$ cd adam-distribution_2.10-0.21.0

The ADAM version should report a release version (currently 0.21.0).

$ ./bin/adam-submit --version
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-submit
 
       e         888~-_          e             e    e
      d8b        888   \        d8b           d8b  d8b
     /Y88b       888    |      /Y88b         d888bdY88b
    /  Y88b      888    |     /  Y88b       / Y88Y Y888b
   /____Y88b     888   /     /____Y88b     /   YY   Y888b
  /      Y88b    888_-~     /      Y88b   /          Y888b
 
ADAM version: 0.21.0
Built for: Scala 2.10.6 and Hadoop 2.7.3

Install the development version of ADAM from git HEAD

From the spark-master node, to build the development version of ADAM from git HEAD, first install Apache Maven version 3.3.9.

The most recent version of Apache Maven in the Ubuntu repositories is not recent enough to build ADAM from source.

$ wget http://apache.osuosl.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
$ tar xvfz apache-maven-3.3.9-bin.tar.gz

Then clone the ADAM git repository and package using Apache Maven.

The ADAM unit tests require a long time to complete, so the -DskipTests=true argument may be useful.

$ git clone https://github.com/bigdatagenomics/adam.git
$ cd adam/
$ ../apache-maven-3.3.9/bin/mvn package -DskipTests=true

The ADAM version should report a SNAPSHOT version (currently 0.21.1-SNAPSHOT). Confirm the built for Spark version is 1.6.x.

$ ./bin/adam-submit --version
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/opt/sparkbox/spark/bin/spark-submit
17/03/08 04:26:20 INFO cli.ADAMMain: ADAM invoked with args: "--version"
 
       e         888~-_          e             e    e
      d8b        888   \        d8b           d8b  d8b
     /Y88b       888    |      /Y88b         d888bdY88b
    /  Y88b      888    |     /  Y88b       / Y88Y Y888b
   /____Y88b     888   /     /____Y88b     /   YY   Y888b
  /      Y88b    888_-~     /      Y88b   /          Y888b
 
ADAM version: 0.21.1-SNAPSHOT
Commit: 07c1982fb6fafb959b96d4c3cb83fa7edbe25721 Build: 2017-03-07
Built for: Apache Spark 1.6.3, Scala 2.10.6, and Hadoop 2.7.3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment