heuermh/cgcloud.md

## cgcloud.md

      
    Raw
  

              cgcloud.md
            
          
    ADAM on AWS using CGCloud

Create an EC2 instance as a CGCloud gateway

Installing CGCloud on a Mac is a bit of trouble, so installing to a linux VM or an EC2 instance on AWS might be helpful.
Our client chose the latter, installing to a t2.small instance that is available all the time.
$ ssh -A cgcloud.foo.com
From the CGCloud gateway, create spark-box image

After ssh'ing to the CGCloud gateway instance, configure ssh keys and AWS credentials.
$ aws configure
AWS Access Key ID [None]: A...
AWS Secret Access Key [None]: M...

$ cgcloud register-key ~/.ssh/${username}.pub
Then create the spark-box CGCloud image.  The zone, VPC, and subnet options were required for the client's VPC and
and may not be required for other AWS accounts.  This only needs to be done once.
$ cgcloud create \
    --zone ... \
    --vpc ... \
    --subnet ... \
    --create-image \
    --terminate \
    spark-box
Create an Apache Spark cluster with CGCloud

Use CGCloud to create an Apache Spark cluster from the spark-box image created above.  Specify --num-workers and
--instance-type as appropriate.
$ cgcloud create-cluster \
    --zone ... \
    --vpc ... \
    --subnet ... \
    --cluster-name cluster1 \
    --num-workers 2 \
    --instance-type m3.large \
    spark
Install additional binaries to the spark-master node as necessary.  Note the --admin argument required for sudo access.
$ cgcloud ssh
    --zone ... \
    --vpc ... \
    --subnet ... \
    --admin \
    spark-master \
    sudo apt-get update

$ cgcloud ssh
    --zone ... \
    --vpc ... \
    --subnet ... \
    --admin \
    spark-master \
    sudo apt-get install wget git emacs
Ssh into the spark-master node to use Apache Spark via spark-submit or spark-shell.
$ cgcloud ssh \
    --zone us-east-1e \
    --vpc ... \
    --subnet ... \
    --cluster-name cluster1 \
    spark-master

Install ADAM on Apache Spark cluster created with CGCloud

After creating the Apache Spark cluster with CGCloud, install a binary distribution of ADAM
or the development version of ADAM from git HEAD.
Install a binary distribution of ADAM

Binary distributions of ADAM are available from the Releases page on GitHub
https://github.com/bigdatagenomics/adam/releases
and from the Maven Central repository
http://search.maven.org/#search%7Cga%7C1%7Cadam-distribution
From the spark-master node, download the ADAM binary distribution built for Spark version 1.x and Scala version 2.10 to match the versions deployed by CGCloud (currently Spark version 1.6.2 and Scala version 2.10).
$ wget https://repo1.maven.org/maven2/org/bdgenomics/adam/adam-distribution_2.10/0.21.0/adam-distribution_2.10-0.21.0-bin.tar.gz
Then unzip and extract the ADAM distribution.
$ tar -xvzf adam-distribution_2.10-0.21.0-bin.tar.gz
$ cd adam-distribution_2.10-0.21.0
The ADAM version should report a release version (currently 0.21.0).
$ ./bin/adam-submit --version
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-submit
 
       e         888~-_          e             e    e
      d8b        888   \        d8b           d8b  d8b
     /Y88b       888    |      /Y88b         d888bdY88b
    /  Y88b      888    |     /  Y88b       / Y88Y Y888b
   /____Y88b     888   /     /____Y88b     /   YY   Y888b
  /      Y88b    888_-~     /      Y88b   /          Y888b
 
ADAM version: 0.21.0
Built for: Scala 2.10.6 and Hadoop 2.7.3
Install the development version of ADAM from git HEAD

From the spark-master node, to build the development version of ADAM from git HEAD, first install Apache Maven version 3.3.9.
The most recent version of Apache Maven in the Ubuntu repositories is not recent enough to build ADAM from source.
$ wget http://apache.osuosl.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
$ tar xvfz apache-maven-3.3.9-bin.tar.gz
Then clone the ADAM git repository and package using Apache Maven.
The ADAM unit tests require a long time to complete, so the -DskipTests=true argument may be useful.
$ git clone https://github.com/bigdatagenomics/adam.git
$ cd adam/
$ ../apache-maven-3.3.9/bin/mvn package -DskipTests=true
The ADAM version should report a SNAPSHOT version (currently 0.21.1-SNAPSHOT).  Confirm the built for Spark version is 1.6.x.
$ ./bin/adam-submit --version
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/opt/sparkbox/spark/bin/spark-submit
17/03/08 04:26:20 INFO cli.ADAMMain: ADAM invoked with args: "--version"
 
       e         888~-_          e             e    e
      d8b        888   \        d8b           d8b  d8b
     /Y88b       888    |      /Y88b         d888bdY88b
    /  Y88b      888    |     /  Y88b       / Y88Y Y888b
   /____Y88b     888   /     /____Y88b     /   YY   Y888b
  /      Y88b    888_-~     /      Y88b   /          Y888b
 
ADAM version: 0.21.1-SNAPSHOT
Commit: 07c1982fb6fafb959b96d4c3cb83fa7edbe25721 Build: 2017-03-07
Built for: Apache Spark 1.6.3, Scala 2.10.6, and Hadoop 2.7.3