Installing CGCloud on a Mac is a bit of trouble, so installing to a linux VM or an EC2 instance on AWS might be helpful. Our client chose the latter, installing to a t2.small instance that is available all the time.
$ ssh -A cgcloud.foo.com
After ssh'ing to the CGCloud gateway instance, configure ssh keys and AWS credentials.
$ aws configure
AWS Access Key ID [None]: A...
AWS Secret Access Key [None]: M...
$ cgcloud register-key ~/.ssh/${username}.pub
Then create the spark-box CGCloud image. The zone, VPC, and subnet options were required for the client's VPC and and may not be required for other AWS accounts. This only needs to be done once.
$ cgcloud create \
--zone ... \
--vpc ... \
--subnet ... \
--create-image \
--terminate \
spark-box
Use CGCloud to create an Apache Spark cluster from the spark-box image created above. Specify --num-workers
and
--instance-type
as appropriate.
$ cgcloud create-cluster \
--zone ... \
--vpc ... \
--subnet ... \
--cluster-name cluster1 \
--num-workers 2 \
--instance-type m3.large \
spark
Install additional binaries to the spark-master node as necessary. Note the --admin
argument required for sudo access.
$ cgcloud ssh
--zone ... \
--vpc ... \
--subnet ... \
--admin \
spark-master \
sudo apt-get update
$ cgcloud ssh
--zone ... \
--vpc ... \
--subnet ... \
--admin \
spark-master \
sudo apt-get install wget git emacs
Ssh into the spark-master node to use Apache Spark via spark-submit
or spark-shell
.
$ cgcloud ssh \
--zone us-east-1e \
--vpc ... \
--subnet ... \
--cluster-name cluster1 \
spark-master
After creating the Apache Spark cluster with CGCloud, install a binary distribution of ADAM or the development version of ADAM from git HEAD.
Binary distributions of ADAM are available from the Releases page on GitHub
https://github.com/bigdatagenomics/adam/releases
and from the Maven Central repository
http://search.maven.org/#search%7Cga%7C1%7Cadam-distribution
From the spark-master node, download the ADAM binary distribution built for Spark version 1.x and Scala version 2.10 to match the versions deployed by CGCloud (currently Spark version 1.6.2 and Scala version 2.10).
$ wget https://repo1.maven.org/maven2/org/bdgenomics/adam/adam-distribution_2.10/0.21.0/adam-distribution_2.10-0.21.0-bin.tar.gz
Then unzip and extract the ADAM distribution.
$ tar -xvzf adam-distribution_2.10-0.21.0-bin.tar.gz
$ cd adam-distribution_2.10-0.21.0
The ADAM version should report a release version (currently 0.21.0).
$ ./bin/adam-submit --version
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-submit
e 888~-_ e e e
d8b 888 \ d8b d8b d8b
/Y88b 888 | /Y88b d888bdY88b
/ Y88b 888 | / Y88b / Y88Y Y888b
/____Y88b 888 / /____Y88b / YY Y888b
/ Y88b 888_-~ / Y88b / Y888b
ADAM version: 0.21.0
Built for: Scala 2.10.6 and Hadoop 2.7.3
From the spark-master node, to build the development version of ADAM from git HEAD, first install Apache Maven version 3.3.9.
The most recent version of Apache Maven in the Ubuntu repositories is not recent enough to build ADAM from source.
$ wget http://apache.osuosl.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
$ tar xvfz apache-maven-3.3.9-bin.tar.gz
Then clone the ADAM git repository and package using Apache Maven.
The ADAM unit tests require a long time to complete, so the -DskipTests=true
argument may be useful.
$ git clone https://github.com/bigdatagenomics/adam.git
$ cd adam/
$ ../apache-maven-3.3.9/bin/mvn package -DskipTests=true
The ADAM version should report a SNAPSHOT version (currently 0.21.1-SNAPSHOT). Confirm the built for Spark version is 1.6.x.
$ ./bin/adam-submit --version
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/opt/sparkbox/spark/bin/spark-submit
17/03/08 04:26:20 INFO cli.ADAMMain: ADAM invoked with args: "--version"
e 888~-_ e e e
d8b 888 \ d8b d8b d8b
/Y88b 888 | /Y88b d888bdY88b
/ Y88b 888 | / Y88b / Y88Y Y888b
/____Y88b 888 / /____Y88b / YY Y888b
/ Y88b 888_-~ / Y88b / Y888b
ADAM version: 0.21.1-SNAPSHOT
Commit: 07c1982fb6fafb959b96d4c3cb83fa7edbe25721 Build: 2017-03-07
Built for: Apache Spark 1.6.3, Scala 2.10.6, and Hadoop 2.7.3