mmulvahill/csphbiostats-computing-options.md

## csphbiostats-computing-options.md

      
    Raw
  

              csphbiostats-computing-options.md
            
          
    My task for June 15th


Itemize resources and use cases for the cloud
Consider how all of these can fit together into a unified environment + command-line options

Overview

Key topics

Google Datalab/Amazon Sagemaker, or hosted Jupyter Notebook through controlled, interface

e.g. AHA's precision.heart.org


Datalakes

e.g. AHA's precision.heart.org


Slurm/HTCondor quick-start tutorials (better on Google Cloud)
Instant bootup of RStudio server instances
Databases

Have write-ups from Harry & my experience with Google cloud, Maxie's experience with Open Science Grid, and Harry's overview of Amazon AWS conference.
Google Cloud Platform

-omics test run findings


Extremely versatile and customizable

Can pick the number of virtual machines (VMs) you want

Currently approved for 24 VM instances


Can pick the number of processors you want in a VM

Currently approved for 700 cores (up to 64 per VM)


Can pick the amount of RAM you want in a VM
Can pick the type of storage you want (2 primary types)

Persistant Disk (PD)

Directly attached to the VM instance (think hard drive in a computer)
Faster data transfer, but more expensive


Google Cloud Storage (GCS)

Exists independent of VM instance (similar to Dropbox or Amazon S3)
Slower data transfer, but much cheaper.


Did a project test run – RNA-Seq Pre-processing on 10 samples

With 20 cores and 1 VM was able to complete in 50 computational hours (this
is slightly conservative as I was using index files that I had already
build. This will still be the case with human projects, but may take a bit
longer if samples are of different species, for example, the rat).
Overall cost for 48 hours of computation and storage using GCS is $106.

Note: this was for 10 samples. Projects can vary between 8 samples to 40
samples so the cost will need to be adjusted accordingly.


User friendly -- multiple interface options

Offers the gcloud command line tool

Let’s you use your own terminal as an access point to any of your VMs


Web browser-based terminal
Can set up RStudio server for browser-based Rstudio sessions


Details on pricing:

Right now, we pay $0.90/month for ~50 GB of storage on GCS (most projects
will be a little more than this).
Mistakes can be costly though.

Initially we had ~500GB of unused space at a cost of ~$60/month. Using
GCS for data storage fixed this.


Current VM pricing is about $0.0535 per core per hour.  This is for a
standard machine (n1-standard) with 3.75GB of RAM per core.


Other findings


Not HIPAA compliant by default, although Business Associates
Agreements can be
arranged.

Just booting up a big, multi-core machine, installing software, developing code,
debugging, then finally running the computationally intensive analysis or
simulation would be very expensive. Some thought must be put into how to
efficiently use GCP (Google Cloud Platform) a priori.
The approach we used for Harry's omics analysis was to:

Create one of the cheapest single core VMs
Install and test all software needed for the analysis
Test attaching and accessing data storage
Save an 'image' of this machine
Create a new, large VM choosing our saved image as the Boot Disk (32 core,
120GB RAM, 10GB persistant disk for the OS and software)
Run analysis! (save results to GCS bucket)
Manually shut down the large VM as soon as we notice the analysis is
complete.

This approach is relatively financially safe as long as the user pays close
attention to shutting down the machine when it is not actively computing.
Even better options are to use tools and batch managers that manage software
installs, VM creation/destruction, and allow for using preemptible VMs that cost
80% less than standard VMs. These would be the least expensive and financially
safest options, but have more knowledge/learning overhead.

R package (beta) - googleComputeEngineR

Was very buggy last I used it (8/2017), but worth exploring more
Free, open source
Decent, but in-progess, user guide

Especially interesting for me: Run massive parallel R jobs cheaply


HTCondor for high-throughput computing (many independent processes)

Google has a guide on how to do this.
Probably should be managed by 1 person for the department.
HTCondor has it's own language, and thus learning curve.


Techila Distributed Computing

Not tested, but looks promising
Has video tutorials, user guides, and an R-package
Also works with Matlab and Python


Docker or Kubernetes are helpful tools for cloud computing.  These 'containers'
are a method/tool for setting up the the software environment of a computational project.
It is an alternative to the VM image approach we used for Harry's work.
OpenScienceGrid

OpenScienceGrid is a free high-throughput computing resource using HTCondor for
batch management. It is entirely free, highly scalable, and has a responsive
support team, but it's use case is more limited than Google Compute Engine or
Rosalind.
OSG uses spare computing cycles on participating supercomputers (located
internationally) to provide a free resource to non-profit and academic
researchers.
OSG is a good fit if your project:

Doesn't use PHI or otherwise sensitive data
Consists of many independent processes (multithreading supported, but not
message passing/MPI)
Input/Output data for each job is relatively small (<10GB)
Can run on Linux
Doesn't require licensed software

Projects run with OSG can add in computing resourcess from both XSEDE and Amazon
AWS (likely Google Cloud as well). OSG resources primarily use preemption, i.e.
higher priority processes can lead to jobs being halted/killed. For projects
consisting of many small (hundreds-thousands of 10min-6hr) jobs, this is fine.
These preempted jobs get restarted automatically on a new node.
This is a good fit for large methodological simulation studies.  My thesis, for
example, consisted of estimating hormone pulses on ~ 4000 simulated datasets.
Each estimation process took 3-10 minutes. Since this is simulated data and each
3-10 minute job is an independent sequential process, this would be a great fit
for OSG. I could request several hundred cores and receive my results in an hour
up to a few hours at no charge.  On Amazon AWS, I used 32-core VM's over about
10-20 hours at a charge of about $100 per run, and a total cost of $300 at the
end of my thesis.
Help/Support page
HTC Tutorials