Skip to content

Instantly share code, notes, and snippets.

@mmulvahill
Last active May 31, 2018 18:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mmulvahill/449f0962d672e436f64f43b195120dcf to your computer and use it in GitHub Desktop.
Save mmulvahill/449f0962d672e436f64f43b195120dcf to your computer and use it in GitHub Desktop.
Computing options and resources for CSPH Biostats

My task for June 15th

  • Itemize resources and use cases for the cloud
  • Consider how all of these can fit together into a unified environment + command-line options

Overview

Key topics

  • Google Datalab/Amazon Sagemaker, or hosted Jupyter Notebook through controlled, interface
    • e.g. AHA's precision.heart.org
  • Datalakes
    • e.g. AHA's precision.heart.org
  • Slurm/HTCondor quick-start tutorials (better on Google Cloud)
  • Instant bootup of RStudio server instances
  • Databases

Have write-ups from Harry & my experience with Google cloud, Maxie's experience with Open Science Grid, and Harry's overview of Amazon AWS conference.

Google Cloud Platform

-omics test run findings

  • Extremely versatile and customizable
    • Can pick the number of virtual machines (VMs) you want
      • Currently approved for 24 VM instances
    • Can pick the number of processors you want in a VM
      • Currently approved for 700 cores (up to 64 per VM)
    • Can pick the amount of RAM you want in a VM
    • Can pick the type of storage you want (2 primary types)
      • Persistant Disk (PD)
        • Directly attached to the VM instance (think hard drive in a computer)
        • Faster data transfer, but more expensive
      • Google Cloud Storage (GCS)
        • Exists independent of VM instance (similar to Dropbox or Amazon S3)
        • Slower data transfer, but much cheaper.
  • Did a project test run – RNA-Seq Pre-processing on 10 samples
    • With 20 cores and 1 VM was able to complete in 50 computational hours (this is slightly conservative as I was using index files that I had already build. This will still be the case with human projects, but may take a bit longer if samples are of different species, for example, the rat).
    • Overall cost for 48 hours of computation and storage using GCS is $106.
      • Note: this was for 10 samples. Projects can vary between 8 samples to 40 samples so the cost will need to be adjusted accordingly.
  • User friendly -- multiple interface options
    • Offers the gcloud command line tool
      • Let’s you use your own terminal as an access point to any of your VMs
    • Web browser-based terminal
    • Can set up RStudio server for browser-based Rstudio sessions
  • Details on pricing:
    • Right now, we pay $0.90/month for ~50 GB of storage on GCS (most projects will be a little more than this).
    • Mistakes can be costly though.
      • Initially we had ~500GB of unused space at a cost of ~$60/month. Using GCS for data storage fixed this.
    • Current VM pricing is about $0.0535 per core per hour. This is for a standard machine (n1-standard) with 3.75GB of RAM per core.

Other findings

Just booting up a big, multi-core machine, installing software, developing code, debugging, then finally running the computationally intensive analysis or simulation would be very expensive. Some thought must be put into how to efficiently use GCP (Google Cloud Platform) a priori.

The approach we used for Harry's omics analysis was to:

  • Create one of the cheapest single core VMs
  • Install and test all software needed for the analysis
  • Test attaching and accessing data storage
  • Save an 'image' of this machine
  • Create a new, large VM choosing our saved image as the Boot Disk (32 core, 120GB RAM, 10GB persistant disk for the OS and software)
  • Run analysis! (save results to GCS bucket)
  • Manually shut down the large VM as soon as we notice the analysis is complete.

This approach is relatively financially safe as long as the user pays close attention to shutting down the machine when it is not actively computing.

Even better options are to use tools and batch managers that manage software installs, VM creation/destruction, and allow for using preemptible VMs that cost 80% less than standard VMs. These would be the least expensive and financially safest options, but have more knowledge/learning overhead.

  • R package (beta) - googleComputeEngineR
  • HTCondor for high-throughput computing (many independent processes)
    • Google has a guide on how to do this.
    • Probably should be managed by 1 person for the department.
    • HTCondor has it's own language, and thus learning curve.
  • Techila Distributed Computing
    • Not tested, but looks promising
    • Has video tutorials, user guides, and an R-package
    • Also works with Matlab and Python

Docker or Kubernetes are helpful tools for cloud computing. These 'containers' are a method/tool for setting up the the software environment of a computational project. It is an alternative to the VM image approach we used for Harry's work.

OpenScienceGrid

OpenScienceGrid is a free high-throughput computing resource using HTCondor for batch management. It is entirely free, highly scalable, and has a responsive support team, but it's use case is more limited than Google Compute Engine or Rosalind.

OSG uses spare computing cycles on participating supercomputers (located internationally) to provide a free resource to non-profit and academic researchers.

OSG is a good fit if your project:

  • Doesn't use PHI or otherwise sensitive data
  • Consists of many independent processes (multithreading supported, but not message passing/MPI)
  • Input/Output data for each job is relatively small (<10GB)
  • Can run on Linux
  • Doesn't require licensed software

Projects run with OSG can add in computing resourcess from both XSEDE and Amazon AWS (likely Google Cloud as well). OSG resources primarily use preemption, i.e. higher priority processes can lead to jobs being halted/killed. For projects consisting of many small (hundreds-thousands of 10min-6hr) jobs, this is fine. These preempted jobs get restarted automatically on a new node.

This is a good fit for large methodological simulation studies. My thesis, for example, consisted of estimating hormone pulses on ~ 4000 simulated datasets. Each estimation process took 3-10 minutes. Since this is simulated data and each 3-10 minute job is an independent sequential process, this would be a great fit for OSG. I could request several hundred cores and receive my results in an hour up to a few hours at no charge. On Amazon AWS, I used 32-core VM's over about 10-20 hours at a charge of about $100 per run, and a total cost of $300 at the end of my thesis.

Help/Support page HTC Tutorials

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment