mwvaughn/etl-proc-high-level.md

## etl-proc-high-level.md

      
    Raw
  

              etl-proc-high-level.md
            
          
    Packaging Code for SD2E (30,000 ft view)


Containerize the application/script codebase using Docker -or- Work with TACC to install directly on a targeted HPC system
Provide a small data set and example invocation of the code so we can validate it
Document at least one protocol (command sequence) for running the application to accomplish a specific set of objectives.

Input file(s), any reference data files (genome sequences, structured database files, etc)
Output file(s)
Parameters, including types and validation rules
An example invocation of the application that combines inputs, parameters, and leads to specific outputs
Approximate resource requirements for running the code (CPU architecture, CPU cores, RAM, free disk, requires MPI, etc)


Information gathered in #3 is represented in a specialized JSON document and registered with one of TACC's application web services
The application can be invoked via a call to a TACC task execution web service

For interactive applications, we'll support some alternate approaches like hosted Jupyter notebooks. Where there are 3rd party services or web applications involved, we will assist with integration.
Notes

The general approach is to take code packages that might be really complex and define repeatable protocols to do one thing well with them. For instance, if you have a package TRIATHLON v1.01 that can do run, swim, and bike tasks, we'd represent that as a set of components TRIATHLON-run-1.00 TRIATHLON-swim-1.00 and TRIATHLON-bike-1.00 each of which could be invoked separately. On the backend, the code assets are shared between the run, swim, and bike components, and we can add variations to each task or even add new ones.