Skip to content

Instantly share code, notes, and snippets.

@delagoya
Last active February 4, 2020 03:03
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save delagoya/1fa14163055025de4b57e61d51c173ca to your computer and use it in GitHub Desktop.
Save delagoya/1fa14163055025de4b57e61d51c173ca to your computer and use it in GitHub Desktop.
A mapping of the GA4GH Task Execution Service API schema to AWS Batch

Map of Task Execution Service (TES) to AWS

This document is an overview of how concepts from TES map to concepts in AWS Batch.

AWS Batch - Basic Concepts

AWS Batch ("Batch") has a few basic concepts that need to be understood before we can make a comparison to concepts in TES. Some relate directly to TES and others do not.

Job : A Job is a unit of work executed by AWS Batch. Jobs can be executed as containerized applications via Amazon ECS in an ECS cluster. Containerized jobs can reference a container image, command, and parameters. The general structure of a Job must be pre-defined via a JobDefinition. You can submit a large number of independent, simple jobs. More information here.

JobDefinition : A JobDefinition specifies how jobs are to be run. While each Job must reference a definition, many of the parameters that are specified in the job definition can be overridden at run time. Some of the attributes specified in a job definition include: Docker image, number of CPU's, memory, the command to run, environment variables, data volumes, and AWS permissions needed (e.g. access to particular private S3 bucket). More information here.

JobQueue : Jobs are submitted to a JobQueue, where they reside until they are able to be scheduled to run in a compute environment. You can have multiple job queues, for example one using On-Demand instances, and one for Spot instances. JobQueues have a priority that is used by the scheduler to determine which jobs in which queue should be evaluated for execution first. More inforamtion here

JobState : The current state of a submitted Job. More information here.

ComputeEnvironment : The underlying compute and storage resources to run jobs from a particular JobQueue. More than one JobQueue can be mapped to a given ComputeEnvironment. A ComputeEnvironment can either be managed (Batch will provision and deprovision compute resources automatically) or unmanaged (you control the underlying resources to send Jobs to). More information here.

Mapping TES concepts to AWS Batch

From the above, you can see that the closest analogues from TES are Executor and Task as they relate (roughly) to JobDefinition and Job. There is not a straight mapping, since a single TES Task is a vector of processes to be executed (Executor[]). In AWS Batch, this niavely translates to a set of serially dependent job submissions 1.

At a high level, any TES compliant provider endpoint built on top of AWS Batch has a couple of requirements:

  1. It must already have configured a ComputeEnvironment and JobQueue
  2. It must have it's own registry to track previously submitted tasks, and upon discovery of a new type of Executor, it will likely have to create a new JobDefinition to fulfill the full Task submission request.

There are other requirements specific to a TES API endpoint and we will cover those in turn as we cover the TES ontology tree. We will cover leaf concepts first, moving up the TES ontology to collection types afterwards.

Enumerations - FileType, TaskView, and State

FileType and TaskView have no analogue in AWS Batch. It would be up to a service provider to define how to interperet these concepts.

The State enumeration is used in a Task.state response, and represents the current state of a submitted task. Since a single TES Task may contain a Executor vector, there is a disconnect to Batch. Specifically JobState is stored in a response to a DescribeJobs API request within a JSON structure as jobs[].status. The overall Task state may be a function of that collection.

In addition to the above, AWS Batch has a different set of enumerations, with a clearly defined state transition.

+-----------+     +----------+     +----------+     +-----------+     +--------+
| SUBMITTED | --> | RUNNABLE | --> | STARTING | --> |  RUNNING  | --> | FAILED |
+-----------+     +----------+     +----------+     +-----------+     +--------+
      |               ^                                   |
      |               |                                   |
      v               |                                   v
+---------+           |                               +-----------+
| PENDING | ----------+                               | SUCCEEDED |
+---------+                                           +-----------+

The following table is a rough mapping from TES State to Batch JobState

TES State Batch JobState Note
UNKNOWN Possible to use for canceled jobs.
QUEUED SUBMITTED, PENDING or RUNNABLE Could be any of these. See the state transition diagram
INITIALIZING STARTING
RUNNING RUNNING
PAUSED Jobs can only be canceled or terminated in Batch
COMPLETE SUCCEEDED
ERROR FAILED
SYSTEM_ERROR FAILED
CANCELED FAILED CancelJob only sets this when job has not progressed to STARTING or RUNNING state, otherwise job would need to be terminated using TerminateJob API call. The reason for either canceling or termincation will be in the Batch job details under statusReason

Ports

Not applicable to AWS Batch. Amazon ECS, which Batch is built on top of, does support port mappings for a container, but Batch does not expose this service feature.

Logs - TaskLog, ExecutorLog, OutputFileLog

The various *Log types in TES are spread across a few Batch types. The main difference between AWS Batch and TES is that TES is explicit about reporting where the runtime outputs would be using OutputFileLog, while Batch leaves the output handling up to the user to manage. In Batch, the STDOUT and STDERR of a Job are submitted to CloudWatch Logs. It is important to note that CloudWatch Logs have a configurable retention time, and that any TES service built on top of AWS Batch would need to account for how it wants to handle job data over the long term.

OutputFileLog

This is metadata information associated with one of the set of output files from the set of all Executor of a Task. Batch leaves handling of output files to the process. For example, if you had a Job that runs BWA, the output would be a BAM file. It would be up to the launched container's process to move that BAM file to some storage like S3. We will discuss this more when we discuss Task.

ExecutorLog

As the output from an individual Job, the ExecutorLog type has some direct analogues to a batch Job description from status queries against the API.

EL attribute Job attribute Note
string start_time jobs[].startedAt
string end_time jobs[].stoppedAt
string stdout Stored in Cloud Watch Logs
string stderr Stored in Cloud Watch Logs
int32 exit_code jobs[].exitCode
string host_ip Possible that this is avialable from containerInstanceArn information
repeated Ports ports Not applicable for Batch

TaskLog

As a aggragator of ExecutorLog and OutputFileLog, this class has no real correllary to Batch Job and must be computed.

TaskLog attribute AWS Batch computed value
repeated ExecutorLog logs[] The set of values mapped to ExecutorLog
map<string, string> metadata Useful items not reported elsewhere like jobs[].statusReason
string start_time first Executor jobs[].startedAt
string end_time last Executor jobs[].stoppedAt
repeated OutputFileLog outputs Problematic for a lot of reasons

Resources and TaskParameters

A Resource maps closest to the JobDefinition container properties (which can be mostly overridden at Job runtime). A notable exception here is that volume sizes given to a container is handled at the level of a ComputeEnvironment, not at the individual job level.

Resource attr JobDefinition attr Note
uint32 cpu_cores jobDefinition.containerPropterties.vcpus These are hyperthreaded cores
bool preemptible Handled by virtue of which JobQueue the Job was submitted to
double ram_gb jobDefinition.containerPropterties.memory Integer in MiB
double size_gb a container's properties of volumes and mountPoints would need to account for this.
repeated string zones Handled at the level of ComputeEnvironment and JobQueue

Batch job parameters are simple key-value pairs, are represent default values or parameter substitution placeholders and defined within a JobDefintion. Parameters in a job submission request override any corresponding parameter defaults from the job definition. This is a big departure from TES TaskParameter which is meant to define file input and outputs for a set of operations. Any mapping would be subject to a lot of conventions and be specific to an implementation of TES on top of AWS Batch.

Executors

As mentioned, the simplest mapping of Job to a TES Task::Executor vector would be a encoding the Executor vector to a set of serially dependent Jobs each with a matching JobDefinition.

Executor attr JobDefinition containerProperties Notes
string image_name image
repeated string cmd command
string workdir mountPoints Conventions needed
string stdin mountPoints Conventions needed
string stdout mountPoints Conventions needed
string stderr mountPoints Conventions needed
repeated Ports ports Not applicable for Batch
map<string,string> environ environment Also a key-value array

Tasks

Not to beat a dead horse, but have I mentioned that Tasks do not map cleanly to Batch? Below I've made my best attempt to do so, but this is worth a discussion.

Executor attr JobDefinition or Job attr Notes
string id jobs[].jobId
State state jobs[].status
string name jobs[].jobName
string project Could utilize JobQueue by convention.
string description No use in Batch
repeated TaskParameter inputs jobDefinition.parameters Conventions needed
repeated TaskParameter outputs jobDefinition.parameters Conventions needed
Resources resources See mapping above.
repeated Executor executors See mapping above.
repeated string volumes jobDefinition.containerProperties volumes and mountPoints Conventions needed
map<string, string> tags Not used.
repeated TaskLog logs See mapping above

API differences

Querying Tasks

Both of Batch's API requests for job information (DescribeJobs and ListJobs) return an array of results. You can give a array of jobId as a filtering parameter, but the result is still an array of job information even when only one jobId is given.

Batch also does not support query of jobs by their jobName, while TES allows for defining a prefix to search on. This would have to be handled outside of Batch.

Canceling Tasks

AWS Batch differentiates between canceling and terminating a Job.

A Batch CancelJob request will cancel a Job in the PENDING or RUNNABLE states, but will be a no-op for jobs that have entered the STARTING or RUNNING states. For the latter, Batch requires an explicit TerminateJob request to be issued on a job. In the case of a successful cancel or termination of a job, Batch will set the job detail status to FAILED, and the statusReason to the provided reason given to the CancelJob or TerminateJob API call.

Footnotes

  1. Alternatively one could implement a system where Batch only receives a request to run a TES Task launcher container with priviledged access and enough resource allocations for the serial Executors, but this is getting ahead of ourselves.

digraph G
{
rankdir=LR;
node[shape=box];
SUBMITTED -> PENDING -> RUNNABLE -> STARTING -> RUNNING
SUBMITTED -> RUNNABLE
RUNNING -> SUCCEEDED
RUNNING -> FAILED
}
+-----------+ +----------+ +----------+ +-----------+ +--------+
| SUBMITTED | --> | RUNNABLE | --> | STARTING | --> | RUNNING | --> | FAILED |
+-----------+ +----------+ +----------+ +-----------+ +--------+
| ^ |
| | |
v | v
+---------+ | +-----------+
| PENDING | ----------+ | SUCCEEDED |
+---------+ +-----------+
@briandoconnor
Copy link

This is great Angel!!!

@miachamp
Copy link

Additional information/work-arounds for TES-to-AWS Batch

  1. Is it possible to get port mappings in AWS Batch (also related to 'repeated Ports ports’)

Expose them via the Dockerfile or retrievable from ALB logs and then just query the dynamic port assignment from the instance itself (over SSH), Not the same as API-driven assignments and queries but, this feature request has been submitted to the AWS Batch Team.

  1. Is there a way to force a ‘pause’ or push to a ‘pending’ state in AWS Batch

The dependsOn parameter of the SubmitJob API can be used to specify inter-job dependencies.  This will cause AWS Batch to delay starting the downstream job until the preceding job has completed.

  1. Is the string host_ip available from the containerInstanceArn?

To determine the host_ip, you can use the container instance ARN associated with the job:
 
$ aws batch describe-jobs --jobs 224aa9a9-e426-4a55-b057-529dc9e7139a | grep containerInstanceArn
                "containerInstanceArn": "arn:aws:ecs:us-east-1:493731438004:container-instance/685d4bc8-7c66-4b8c-bff6-3fe59f4bc416",
                        "containerInstanceArn": "arn:aws:ecs:us-east-1:493731438004:container-instance/685d4bc8-7c66-4b8c-bff6-3fe59f4bc416",
 
$ aws ecs describe-container-instances --cluster F1CE_Batch_c3d69b6f-1fe2-38a9-919d-3a24e82baa8b --container-instances 26413633-9f14-4474-94ee-6a4e38a6b219 | grep ec2InstanceId
            "ec2InstanceId": "i-064adceb383673be1",
 
$ aws ec2 describe-instances --instance-ids i-064adceb383673be1 | grep PublicIpAddress
                    "PublicIpAddress": "52.91.79.130",
 
$ aws ec2 describe-instances --instance-ids i-064adceb383673be1 | grep PrivateIpAddress
                    "PrivateIpAddress": "10.0.1.20",
                            "PrivateIpAddresses": [
                                    "PrivateIpAddress": "10.0.1.20"
                            "PrivateIpAddress": "10.0.1.20"

  1. Is there an equivalent or workaround for 'repeated OutputFileLog outputs’ and retrieving the following:
    a) string workdir
    b) string stdin
    c) string stdout
    d) string stderr

With the Job Id, Job Name and Task Arn, you are able to point to the exact CloudWatch log stream for information on output stream(s)/files, workdir path, stdin, stdout, and any stderr

  1. Does Batch support query of jobs by their jobName ? What about Mapping Tasks to Batch (ie. string project, string description, map<string,string>tags)

Use DescribeJobs  (there is also DescribeComputeEnvironments, DescribeJobDefinitions, and DescribeJobQueues)

[ec2-user@ip-172-31-53-131 ~]$ aws batch describe-jobs --jobs 279ba4c6-393e-4d1b-9e64-d5bada3d6ec9
{
    "jobs": [
        {
            "status": "RUNNABLE", 
            "container": {
                "mountPoints": [], 
                "image": "508922263819.dkr.ecr.us-east-1.amazonaws.com/wgetfetch_and_run", 
                "environment": [
                    {
                        "name": "BATCH_FILE_S3_URL", 
                        "value": "s3://mybatchjobs-scripts-mc/DemoBashCopy.sh"
                    }, 
                    {
                        "name": "BATCH_FILE_TYPE", 
                        "value": "script"
                    }
                ], 
                "vcpus": 8, 
                "jobRoleArn": "arn:aws:iam::508922263819:role/batchJobRole", 
                "volumes": [], 
                "memory": 128, 
                "command": [
                    "DemoBashCopy.sh", 
                    "120"
                ], 
                "ulimits": []
            }, 
            "parameters": {}, 
            "jobDefinition": "arn:aws:batch:us-east-1:508922263819:job-definition/BowtieDemo:5", 
            "jobQueue": "arn:aws:batch:us-east-1:508922263819:job-queue/SimpleGenomicsDemo", 
            "jobId": "279ba4c6-393e-4d1b-9e64-d5bada3d6ec9", 
            "dependsOn": [], 
            "jobName": "myTest001", 
            "createdAt": 1493314175709
        }
    ]
}

@buchanae
Copy link

buchanae commented Jun 8, 2017

If I understand correctly, when calling batch.ListJobs without a JobStatus filter, the default is to list only jobs in the RUNNING state. Is that right? Or is it both STARTING and RUNNING?

This presents a small hurdle for implementing tes.ListTasks, where I'll need to call batch.ListJobs multiple times (for each status) and manage the pagination somehow. I'm also only focusing on a single batch.JobQueue right now, but if (when) there were multiple, I'd need more batch.ListJobs calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment