delagoya/00_aws_batch_tes_map.md

## 00_aws_batch_tes_map.md

      
    Raw
  

              00_aws_batch_tes_map.md
            
          
    Map of Task Execution Service (TES) to AWS

This document is an overview of how concepts from TES map to concepts in AWS Batch.
AWS Batch - Basic Concepts

AWS Batch ("Batch") has a few basic concepts that need to be understood before we can make a comparison to concepts in TES. Some relate directly to TES and others do not.
Job
: A Job is a unit of work executed by AWS Batch. Jobs can be executed as containerized applications via Amazon ECS  in an ECS cluster. Containerized jobs can reference a container image, command, and parameters. The general structure of a Job must be pre-defined via a  JobDefinition. You can submit a large number of independent, simple jobs. More information here.
JobDefinition
: A JobDefinition specifies how jobs are to be run. While each Job must reference a definition, many of the parameters that are specified in the job definition can be overridden at run time. Some of the attributes specified in a job definition include: Docker image, number of CPU's, memory, the command to run, environment variables, data volumes, and AWS permissions needed (e.g. access to particular private S3 bucket). More information here.
JobQueue
: Jobs are submitted to a JobQueue, where they reside until they are able to be scheduled to run in a compute environment. You can have multiple job queues, for example one using On-Demand instances, and one for Spot instances. JobQueues have a priority that is used by the scheduler to determine which jobs in which queue should be evaluated for execution first. More inforamtion here
JobState
: The current state of a submitted Job. More information here.
ComputeEnvironment
: The underlying compute and storage resources to run jobs from a particular JobQueue. More than one JobQueue can be mapped to a given ComputeEnvironment. A ComputeEnvironment can either be managed (Batch will provision and deprovision compute resources automatically) or unmanaged (you control the underlying resources to send Jobs to). More information here.
Mapping TES concepts to AWS Batch

From the above, you can see that the closest analogues from TES are Executor and Task as they relate (roughly) to JobDefinition and Job. There is not a straight mapping, since a single TES Task is a vector of processes to be executed (Executor[]). In AWS Batch, this niavely translates to a set of serially dependent job submissions ¹.
At a high level, any TES compliant provider endpoint built on top of AWS Batch has a couple of requirements:

It must already have configured a ComputeEnvironment and JobQueue
It must have it's own registry to track previously submitted tasks, and upon discovery of a new type of Executor, it will likely have to create a new JobDefinition to fulfill the full Task submission request.

There are other requirements specific to a TES API endpoint and we will cover those in turn as we cover the TES ontology tree. We will cover leaf concepts first, moving up the TES ontology to collection types afterwards.
Enumerations - FileType, TaskView, and State

FileType and TaskView have no analogue in AWS Batch. It would be up to a service provider to define how to interperet these concepts.
The State enumeration is used in a Task.state response, and represents the current state of a submitted task. Since a single TES Task may contain a Executor vector, there is a disconnect to Batch. Specifically JobState is stored in a response to a DescribeJobs  API request within a JSON structure as jobs[].status. The overall Task state may be a function of that collection.
In addition to the above, AWS Batch has a different set of enumerations, with a clearly defined state transition.

+-----------+     +----------+     +----------+     +-----------+     +--------+
| SUBMITTED | --> | RUNNABLE | --> | STARTING | --> |  RUNNING  | --> | FAILED |
+-----------+     +----------+     +----------+     +-----------+     +--------+
      |               ^                                   |
      |               |                                   |
      v               |                                   v
+---------+           |                               +-----------+
| PENDING | ----------+                               | SUCCEEDED |
+---------+                                           +-----------+

The following table is a rough mapping from TES State to Batch JobState


TES State
Batch JobState
Note


UNKNOWN

Possible to use for canceled jobs.


QUEUED
SUBMITTED, PENDING or RUNNABLE
Could be any of these. See the state transition diagram


INITIALIZING
STARTING


RUNNING
RUNNING


PAUSED

Jobs can only be canceled or terminated in Batch


COMPLETE
SUCCEEDED


ERROR
FAILED


SYSTEM_ERROR
FAILED


CANCELED
FAILED
CancelJob only sets this when job has not progressed to STARTING or RUNNING state, otherwise job would need to be terminated using TerminateJob API call. The reason for either canceling or termincation will be in the Batch job details under statusReason


Ports

Not applicable to AWS Batch. Amazon ECS, which Batch is built on top of, does support port mappings for a container, but Batch does not expose this service feature.
Logs - TaskLog, ExecutorLog, OutputFileLog

The various *Log types in TES are spread across a few Batch types. The main difference between AWS Batch and TES is that TES is explicit about reporting where the runtime outputs would be using OutputFileLog, while Batch leaves the output handling up to the user to manage. In Batch, the STDOUT and STDERR of a Job are submitted to CloudWatch Logs. It is important to note that CloudWatch Logs have a configurable retention time, and that any TES service built on top of AWS Batch would need to account for how it wants to handle job data over the long term.
OutputFileLog

This is metadata information associated with one of the set of output files from the set of all Executor of a Task. Batch leaves handling of output files to the process. For example, if you had a Job that runs BWA, the output would be a BAM file. It would be up to the launched container's process to move that BAM file to some storage like S3. We will discuss this more when we discuss Task.
ExecutorLog

As the output from an individual Job, the ExecutorLog type has some direct analogues to a batch Job description from status queries against the API.


EL attribute
Job attribute
Note


string start_time
jobs[].startedAt


string end_time
jobs[].stoppedAt


string stdout

Stored in Cloud Watch Logs


string stderr

Stored in Cloud Watch Logs


int32 exit_code
jobs[].exitCode


string host_ip

Possible that this is avialable from containerInstanceArn information


repeated Ports ports

Not applicable for Batch


TaskLog

As a aggragator of ExecutorLog and OutputFileLog, this class has no real correllary to Batch Job and must be computed.


TaskLog attribute
AWS Batch computed value


repeated ExecutorLog logs[]
The set of values mapped to ExecutorLog


map<string, string> metadata
Useful items not reported elsewhere like jobs[].statusReason


string start_time
first Executor jobs[].startedAt


string end_time
last Executor jobs[].stoppedAt


repeated OutputFileLog outputs
Problematic for a lot of reasons


Resources and TaskParameters

A Resource maps closest to the JobDefinition container properties (which can be mostly overridden at Job runtime). A notable exception here is that volume sizes given to a container is handled at the level of a ComputeEnvironment, not at the individual job level.


Resource attr
JobDefinition attr
Note


uint32 cpu_cores
jobDefinition.containerPropterties.vcpus
These are hyperthreaded cores


bool preemptible

Handled by virtue of which JobQueue the Job was submitted to


double ram_gb
jobDefinition.containerPropterties.memory
Integer in MiB


double size_gb

a container's properties of volumes and mountPoints would need to account for this.


repeated string zones

Handled at the level of ComputeEnvironment and JobQueue


Batch job parameters are simple key-value pairs, are represent default values or parameter substitution placeholders  and defined within a JobDefintion. Parameters in a job submission request override any corresponding parameter defaults from the job definition. This is a big departure from TES TaskParameter which is meant to define file input and outputs for a set of operations. Any mapping would be subject to a lot of conventions and be specific to an implementation of TES on top of AWS Batch.
Executors

As mentioned, the simplest mapping of Job to a TES Task::Executor vector would be a encoding the Executor vector to a set of serially dependent Jobs each with a matching JobDefinition.


Executor attr
JobDefinition containerProperties
Notes


string image_name
image


repeated string cmd
command


string workdir
mountPoints
Conventions needed


string stdin
mountPoints
Conventions needed


string stdout
mountPoints
Conventions needed


string stderr
mountPoints
Conventions needed


repeated Ports ports

Not applicable for Batch


map<string,string> environ
environment
Also a key-value array


Tasks

Not to beat a dead horse, but have I mentioned that Tasks do not map cleanly to Batch? Below I've made my best attempt to do so, but this is worth a discussion.


Executor attr
JobDefinition or Job attr
Notes


string id
jobs[].jobId


State state
jobs[].status


string name
jobs[].jobName


string project

Could utilize JobQueue by convention.


string description

No use in Batch


repeated TaskParameter inputs
jobDefinition.parameters
Conventions needed


repeated TaskParameter outputs
jobDefinition.parameters
Conventions needed


Resources resources

See mapping above.


repeated Executor executors

See mapping above.


repeated string volumes
jobDefinition.containerProperties volumes and mountPoints
Conventions needed


map<string, string> tags

Not used.


repeated TaskLog logs

See mapping above


API differences

Querying Tasks

Both of Batch's API requests for job information  (DescribeJobs and ListJobs) return an array of results. You can give a array of jobId as a filtering parameter, but the result is still an array of job information even when only one jobId is given.
Batch also does not support query of jobs by their jobName, while TES allows for defining a prefix to search on. This would have to be handled outside of Batch.
Canceling Tasks

AWS Batch differentiates between canceling and terminating a Job.
A Batch CancelJob request will cancel a Job in the PENDING or RUNNABLE states, but will be a no-op for jobs that have entered the STARTING or RUNNING states. For the latter, Batch requires an explicit TerminateJob request to be issued on a job. In the case of a successful cancel or termination of a job, Batch will set the job detail status to FAILED, and the statusReason to the provided reason given to the CancelJob or TerminateJob API call.
Footnotes


Alternatively one could implement a system where Batch only receives a request to run a TES Task launcher container with priviledged access and enough resource allocations for the serial Executors, but this is getting ahead of ourselves. ↩


## aws_batch_jobstate_transition.dot
digraph G
{
  rankdir=LR;
  node[shape=box];
  SUBMITTED -> PENDING -> RUNNABLE -> STARTING -> RUNNING
  SUBMITTED -> RUNNABLE
  RUNNING -> SUCCEEDED
  RUNNING -> FAILED
}

## aws_batch_jobstate_transition.png

      
    Raw
  

              aws_batch_jobstate_transition.png
            
          
## aws_batch_jobstate_transition.txt

+-----------+     +----------+     +----------+     +-----------+     +--------+
| SUBMITTED | --> | RUNNABLE | --> | STARTING | --> |  RUNNING  | --> | FAILED |
+-----------+     +----------+     +----------+     +-----------+     +--------+
      |               ^                                   |
      |               |                                   |
      v               |                                   v
+---------+           |                               +-----------+
| PENDING | ----------+                               | SUCCEEDED |
+---------+                                           +-----------+
TES State	Batch JobState	Note
UNKNOWN		Possible to use for canceled jobs.
QUEUED	SUBMITTED, PENDING or RUNNABLE	Could be any of these. See the state transition diagram
INITIALIZING	STARTING
RUNNING	RUNNING
PAUSED		Jobs can only be canceled or terminated in Batch
COMPLETE	SUCCEEDED
ERROR	FAILED
SYSTEM_ERROR	FAILED
CANCELED	FAILED	`CancelJob` only sets this when job has not progressed to STARTING or RUNNING state, otherwise job would need to be terminated using `TerminateJob` API call. The reason for either canceling or termincation will be in the Batch job details under `statusReason`
EL attribute	Job attribute	Note
string start_time	jobs[].startedAt
string end_time	jobs[].stoppedAt
string stdout		Stored in Cloud Watch Logs
string stderr		Stored in Cloud Watch Logs
int32 exit_code	jobs[].exitCode
string host_ip		Possible that this is avialable from `containerInstanceArn` information
repeated Ports ports		Not applicable for Batch
TaskLog attribute	AWS Batch computed value
repeated ExecutorLog logs[]	The set of values mapped to `ExecutorLog`
map<string, string> metadata	Useful items not reported elsewhere like `jobs[].statusReason`
string start_time	first Executor `jobs[].startedAt`
string end_time	last Executor `jobs[].stoppedAt`
repeated OutputFileLog outputs	Problematic for a lot of reasons
Resource attr	JobDefinition attr	Note
uint32 cpu_cores	jobDefinition.containerPropterties.vcpus	These are hyperthreaded cores
bool preemptible		Handled by virtue of which JobQueue the Job was submitted to
double ram_gb	jobDefinition.containerPropterties.memory	Integer in MiB
double size_gb		a container's properties of `volumes` and `mountPoints` would need to account for this.
repeated string zones		Handled at the level of ComputeEnvironment and JobQueue
Executor attr	JobDefinition containerProperties	Notes
string image_name	image
repeated string cmd	command
string workdir	mountPoints	Conventions needed
string stdin	mountPoints	Conventions needed
string stdout	mountPoints	Conventions needed
string stderr	mountPoints	Conventions needed
repeated Ports ports		Not applicable for Batch
map<string,string> environ	environment	Also a key-value array
Executor attr	JobDefinition or Job attr	Notes
string id	jobs[].jobId
State state	jobs[].status
string name	jobs[].jobName
string project		Could utilize `JobQueue` by convention.
string description		No use in Batch
repeated TaskParameter inputs	jobDefinition.parameters	Conventions needed
repeated TaskParameter outputs	jobDefinition.parameters	Conventions needed
Resources resources		See mapping above.
repeated Executor executors		See mapping above.
repeated string volumes	jobDefinition.containerProperties volumes and mountPoints	Conventions needed
map<string, string> tags		Not used.
repeated TaskLog logs		See mapping above
	digraph G
	{
	rankdir=LR;
	node[shape=box];
	SUBMITTED -> PENDING -> RUNNABLE -> STARTING -> RUNNING
	SUBMITTED -> RUNNABLE
	RUNNING -> SUCCEEDED
	RUNNING -> FAILED
	}

	+-----------+ +----------+ +----------+ +-----------+ +--------+
	\| SUBMITTED \| --> \| RUNNABLE \| --> \| STARTING \| --> \| RUNNING \| --> \| FAILED \|
	+-----------+ +----------+ +----------+ +-----------+ +--------+
	\| ^ \|
	\| \| \|
	v \| v
	+---------+ \| +-----------+
	\| PENDING \| ----------+ \| SUCCEEDED \|
	+---------+ +-----------+