AWS Batch Tips and Tricks
Working with AWS Batch works roughly like this:
Compute Environments > Job Queues ⊥ Job Definitions > Jobs
Meaning compute environments and job queues are configured independently of job definitions and jobs — though if you're going to create a job, you're going to need a job definition, and if you're going to create a job queue, it's going to need a compute environment to run within.
Job Definitions > Jobs
Job Definitions define the parameters for individual jobs (where the "batch" in AWS Batch generally consists of multiple jobs). It's here you specify the docker image that will be run for each job, each job's vCPU and memory requirements, ulimits, parameters/environment variables, and the actual command that the image should run (same idea as CMD in a dockerfile — in fact, this overrides any CMD which might have been specified when the image was built). If our job command takes parameters, it looks like this:
python down_sample_re_count.py Ref::input Ref::output
There are options to specify volumes here, but I managed my volumes when configuring the compute environment instead.
Jobs are generally submitted via the CLI (
aws batch submit-job …) rather than the web UI, since we often want to submit dozens of jobs at once. (Job definitions, job queues, and compute environments are typically one-off things that can be configured within the web UI and left alone afterwards). The only parameters specified when submitting a job are usually a name, job definition, job queue and what our job accepts as input, since the job definition takes care of all the execution details — though job definition parameters can be overridden when submitting individual jobs if necessary.
Submitting a job when our command is as above in the Job Definitions section might look like this:
submit-job --job-name "bigBam" --job-queue "rnaseq" --job-definition test:6 --parameters input=syn8540852,output=syn11614902
Compute Environment > Job Queues
Setting up a proper compute environment is when things get tricky. Compute environment type (managed), service role, and instance role will work out of the box if left to their defaults. It's usually uncessary to modify the minimum, desired, and maximum vCPU settings when working in a managed compute environment. (AWS will scale these things for us when it sees jobs waiting to be run).
Since AWS Batch is a collaborative effort between itself, AWS Elastic Container Service, and AWS EC2 (et al.), we have to keep in mind some default constraints imposed by these services when submitting Batch jobs. Oftentimes getting around these limits means running commands on the host instance or modifying the hardware of the host instance before any "Batch" stuff can occur. So I will assume that you will be creating your own custom AMI so that we can modify some of the more inconvenient default limits imposed by AWS and bake these new settings into the AMI that all of our jobs will run on.
The recommended base AMI to use when creating your custom AMI is here.
Of course, you will first want to
sudo yum -y update
A "gotcha" to look out for is where Docker stores its volumes. By default, volumes are stored to the Root volume (
/dev/xvda), which defaults to 8GB in size. You might need to increase the size here when launching the instance your custom AMI will be created upon. Images and containers are stored on
/dev/xvdcz. A default 22GB
/dev/xvdcz is mounted when using one of the recommended base AMIs above.
Docker container volumes are automatically constrained by ECS to be 10GB in size, which is not very helpful when we are working with 20GB RNAseq data. Instructions for increasing the limit are here, though I had difficulty getting the cloud-boothook command to work and found it easier to manually edit
/etc/sysconfig/docker and add
dm.basesize=50G (or however much storage you need for each container) to the
Detailed ECS Docker storage information is here.
ECS has some other parameters set in
/etc/ecs/ecs.config to be aware of (list here). The ones I found relevent were
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION which defaults to only cleaning up containers once every three hours,
ECS_IMAGE_CLEANUP_INTERVAL default 30 minutes,
ECS_IMAGE_MINIMUM_CLEANUP_AGE default one hour, and
ECS_NUM_IMAGES_DELETE_PER_CYCLE default 5.
Once you have configured your instance, don't forget to kill the ecs agent.
sudo stop ecs sudo rm -rf /var/lib/ecs/data/ecs_agent_data.json
The documents for creating a custom AMI for use with Batch are here.
This article was very relevant to the specific task I had to use Batch for. It's written in four parts, but I've linked to the specific part that involves Batch.
Once we've done all that, we can finally check the box "Enable user-specified AMI ID" within the AWS Batch web UI when creating our compute environment and pass it the ID of our custom AMI.
There is nothing special to do with VPC or security groups before clicking "Create".
After all that hard work, it's nice that we now get to do something easy. Job queues just need a name, priority (relative to other job queues), and a compute environment.