AWS Spot is a AWS service which provides EC2 capacity at significant discounts. Savings of upto 80-90% can be achieved (vs on Demand pricing) utilizing Spot.
Spot is a active market you specify the amount your willing to pay and the actual price you pay is whatever the current market price actually is. If the market price goes above your bid price your instance will be terminated. AWS provides a 2 minute warning before terminating your instance. It is therefore important to design your overall architecture around the ability for instances to terminate at any given time.
It should be noted that spot capacity (and therefore its price) varies per instance type and AZ in operates in its therefore important to understand the historic pricing of the instances your interested in, in the specific AZ your wanting to run it in!. A number of useful tools to find this data includes: Spot Bid Advisor, Spot Pricing History and spot history data on the command line.
Finally its important to understand that spot prices can (and do) go above onDemand pricing, the market is not in any way limited by standard onDemand prices for EC2 instances - hence if supply is short prices can spike significantly.
AWS Spot provides the ability to bid for 1 or more instances. Spot Fleet provides the ability to spin up and maintain a fleet of instances across a number of AZ's and instance types.
Spot Fleet provides a diversified option, allowing for a fleet of EC2 instances to be balanced across AZ's and instance family's. This greatly reduces the risk of a large portion of your EC2 capacity being terminated at the same time.
Spot Fleet can be autoscaled similiar to standard onDemand autoscale groups.
Spot provides super cheap EC2 resources with the caveat that your workload must be able to easily be terminated and moved to a different EC2 instance (in a different AZ, or perhaps a different instance family). Therefore the workload has to be fairly ephemeral or at least not maintain state. It also ideally should not store anything local to the EC2 instance itself.
Containerization (this one, not the freight one) is a pretty good match for the above constraints. A workload in a container is isolated from the underlying EC2 instance, and hence is able easily to be moved around between instances. Containerized apps typically try to adhere to the twelve-factor principals. Of specific interest is the disposibility principal which mandates a apps processes should be capable of being started and stopped at a moments notice. This fits beautifully with spot and its 2 minute termination window!
Spot Fleet provides the mechanism to distribute capacity across a number of capacity pools
Spot Fleet provides the mechanism to scale the fleet, by increasing or decreasing the TargetCapacity
of the fleet based on any given cloudwatch metric/alarm.
Spot Fleet reacts to terminated capacity by re-bidding for new capacity once a instance has been terminated by Spot.
However Spot Fleet does not get an advance warning of the EC2 termination. Therefore it is better to add in a termination hook on the instance itself, so that as soon as the instance is notified of ther termination event, a trigger (SNS topic) can call a Lambda function to automatically increase the desiredCapacity
of the fleet.
The AWS EC2 Container Service is a FREE container management service. It is managed service where AWS takes care of task and service management and a customer managed fleet of EC2 instances operate as worker nodes - ready to run tasks and services as instructed by the ECS service.
ECS is feature rich highlights include:
- Service load balancing with ALB with dynamic port allocation
- Service level autoscaling of containers
- IAM Roles for tasks and containers
- Container Instance Draining
ECS tasks encapsulate the docker image, parameters, ports, IAM roles, data volumes and resource allocations (CPU and memory).
A ECS Service defines the number of tasks required to be simulatiously run within the cluster to form the "service". It also defines the ALB to assosiate the service against (and its Target Group)
ECS automatically manages the registration of each task (within a service) to its assosiated ALB target group.
ECS has a ephemeral port range on each physical EC2 instance. A random port in this range is allocated to a running task and is automatically mapped to the static port within the container.
The port on the EC2 instance is then registered against the ALB target group. The ALB will distribute traffic to all registered targets.
ECS supports the ability for the underlying EC2 instances to be drained and removed from service.
Once a container instance has been set to DRAINING
no new tasks will be started on the underlying instance and existing running tasks will be moved to new container instances running on alterative EC2 instances.
Instance draining is a superb way of dealing with Spot EC2 termination notifications. It is realively trivial to wire-up a Lambda function to automatically set the EC2 instance into a draining state as soon as the 2-minute warning is given. This allows ECS to proactively move running tasks to a new service before the EC2 instance is terminated.
ECS provides a standard task/service scheduler and a open sourced task scheduler which provides more control on how tasks will be scheduled against the underlying EC2 worker instances.
Different task strategies are provided that allow for varying distributions. In addition task placement constraints allow for specific criteria to be considered before a task is placed on a given EC2 container instance.
It is easily possible for the overall ECS constainer fleet to have a mix of instances of different types, cost strategies (onDemand and Spot) and capabilities. It is then possible to define task scheduling criteria to ensure things like:
- High value tasks are only placed on onDemand Instances
- Tasks requiring specific hardware (GPU's etc) are placed only on instances that have them
- Tasks with minimal CPU requirements are placed on cheaper burstable EC2 instance types.
ECS provides a large amount of flexability in its task scheduling, normal operation does not require in-depth knowledge of this however it is important to understand what is possible if and when a specific need arises.