jamiegs/adaptive-ecs-scaling.md

## adaptive-ecs-scaling.md

      
    Raw
  

              adaptive-ecs-scaling.md
            
          
    Adaptive Container Scaling

Problems with AWS's recommended scaling method


Based on a fixed percentage of memory and CPU reservation.
Scaling on both cpu and memory metrics can cause the metrics to fight with each other where it'll scale out then back in and repeat.
If you're scaling threshold is at 80% memory, current reservation utilization is at 79% and you deploy a container that requires 25% memory, It will fail to launch.
ASGs are not container aware.

Solution


Determine size of largest needed capacity to launch a container, which is the 'largest container' that might be launched due to ecs service scaling.

Based on most CPU units and most memory units for a container.


Get the number of these 'largest containers' or 'slots' available for each host in the cluster.

Used the smaller of two CPU and memory container counts.,

so if an instance can run 5 of the largest containers based on CPU units and 7 of the largest containers based on memory, use the 5 as how many of the 'largest containers' can run on that instance.


emit a custom cloudwatch metric 'available slots' that is a sum of how many of the largest containers the cluster can run.


Leave room for x 'largest containers', this should be customizable. We use 3 because most services are running 3 copies of each container, this gives enough room to deploy.

Scaling policies are used to control when it should scale in or out.
The smaller it is, the more frequently you'll need to scale out during a deploy or scaling event, slowing those down.
The larger it is, the more money is being spent on wasted resources.


(not yet implemented) Scaling in needs to look at the number of possible available slots on ECS instance and not scale in if it's just going to scale right back out, can get into scaling loop.  Hosts might have different capacities, so while you may not be able to scale in 1 host, another might be able to.

Scaling on the 'available slots' metric wouldn't work when this is introduced.
For this I was going to send to cloudwatch a 1 for steady state, 0 for 'should scale up', a 2 for 'should scale down'


Shortcomings of this implementation


ASGs are not ECS aware. They usually terminate the oldest instance which typically has the most containers resulting in several container migrations.  If it'd terminate the host with the least amount of instances that'd be handy.. but it might make it difficult to replace all containers with a new image.


Doesn't have a way to detect if we're launching an container larger than we have available space for, which would be larger than the 'largest container'.. say we're introducing a new service or increasing our reservation for the service. This might cause a container to fail to launch.


It'd be nice if during the drain, for a container that's being migrated from draining instance would start the new container, wait for it to successfully finish starting and be added to any ASGs before stopping the old one so there'd be no reduction in the service's available capacity.


The cluster doesn't re-balance, We might be able to re-balance the cluster by moving a container or two to different hosts and scale in some hosts


I was going to implement these changes into the ecs-refarch-cloudformation, other fixes I was planning on doing were:


Switch to using cloudwatch events rather than SNS for lifecycle events for LifecycleHandlerFunction, invoke lambda rather than sns topics. doing so can remove SNS topic from stack.
Switch to using launch templates rather than LaunchConfigurations