jfryman/Autoscale-Thoughts.md Secret

## Autoscale-Thoughts.md

      
    Raw
  

              Autoscale-Thoughts.md
            
          
    Phase 1: Systems Failing

In this phase, systems are not responding at all. Some event has occurred that is preventing normal operations. A few examples:

Customers are experiencing slow loading times or 500s on website loading, resulting in a drop in conversion %. During this event, the company is not accruing revenue at the rate of $1/sec.
CSRs are experiencing blank pages while using the internal claims system, and are unable to file repair claims. During this time, company is losing $4.25/sec is fraud abuse.
911 Operators are unable to dispatch ambulances due to slow-loading times. Potential loss of life.

These three different use cases are real. In many cases, we're going to be going to banks and large multinational companies that have different reasons for needing auto-scaling, and many of them know exactly what the time/money tradeoffs are. As a result, we are going to be graded on the speed at which we can manage responding to this problem (e.g.: how fast can we acknowledge the situation, and transition to a repair state), the reliability of us responding to events (are we accurately detecting the situation, no false positives/negatives), and the robustness of our ability to respond to these events (our ability to always be around to do the needful).
During this phase, Operations needs to be concerned about:

Knowing what service or component is affected
Knowing what a bad state means

What is WARN?
What is CRIT?


Knowing if there are resources available to provision additional resources

Physical capacity
VM Capacity
Cloud Costs


Knowing who to notify
Knowing at what point to deploy additional resources
Knowing how many resources to deploy.
Knowing how fast to deploy resources.
Knowing how to allocate additional resources

Cloud APIs
Internal provisioning tools
Mesos or other schedulers


Knowing when sufficient relief has been provided.

Operations will need to know that:

StackStorm is aware that there is a problem, and has begun fixing it.
What StackStorm is doing about it
Where can they go to find additional information about what it is doing, or the status of repairs.

Phase 2: Monitor

The event may still be ongoing, but there is adequate capacity to manage the load. Operations will need to monitor the situation to be ready to allocate additional resources if the need arises again.
During this phase, Operations need to to be concerned with:

Detecting additional failure, and reverting back to Phase 1
Detecting normal systems, and advancing to Phase 3

Operations will need to know that:

StackStorm is still tracking the situation.

Phase 3: Stand Down

The event is now over, and systems have returned to normal. Operations need to clean up (notify that systems are back to normal, de-allocate resources and return to pool).
During this phase, Operations needs to be concerned with:

Knowing who to notify
Knowing what resources need to be de-allocated.
Knowing how to de-allocate resources.
Knowing at what speed to de-allocate resources.
Knowing what normal looks like

I have seen a few patterns go with de-allocating when resources:

LIFO: All machines are eventually removed as they age out.
Return to previous state: only auto-scale nodes are treated as ephemeral... the original allocation stays.

With speed, the strategy can be varied in how aggressive we remove un-needed resources, depending on the use-case. Some may need a tiered step down to potentially account for momentary spikes on return to normal. This accounts for temporary spikes in the pattern, or allows time for old connections to drain.
Operations will need to know that:

StackStorm acknowledges an 'all-clear'
StackStorm is de-allocating resources

Questions we need to answer:

Phase 1


What are we scaling?

What CRIT/WARN/OK thresholds are
What infrastructure components make up the server?
How do we trigger load to New Relic
What governor strategy do we use to


How do we enter this phase? What options do we have to trigger an alert from New Relic?
What data does the alert from New Relic provide? How can we relate a New Relic alert to scale the thing we're scaling
What components of Rackspace integration do we want to show off
How do we interface with the user? ChatOps / UI?

How does a user determine:

How to define an autoscaling group
The number of nodes in a group
Minimum number of nodes in a group
Maximum nodes in a group
The rate at which to expand.


What data comes from New Relic to let us know state has changed.
How can we control the rate of expansion? Need to provision N and then poll, reassess, otherwise we might race.

Phase 2


What data comes from New Relic to let us know state has changed.
How can we keep StackStorm engaged in remediation (active/async)? Is that necessary?
What CRIT/WARN/OK are

Phase 3


How to slowly iterate over de-provisioning
What nodes do we kill.
How can we control the rate of de-provision.