Skip to content

Instantly share code, notes, and snippets.

@jfryman
Last active November 24, 2015 14:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save jfryman/2345a6c6b1abb312d8cb to your computer and use it in GitHub Desktop.
Save jfryman/2345a6c6b1abb312d8cb to your computer and use it in GitHub Desktop.

Phase 1: Systems Failing

In this phase, systems are not responding at all. Some event has occurred that is preventing normal operations. A few examples:

  • Customers are experiencing slow loading times or 500s on website loading, resulting in a drop in conversion %. During this event, the company is not accruing revenue at the rate of $1/sec.
  • CSRs are experiencing blank pages while using the internal claims system, and are unable to file repair claims. During this time, company is losing $4.25/sec is fraud abuse.
  • 911 Operators are unable to dispatch ambulances due to slow-loading times. Potential loss of life.

These three different use cases are real. In many cases, we're going to be going to banks and large multinational companies that have different reasons for needing auto-scaling, and many of them know exactly what the time/money tradeoffs are. As a result, we are going to be graded on the speed at which we can manage responding to this problem (e.g.: how fast can we acknowledge the situation, and transition to a repair state), the reliability of us responding to events (are we accurately detecting the situation, no false positives/negatives), and the robustness of our ability to respond to these events (our ability to always be around to do the needful).

During this phase, Operations needs to be concerned about:

  • Knowing what service or component is affected
  • Knowing what a bad state means
    • What is WARN?
    • What is CRIT?
  • Knowing if there are resources available to provision additional resources
    • Physical capacity
    • VM Capacity
    • Cloud Costs
  • Knowing who to notify
  • Knowing at what point to deploy additional resources
  • Knowing how many resources to deploy.
  • Knowing how fast to deploy resources.
  • Knowing how to allocate additional resources
    • Cloud APIs
    • Internal provisioning tools
    • Mesos or other schedulers
  • Knowing when sufficient relief has been provided.

Operations will need to know that:

  • StackStorm is aware that there is a problem, and has begun fixing it.
  • What StackStorm is doing about it
  • Where can they go to find additional information about what it is doing, or the status of repairs.

Phase 2: Monitor

The event may still be ongoing, but there is adequate capacity to manage the load. Operations will need to monitor the situation to be ready to allocate additional resources if the need arises again.

During this phase, Operations need to to be concerned with:

  • Detecting additional failure, and reverting back to Phase 1
  • Detecting normal systems, and advancing to Phase 3

Operations will need to know that:

  • StackStorm is still tracking the situation.

Phase 3: Stand Down

The event is now over, and systems have returned to normal. Operations need to clean up (notify that systems are back to normal, de-allocate resources and return to pool).

During this phase, Operations needs to be concerned with:

  • Knowing who to notify
  • Knowing what resources need to be de-allocated.
  • Knowing how to de-allocate resources.
  • Knowing at what speed to de-allocate resources.
  • Knowing what normal looks like

I have seen a few patterns go with de-allocating when resources:

  • LIFO: All machines are eventually removed as they age out.
  • Return to previous state: only auto-scale nodes are treated as ephemeral... the original allocation stays.

With speed, the strategy can be varied in how aggressive we remove un-needed resources, depending on the use-case. Some may need a tiered step down to potentially account for momentary spikes on return to normal. This accounts for temporary spikes in the pattern, or allows time for old connections to drain.

Operations will need to know that:

  • StackStorm acknowledges an 'all-clear'
  • StackStorm is de-allocating resources

Questions we need to answer:

Phase 1

  • What are we scaling?
    • What CRIT/WARN/OK thresholds are
    • What infrastructure components make up the server?
    • How do we trigger load to New Relic
    • What governor strategy do we use to
  • How do we enter this phase? What options do we have to trigger an alert from New Relic?
  • What data does the alert from New Relic provide? How can we relate a New Relic alert to scale the thing we're scaling
  • What components of Rackspace integration do we want to show off
  • How do we interface with the user? ChatOps / UI?
    • How does a user determine:
      • How to define an autoscaling group
      • The number of nodes in a group
      • Minimum number of nodes in a group
      • Maximum nodes in a group
      • The rate at which to expand.
  • What data comes from New Relic to let us know state has changed.
  • How can we control the rate of expansion? Need to provision N and then poll, reassess, otherwise we might race.

Phase 2

  • What data comes from New Relic to let us know state has changed.
  • How can we keep StackStorm engaged in remediation (active/async)? Is that necessary?
  • What CRIT/WARN/OK are

Phase 3

  • How to slowly iterate over de-provisioning
  • What nodes do we kill.
  • How can we control the rate of de-provision.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment