Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
ideal ops checklist

In a perfect world, where things are done well, not just quickly, I would expect to find the following when joining the company:

Documentation

  • Accurate / up-to-date systems architecture diagram

  • Accurate / up-to-date network diagram

  • Out-of-hours support plan

  • Incident management plan

  • Change management plan

  • Application documentation

Metric collection:

  • comprehensive system metrics (eg. cpu, load, mem, disk, network, etc)

  • application metrics instrumented in code (eg. queue length, time to post new job) [statsd]

  • business metrics instrumented in code as well (eg. registrations) [statsd]

  • include network devices (eg. firewall, loadbalancers, switches, vpns, vpc)

  • include storage (eg. netapp)

  • include database

  • include cron jobs

  • include CD pipeline systems/applications (e.g., jenkins, chef, build / test farm)

  • majority of monitoring from internal systems

  • also monitor from external systems (e.g., Nimsoft/Watchmouse)

  • retrieve external monitoring data into internal collection for correlation

Alert system:

  • alert off data collected (passive)

  • alert on checks (active)

  • call-out on important alerts

  • email, irc/chat, sms, mobile escalation

  • call-out rotation, escalation plans

Dashboards:

  • Real-time dashboards of all services

  • Real-time dashboard of what is being viewed on the site, where traffic is coming from

  • Dashboards to include event / deploy lines

  • Anyone can create/share dashboards

  • No passwords to access dashboards

  • Key dashboards visible in the office on screen

  • Dashboard of environments - what's deployed

  • Cost dashboard (IaaS, SaaS)

Correlation / Investigation

  • Graphing system which allows ad-hoc metric correlation (eg. Graphite)

  • Centralized logging with search (eg. Logstash, Greylog)

  • Record of everything that has changed, by whom, when, and what the change was

  • Access to all relevant systems

Infrastructure as Code

  • Infrastructure DB with API (Chef server)
  • All infra changes tracked, done via configuration management

Security

  • Automated view of what needs to be patched/updated
  • Regular vulnerability scans with recorded history
  • ssh-key as only authentication
  • segregated environments (dev, test, prod)
  • data anonymisation for performance testing

Performance testing

  • Prod-like environment to test in
  • Good performance test, with assumptions and approximations documented
  • Record of all previous test results
  • Automated running of test
  • Automated comparison of test results with previous tests

Communications

  • whole company using the same instant messaging / chat system
  • task/kanbansystem for giving work to systems engineers / infrastructure developers
  • ops twitter
  • ops status (eg. etsystatus.com; stashboard; amazon status)

Deployment

  • single-click deploy
  • rollback-able
  • performed by developer
  • dashboard/KPI used to validate release
  • zero-down time
  • dark-launches
  • feature flags can be turned on/off via webui

Standards

  • Published standards of web systems requirements

Process

  • Light-weight post-mortem process, blame-free
  • Daily operations review
  • Monthly/quarterly architecture summit
  • Daily stand-ups
  • Iteration planning/review
  • Regular capacity planning /cost optimization

Meta-metrics

  • MTTD
  • MTTR
  • Availability
  • Service degradation (Slow versus broken; features disabled to protect site)
  • CD Pipeline Availability
  • Release tracking (type, success/failure, success rate, length of incident)

I note there is nothing around backups or disaster recovery plans here, or are you including those in the Incident Management Plan?

pkhamre commented May 31, 2012

When all points are checked, operations team can leave for a 3 month vacation.

morgajel commented Jun 4, 2012

I'm just gonna drop this here because the people who love this may also find these useful:

As an ops team using Jenkins for automation, I wrote about how we use it: http://morgajel.net/2011/12/12/1108

Regarding monitoring, I wrote up this doc a while back: http://morgajel.net/2010/06/30/755

One thing I'd add to the list is "solid hostname conventions," http://morgajel.net/2012/01/16/1049

Another point which could also be included in the incident management plan, would be something like a DDoS migitation plan.
A CDN can handle lots of traffic, but it should not be responsible to deliver trash traffic. Something like this should be mentioned after the first incident.

There are services like:
http://www.prolexic.com/
http://www.blacklotus.net/

On security:

  • Only ephemeral creds (ssh-key only) baked into images. eg, once the node is bootstrapped and under centralized control, the original creds are disabled.
  • Single well-known route for getting app-level credentials to a host. No creds in repos.

Flexibility and resiliency:

  • Related to feature flags -- having big red buttons that are easy to push to disable especially expensive/non-mission-critical paths when under heavy load. James Hamilton has an excellent paper which mentions this technique, and other good tips.

Training/preparedness:

  • Have regular training exercises that simulate failures. I've seen this done well in a few places, typically modeled after Google-ops's Wheel of Misfortune exercises. (Smaller counterpart to DiRT)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment