Instantly share code, notes, and snippets.

Embed
What would you like to do?
ideal ops checklist

In a perfect world, where things are done well, not just quickly, I would expect to find the following when joining the company:

Documentation

  • Accurate / up-to-date systems architecture diagram

  • Accurate / up-to-date network diagram

  • Out-of-hours support plan

  • Incident management plan

  • Change management plan

  • Application documentation

Metric collection:

  • comprehensive system metrics (eg. cpu, load, mem, disk, network, etc)

  • application metrics instrumented in code (eg. queue length, time to post new job) [statsd]

  • business metrics instrumented in code as well (eg. registrations) [statsd]

  • include network devices (eg. firewall, loadbalancers, switches, vpns, vpc)

  • include storage (eg. netapp)

  • include database

  • include cron jobs

  • include CD pipeline systems/applications (e.g., jenkins, chef, build / test farm)

  • majority of monitoring from internal systems

  • also monitor from external systems (e.g., Nimsoft/Watchmouse)

  • retrieve external monitoring data into internal collection for correlation

Alert system:

  • alert off data collected (passive)

  • alert on checks (active)

  • call-out on important alerts

  • email, irc/chat, sms, mobile escalation

  • call-out rotation, escalation plans

Dashboards:

  • Real-time dashboards of all services

  • Real-time dashboard of what is being viewed on the site, where traffic is coming from

  • Dashboards to include event / deploy lines

  • Anyone can create/share dashboards

  • No passwords to access dashboards

  • Key dashboards visible in the office on screen

  • Dashboard of environments - what's deployed

  • Cost dashboard (IaaS, SaaS)

Correlation / Investigation

  • Graphing system which allows ad-hoc metric correlation (eg. Graphite)

  • Centralized logging with search (eg. Logstash, Greylog)

  • Record of everything that has changed, by whom, when, and what the change was

  • Access to all relevant systems

Infrastructure as Code

  • Infrastructure DB with API (Chef server)
  • All infra changes tracked, done via configuration management

Security

  • Automated view of what needs to be patched/updated
  • Regular vulnerability scans with recorded history
  • ssh-key as only authentication
  • segregated environments (dev, test, prod)
  • data anonymisation for performance testing

Performance testing

  • Prod-like environment to test in
  • Good performance test, with assumptions and approximations documented
  • Record of all previous test results
  • Automated running of test
  • Automated comparison of test results with previous tests

Communications

  • whole company using the same instant messaging / chat system
  • task/kanbansystem for giving work to systems engineers / infrastructure developers
  • ops twitter
  • ops status (eg. etsystatus.com; stashboard; amazon status)

Deployment

  • single-click deploy
  • rollback-able
  • performed by developer
  • dashboard/KPI used to validate release
  • zero-down time
  • dark-launches
  • feature flags can be turned on/off via webui

Standards

  • Published standards of web systems requirements

Process

  • Light-weight post-mortem process, blame-free
  • Daily operations review
  • Monthly/quarterly architecture summit
  • Daily stand-ups
  • Iteration planning/review
  • Regular capacity planning /cost optimization

Meta-metrics

  • MTTD
  • MTTR
  • Availability
  • Service degradation (Slow versus broken; features disabled to protect site)
  • CD Pipeline Availability
  • Release tracking (type, success/failure, success rate, length of incident)
@WheresAlice

This comment has been minimized.

Copy link

WheresAlice commented May 27, 2012

I note there is nothing around backups or disaster recovery plans here, or are you including those in the Incident Management Plan?

@pkhamre

This comment has been minimized.

Copy link

pkhamre commented May 31, 2012

When all points are checked, operations team can leave for a 3 month vacation.

@morgajel

This comment has been minimized.

Copy link

morgajel commented Jun 4, 2012

I'm just gonna drop this here because the people who love this may also find these useful:

As an ops team using Jenkins for automation, I wrote about how we use it: http://morgajel.net/2011/12/12/1108

Regarding monitoring, I wrote up this doc a while back: http://morgajel.net/2010/06/30/755

One thing I'd add to the list is "solid hostname conventions," http://morgajel.net/2012/01/16/1049

@foetterer

This comment has been minimized.

Copy link

foetterer commented Jun 16, 2012

Another point which could also be included in the incident management plan, would be something like a DDoS migitation plan.
A CDN can handle lots of traffic, but it should not be responsible to deliver trash traffic. Something like this should be mentioned after the first incident.

There are services like:
http://www.prolexic.com/
http://www.blacklotus.net/

@hatchetation

This comment has been minimized.

Copy link

hatchetation commented Feb 8, 2013

On security:

  • Only ephemeral creds (ssh-key only) baked into images. eg, once the node is bootstrapped and under centralized control, the original creds are disabled.
  • Single well-known route for getting app-level credentials to a host. No creds in repos.

Flexibility and resiliency:

  • Related to feature flags -- having big red buttons that are easy to push to disable especially expensive/non-mission-critical paths when under heavy load. James Hamilton has an excellent paper which mentions this technique, and other good tips.

Training/preparedness:

  • Have regular training exercises that simulate failures. I've seen this done well in a few places, typically modeled after Google-ops's Wheel of Misfortune exercises. (Smaller counterpart to DiRT)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment