Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
ideal ops checklist

In a perfect world, where things are done well, not just quickly, I would expect to find the following when joining the company:

Documentation

  • Accurate / up-to-date systems architecture diagram

  • Accurate / up-to-date network diagram

  • Out-of-hours support plan

  • Incident management plan

  • Change management plan

  • Application documentation

Metric collection:

  • comprehensive system metrics (eg. cpu, load, mem, disk, network, etc)

  • application metrics instrumented in code (eg. queue length, time to post new job) [statsd]

  • business metrics instrumented in code as well (eg. registrations) [statsd]

  • include network devices (eg. firewall, loadbalancers, switches, vpns, vpc)

  • include storage (eg. netapp)

  • include database

  • include cron jobs

  • include CD pipeline systems/applications (e.g., jenkins, chef, build / test farm)

  • majority of monitoring from internal systems

  • also monitor from external systems (e.g., Nimsoft/Watchmouse)

  • retrieve external monitoring data into internal collection for correlation

Alert system:

  • alert off data collected (passive)

  • alert on checks (active)

  • call-out on important alerts

  • email, irc/chat, sms, mobile escalation

  • call-out rotation, escalation plans

Dashboards:

  • Real-time dashboards of all services

  • Real-time dashboard of what is being viewed on the site, where traffic is coming from

  • Dashboards to include event / deploy lines

  • Anyone can create/share dashboards

  • No passwords to access dashboards

  • Key dashboards visible in the office on screen

  • Dashboard of environments - what's deployed

  • Cost dashboard (IaaS, SaaS)

Correlation / Investigation

  • Graphing system which allows ad-hoc metric correlation (eg. Graphite)

  • Centralized logging with search (eg. Logstash, Greylog)

  • Record of everything that has changed, by whom, when, and what the change was

  • Access to all relevant systems

Infrastructure as Code

  • Infrastructure DB with API (Chef server)
  • All infra changes tracked, done via configuration management

Security

  • Automated view of what needs to be patched/updated
  • Regular vulnerability scans with recorded history
  • ssh-key as only authentication
  • segregated environments (dev, test, prod)
  • data anonymisation for performance testing

Performance testing

  • Prod-like environment to test in
  • Good performance test, with assumptions and approximations documented
  • Record of all previous test results
  • Automated running of test
  • Automated comparison of test results with previous tests

Communications

  • whole company using the same instant messaging / chat system
  • task/kanbansystem for giving work to systems engineers / infrastructure developers
  • ops twitter
  • ops status (eg. etsystatus.com; stashboard; amazon status)

Deployment

  • single-click deploy
  • rollback-able
  • performed by developer
  • dashboard/KPI used to validate release
  • zero-down time
  • dark-launches
  • feature flags can be turned on/off via webui

Standards

  • Published standards of web systems requirements

Process

  • Light-weight post-mortem process, blame-free
  • Daily operations review
  • Monthly/quarterly architecture summit
  • Daily stand-ups
  • Iteration planning/review
  • Regular capacity planning /cost optimization

Meta-metrics

  • MTTD
  • MTTR
  • Availability
  • Service degradation (Slow versus broken; features disabled to protect site)
  • CD Pipeline Availability
  • Release tracking (type, success/failure, success rate, length of incident)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.