Skip to content

Instantly share code, notes, and snippets.

@adyrcz
Last active August 29, 2015 14:24
Show Gist options
  • Save adyrcz/7a8da7607b9ac336b4d2 to your computer and use it in GitHub Desktop.
Save adyrcz/7a8da7607b9ac336b4d2 to your computer and use it in GitHub Desktop.
The DevOps Dream

In a perfect world, where things are done well, not just quickly, I would expect to find the following when joining the company:

Documentation

  • Accurate / up-to-date systems architecture diagram
  • Accurate / up-to-date network diagram
  • Out-of-hours support plan
  • Incident management plan
  • Change management plan
  • Application documentation

Metric collection:

  • comprehensive system metrics (eg. cpu, load, mem, disk, network, etc)
  • application metrics instrumented in code (eg. queue length, time to post new job) [statsd]
  • business metrics instrumented in code as well (eg. registrations) [statsd]
  • include network devices (eg. firewall, loadbalancers, switches, vpns, vpc)
  • include storage (eg. netapp)
  • include database
  • include cron jobs
  • include CD pipeline systems/applications (e.g., jenkins, chef, build / test farm)
  • majority of monitoring from internal systems
  • also monitor from external systems (e.g., Nimsoft/Watchmouse)
  • retrieve external monitoring data into internal collection for correlation

Alert system:

  • alert off data collected (passive)
  • alert on checks (active)
  • call-out on important alerts
  • email, irc/chat, sms, mobile escalation
  • call-out rotation, escalation plans

Dashboards:

  • Real-time dashboards of all services
  • Real-time dashboard of what is being viewed on the site, where traffic is coming from
  • Dashboards to include event / deploy lines
  • Anyone can create/share dashboards
  • No passwords to access dashboards
  • Key dashboards visible in the office on screen
  • Dashboard of environments - what's deployed
  • Cost dashboard (IaaS, SaaS)

Correlation / Investigation

  • Graphing system which allows ad-hoc metric correlation (eg. Graphite)
  • Centralized logging with search (eg. Logstash, Greylog)
  • Record of everything that has changed, by whom, when, and what the change was
  • Access to all relevant systems

Infrastructure as Code

  • Infrastructure DB with API (Chef server)
  • All infra changes tracked, done via configuration management

Security

  • Automated view of what needs to be patched/updated
  • Regular vulnerability scans with recorded history
  • ssh-key as only authentication
  • segregated environments (dev, test, prod)
  • data anonymisation for performance testing

Performance testing

  • Prod-like environment to test in
  • Good performance test, with assumptions and approximations documented
  • Record of all previous test results
  • Automated running of test
  • Automated comparison of test results with previous tests

Communications

  • whole company using the same instant messaging / chat system
  • task/kanbansystem for giving work to systems engineers / infrastructure developers
  • ops twitter
  • ops status (eg. etsystatus.com; stashboard; amazon status)

Deployment

  • single-click deploy
  • rollback-able
  • performed by developer
  • dashboard/KPI used to validate release
  • zero-down time
  • dark-launches
  • feature flags can be turned on/off via webui

Standards

  • Published standards of web systems requirements

Process

  • Light-weight post-mortem process, blame-free
  • Daily operations review
  • Monthly/quarterly architecture summit
  • Daily stand-ups
  • Iteration planning/review
  • Regular capacity planning /cost optimization

Meta-metrics

  • MTTD
  • MTTR
  • Availability
  • Service degradation (Slow versus broken; features disabled to protect site)
  • CD Pipeline Availability
  • Release tracking (type, success/failure, success rate, length of incident)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment