alexjs/ideal ops.md

## ideal ops.md

      
    Raw
  

              ideal ops.md
            
          
    In a perfect world, where things are done well, not just quickly, I would expect to find the following when joining the company:
Documentation


Accurate / up-to-date systems architecture diagram


Accurate / up-to-date network diagram


Out-of-hours support plan


Incident management plan


Change management plan


Application documentation


Metric collection:


comprehensive system metrics (eg. cpu, load, mem, disk, network, etc)


application metrics instrumented in code (eg. queue length, time to post new job) [statsd]


business metrics instrumented in code as well (eg. registrations) [statsd]


include network devices (eg. firewall, loadbalancers, switches, vpns, vpc)


include storage (eg. netapp)


include database


include cron jobs


include CD pipeline systems/applications (e.g., jenkins, chef, build / test farm)


majority of monitoring from internal systems


also monitor from external systems (e.g., Nimsoft/Watchmouse)


retrieve external monitoring data into internal collection for correlation


Alert system:


alert off data collected (passive)


alert on checks (active)


call-out on important alerts


email, irc/chat, sms, mobile escalation


call-out rotation, escalation plans


Dashboards:


Real-time dashboards of all services


Real-time dashboard of what is being viewed on the site, where traffic is coming from


Dashboards to include event / deploy lines


Anyone can create/share dashboards


No passwords to access dashboards


Key dashboards visible in the office on screen


Dashboard of environments - what's deployed


Cost dashboard (IaaS, SaaS)


Correlation / Investigation


Graphing system which allows ad-hoc metric correlation (eg. Graphite)


Centralized logging with search (eg. Logstash, Greylog)


Record of everything that has changed, by whom, when, and what the change was


Access to all relevant systems


Infrastructure as Code

Infrastructure DB with API (Chef server)
All infra changes tracked, done via configuration management

Security

Automated view of what needs to be patched/updated
Regular vulnerability scans with recorded history
ssh-key as only authentication
segregated environments (dev, test, prod)
data anonymisation for performance testing

Performance testing

Prod-like environment to test in
Good performance test, with assumptions and approximations documented
Record of all previous test results
Automated running of test
Automated comparison of test results with previous tests

Communications

whole company using the same instant messaging / chat system
task/kanbansystem for giving work to systems engineers / infrastructure developers
ops twitter
ops status (eg. etsystatus.com; stashboard; amazon status)

Deployment

single-click deploy
rollback-able
performed by developer
dashboard/KPI used to validate release
zero-down time
dark-launches
feature flags can be turned on/off via webui

Standards

Published standards of web systems requirements

Process

Light-weight post-mortem process, blame-free
Daily operations review
Monthly/quarterly architecture summit
Daily stand-ups
Iteration planning/review
Regular capacity planning /cost optimization

Meta-metrics

MTTD
MTTR
Availability
Service degradation (Slow versus broken; features disabled to protect site)
CD Pipeline Availability
Release tracking (type, success/failure, success rate, length of incident)