Skip to content

Instantly share code, notes, and snippets.

@markwragg
Forked from sjourdan/The Operations Report Card.md
Last active November 25, 2018 23:30
Show Gist options
  • Save markwragg/c6542459244e02ce231d3e58d5d31284 to your computer and use it in GitHub Desktop.
Save markwragg/c6542459244e02ce231d3e58d5d31284 to your computer and use it in GitHub Desktop.
The Operations Report Card

The Operations Report Card

Source: http://www.opsreportcard.com/.

Public Facing Practices

  1. Are user requests tracked via a ticket system?
  2. Are "the 3 empowering policies" defined and published?
  3. How do users get help?
  4. What is an emergency?
  5. What is supported?
  6. Does the team record monthly metrics?

Modern Team Practices

  1. Do you have a "policy and procedure" wiki?
  2. Do you have a password safe?
  3. Is your team's code kept in a source code control system?
  4. Does your team use a bug-tracking system for their own code?
  5. In your bugs/tickets, does stability have a higher priority than new features?
  6. Does your team write "design docs?"
  7. Do you have a "post-mortem" process?

Operational Practices

  1. Does each service have an OpsDoc?
  2. Does each service have appropriate monitoring?
  3. Do you have a pager rotation schedule?
  4. Do you have separate development, QA, and production systems?
  5. Do roll-outs to many machines have a "canary process?"

Automation Practices

  1. Do you use configuration management tools like cfengine/puppet/chef?
  2. Do automated administration tasks run under role accounts?
  3. Do automated processes that generate e-mail only do so when they have something to say?

Fleet Management Processes

  1. Is there a database of all machines?
  2. Is OS installation automated?
  3. Can you automatically patch software across your entire fleet?
  4. Do you have a PC refresh policy?

Disaster Preparation Practices

  1. Can your servers keep operating even if 1 disk dies?
  2. Is the network core N+1?
  3. Are your backups automated?
  4. Are your disaster recovery plans tested periodically?
  5. Do machines in your data center have remote power / console access?

Security Practices

  1. Do Desktops/laptops/servers run self-updating, silent, anti-malware software?
  2. Do you have a written security policy?
  3. Do you submit to periodic security audits?
  4. Can a user's account be disabled on all systems in 1 hour?
  5. Can you change all privileged (root) passwords in 1 hour?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment