Skip to content

Instantly share code, notes, and snippets.

@sjourdan
Last active December 24, 2023 16:35
Show Gist options
  • Save sjourdan/b32a1a8a1d0e5876f4d5 to your computer and use it in GitHub Desktop.
Save sjourdan/b32a1a8a1d0e5876f4d5 to your computer and use it in GitHub Desktop.
The Operations Report Card

The Operations Report Card

Source: http://www.opsreportcard.com/.

Public Facing Practices

  1. Are user requests tracked via a ticket system?
  2. Are "the 3 empowering policies" defined and published?
  3. Does the team record monthly metrics?

Modern Team Practices

  1. Do you have a "policy and procedure" wiki?
  2. Do you have a password safe?
  3. Is your team's code kept in a source code control system?
  4. Does your team use a bug-tracking system for their own code?
  5. In your bugs/tickets, does stability have a higher priority than new features?
  6. Does your team write "design docs?"
  7. Do you have a "post-mortem" process?

Operational Practices

  1. Does each service have an OpsDoc?
  2. Does each service have appropriate monitoring?
  3. Do you have a pager rotation schedule?
  4. Do you have separate development, QA, and production systems?
  5. Do roll-outs to many machines have a "canary process?"

Automation Practices

  1. Do you use configuration management tools like cfengine/puppet/chef?
  2. Do automated administration tasks run under role accounts?
  3. Do automated processes that generate e-mail only do so when they have something to say?

Fleet Management Processes

  1. Is there a database of all machines?
  2. Is OS installation automated?
  3. Can you automatically patch software across your entire fleet?
  4. Do you have a PC refresh policy?

Disaster Preparation Practices

  1. Can your servers keep operating even if 1 disk dies?
  2. Is the network core N+1?
  3. Are your backups automated?
  4. Are your disaster recovery plans tested periodically?
  5. Do machines in your data center have remote power / console access?

Security Practices

  1. Do Desktops/laptops/servers run self-updating, silent, anti-malware software?
  2. Do you have a written security policy?
  3. Do you submit to periodic security audits?
  4. Can a user's account be disabled on all systems in 1 hour?
  5. Can you change all privileged (root) passwords in 1 hour?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment