The Operations Report Card
Public Facing Practices
- Are user requests tracked via a ticket system?
- Are "the 3 empowering policies" defined and published?
- How do users get help?
- What is an emergency?
- What is supported?
- Does the team record monthly metrics?
Modern Team Practices
- Do you have a "policy and procedure" wiki?
- Do you have a password safe?
- Is your team's code kept in a source code control system?
- Does your team use a bug-tracking system for their own code?
- In your bugs/tickets, does stability have a higher priority than new features?
- Does your team write "design docs?"
- Do you have a "post-mortem" process?
- Does each service have an OpsDoc?
- Does each service have appropriate monitoring?
- Do you have a pager rotation schedule?
- Do you have separate development, QA, and production systems?
- Do roll-outs to many machines have a "canary process?"
- Do you use configuration management tools like cfengine/puppet/chef?
- Do automated administration tasks run under role accounts?
- Do automated processes that generate e-mail only do so when they have something to say?
Fleet Management Processes
- Is there a database of all machines?
- Is OS installation automated?
- Can you automatically patch software across your entire fleet?
- Do you have a PC refresh policy?
Disaster Preparation Practices
- Can your servers keep operating even if 1 disk dies?
- Is the network core N+1?
- Are your backups automated?
- Are your disaster recovery plans tested periodically?
- Do machines in your data center have remote power / console access?
- Do Desktops/laptops/servers run self-updating, silent, anti-malware software?
- Do you have a written security policy?
- Do you submit to periodic security audits?
- Can a user's account be disabled on all systems in 1 hour?
- Can you change all privileged (root) passwords in 1 hour?