Skip to content

Instantly share code, notes, and snippets.

@ryantuck
Last active March 17, 2021 15:24
Show Gist options
  • Save ryantuck/cf33090a1ea96c0cb4ec3912d9cc9072 to your computer and use it in GitHub Desktop.
Save ryantuck/cf33090a1ea96c0cb4ec3912d9cc9072 to your computer and use it in GitHub Desktop.

Regular Tasks

Regular Tasks include how normal, non-emergency, operational duties are handled—that is, how work is received, queued, distributed, processed, and verified, plus how periodic tasks are scheduled and performed. All services have some kind of normal, scheduled or unscheduled work that needs to be done. Often web operations teams do not perform direct customer support but there are interteam requests, requests from stakeholders, and escalations from direct customer support teams. These topics are covered in Chapters 12 and 14.

Sample Assessment Questions

  • What are the common and periodic operational tasks and duties?
  • Is there a playbook for common operational duties?
  • What is the SLA for regular requests?
  • How is the need for new playbook entries identified? Who may write new entries? Edit existing ones?
  • How are requests from users received and tracked?
  • Is there a playbook for common user requests?
  • How often are user requests not covered by the playbook?
  • How do users engage us for support? (online and physical locations)
  • How do users know how to engage us for support?
  • How do users know what is supported and what isn’t?
  • How do we respond to requests for support of the unsupported?
  • What are the limits of regular support (hours of operation, remote or on-site)? How do users know these limits?
  • Are different size categories handled differently? How is size determined?
  • If there is a corporate standard practice for this OR, what is it and how does this service comply with the practice?

Level 1: Initial

  • There is no playbook, or it is out of date and unused.
  • Results are inconsistent.
  • Different people do tasks differently.
  • Two users requesting the same thing usually get different results.
  • Processes aren’t documented.
  • The team can’t enumerate all the processes a team does (even at a high level).
  • Requests get lost or stalled indefinitely.
  • The organization cannot predict how long common tasks will take to complete.
  • Operational problems, if reported, don’t get attention.

Level 2: Repeatable

  • There is a finite list of which services are supported by the team.
  • Each end-to-end process has each step enumerated, with dependencies.
  • Each end-to-end process has each step’s process documented.
  • Different people do the tasks the same way.
  • Sadly, there is some duplication of effort seen in the flow.
  • Sadly, some information needed by multiple tasks may be re-created by each step that needs it.

Level 3: Defined

  • The team has an SLA defined for most requests, though it may not be adhered to.
  • Each step has a QA checklist to be completed before handing off to next step.
  • Teams learn of process changes by other teams ahead of time.
  • Information or processing needed by multiple steps is created once.
  • There is no (or minimal) duplication of effort.
  • The ability to turn up new capacity is a repeatable process.

Level 4: Managed

  • The defined SLA is measured.
  • There are feedback mechanisms for all steps.
  • There is periodic (weekly?) review of defects and reworks.
  • Postmortems are published for all to see, with a draft report available within x hours and a final report completed within y days.
  • There is periodic review of alerts by the affected team. There is periodic review of alerts by a cross-functional team.
  • Process change requests require data to measure the problem being fixed.
  • Dashboards report data in business terms (i.e., not just technical terms).
  • Every “failover procedure” has a “date of last use” dashboard.
  • Capacity needs are predicted ahead of need.

Level 5: Optimizing

  • After process changes are made, before/after data are compared to determine success.
  • Process changes are reverted if before/after data shows no improvement.
  • Process changes that have been acted on come from a variety of sources.
  • At least one process change has come from every step (in recent history).
  • Cycle time enjoys month-over-month improvements.
  • Decisions are supported by modeling “what if” scenarios using extracted actual data.

Emergency Response

Emergency Response covers how outages and disasters are handled. This includes engineering resilient systems that prevent outages plus technical and non-technical processes performed during and after outages (response and remediation). These topics are covered in Chapters 6, 14, and 15.

Sample Assessment Questions

  • How are outages detected? (automatic monitoring? user complaints?)
  • Is there a playbook for common failover scenarios and outage-related duties?
  • Is there an oncall calendar?
  • How is the oncall calendar created?
  • Can the system withstand failures on the local level (component failure)?
  • Can the system withstand failures on the geographic level (alternative data-centers)?
  • Are staff geographically distributed (i.e., can other regions cover for each other for extended periods of time)?
  • Do you write postmortems? Is there a deadline for when a postmortem must be completed?
  • Is there a standard template for postmortems?
  • Are postmortems reviewed to assure action items are completed?
  • If there is a corporate standard practice for this OR, what is it and how does this service comply with the practice?

Level 1: Initial

  • Outages are reported by users rather than a monitoring system.
  • No one is ever oncall, a single person is always oncall, or everyone is always oncall.
  • There is no oncall schedule.
  • There is no oncall calendar.
  • There is no playbook of what to do for various alerts.

Level 2: Repeatable

  • A monitoring system contacts the oncall person.
  • There is an oncall schedule with escalation plan.
  • There is a repeatable process for creating the next month’s oncall calendar.
  • A playbook item exists for any possible alert.
  • A postmortem template exists.
  • Postmortems are written occasionally but not consistently.
  • Oncall coverage is geographically diverse (multiple time zones).

Level 3: Defined

  • Outages are classified by size (i.e., minor, major, catastrophic).
  • Limits (and minimums) for how often people should be oncall are defined.
  • Postmortems are written for all major outages.
  • There is an SLA defined for alert response: initial, hands-on-keyboard, issue resolved, postmortem complete.

Level 4: Managed

  • The oncall pain is shared by the people most able to fix problems.
  • How often people are oncall is verified against the policy.
  • Postmortems are reviewed.
  • There is a mechanism to triage recommendations in postmortems and assure they are completed.
  • The SLA is actively measured.

Level 5: Optimizing

  • Stress testing and failover testing are done frequently (quarterly or monthly).
  • “Game Day” exercises (intensive, system-wide tests) are done periodically.
  • The monitoring system alerts before outages occur (indications of “sick” systems rather than “down” systems).
  • Mechanisms exist so that any failover procedure not utilized in recent history is activated artificially.
  • Experiments are performed to improve SLA compliance.

Monitoring and Metrics (MM)

Monitoring and Metrics covers collecting and using data to make decisions. Monitoring collects data about a system. Metrics uses that data to measure a quantifiable component of performance. This includes technical metrics such as bandwidth, speed, or latency; derived metrics such as ratios, sums, averages, and percentiles; and business goals such as the efficient use of resources or compliance with a service level agreement (SLA). These topics are covered in Chapters 16, 17, and 19.

Sample Assessment Questions

  • Is the service level objective (SLO) documented? How do you know your SLO matches customer needs?
  • Do you have a dashboard? Is it in technical or business terms?
  • How accurate are the collected data and the predictions? How do you know?
  • How efficient is the service? Are machines over- or under-utilized? How is utilization measured?
  • How is latency measured?
  • How is availability measured?
  • How do you know if the monitoring system itself is down?
  • How do you know if the data used to calculate key performance indicators (KPIs) is fresh? Is there a dashboard that shows measurement freshness and accuracy?
  • If there is a corporate standard practice for this OR, what is it and how does this service comply with the practice?

Level 1: Initial

  • No SLOs are documented.
  • If there is monitoring, not everything is monitored, and there is no way to check completeness.
  • Systems and services are manually added to the monitoring system, if at all: there is no process.
  • There are no dashboards.
  • Little or no measurement or metrics.
  • You think customers are happy but they aren’t.
  • It is common (and rewarded) to enact optimizations that benefit a person or small group to the detriment of the larger organization or system.
  • Departmental goals emphasize departmental performance to the detriment of organizational performance.

Level 2: Repeatable

  • The process for creating machines/server instances assures they will be monitored.

Level 3: Defined

  • SLOs are documented.
  • Business KPIs are defined.
  • The freshness of business KPI data is defined.
  • A system exists to verify that all services are monitored.
  • The monitoring system itself is monitored (meta-monitoring).

Level 4: Managed

  • SLOs are documented and monitored.
  • Defined KPIs are measured.
  • Dashboards exist showing each step’s completion time; the lag time of each step is identified.
  • Dashboards exist showing current bottlenecks, backlogs, and idle steps.
  • Dashboards show defect and rework counts.
  • Capacity planning is performed for the monitoring system and all analysis systems.
  • The freshness of the data used to calculate KPIs is measured.

Level 5: Optimizing

  • The accuracy of collected data is verified through active testing.
  • KPIs are calculated using data that is less than a minute old.
  • Dashboards and other analysis displays are based on fresh data.
  • Dashboards and other displays load quickly.
  • Capacity planning for storage, CPU, and network of the monitoring system is done with the same sophistication as any major service.

Capacity Planning (CP)

Capacity Planning covers determining future resource needs. All services require some kind of planning for future resources. Services tend to grow. Capacity planning involves the technical work of understanding how many resources are needed per unit of growth, plus non-technical aspects such as budgeting, forecasting, and supply chain management. These topics are covered in Chapter 18.

Sample Assessment Questions

  • How much capacity do you have now?
  • How much capacity do you expect to need three months from now? Twelve months from now?
  • Which statistical models do you use for determining future needs?
  • How do you load-test?
  • How much time does capacity planning take? What could be done to make it easier?
  • Are metrics collected automatically?
  • Are metrics available always or does their need initiate a process that collects them?
  • Is capacity planning the job of no one, everyone, a specific person, or a team of capacity planners?
  • If there is a corporate standard practice for this OR, what is it and how does this service comply with the practice?

Level 1: Initial

  • No inventory is kept.
  • The system runs out of capacity from time to time.
  • Determining how much capacity to add is done by tradition, guessing, or luck.
  • Operations is reactive about capacity planning, often not being able to fulfill the demand for capacity in time.
  • Capacity planning is everyone’s job, and therefore no one’s job.
  • No one is specifically assigned to handle CP duties.
  • A large amount of headroom exists rather than knowing precisely how much slack is needed.

Level 2: Repeatable

  • CP metrics are collected on demand, or only when needed.
  • The process for collecting CP metrics is written and repeatable.
  • Load testing is done occasionally, perhaps when a service is new.
  • Inventory of all systems is accurate, possibly due to manual effort.

Level 3: Defined

  • CP metrics are automatically collected.
  • Capacity required for a certain amount of growth is well defined.
  • There is a dedicated CP person on the team.
  • CP requirements are defined at a subsystem Level.
  • Load testing is triggered by major software and hardware changes.
  • Inventory is updated as part of capacity changes.
  • The amount of headroom needed to survive typical surges is defined.

Level 4: Managed

  • CP metrics are collected continuously (daily/weekly instead of monthly or quarterly).
  • Additional capacity is gained automatically, with human approval.
  • Performance regressions are detected during testing, involving CP if performance regression will survive into production (i.e., it is not a bug).
  • Dashboards include CP information.
  • Changes in correlation are automatically detected and raise a ticket for CP to verify and adjust relationships between core drivers and resource units.
  • Unexpected increases in demand are automatically detected using MACD metrics or similar technique, which generates a ticket for the CP person or team.
  • The amount of headroom in the system is monitored.

Level 5: Optimizing

  • Past CP projections are compared with actual results.
  • Load testing is done as part of a continuous test environment.
  • The team employs a statistician.
  • Additional capacity is gained automatically.
  • The amount of headroom is systematically optimized to reduce waste.

Change Management (CM)

Change Management covers how services are deliberately changed over time. This includes the software delivery platform—the steps involved in a software release: develop, build, test, and push into production. For hardware, this includes firmware upgrades and minor hardware revisions. These topics are covered in Chapters 9, 10, and 11.

Sample Assessment Questions

  • How often are deployments (releases pushed into production)?
  • How much human labor does it take?
  • When a release is received, does the operations team need to change anything in it before it is pushed?
  • How does operations know if a release is major or minor, a big or small change? How are these types of releases handled differently?
  • How does operations know if a release is successful?
  • How often have releases failed?
  • How does operations know that new releases are available?
  • Are there change-freeze windows?
  • If there is a corporate standard practice for this OR, what is it and how does this service comply with the practice?

Level 1: Initial

  • Deployments are done sparingly, as they are very risky.
  • The deployment process is ad hoc and laborious.
  • Developers notify operations of new releases when a release is ready for deployment.
  • Releases are not deployed until weeks or months after they are available.
  • Operations and developers bicker over when to deploy releases.

Level 2: Repeatable

  • The deployment is no longer ad hoc.
  • Deployment is manual but consistent.
  • Releases are deployed as delivered.
  • Deployments fail often.

Level 3: Defined

  • What constitutes a successful deployment is defined.
  • Minor and major releases are handled differently.
  • The expected time gap between release availability and deployment is defined.

Level 4: Managed

  • Deployment success/failure is measured against definitions.
  • Deployments fail rarely.
  • The expected time gap between release availability and deployment is measured.

Level 5: Optimizing

  • Continuous deployment is in use.
  • Failed deployments are extremely rare.
  • New releases are deployed with little delay.

New Product Introduction and Removal (NPI/NPR)

New Product Introduction and Removal covers how new products and services are introduced into the environment and how they are removed. This is a coordination function: introducing a new product or service requires a support infrastructure that may touch multiple teams.

For example, before a new model of computer hardware is introduced into the datacenter environment, certain teams must have access to sample hardware for testing and qualification, the purchasing department must have a process to purchase the machines, and datacenter technicians need documentation. For introducing software and services, there should be tasks such as requirements gathering, evaluation and procurement, licensing, and creation of playbooks for the helpdesk and operations.

Product removal might involve finding all machines with a particularly old release of an operating system and seeing that all of them get upgraded. Product removal requires identifying current users, agreeing on timelines for migrating them away, updating documentation, and eventually decommissioning the product, any associated licenses, maintenance contracts, monitoring, and playbooks. The majority of the work consists of communication and coordination between teams.

Sample Assessment Questions

  • How is new hardware introduced into the environment? Which teams are involved and how do they communicate? How long does the process take?
  • How is old hardware or software eliminated from the system?
  • What is the process for disposing of old hardware?
  • Which steps are taken to ensure disks and other storage are erased when disposed?
  • How is new software or a new service brought into being? Which teams are involved and how do they communicate? How long does the process take?
  • What is the process for handoff between teams?
  • Which tools are used?
  • Is documentation current?
  • Which steps involve human interaction? How could it be eliminated?
  • If there is a corporate standard practice for this OR, what is it and how does this service comply with the practice?

Level 1: Initial

  • New products are introduced through ad hoc measures and individual heroics.
  • Teams are surprised by NPI, often learning they must deploy something into production with little notice.
  • NPI is delayed due to lack of capacity, miscommunication, or errors.
  • Deprecating old products is rarely done, resulting in operations having to support an “infinite” number of hardware or software versions.

Level 2: Repeatable

  • The process used for NPI/NPR is repeatable.
  • The handoff between teams is written and agreed upon.
  • Each team has a playbook for tasks related to its involvement with NPR/NPI.
  • Equipment erasure and disposal is documented and verified.

Level 3: Defined

  • Expectations for how long NPI/NPR will take are defined.
  • The handoff between teams is encoded in a machine-readable format.
  • Members of all teams understand their role as it fits into the larger, overall process.
  • The maximum number of products supported by each team is defined.
  • The list of each team’s currently supported products is available to all teams.

Level 4: Managed

  • There are dashboards for observing NPI and NPR progress.
  • The handoff between teams is actively revised and improved.
  • The number of no-longer-supported products is tracked.
  • Decommissioning no-longer-supported products is a high priority.

Level 5: Optimizing

  • NPI/NPR tasks have become API calls between teams.
  • NPI/NPR processes are self-service by the team responsible.
  • The handoff between teams is a linear flow (or for very complex systems, joining multiple linear flows).

Service Deployment and Decommissioning (SDD)

Service Deployment and Decommissioning covers how instances of an existing service are created and how they are turned off (decommissioned). After a service is designed, it is usually deployed repeatedly. Deployment may involve turning up satellite replicas in new datacenters or creating a development environment of an existing service. Decommissioning could be part of turning down a datacenter, reducing excess capacity, or turning down a particular service instance such as a demo environment.

Sample Assessment Questions

  • What is the process for turning up a service instance?
  • What is the process for turning down a service instance?
  • How is new capacity added? How is unused capacity turned down?
  • Which steps involve human interaction? How could it be eliminated?
  • How many teams touch these processes?
  • Do all teams know how they fit into the over all picture?
  • What is the workflow from team to team?
  • Which tools are used?
  • Is documentation current?
  • If there is a corporate standard practice for this OR, what is it and how does this service comply with the practice?

Level 1: Initial

  • The process is undocumented and haphazard. Results are inconsistent.
  • The process is defined by who does something, not what is done.
  • Requests get delayed due to miscommunication, lack of resources, or other avoidable reasons.
  • Different people do the tasks differently.

Level 2: Repeatable

  • The processes required to deploy or decommission a service are understood and documented.
  • The process for each step is documented and verified.
  • Each step has a QA checklist to be completed before handing off to the next step.
  • Teams learn of process changes by other teams ahead of time.- Information or processing needed by multiple steps is created once.
  • There is no (or minimal) duplication of effort.
  • The ability to turn up new capacity is a repeatable process.
  • Equipment erasure and disposal is documented and verified.

Level 3: Defined

  • The SLA for how long each step should take is defined.
  • For physical deployments, standards for removal of waste material (boxes, wrappers, containers) are based on local environmental standards.
  • For physical decommissions, standards for disposing of old hardware are based on local environmental standards as well as the organization’s own standards for data erasure.
  • Tools exist to implement many of the steps and processes.

Level 4: Managed

  • The defined SLA for each step is measured.
  • There are feedback mechanisms for all steps.
  • There is periodic review of defects and reworks.
  • Capacity needs are predicted ahead of need.
  • Equipment disposal compliance is measured against organization standards as well as local environmental law.
  • Waste material (boxes, wrappers, containers) involved in deployment is measured.
  • Quantity of equipment disposal is measured.

Level 5: Optimizing

  • After process changes are made, before/after data is compared to determine success.
  • Process changes are reverted if before/after data shows no improvement.
  • Process changes that have been acted on come from a variety of sources.
  • Cycle time enjoys month-over-month improvements.
  • Decisions are supported by modeling “what if” scenarios using extracts from actual data.
  • Equipment disposal is optimized by the reduction of equipment deployment.

Performance and Efficiency (PE)

Performance and Efficiency covers how cost-effectively resources are used and how well the service performs. A running service needs to have good performance without wasting resources. We can generally improve performance by using more resources, or we may be able to improve efficiency to the detriment of performance. Achieving both requires a large effort to bring about equilibrium. Cost-efficiency is cost of resources divided by quantity of use. Resource efficiency is quantity of resources divided by quantity of use. To calculate these statistics, one must know how many resources exist; thus some kind of inventory is required.

Sample Assessment Questions

  • What is the formula used to measure performance?
  • What is the formula used to determine utilization?
  • What is the formula used to determine resource efficiency?
  • What is the formula used to determine cost efficiency?
  • How is performance variation measured?
  • Are performance, utilization, and resource efficiency monitored automatically? Is there a dashboard for each?
  • Is there an inventory of the machines and servers used in this service?
  • How is the inventory kept up-to-date?
  • How would you know if something was missing from the inventory?
  • If there is a corporate standard practice for this OR, what is it and how does this service comply with the practice?

Level 1: Initial

  • Performance and utilization are not consistently measured.
  • What is measured depends on who set up the systems and services.
  • Resource efficiency is not measured.
  • Performance problems often come as a surprise and are hard to diagnose and resolve because there is insufficient data.
  • Inventory is not up-to-date.
  • Inventory may or may not be updated, depending on who is involved in receiving or disposing of items.

Level 2: Repeatable

  • All metrics relevant to performance and utilization are collected across all systems and services.
  • The process for bringing up new systems and services is documented and everyone follows the process.
  • Systems are associated with services when configured for use by a service, and disassociated when released.
  • Inventory is up-to-date. The inventory process is well documented and everyone follows the process.

Level 3: Defined

  • Performance and utilization monitoring is automatically configured for all systems and services during installation and removed during decommission.
  • Performance targets for each service are defined.
  • Resource usage targets for each service are defined.
  • Formulas for service-oriented performance and utilization metrics are defined.
  • Performance of each service is monitored continuously.
  • Resource utilization of each service is monitored continuously.
  • Idle capacity that is not currently used by any service is monitored.
  • The desired amount of headroom is defined.
  • The roles and responsibilities for keeping the inventory up-to-date are defined.
  • Systems for tracking the devices that are connected to the network and their hardware configurations are in place.

Level 4: Managed

  • Dashboards track performance, utilization, and resource efficiency.
  • Minimum, maximum, and 90th percentile headroom are tracked and compared to the desired headroom and are visible on a dashboard.
  • Goals for performance and efficiency are set and tracked.
  • There are periodic reviews of performance and efficiency goals and status for each service.
  • KPIs are used to set performance, utilization, and resource efficiency goals that drive optimal behavior.
  • Automated systems track the devices that are on the network and their con- figurations and compare them with the inventory system, flagging problems when they are found.

Level 5: Optimizing

  • Bottlenecks are identified using the performance dashboard. Changes are made as a result.
  • Services that use large amounts of resources are identified and changes are made.
  • Changes are reverted if the changes do not have a positive effect.
  • Computer hardware models are regularly evaluated to find models where utilization of the different resources is better balanced.
  • Other sources of hardware and other hardware models are regularly evaluated to determine if cost efficiency can improved.

Service Delivery: The Build Phase

Service delivery is the technical process of how a service is created. It starts with source code created by developers and ends with a service running in production.

Sample Assessment Questions

  • How is software built from source code to packages?
  • Is the final package built from source or do developers deliver precompiled elements?
  • What percentage of code is covered by unit tests?
  • Which tests are fully automated?
  • Are metrics collected about bug lead time, code lead time, and patch lead time?
  • To build the software, do all raw source files come from version control repositories?
  • To build the software, how many places (repositories or other sources) are accessed to attain all raw source files?
  • Is the resulting software delivered as a package or a set of files?
  • Is everything required for deployment delivered in the package?
  • Which package repository is used to hand off the results to the deployment phase?
  • Is there a single build console for status and control of all steps?
  • If there is a corporate standard practice for this OR, what is it and how does this service comply with the practice?

Level 1: Initial

  • Each person builds in his or her own environment.
  • People check in code without checking that it builds.
  • Developers deliver precompiled elements to be packaged.
  • Little or no unit testing is performed.
  • No metrics are collected.
  • Version control systems are not used to store source files.
  • Building the software is a manual process or has manual steps.
  • The master copies of some source files are kept in personal home directories or computers.

Level 2: Repeatable

  • The build environment is defined; everyone uses the same system for consistent results.
  • Building the software is still done manually.- Testing is done manually.
  • Some unit tests exist.
  • Source files are kept in version-controlled repositories.
  • Software packages are used as the means of delivering the end result.
  • If multiple platforms are supported, each is repeatable, though possibly independently.

Level 3: Defined

  • Building the software is automated.
  • Triggers for automated builds are defined.
  • Expectations around unit test coverage are defined; they are less than 100 percent.
  • Metrics for bug lead time, code lead time, and patch lead time are defined.
  • Inputs and outputs of each step are defined.

Level 4: Managed

  • Success/fail build ratios are measured and tracked on a dashboard.
  • Metrics for bug lead time, code lead time, and patch lead time are collected automatically.
  • Metrics are presented on a dashboard.
  • Unit test coverage is measured and tracked.

Level 5: Optimizing

  • Metrics are used to select optimization projects.
  • Attempts to optimize the process involve collecting before and after metrics.
  • Each developer can perform the end-to-end build process in his or her own sandbox before committing changes to a centralized repository.
  • Insufficient unit test code coverage stops production.
  • If multiple platforms are supported, building for one is as easy as building for them all.
  • The software delivery platform is used for building infrastructure as well as applications.

Service Delivery: The Deployment Phase

The goal of the deployment phase is to create a running environment. The deployment phase creates the service in one or more testing and production environments. This environment will then be used for testing or for live production services.

Sample Assessment Questions

  • How are packages deployed in production?
  • How much downtime is required to deploy the service in production?
  • Are metrics collected about frequency of deployment, mean time to restore service, and change success rate?
  • How is the decision made to promote a package from testing to production?
  • Which kind of testing is done (system, performance, load, user acceptance)?
  • How is deployment handled differently for small, medium, and large releases?
  • If there is a corporate standard practice for this OR, what is it and how does this service comply?

Level 1: Initial

  • Deployment involves or requires manual steps.
  • Deployments into the testing and production environments are different processes, each with its own tools and procedures.
  • Different people on the team perform deployments differently.
  • Deployment requires downtime, and sometimes significant downtime.
  • How a release is promoted to production is ad hoc or ill defined.
  • Testing is manual, ill defined, or not done.

Level 2: Repeatable

  • Deployment is performed in a documented, repeatable process.
  • If deployment requires downtime, it is a predictable.
  • Testing procedures are documented and repeatable.

Level 3: Defined

  • Metrics for frequency of deployment, mean time to restore service, and change success rate are defined.
  • How downtime due to deployments is to be measured is defined; limits and expectations are defined.
  • How a release is promoted to production is defined.
  • Testing results are clearly communicated to all stakeholders.

Level 4: Managed

  • Metrics for frequency of deployment, mean time to restore service, and change success rate are collected automatically.
  • Metrics are presented on a dashboard.
  • Downtime due to deployments is measured automatically.
  • Reduced production capacity during deployment is measured.
  • Tests are fully automated.

Level 5: Optimizing

  • Metrics are used to select optimization projects.
  • Attempts to optimize the process involve collecting before and after metrics.
  • Deployment is fully automated.
  • Promotion decisions are fully automated, perhaps with a few specific exceptions.
  • Deployment requires no downtime.

Toil Reduction

Toil Reduction is the process by which we improve the use of people within our system. When we reduce toil (i.e., exhausting physical labor), we create a more sustainable working environment for operational staff. While reducing toil is not a service per se, this OR can be used to assess the amount of toil and determine whether practices are in place to limit the amount of toil.

Sample Assessment Questions

  • How many hours each week are spent on coding versus non-coding projects?
  • What percent of time is spent on project work versus manual labor that could be automated?
  • What percentage of time spent on manual labor should raise a red flag?
  • What is the process for detecting that the percentage of manual labor has exceeded the red flag threshold?
  • What is the process for raising a red flag? Whose responsibility is it?
  • What happens after a red flag is raised? When is it lowered?
  • How are projects for reducing toil identified? How are they prioritized?
  • How is the effectiveness of those projects measured?
  • If there is a corporate standard practice for this OR, what is it and how does this service comply with the practice?

Level 1: Initial

  • Toil is not measured and grows until no project work, or almost no project work, can be accomplished.
  • There is no process for raising a red flag.
  • Some individuals recognize when toil is becoming a problem and look for solutions, but others are unaware of the problem.
  • Individuals choose to work on the projects that are the most interesting to them, without looking at which projects will have the biggest impact.

Level 2: Repeatable

  • The amount of time spent on toil versus on projects is measured.
  • The percentage of time spent on toil that constitutes a problem is defined and communicated.
  • The process for raising a red flag is documented and communicated.
  • Individuals track their own toil to project work ratio, and are individually responsible for raising a red flag.
  • Red flags may not always be raised when they should be.- The process for identifying which projects will have the greatest impact on toil reduction is defined.
  • The method for prioritizing projects is documented.

Level 3: Defined

  • For each team, the person responsible for tracking toil and raising a red flag is identified.
  • The people involved in identifying and prioritizing toil-reduction projects are known.
  • Both a red flag Level of toil and a target Level are defined. The red flag is lowered when toil reaches the target Level.
  • During the red flag period, the team works on only the highest-impact toil-reduction projects.
  • During the red flag period, the team has management support for putting other projects on hold until toil is reduced to a target Level.
  • After each step in a project, statistics on toil are closely monitored, providing feedback on any positive or negative changes.

Level 4: Managed

  • Project time versus toil is tracked on a dashboard, and the amount of time spent on each individual project or manual task is also tracked.
  • Red flags are raised automatically, and the dashboard gives an overview of where the problems lie.
  • The time-tracking data is monitored for trends that give an early alert for teams that are showing an increase in toil in one or more areas.
  • KPIs are defined and tracked to keep toil within the desired range and minimize the red flag periods.

Level 5: Optimizing

  • The target and red flag Levels are adjusted and the results are monitored to the effect on overall flow, performance, and innovation.
  • Changes to the main project prioritization process are introduced and evaluated for positive or negative impact, including the impact on toil.
  • Changes to the red flag toil-reduction task prioritization process are introduced and evaluated.

Disaster Preparedness

An operations organization needs to be able to handle outages well, and it must have practices that reduce the chance of repeating past mistakes. Disasters and major outages happen. Everyone in the company from the top down needs to recognize that fact, and adopt a mind-set that accepts outages and learns from them. Systems should be designed to be resilient to failure.

Sample Assessment Questions

  • What is the SLA? Which tools and processes are in place to ensure that the SLA is met?
  • How complete are the playbooks?
  • When was each scenario in the playbooks last exercised?
  • What is the mechanism for exercising different failure modes?
  • How are new team members trained to be prepared to handle disasters?
  • Which roles and responsibilities apply during a disaster?
  • How do you prepare for disasters?
  • How are disasters used to improve future operations and disaster response?
  • If there is a corporate standard practice for this OR, what is it and how does this service comply with the practice?

Level 1: Initial

  • Disasters are handled in an ad hoc manner, requiring individual heroics.
  • Playbooks do not exist, or do not cover all scenarios.
  • Little or no training exists.
  • Service resiliency and different failure scenarios are never tested.

Level 2: Repeatable

  • Playbooks exist for all failure modes, including large-scale disasters.
  • New team members receive on-the-job training.
  • Disasters are handled consistently, independent of who is responding.
  • If multiple team members respond, their roles, responsibilities, and handoffs are not clearly defined, leading to some duplication of effort.

Level 3: Defined

  • The SLA is defined, including dates for postmortem reports.
  • Handoff procedures are defined, including checks to be performed and documented.
  • How to scale the responding team to make efficient use of more team members is defined.
  • The roles and responsibilities of team members in a disaster are defined.
  • Specific disaster preparedness training for new team members is defined and implemented.
  • The team has regular disaster preparedness exercises.
  • The exercises include fire drills performed on the live service.
  • After every disaster, a postmortem report is produced and circulated.

Level 4: Managed

  • The SLA is tracked using dashboards.
  • The timing for every step in the process from the moment the event occurred is tracked on the dashboard.
  • A program for disaster preparedness training ensures that all aspects are covered.
  • The disaster preparedness program measures the results of disaster preparedness training.
  • As teams become better at handling disasters, the training expands to cover more complex scenarios.
  • Teams are involved in cross-functional fire drills that involve multiple teams and services.
  • Dates for publishing initial and final postmortem reports are tracked and measured against the SLA.

Level 5: Optimizing

  • Areas for improvement are identified from the dashboards.
  • New techniques and processes are tested and the results measured and used for further decision making.
  • Automated systems ensure that every failure mode is exercised within a certain period, by artificially causing a failure if one has not occurred naturally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment