ryantuck/sre_checklist.md

## sre_checklist.md

      
    Raw
  

              sre_checklist.md
            
          
    Regular Tasks

Regular Tasks include how normal, non-emergency, operational duties are
handled—that is, how work is received, queued, distributed, processed, and verified, plus how periodic tasks are scheduled and performed. All services have some
kind of normal, scheduled or unscheduled work that needs to be done. Often web
operations teams do not perform direct customer support but there are interteam
requests, requests from stakeholders, and escalations from direct customer support
teams. These topics are covered in Chapters 12 and 14.
Sample Assessment Questions


What are the common and periodic operational tasks and duties?
Is there a playbook for common operational duties?
What is the SLA for regular requests?
How is the need for new playbook entries identified? Who may write new entries? Edit existing ones?
How are requests from users received and tracked?
Is there a playbook for common user requests?
How often are user requests not covered by the playbook?
How do users engage us for support? (online and physical locations)
How do users know how to engage us for support?
How do users know what is supported and what isn’t?
How do we respond to requests for support of the unsupported?
What are the limits of regular support (hours of operation, remote or on-site)? How do users know these limits?
Are different size categories handled differently? How is size determined?
If there is a corporate standard practice for this OR, what is it and how does this service comply with the practice?

Level 1: Initial


There is no playbook, or it is out of date and unused.
Results are inconsistent.
Different people do tasks differently.
Two users requesting the same thing usually get different results.
Processes aren’t documented.
The team can’t enumerate all the processes a team does (even at a high level).
Requests get lost or stalled indefinitely.
The organization cannot predict how long common tasks will take to complete.
Operational problems, if reported, don’t get attention.

Level 2: Repeatable


There is a finite list of which services are supported by the team.
Each end-to-end process has each step enumerated, with dependencies.
Each end-to-end process has each step’s process documented.
Different people do the tasks the same way.
Sadly, there is some duplication of effort seen in the flow.
Sadly, some information needed by multiple tasks may be re-created by each step that needs it.

Level 3: Defined


The team has an SLA defined for most requests, though it may not be adhered to.
Each step has a QA checklist to be completed before handing off to next step.
Teams learn of process changes by other teams ahead of time.
Information or processing needed by multiple steps is created once.
There is no (or minimal) duplication of effort.
The ability to turn up new capacity is a repeatable process.

Level 4: Managed


The defined SLA is measured.
There are feedback mechanisms for all steps.
There is periodic (weekly?) review of defects and reworks.
Postmortems are published for all to see, with a draft report available within x hours and a final report completed within y days.
There is periodic review of alerts by the affected team. There is periodic review of alerts by a cross-functional team.
Process change requests require data to measure the problem being fixed.
Dashboards report data in business terms (i.e., not just technical terms).
Every “failover procedure” has a “date of last use” dashboard.
Capacity needs are predicted ahead of need.

Level 5: Optimizing


After process changes are made, before/after data are compared to determine success.
Process changes are reverted if before/after data shows no improvement.
Process changes that have been acted on come from a variety of sources.
At least one process change has come from every step (in recent history).
Cycle time enjoys month-over-month improvements.
Decisions are supported by modeling “what if” scenarios using extracted actual data.


Emergency Response

Emergency Response covers how outages and disasters are handled. This includes
engineering resilient systems that prevent outages plus technical and non-technical
processes performed during and after outages (response and remediation). These
topics are covered in Chapters 6, 14, and 15.
Sample Assessment Questions


How are outages detected? (automatic monitoring? user complaints?)
Is there a playbook for common failover scenarios and outage-related duties?
Is there an oncall calendar?
How is the oncall calendar created?
Can the system withstand failures on the local level (component failure)?
Can the system withstand failures on the geographic level (alternative data-centers)?
Are staff geographically distributed (i.e., can other regions cover for each other for extended periods of time)?
Do you write postmortems? Is there a deadline for when a postmortem must
be completed?
Is there a standard template for postmortems?
Are postmortems reviewed to assure action items are completed?
If there is a corporate standard practice for this OR, what is it and how does
this service comply with the practice?

Level 1: Initial


Outages are reported by users rather than a monitoring system.
No one is ever oncall, a single person is always oncall, or everyone is always
oncall.
There is no oncall schedule.
There is no oncall calendar.
There is no playbook of what to do for various alerts.

Level 2: Repeatable


A monitoring system contacts the oncall person.
There is an oncall schedule with escalation plan.
There is a repeatable process for creating the next month’s oncall calendar.
A playbook item exists for any possible alert.
A postmortem template exists.
Postmortems are written occasionally but not consistently.
Oncall coverage is geographically diverse (multiple time zones).

Level 3: Defined


Outages are classified by size (i.e., minor, major, catastrophic).
Limits (and minimums) for how often people should be oncall are defined.
Postmortems are written for all major outages.
There is an SLA defined for alert response: initial, hands-on-keyboard, issue
resolved, postmortem complete.

Level 4: Managed


The oncall pain is shared by the people most able to fix problems.
How often people are oncall is verified against the policy.
Postmortems are reviewed.
There is a mechanism to triage recommendations in postmortems and assure
they are completed.
The SLA is actively measured.

Level 5: Optimizing


Stress testing and failover testing are done frequently (quarterly or monthly).
“Game Day” exercises (intensive, system-wide tests) are done periodically.
The monitoring system alerts before outages occur (indications of “sick”
systems rather than “down” systems).
Mechanisms exist so that any failover procedure not utilized in recent history
is activated artificially.
Experiments are performed to improve SLA compliance.


Monitoring and Metrics (MM)

Monitoring and Metrics covers collecting and using data to make decisions. Monitoring collects data about a system. Metrics uses that data to measure a quantifiable
component of performance. This includes technical metrics such as bandwidth,
speed, or latency; derived metrics such as ratios, sums, averages, and percentiles;
and business goals such as the efficient use of resources or compliance with
a service level agreement (SLA). These topics are covered in Chapters 16, 17,
and 19.
Sample Assessment Questions


Is the service level objective (SLO) documented? How do you know your SLO
matches customer needs?
Do you have a dashboard? Is it in technical or business terms?
How accurate are the collected data and the predictions? How do you know?
How efficient is the service? Are machines over- or under-utilized? How is
utilization measured?
How is latency measured?
How is availability measured?
How do you know if the monitoring system itself is down?
How do you know if the data used to calculate key performance indicators
(KPIs) is fresh? Is there a dashboard that shows measurement freshness and
accuracy?
If there is a corporate standard practice for this OR, what is it and how does
this service comply with the practice?

Level 1: Initial


No SLOs are documented.
If there is monitoring, not everything is monitored, and there is no way to
check completeness.
Systems and services are manually added to the monitoring system, if at all:
there is no process.
There are no dashboards.
Little or no measurement or metrics.
You think customers are happy but they aren’t.
It is common (and rewarded) to enact optimizations that benefit a person or
small group to the detriment of the larger organization or system.
Departmental goals emphasize departmental performance to the detriment of
organizational performance.

Level 2: Repeatable


The process for creating machines/server instances assures they will be
monitored.

Level 3: Defined


SLOs are documented.
Business KPIs are defined.
The freshness of business KPI data is defined.
A system exists to verify that all services are monitored.
The monitoring system itself is monitored (meta-monitoring).

Level 4: Managed


SLOs are documented and monitored.
Defined KPIs are measured.
Dashboards exist showing each step’s completion time; the lag time of each
step is identified.
Dashboards exist showing current bottlenecks, backlogs, and idle steps.
Dashboards show defect and rework counts.
Capacity planning is performed for the monitoring system and all analysis
systems.
The freshness of the data used to calculate KPIs is measured.

Level 5: Optimizing


The accuracy of collected data is verified through active testing.
KPIs are calculated using data that is less than a minute old.
Dashboards and other analysis displays are based on fresh data.
Dashboards and other displays load quickly.
Capacity planning for storage, CPU, and network of the monitoring system is
done with the same sophistication as any major service.


Capacity Planning (CP)

Capacity Planning covers determining future resource needs. All services require
some kind of planning for future resources. Services tend to grow. Capacity planning involves the technical work of understanding how many resources are needed
per unit of growth, plus non-technical aspects such as budgeting, forecasting, and
supply chain management. These topics are covered in Chapter 18.
Sample Assessment Questions


How much capacity do you have now?
How much capacity do you expect to need three months from now? Twelve
months from now?
Which statistical models do you use for determining future needs?
How do you load-test?
How much time does capacity planning take? What could be done to make it
easier?
Are metrics collected automatically?
Are metrics available always or does their need initiate a process that collects
them?
Is capacity planning the job of no one, everyone, a specific person, or a team
of capacity planners?
If there is a corporate standard practice for this OR, what is it and how does
this service comply with the practice?

Level 1: Initial


No inventory is kept.
The system runs out of capacity from time to time.
Determining how much capacity to add is done by tradition, guessing, or luck.
Operations is reactive about capacity planning, often not being able to fulfill
the demand for capacity in time.
Capacity planning is everyone’s job, and therefore no one’s job.
No one is specifically assigned to handle CP duties.
A large amount of headroom exists rather than knowing precisely how much
slack is needed.

Level 2: Repeatable


CP metrics are collected on demand, or only when needed.
The process for collecting CP metrics is written and repeatable.
Load testing is done occasionally, perhaps when a service is new.
Inventory of all systems is accurate, possibly due to manual effort.

Level 3: Defined


CP metrics are automatically collected.
Capacity required for a certain amount of growth is well defined.
There is a dedicated CP person on the team.
CP requirements are defined at a subsystem Level.
Load testing is triggered by major software and hardware changes.
Inventory is updated as part of capacity changes.
The amount of headroom needed to survive typical surges is defined.

Level 4: Managed


CP metrics are collected continuously (daily/weekly instead of monthly or
quarterly).
Additional capacity is gained automatically, with human approval.
Performance regressions are detected during testing, involving CP if performance regression will survive into production (i.e., it is not a bug).
Dashboards include CP information.
Changes in correlation are automatically detected and raise a ticket for CP to
verify and adjust relationships between core drivers and resource units.
Unexpected increases in demand are automatically detected using MACD
metrics or similar technique, which generates a ticket for the CP person or
team.
The amount of headroom in the system is monitored.

Level 5: Optimizing


Past CP projections are compared with actual results.
Load testing is done as part of a continuous test environment.
The team employs a statistician.
Additional capacity is gained automatically.
The amount of headroom is systematically optimized to reduce waste.


Change Management (CM)

Change Management covers how services are deliberately changed over time.
This includes the software delivery platform—the steps involved in a software
release: develop, build, test, and push into production. For hardware, this includes
firmware upgrades and minor hardware revisions. These topics are covered in
Chapters 9, 10, and 11.
Sample Assessment Questions


How often are deployments (releases pushed into production)?
How much human labor does it take?
When a release is received, does the operations team need to change anything
in it before it is pushed?
How does operations know if a release is major or minor, a big or small
change? How are these types of releases handled differently?
How does operations know if a release is successful?
How often have releases failed?
How does operations know that new releases are available?
Are there change-freeze windows?
If there is a corporate standard practice for this OR, what is it and how does
this service comply with the practice?

Level 1: Initial


Deployments are done sparingly, as they are very risky.
The deployment process is ad hoc and laborious.
Developers notify operations of new releases when a release is ready for
deployment.
Releases are not deployed until weeks or months after they are available.
Operations and developers bicker over when to deploy releases.

Level 2: Repeatable


The deployment is no longer ad hoc.
Deployment is manual but consistent.
Releases are deployed as delivered.
Deployments fail often.

Level 3: Defined


What constitutes a successful deployment is defined.
Minor and major releases are handled differently.
The expected time gap between release availability and deployment is
defined.

Level 4: Managed


Deployment success/failure is measured against definitions.
Deployments fail rarely.
The expected time gap between release availability and deployment is
measured.

Level 5: Optimizing


Continuous deployment is in use.
Failed deployments are extremely rare.
New releases are deployed with little delay.


New Product Introduction and Removal (NPI/NPR)

New Product Introduction and Removal covers how new products and services are
introduced into the environment and how they are removed. This is a coordination
function: introducing a new product or service requires a support infrastructure
that may touch multiple teams.
For example, before a new model of computer hardware is introduced into
the datacenter environment, certain teams must have access to sample hardware
for testing and qualification, the purchasing department must have a process
to purchase the machines, and datacenter technicians need documentation. For
introducing software and services, there should be tasks such as requirements
gathering, evaluation and procurement, licensing, and creation of playbooks for
the helpdesk and operations.
Product removal might involve finding all machines with a particularly old
release of an operating system and seeing that all of them get upgraded. Product removal requires identifying current users, agreeing on timelines for migrating
them away, updating documentation, and eventually decommissioning the product, any associated licenses, maintenance contracts, monitoring, and playbooks.
The majority of the work consists of communication and coordination between
teams.
Sample Assessment Questions


How is new hardware introduced into the environment? Which teams are
involved and how do they communicate? How long does the process take?
How is old hardware or software eliminated from the system?
What is the process for disposing of old hardware?
Which steps are taken to ensure disks and other storage are erased when
disposed?
How is new software or a new service brought into being? Which teams are
involved and how do they communicate? How long does the process take?
What is the process for handoff between teams?
Which tools are used?
Is documentation current?
Which steps involve human interaction? How could it be eliminated?
If there is a corporate standard practice for this OR, what is it and how does
this service comply with the practice?

Level 1: Initial


New products are introduced through ad hoc measures and individual
heroics.
Teams are surprised by NPI, often learning they must deploy something into
production with little notice.
NPI is delayed due to lack of capacity, miscommunication, or errors.
Deprecating old products is rarely done, resulting in operations having to
support an “infinite” number of hardware or software versions.

Level 2: Repeatable


The process used for NPI/NPR is repeatable.
The handoff between teams is written and agreed upon.
Each team has a playbook for tasks related to its involvement with NPR/NPI.
Equipment erasure and disposal is documented and verified.

Level 3: Defined


Expectations for how long NPI/NPR will take are defined.
The handoff between teams is encoded in a machine-readable format.
Members of all teams understand their role as it fits into the larger, overall
process.
The maximum number of products supported by each team is defined.
The list of each team’s currently supported products is available to all teams.

Level 4: Managed


There are dashboards for observing NPI and NPR progress.
The handoff between teams is actively revised and improved.
The number of no-longer-supported products is tracked.
Decommissioning no-longer-supported products is a high priority.

Level 5: Optimizing


NPI/NPR tasks have become API calls between teams.
NPI/NPR processes are self-service by the team responsible.
The handoff between teams is a linear flow (or for very complex systems,
joining multiple linear flows).


Service Deployment and Decommissioning (SDD)

Service Deployment and Decommissioning covers how instances of an existing
service are created and how they are turned off (decommissioned). After a service
is designed, it is usually deployed repeatedly. Deployment may involve turning
up satellite replicas in new datacenters or creating a development environment of
an existing service. Decommissioning could be part of turning down a datacenter,
reducing excess capacity, or turning down a particular service instance such as a
demo environment.
Sample Assessment Questions


What is the process for turning up a service instance?
What is the process for turning down a service instance?
How is new capacity added? How is unused capacity turned down?
Which steps involve human interaction? How could it be eliminated?
How many teams touch these processes?
Do all teams know how they fit into the over all picture?
What is the workflow from team to team?
Which tools are used?
Is documentation current?
If there is a corporate standard practice for this OR, what is it and how does
this service comply with the practice?

Level 1: Initial


The process is undocumented and haphazard. Results are inconsistent.
The process is defined by who does something, not what is done.
Requests get delayed due to miscommunication, lack of resources, or other
avoidable reasons.
Different people do the tasks differently.

Level 2: Repeatable


The processes required to deploy or decommission a service are understood
and documented.
The process for each step is documented and verified.
Each step has a QA checklist to be completed before handing off to the next
step.
Teams learn of process changes by other teams ahead of time.- Information or processing needed by multiple steps is created once.
There is no (or minimal) duplication of effort.
The ability to turn up new capacity is a repeatable process.
Equipment erasure and disposal is documented and verified.

Level 3: Defined


The SLA for how long each step should take is defined.
For physical deployments, standards for removal of waste material (boxes,
wrappers, containers) are based on local environmental standards.
For physical decommissions, standards for disposing of old hardware are
based on local environmental standards as well as the organization’s own
standards for data erasure.
Tools exist to implement many of the steps and processes.

Level 4: Managed


The defined SLA for each step is measured.
There are feedback mechanisms for all steps.
There is periodic review of defects and reworks.
Capacity needs are predicted ahead of need.
Equipment disposal compliance is measured against organization standards
as well as local environmental law.
Waste material (boxes, wrappers, containers) involved in deployment is
measured.
Quantity of equipment disposal is measured.

Level 5: Optimizing


After process changes are made, before/after data is compared to determine
success.
Process changes are reverted if before/after data shows no improvement.
Process changes that have been acted on come from a variety of sources.
Cycle time enjoys month-over-month improvements.
Decisions are supported by modeling “what if” scenarios using extracts from
actual data.
Equipment disposal is optimized by the reduction of equipment deployment.


Performance and Efficiency (PE)

Performance and Efficiency covers how cost-effectively resources are used and
how well the service performs. A running service needs to have good performance
without wasting resources. We can generally improve performance by using more
resources, or we may be able to improve efficiency to the detriment of performance.
Achieving both requires a large effort to bring about equilibrium. Cost-efficiency
is cost of resources divided by quantity of use. Resource efficiency is quantity of
resources divided by quantity of use. To calculate these statistics, one must know
how many resources exist; thus some kind of inventory is required.
Sample Assessment Questions


What is the formula used to measure performance?
What is the formula used to determine utilization?
What is the formula used to determine resource efficiency?
What is the formula used to determine cost efficiency?
How is performance variation measured?
Are performance, utilization, and resource efficiency monitored automatically? Is there a dashboard for each?
Is there an inventory of the machines and servers used in this service?
How is the inventory kept up-to-date?
How would you know if something was missing from the inventory?
If there is a corporate standard practice for this OR, what is it and how does
this service comply with the practice?

Level 1: Initial


Performance and utilization are not consistently measured.
What is measured depends on who set up the systems and services.
Resource efficiency is not measured.
Performance problems often come as a surprise and are hard to diagnose and
resolve because there is insufficient data.
Inventory is not up-to-date.
Inventory may or may not be updated, depending on who is involved in
receiving or disposing of items.

Level 2: Repeatable


All metrics relevant to performance and utilization are collected across all
systems and services.
The process for bringing up new systems and services is documented and
everyone follows the process.
Systems are associated with services when configured for use by a service, and
disassociated when released.
Inventory is up-to-date. The inventory process is well documented and everyone follows the process.

Level 3: Defined


Performance and utilization monitoring is automatically configured for all
systems and services during installation and removed during decommission.
Performance targets for each service are defined.
Resource usage targets for each service are defined.
Formulas for service-oriented performance and utilization metrics are
defined.
Performance of each service is monitored continuously.
Resource utilization of each service is monitored continuously.
Idle capacity that is not currently used by any service is monitored.
The desired amount of headroom is defined.
The roles and responsibilities for keeping the inventory up-to-date are defined.
Systems for tracking the devices that are connected to the network and their
hardware configurations are in place.

Level 4: Managed


Dashboards track performance, utilization, and resource efficiency.
Minimum, maximum, and 90th percentile headroom are tracked and compared to the desired headroom and are visible on a dashboard.
Goals for performance and efficiency are set and tracked.
There are periodic reviews of performance and efficiency goals and status for
each service.
KPIs are used to set performance, utilization, and resource efficiency goals that
drive optimal behavior.
Automated systems track the devices that are on the network and their con-
figurations and compare them with the inventory system, flagging problems
when they are found.

Level 5: Optimizing


Bottlenecks are identified using the performance dashboard. Changes are
made as a result.
Services that use large amounts of resources are identified and changes are
made.
Changes are reverted if the changes do not have a positive effect.
Computer hardware models are regularly evaluated to find models where
utilization of the different resources is better balanced.
Other sources of hardware and other hardware models are regularly evaluated
to determine if cost efficiency can improved.


Service Delivery: The Build Phase

Service delivery is the technical process of how a service is created. It starts with
source code created by developers and ends with a service running in production.
Sample Assessment Questions


How is software built from source code to packages?
Is the final package built from source or do developers deliver precompiled
elements?
What percentage of code is covered by unit tests?
Which tests are fully automated?
Are metrics collected about bug lead time, code lead time, and patch lead time?
To build the software, do all raw source files come from version control
repositories?
To build the software, how many places (repositories or other sources) are
accessed to attain all raw source files?
Is the resulting software delivered as a package or a set of files?
Is everything required for deployment delivered in the package?
Which package repository is used to hand off the results to the deployment
phase?
Is there a single build console for status and control of all steps?
If there is a corporate standard practice for this OR, what is it and how does
this service comply with the practice?

Level 1: Initial


Each person builds in his or her own environment.
People check in code without checking that it builds.
Developers deliver precompiled elements to be packaged.
Little or no unit testing is performed.
No metrics are collected.
Version control systems are not used to store source files.
Building the software is a manual process or has manual steps.
The master copies of some source files are kept in personal home directories
or computers.

Level 2: Repeatable


The build environment is defined; everyone uses the same system for consistent results.
Building the software is still done manually.- Testing is done manually.
Some unit tests exist.
Source files are kept in version-controlled repositories.
Software packages are used as the means of delivering the end result.
If multiple platforms are supported, each is repeatable, though possibly
independently.

Level 3: Defined


Building the software is automated.
Triggers for automated builds are defined.
Expectations around unit test coverage are defined; they are less than
100 percent.
Metrics for bug lead time, code lead time, and patch lead time are defined.
Inputs and outputs of each step are defined.

Level 4: Managed


Success/fail build ratios are measured and tracked on a dashboard.
Metrics for bug lead time, code lead time, and patch lead time are collected
automatically.
Metrics are presented on a dashboard.
Unit test coverage is measured and tracked.

Level 5: Optimizing


Metrics are used to select optimization projects.
Attempts to optimize the process involve collecting before and after metrics.
Each developer can perform the end-to-end build process in his or her own
sandbox before committing changes to a centralized repository.
Insufficient unit test code coverage stops production.
If multiple platforms are supported, building for one is as easy as building for
them all.
The software delivery platform is used for building infrastructure as well as
applications.


Service Delivery: The Deployment Phase

The goal of the deployment phase is to create a running environment. The deployment phase creates the service in one or more testing and production environments. This environment will then be used for testing or for live production
services.
Sample Assessment Questions


How are packages deployed in production?
How much downtime is required to deploy the service in production?
Are metrics collected about frequency of deployment, mean time to restore
service, and change success rate?
How is the decision made to promote a package from testing to production?
Which kind of testing is done (system, performance, load, user acceptance)?
How is deployment handled differently for small, medium, and large releases?
If there is a corporate standard practice for this OR, what is it and how does
this service comply?

Level 1: Initial


Deployment involves or requires manual steps.
Deployments into the testing and production environments are different
processes, each with its own tools and procedures.
Different people on the team perform deployments differently.
Deployment requires downtime, and sometimes significant downtime.
How a release is promoted to production is ad hoc or ill defined.
Testing is manual, ill defined, or not done.

Level 2: Repeatable


Deployment is performed in a documented, repeatable process.
If deployment requires downtime, it is a predictable.
Testing procedures are documented and repeatable.

Level 3: Defined


Metrics for frequency of deployment, mean time to restore service, and change
success rate are defined.
How downtime due to deployments is to be measured is defined; limits and
expectations are defined.
How a release is promoted to production is defined.
Testing results are clearly communicated to all stakeholders.

Level 4: Managed


Metrics for frequency of deployment, mean time to restore service, and change
success rate are collected automatically.
Metrics are presented on a dashboard.
Downtime due to deployments is measured automatically.
Reduced production capacity during deployment is measured.
Tests are fully automated.

Level 5: Optimizing


Metrics are used to select optimization projects.
Attempts to optimize the process involve collecting before and after metrics.
Deployment is fully automated.
Promotion decisions are fully automated, perhaps with a few specific exceptions.
Deployment requires no downtime.


Toil Reduction

Toil Reduction is the process by which we improve the use of people within our
system. When we reduce toil (i.e., exhausting physical labor), we create a more
sustainable working environment for operational staff. While reducing toil is not
a service per se, this OR can be used to assess the amount of toil and determine
whether practices are in place to limit the amount of toil.
Sample Assessment Questions


How many hours each week are spent on coding versus non-coding projects?
What percent of time is spent on project work versus manual labor that could
be automated?
What percentage of time spent on manual labor should raise a red flag?
What is the process for detecting that the percentage of manual labor has
exceeded the red flag threshold?
What is the process for raising a red flag? Whose responsibility is it?
What happens after a red flag is raised? When is it lowered?
How are projects for reducing toil identified? How are they prioritized?
How is the effectiveness of those projects measured?
If there is a corporate standard practice for this OR, what is it and how does
this service comply with the practice?

Level 1: Initial


Toil is not measured and grows until no project work, or almost no project
work, can be accomplished.
There is no process for raising a red flag.
Some individuals recognize when toil is becoming a problem and look for
solutions, but others are unaware of the problem.
Individuals choose to work on the projects that are the most interesting to
them, without looking at which projects will have the biggest impact.

Level 2: Repeatable


The amount of time spent on toil versus on projects is measured.
The percentage of time spent on toil that constitutes a problem is defined and
communicated.
The process for raising a red flag is documented and communicated.
Individuals track their own toil to project work ratio, and are individually
responsible for raising a red flag.
Red flags may not always be raised when they should be.- The process for identifying which projects will have the greatest impact on toil
reduction is defined.
The method for prioritizing projects is documented.

Level 3: Defined


For each team, the person responsible for tracking toil and raising a red flag is
identified.
The people involved in identifying and prioritizing toil-reduction projects are
known.
Both a red flag  Level of toil and a target Level are defined. The red flag is lowered when toil reaches the target Level.
During the red flag period, the team works on only the highest-impact toil-reduction projects.
During the red flag period, the team has management support for putting
other projects on hold until toil is reduced to a target Level.
After each step in a project, statistics on toil are closely monitored, providing
feedback on any positive or negative changes.

Level 4: Managed


Project time versus toil is tracked on a dashboard, and the amount of time
spent on each individual project or manual task is also tracked.
Red flags are raised automatically, and the dashboard gives an overview of
where the problems lie.
The time-tracking data is monitored for trends that give an early alert for teams
that are showing an increase in toil in one or more areas.
KPIs are defined and tracked to keep toil within the desired range and
minimize the red flag periods.

Level 5: Optimizing


The target and red flag Levels are adjusted and the results are monitored to the effect on overall flow, performance, and innovation.
Changes to the main project prioritization process are introduced and evaluated for positive or negative impact, including the impact on toil.
Changes to the red flag toil-reduction task prioritization process are introduced and evaluated.


Disaster Preparedness

An operations organization needs to be able to handle outages well, and it must
have practices that reduce the chance of repeating past mistakes. Disasters and
major outages happen. Everyone in the company from the top down needs to recognize that fact, and adopt a mind-set that accepts outages and learns from them.
Systems should be designed to be resilient to failure.
Sample Assessment Questions


What is the SLA? Which tools and processes are in place to ensure that the
SLA is met?
How complete are the playbooks?
When was each scenario in the playbooks last exercised?
What is the mechanism for exercising different failure modes?
How are new team members trained to be prepared to handle disasters?
Which roles and responsibilities apply during a disaster?
How do you prepare for disasters?
How are disasters used to improve future operations and disaster response?
If there is a corporate standard practice for this OR, what is it and how does
this service comply with the practice?

Level 1: Initial


Disasters are handled in an ad hoc manner, requiring individual heroics.
Playbooks do not exist, or do not cover all scenarios.
Little or no training exists.
Service resiliency and different failure scenarios are never tested.

Level 2: Repeatable


Playbooks exist for all failure modes, including large-scale disasters.
New team members receive on-the-job training.
Disasters are handled consistently, independent of who is responding.
If multiple team members respond, their roles, responsibilities, and handoffs
are not clearly defined, leading to some duplication of effort.

Level 3: Defined


The SLA is defined, including dates for postmortem reports.
Handoff procedures are defined, including checks to be performed and documented.
How to scale the responding team to make efficient use of more team members
is defined.
The roles and responsibilities of team members in a disaster are defined.
Specific disaster preparedness training for new team members is defined and
implemented.
The team has regular disaster preparedness exercises.
The exercises include fire drills performed on the live service.
After every disaster, a postmortem report is produced and circulated.

Level 4: Managed


The SLA is tracked using dashboards.
The timing for every step in the process from the moment the event occurred
is tracked on the dashboard.
A program for disaster preparedness training ensures that all aspects are
covered.
The disaster preparedness program measures the results of disaster preparedness training.
As teams become better at handling disasters, the training expands to cover
more complex scenarios.
Teams are involved in cross-functional fire drills that involve multiple teams
and services.
Dates for publishing initial and final postmortem reports are tracked and
measured against the SLA.

Level 5: Optimizing


Areas for improvement are identified from the dashboards.
New techniques and processes are tested and the results measured and used
for further decision making.
Automated systems ensure that every failure mode is exercised within a
certain period, by artificially causing a failure if one has not occurred naturally.