When building a service, we have a responsibility to define some baseline agreements as to the service’s expected uptime and performance. This document focuses on the various terminology that we use to define these values.
-
Service Level Indicator (SLI)
What the service owner has chosen to measure progress towards their goal.
e.g. What is good/bad service for your users. -
Service Level Objective (SLO)
What the service owner’s goal is for the given indicator.
e.g. A percentage that tells how many SLI errors your service can have in a specific period of time. -
Service Level Agreement (SLA)
What the service owner is promising its users (typically the SLO + wiggle room).
- Indicator: request latency.
- Objective: 99.5% of requests will be completed in 5ms.
- Agreement: 99% of requests complete in 5ms or you get a refund.
Below is an alternative example where we state (via a Datadog graph) that if the average request latency over a 1hr period exceeds one second, then we have a ‘service’ issue and it’ll display with a red background. This will signify that we’ve failed our SLA which is defined as being less than one second.
If the average request latency over a 1hr period is greater than half a second, then we have a ‘team’ issue and it’ll display with an orange background. This will signify that we’ve failed our SLO which is defined as being less than half a second.
If the average request latency over a 1hr period is less than half a second, then we have no issues and it’ll display with a green background. This will signify that we’ve reached our SLO which is defined as being less than half a second.