Skip to content

Instantly share code, notes, and snippets.

@lalitkale
Created March 8, 2021 13:54
Show Gist options
  • Save lalitkale/5899fae93e56a47524fa8600bf76f848 to your computer and use it in GitHub Desktop.
Save lalitkale/5899fae93e56a47524fa8600bf76f848 to your computer and use it in GitHub Desktop.
Topic: High Availability

Availability Availability is the proportion of time that the system is functional and working. It is usually measured as a percentage of uptime. Application errors, infrastructure problems, and system load can all reduce availability.

A cloud application should have a service level objective (SLO) that clearly defines the expected availability, and how the availability is measured. When defining availability, look at the critical path. The web front-end might be able to service client requests, but if every transaction fails because it can't connect to the database, the application is not available to users.

Availability is often described in terms of "9s" — for example, "four 9s" means 99.99% uptime. The following table shows the potential cumulative downtime at different availability levels.

99% 1.68 hours 7.2 hours 3.65 days 99.9% 10 minutes 43.2 minutes 8.76 hours 99.95% 5 minutes 21.6 minutes 4.38 hours 99.99% 1 minute 4.32 minutes 52.56 minutes 99.999% 6 seconds 26 seconds 5.26 minutes

Notice that 99% uptime could translate to an almost 2-hour service outage per week. For many applications, especially consumer-facing applications, that is not an acceptable SLO. On the other hand, five 9s (99.999%) means no more than 5 minutes of downtime in a year. It's challenging enough just detecting an outage that quickly, let alone resolving the issue. To get very high availability (99.99% or higher), you can't rely on manual intervention to recover from failures. The application must be self-diagnosing and self-healing, which is where resiliency becomes crucial.

In Cloud, the Service Level Agreement (SLA) describes Cloud Provider's commitments for uptime and connectivity. If the SLA for a particular service is 99.95%, it means you should expect the service to be available 99.95% of the time.

Applications often depend on multiple services. In general, the probability of either service having downtime is independent. For example, suppose your application depends on two services, each with a 99.9% SLA. The composite SLA for both services is 99.9% × 99.9% ≈ 99.8%, or slightly less than each service by itself.

Calculating availability with hard dependencies: Many systems have hard dependencies on other systems, where an interruption in a dependent system directly translates to an interruption of the invoking system. This is opposed to a soft dependency, where a failure of the dependent system is compensated for in the application. Where such hard dependencies occur, the invoking system availability is the product of the dependent systems’ availabilities.

For example, if you have a system designed for 99.99% availability that has a hard dependency on two other independent systems that each are designed for 99.99% availability, the system can theoretically achieve 99.97% availability: invoking system * dependent 1 * dependent 2 = 99.99% * 99.99% * 99.99% = 99.97% It’s therefore important to understand your dependencies and their availability design goals as you calculate your own.

Calculating availability with redundant components: When a system involves the use of independent, redundant components (for example, redundant Availability Zones), the theoretical availability is computed as 100% minus the product of the component failure rates (100% minus availability.)

For example, if a system makes use of two independent components, each with an availability of 99.9%, the resulting system availability is >99.999%: maximum availability - ((downtime of dependent 1) * (downtime of dependent 2)) = 100% - (0.1% * 0.1%) = 99.9999%

But what if I don’t know the availability of a dependency?

Calculating dependency availability: Some dependencies provide guidance on their availability, including availability design goals for many AWS services (see Appendix A: Designed-For Availability for Select AWS Services). But in cases where this isn’t available (for example, a component where the manufacturer does not publish availability information), one simple way to estimate is to determine the Mean Time Between Failure (MTBF) and Mean Time to Recover (MTTR). An availability estimate can be established by: Availability Estimate = MTBF / (MTBF + MTTR) Amazon Web Services Reliability Pillar AWS Well-Architected Framework Page 5 For example, if the MTBF is 150 days and the MTTR is 1 hour, the availability estimate is 99.97%.

References:

https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html

https://d1.awsstatic.com/whitepapers/architecture/AWS-Reliability-Pillar.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment