Instantly share code, notes, and snippets.

# amodm/wdp-20220129.md

Last active March 9, 2023 04:05
Show Gist options
• Save amodm/8e1880a9af176cdea0bfb2137226453a to your computer and use it in GitHub Desktop.
Response to the #WeekendDevPuzzle of 2022-01-29

# On availability aspects of microservices

This post is in response to the WeekendDevPuzzle of 2022-01-29, which has submissions from people who chose to share their perspectives on the puzzle. I'll be linking this post to this Twitter thread, to avoid polluting the original one.

## Motivation for today's topic

Unlike the usual peel the onion kind of topics, this one focuses more on how we think & our mental models of architectural decisions. But there's more.

Over the past 15 yrs, programming has become extremely accessible. While that's undoubtedly a good thing, it's become much easier to just rely on best practice, with or without context. Today's puzzle was designed to bring to the discussion table, that context, albeit in a super simplified fashion.

## Dissecting the puzzle

Let's dissect the question, and bring out the core elements of it.

### Basics

For a distributed system where call graph looks like A → B → C, it's easy to see that the call succeeds only when all of the components are available. We can write this as: `P(U) = P(A) * P(B) * P(C)`, where

• `P(U)` is the probability of the call being processed correctly, as seen by the calling user
• `P(A)` is the probability of component A being up
• and so on...

Side note: If reading the word probability immediately switched off your mind, don't worry. We'll keep the maths super basic, given our super simplified scenarios. Who knows, it might even encourage you to get more comfortable with it!

It would seem like we have the answer to our puzzle then. But do we? Let's dig deeper.

### A deeper look

The puzzle mentions the following:

• LB has never gone down.
• Web servers appear to go down for ~3 hrs/mo.
• DB appears to go down for ~1 hr/mo

What do you think it means for `P(LB)`, `P(WEB)` and `P(DB)`?

• Is `P(LB) == 1.0`, as we've never observed it go down? Is observed availability, the same as designed availability? Maybe the switchover (planned flip from active to standby) incurs no loss, but failover (active fails, standby picks up) incurs 4 secs of loss. What number would you take then?
• For `P(WEB)`, it would seem that it should be `1 - 3/(30*24) == 99.58%` (3 hrs down out of 24x30 hrs in a month). But there are two issues with that:
1. It's not stated if this downtime includes the DB related outages. If it does, then what we're calculating here is really `P(WEB) * P(DB)`, and not `P(WEB)` alone. Can you see why?
2. As with LB above, this is more like observed availability, than designed one.
• For `P(DB)`, the observed availability should be `1 - 1/(30*24) == 99.86%`

So, at least we now have some richer perspective, even if more questions got added, about Scenario A. What about Scenarios B & C? After all, there was no mention about downtime numbers for them in the puzzle.

### An even deeper look

• `P(LB)` - not defined explicitly, but is there any reason for it to change?
• `P(WEB)` - would this change? As we're separating out some code from it, this code is changing. Clearly, that should change its availability too. But for better or for worse?
• `P(LB2)` - not defined explicitly, but any reason for it to be different than `P(LB)` above?
• `P(SVC)` - not defined explicitly. Given that this service was created by carving out a piece of code from the web server, are there assumptions we can make about it, in relative terms to `P(WEB)`?

Similarly, we can analyse the picture for Scenario C, with only one additional point of interest:

• The smart client - this topology aware client will have its own availability. Remember that for it to be topology aware, it'll need to fetch that information over the network. As such, even if it's perfectly implemented, its availability is going to be bound by that of the network, `P(N)` (am I correct in this statement here?)

Is there anything else we're missing?

### Implicit assumptions

We made an implicit assumption above, that our network is perfect. But that isn't always the case (cable breaks, SFP failures, OS network table full, network saturation, router misconfiguration, the list is long). A surprising number of engineers tend to assume perfect networks, just because they've not observed it fail, until it does, and again and again.

Let's denote the network's availability to be `P(N) < 1.0`. Clearly every distributed call over that network, has its availability reduced by this factor. So, Scenario A would look something like `P(U) = P(LB) * P(N) * P(WEB) * P(N) * P(DB)`. Two network hops has meant a reduction in availability by `P(N)^2`. Can you see why?

Nitesh correctly points this out, though he mistakes my comment about observed availability to be design availability.

There're a few other implicit assumptions being made here, but for the sake of brevity, let's ignore them for now.

## I need a break!

At this point, some of you might be thinking: "WTH! Dissecting it was supposed to simplify the question, not add to it!". Let's take a deep breath

Sometimes, breaking down a problem may noticeably amplify the number of factors, but the story from there only gets better, as we start hacking & slashing away at the problem, by making simplifying (yet informed and explicit) assumptions. So let's start doing that.

## My submission

Let's quick go over the calculations we did earlier:

• Scenario A: `P(A) = P(LB) * P(N) * P(WEB) * P(N) * P(DB)`
• Scenario B: `P(B) = P(LB) * P(N) * P(WEB) * P(N) * P(LB2) * P(N) * P(SVC) * P(N) * P(DB)`. Did you notice a hidden assumption here (that every single call requires every component, as against some calls being served by web server alone)?
• Scenario C: `P(C) = P(LB) * P(N) * P(WEB) * P(SMART_CLIENT) * P(N) * P(SVC) * P(N) * P(DB)`. Some of you might notice how removal of `P(LB2) * P(N)` might help C get better. Let's see if it does. Also, hidden assumption of Scenario B applies here too.

I'm gonna make the following assumptions. The exact numbers don't matter, it's the model that matters:

• `P(LB) == 99.995%` and to remain the same in all scenarios
• `P(LB) == P(LB2)`
• `P(N) == 99.995%`
• `P(SMART_CLIENT) == 99.99%`. There are details to this which are tricky to capture, but let's run with a simplified model for now.
• `P(WEB)` and `P(DB)` are independent of each other, i.e. web server's availability numbers do not include DB's availability. We assume this for simplicity for now, you can always play around with the numbers later.

For the remaining parameters, let's evaluate two situations, H1 (for hypothesis 1) and H2.

H1 (refactoring improved availability because of reduced code complexity, no more buggy DB driver code eating up web resources, etc):

• `P(WEB) == 99.7%` for Scenarios B & C
• `P(SVC) == 99.65%`

H2 (refactoring reduced availability due to poor coding, poor abstractions, lesser integration testing, etc):

• `P(WEB) == 99.5%` for Scenarios B & C
• `P(SVC) == 99.58%`. It's difficult for me to visualise the micro service's availability going below a more complex web server complexity that includes the same logic.

Let's see what we get

Scenario H1 H2
A 99.43% 99.43%
B 99.18% 98.91%
C 99.18% 98.91%

So, this is very interesting. Whether you believe me or not, I didn't plan for the numbers to be like this. The following observations stand out to me:

1. Scenario A is noticeably better than B or C, even in H1, i.e. where we're assuming a betterment of availability.
2. Scenario B == Scenario C on availability.
3. H2 being worse off in Scenarios B & C is no surprise, as we're deliberately assuming a worsening of post-refactor availabilities.

Let's analyse these outcomes one by one.

## Analysing the outcomes

#### Scenario A is noticeably better than B & C even in H1

One can read this as Breaking a monolith while increasing individual availabilities, still reduced the overall availability. How did this happen? Well, two things:

1. More moving parts means product of probabilities decreases faster.
2. Network being assumed to be 99.995% available added to the cost. Even if we assume `P(N) == 100%` (ideal network), we still get `P(B) = 99.20%` for H1, which means that the number of moving parts tends to dominate.

Can we assume this to be a universal truth? I'd say no. The takeaway here is that number of moving parts impacts availability substantially, and when breaking a monolith, the individual availability improvements should be large enough to compensate for the reduction due to increase in moving parts, e.g. if the refactoring in H1 improved `P(WEB) == 99.85%` and `P(SVC) == 99.75%`, Scenario B & C become better than A.

#### Scenario B == Scenario C on availability

This is just an artifact of choosing `P(SMART_CLIENT) == 99.99%`. A higher number would put Scenario B better than C, and a lower number will reverse that. The takeaway here is that when focussing on embedded smart clients, their code quality (which reflects in `P(SMART_CLIENT)`) needs to be very high for it to be better than a HA dedicated hardware LB.

This part, though so clear mathematically, was a bit of a surprise. The fact that I was surprised, tells me that there was a bias in my head, a chink in my mental model, that assumed embedded smart clients to be better. Reflecting deeper, I feel the performance implications were leaking into the availability assumptions in my head.

#### An important point

It's not the answer that's important here, it's the model or the factors that lead to the answer. The value of each parameter would vary depending upon the circumstances, so the outcome can be different. So, I'd request you to focus on the takeaways, instead of the answer.

## Conclusion

Phew! This was one of the longer ones.

Some of you might say that I've taken a particularly elaborate (to the point of being unnecessary) approach to the puzzle. I wouldn't disagree with you. But, in my defense, this is supposed to be a weekend puzzle 😄, to be noodled over & over, in different forms, lazily & elaborately. It isn't designed to be a race.

My own reason to be this elaborate, is simply that I wanted to lay as exhaustively as I could, all the different elements of the puzzle. For some of you, it might bring attention to a hidden chink in your mental model. For a few others, it might've helped convert implicit assumptions into explicit ones. For the remaining, it could either be an affirmation of how they thought, or an opportunity to help correct my calculations here.

Irrespective of whether you liked this elaborate approach or not, or even agree with my view to the puzzle, I hope you still had fun thinking about it, including all of the different aspects of it.

### amodm commented Jan 30, 2022

One factor to consider in this refactoring, is the fact that not all the requests to the web servers require a call to the DB, which means that removing an assumedly complicated implementation into a separate service, might actually improve `P(WEB)` far more than the 99.7% I've assumed in H1.

I left this point out in favour of brevity, because it doesn't change the essence of what we're discussing, only moves the availability number assumption slightly.

### amodm commented Jan 30, 2022

If someone cites this post as a reason for you to not build out micro services, this comment should help.

A common reason why monoliths start becoming shaky (in terms of availability), is because different parts are moving at different speeds. Sometimes, moving the high frequency changes into a separate service (if logical abstractions permit), can improve the availability significantly.

Another reason to break a monolith sometimes can be the trade off b/w availability & performance/cost, e.g. if a piece of code that gets used in only 10% of the calls, but takes up 70% of the memory, it might make sense to pull it out, because you can keep the rest of the code quite lean (resource consumption wise).

As I said at the beginning of OP, context matters. There are no blind rules.

I wanted to understand how did we come up to

`P(C) = P(LB) * P(N) * P(WEB) * P(SMART_CLIENT) * P(N) * P(SVC) * P(N) * P(DB)`?
Especially the last `P(SVC) * P(N)`. Is the client talking to a service as well or are we assuming this client lies with the sidecar in the original web_server and maintains connections with database. (Kind of like service mesh, not sure if using the term correctly here.)

PS : amazing question btw and thanks for detailed analysis ^_^

### amodm commented Mar 8, 2023

@uds5501, by smart client I mean a client that's effectively running inside the web server, either as an SDK inside the same runtime, or as a sidecar. Yes, this smart client talks to a service. See this diagram for Scenario C that I'd attached in the original tweet thread:

The probability definitions are based on the usual prob theory: If A and B are two independent events where P(A) represents probability of A being successful, and P(B) being probability of B being successful, then P(A and B) = P(A) * P(B)

As to why `P(SVC) * P(N)`, remember that any call over the network has to account for a network failure (physical infra or OS network stack) too, so whenever there's a network hop, you'll see a `P(N)` getting multiplied, e.g. from the perspective of the smart client, the downstream probability of success that the smartclient sees would be `P(N) * P(SVC) * P(N) * P(DB)`.

I hope I've understood your question correctly.

### amodm commented Mar 8, 2023

Do note that a sidecar introduces another potential failure point (OS IPC layer), but in this example, we've assumed that it never fails.

### uds5501 commented Mar 9, 2023

Ah okay! I missed that Micro Svc layer somehow while reading the answer, thanks for pointing out.