Instantly share code, notes, and snippets.

# pervognsen/branch_math.md

Last active May 16, 2020

Let's say you have to perform a 4-way case analysis and are given a choice between cascaded 2-way branches or a single 4-way branch.

What's the difference in branch misprediction penalties, assuming the 4 cases are random?

Denote the branch penalty by x (which is 15-20 cycles on Sky Lake according to Agner Fog).

With the 4-way branch, the worst-case branch penalty is x. The expected branch penalty is 3/4 x: there's a 3/4 chance of x penalty.

With the cascaded 2-way branches, the worst-case branch penalty is 2x. The expected branch penalty is 1/2 x + 1/4 2x = x: there's a 1/4 chance of 2x penalty, and there's a 1/4 + 1/4 = 1/2 chance of x penalty.

A difference of 1/4 x for x = 20 cycles is 5 cycles. If you want a 4-way branch implemented with a jump table to win, you need to keep the fixed overhead down to less than 5 cycles, at least in this toy model. An L1-hitting load with a complex addressing mode is 5 cycles on Sky Lake, but instructions that only feed into a branch instruction aren't part of the subsequent dependency chain as far as latency goes, so it isn't as simple as adding up the latencies.

Now suppose we extend this comparison to a n-way case analysis where n = 2^k. The expected penalty for the multi-way branch is (n-1)/n x, which is ~x for larger n. The expected penalty for the cascaded 2-way branches is given by the recurrence relation f(n) = 1/2 x + f(n/2), f(1) = 0, which has the closed-form solution f(n) = k/2 x.

There isn't really any grand conclusion from this other than a more quantitative assessment of how multi-way branches widen the gap as n increases. I'm just trying to see if there's some useful napkin math for thinking about different branch structures for case analysis. E.g. how skewed does the case distribution need to be before you should pull out the most frequent cases from the multi-way branch into leading 2-way branches. The easy answer is that you should test it on real hardware with real datasets and pick the winner, but it would be nice to have a mental model that captures the major factors in simpler instances.

### aardappel commented May 7, 2020 • edited

 What if the probabilities of the 4 cases are not all 25% ? For example, with probabilities of `40% 30% 20% 10%`, 4-way chance of x penalty is `0.4 * 0.6 + 0.3 * 0.7 + 0.2 * 0.8 + 0.1 * 0.9` = `0.7` Cascaded 2-way expected penalty is `0.3 (first branch) + ( 0.4 * (3/7) + 0.3 * (4/7) + 0.2 * (1/3) + 0.1 * (2/3) ) ` = `0.776` Which are not only both lower, but they're very close. Which probably means the break-even point is with only slightly more uneven probabilities than in this example. Computing that break-even point left as an exercise for the reader. But now what if probabilities are not just uneven, but they're also correlated in time? That math gets too hairy for an example, but my guess is that Cascaded 2-way would get an even earlier break-even point as its advantage with uneven probabilities compounds, but I am just guessing here. A real life example is this code: read a 1/2/4/8 byte scalar from memory, where the size is dynamic: https://github.com/google/flatbuffers/blob/f12cca8bcb017618773aaa0f19623f44168bd57a/include/flatbuffers/flexbuffers.h#L130-L168 Here probabilities are indeed uneven (the smaller sizes are way more frequent) and also correlated (it is often used to access a vector of all the same size), hence why I assumed cascaded would be faster. But it really depends on the data and access patterns.
Owner Author

### pervognsen commented May 7, 2020

 The reason I kept the analysis to the simplest case of fully random cases is that it seems too hard to analyze with napkin math beyond that point given how modern branch predictors work. You can see some examples in this thread: https://twitter.com/pervognsen/status/1256369404817715200
Owner Author

### pervognsen commented May 8, 2020 • edited

 I just looked at the code you linked. I'm not sure how it would work out in practice given your use case, but all my code relies heavily on unaligned reads and writes for exactly this reason. You can see an example here: https://gist.github.com/pervognsen/12d358c8fdd531d21fbccdca251fc2cd#file-spaceskip-c-L15. You can usually ensure that the data you're reading has sufficient end-padding to avoid the issue with unowned memory. For something like a serialization format where you own both the write side and the read side, it's especially easy; for something like a programming language lexer as in the above example, it's a little trickier (but usually doable with some trickery) if you want to use memory-mapped file IO or read user data directly.

### aardappel commented May 8, 2020

 @pervognsen thanks, I agree unaligned reads would be the best way to implement this, as I mentioned in the comments. Problem is, this is in a very general library that reads data from any (pointer + size_t) pair, so strictly it is not possible to guarantee an unaligned read does not touch a page boundary, unless I explicitly specified it is up to the client to guarantee this is not the case, which is tricky. Maybe I could make it an #ifdef option.