Created
June 9, 2021 07:36
-
-
Save laughinghan/3493f08fbad812d863a04479fb628f13 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<style> | |
body { | |
font-family: monospace; | |
font-size: 16px; | |
color: #202122; | |
white-space: pre-wrap; | |
} | |
.nibble-sep { | |
margin: 0 -.15em; | |
font-size: 0.8em; | |
} | |
.line { | |
display: inline-block; | |
width: 24em; | |
border-top: solid 1px black; | |
vertical-align: 0.25em; | |
} | |
.red { color: #b32424; } | |
.blue { color: #2a4b8d; } | |
.green { color: #14866d; } | |
.brown { color: #ac6600; } | |
</style> | |
This is based on <a href="https://en.wikipedia.org/wiki/Hamming_weight#Efficient_implementation">the Wikipedia example</a>, <a href="https://stackoverflow.com/a/109025/362030">the StackOverflow thread</a>, and <a href="https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel">Bit Twiddling Hacks</a> by Sean Anderson. | |
Note that in practice, modern CPUs have a dedicated instruction for this, interoperable enough that <a href="https://github.com/sunfishcode/wasm-reference-manual/blob/master/WebAssembly.md#integer-population-count">it's even in WebAssembly</a>. <a href="https://vaibhavsagar.com/blog/2019/09/08/popcount/">This post</a> explains what this instruction is useful for. (A variation of this parallelization trick can also be used for <a href="https://graphics.stanford.edu/~seander/bithacks.html#ReverseParallel">reversing the bits in a word</a> (<a href="https://stackoverflow.com/a/746203/362030">see also StackOverflow</a>), which has a dedicated CPU instruction in ARM but not x86, so maybe that's more useful.) | |
As an example, consider 1,825,859,237<sub>10</sub> = 0110<span class="nibble-sep">⋅</span>1100 1101<span class="nibble-sep">⋅</span>0100 0110<span class="nibble-sep">⋅</span>0110 1010<span class="nibble-sep">⋅</span>0101<sub>2</sub>, which has 16 ones. | |
(For readability, the bytes are separated by spaces and the nibbles are separated by dots.) | |
We can of course go through and count them one-by-one, but there's a clever trick where we count them in parallel. | |
Sketch: | |
First, to count the ones in each pair of bits AB, we just want to do A+B. If we mask, we can pull out the Bs; if we shift and mask, we can pull out the As; then we can add them: | |
0110<span class="nibble-sep">⋅</span>1100 1101<span class="nibble-sep">⋅</span>0100 0110<span class="nibble-sep">⋅</span>0110 1010<span class="nibble-sep">⋅</span>0101 | |
& <span class="blue">0101<span class="nibble-sep">⋅</span>0101 1101<span class="nibble-sep">⋅</span>0101 0101<span class="nibble-sep">⋅</span>0101 0101<span class="nibble-sep">⋅</span>0101</span> | |
<span class="line"></span> | |
<span class="green">0100<span class="nibble-sep">⋅</span>0100 0101<span class="nibble-sep">⋅</span>0100 0100<span class="nibble-sep">⋅</span>0100 0000<span class="nibble-sep">⋅</span>0101</span> | |
0110<span class="nibble-sep">⋅</span>1100 1101<span class="nibble-sep">⋅</span>0100 0110<span class="nibble-sep">⋅</span>0110 1010<span class="nibble-sep">⋅</span>0101 >> 1 | |
= <span class="red">0011<span class="nibble-sep">⋅</span>0110 0110<span class="nibble-sep">⋅</span>1010 0011<span class="nibble-sep">⋅</span>0011 0101<span class="nibble-sep">⋅</span>0010</span> | |
<span class="red">0011<span class="nibble-sep">⋅</span>0110 0110<span class="nibble-sep">⋅</span>1010 0011<span class="nibble-sep">⋅</span>0011 0101<span class="nibble-sep">⋅</span>0010</span> | |
& <span class="blue">0101<span class="nibble-sep">⋅</span>0101 1101<span class="nibble-sep">⋅</span>0101 0101<span class="nibble-sep">⋅</span>0101 0101<span class="nibble-sep">⋅</span>0101</span> | |
<span class="line"></span> | |
<span class="brown">0001<span class="nibble-sep">⋅</span>0100 0100<span class="nibble-sep">⋅</span>0000 0001<span class="nibble-sep">⋅</span>0001 0101<span class="nibble-sep">⋅</span>0000</span> | |
Then for each pair of 2-bit counts, we can do the same trick, mask with 0011<span class="nibble-sep">⋅</span>0011 0011<span class="nibble-sep">⋅</span>0011 0011<span class="nibble-sep">⋅</span>0011 0011<span class="nibble-sep">⋅</span>0011 to pull out every other 2-bit count, then shift and mask to pull out the <i>other</i> every other 2-bit counts, and add them, then repeat with 0000<span class="nibble-sep">⋅</span>1111 0000<span class="nibble-sep">⋅</span>1111 0000<span class="nibble-sep">⋅</span>1111 0000<span class="nibble-sep">⋅</span>1111, etc: | |
x = (x & 0x55555555) + ((x >> 1) & 0x55555555) | |
x = (x & 0x33333333) + ((x >> 2) & 0x33333333) | |
x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F) | |
x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF) | |
x = (x & 0x0000FFFF) + ((x >> 16) & 0x0000FFFF) | |
This is pretty cool, but there's 3 tricks we can do to even further optimize this: | |
Trick 1: We can simplify the first step because for just two bits, instead of (x & 01) + ((x >> 1) & 01), it so happens that you can just do x - ((x >> 1) & 01); this easy to verify by trying all 4 values of 2 bits (0, 1, 2, 3). | |
Trick 2: We can simplify the third step by adding before masking, rather than two separate masks before adding, because in the maximum of 8 possible ones easily fits in 4 bits. (Can't do this in earlier steps because, for example, the maximum of 4 possible ones does not fit in 2 bits.) | |
Trick 3: We can simplify the fourth and later steps by noticing that the maximum possible total of 32 ones easily fits in 8 bits, so we can just multiply by 0x01010101, which is equivalent to x + (x<<8) + (x<<16) + (x<<24). | |
End result: | |
x = x - ((x >> 1) & 0x55555555); // Trick 1 | |
x = (x & 0x33333333) + ((x >> 2) & 0x33333333); // original | |
x = (x + (x >> 4)) & 0x0F0F0F0F; // Trick 2 | |
x = x * 0x01010101; // Trick 3 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
https://output.jsbin.com/yureris