laughinghan/popcount.html

## popcount.html
<style>
  body {
    font-family: monospace;
    font-size: 16px;
    color: #202122;
    white-space: pre-wrap;
  }
  .nibble-sep {
    margin: 0 -.15em;
    font-size: 0.8em;
  }
  .line {
    display: inline-block;
    width: 24em;
    border-top: solid 1px black;
    vertical-align: 0.25em;
  }

  .red   { color: #b32424; }
  .blue  { color: #2a4b8d; }
  .green { color: #14866d; }
  .brown { color: #ac6600; }
</style>

This is based on <a href="https://en.wikipedia.org/wiki/Hamming_weight#Efficient_implementation">the Wikipedia example</a>, <a href="https://stackoverflow.com/a/109025/362030">the StackOverflow thread</a>, and <a href="https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel">Bit Twiddling Hacks</a> by Sean Anderson.

Note that in practice, modern CPUs have a dedicated instruction for this, interoperable enough that <a href="https://github.com/sunfishcode/wasm-reference-manual/blob/master/WebAssembly.md#integer-population-count">it's even in WebAssembly</a>. <a href="https://vaibhavsagar.com/blog/2019/09/08/popcount/">This post</a> explains what this instruction is useful for. (A variation of this parallelization trick can also be used for <a href="https://graphics.stanford.edu/~seander/bithacks.html#ReverseParallel">reversing the bits in a word</a> (<a href="https://stackoverflow.com/a/746203/362030">see also StackOverflow</a>), which has a dedicated CPU instruction in ARM but not x86, so maybe that's more useful.)

As an example, consider 1,825,859,237<sub>10</sub> = 0110<span class="nibble-sep">&sdot;</span>1100 1101<span class="nibble-sep">&sdot;</span>0100 0110<span class="nibble-sep">&sdot;</span>0110 1010<span class="nibble-sep">&sdot;</span>0101<sub>2</sub>, which has 16 ones.

(For readability, the bytes are separated by spaces and the nibbles are separated by dots.)

We can of course go through and count them one-by-one, but there's a clever trick where we count them in parallel.

Sketch:
First, to count the ones in each pair of bits AB, we just want to do A+B. If we mask, we can pull out the Bs; if we shift and mask, we can pull out the As; then we can add them:

  0110<span class="nibble-sep">&sdot;</span>1100 1101<span class="nibble-sep">&sdot;</span>0100 0110<span class="nibble-sep">&sdot;</span>0110 1010<span class="nibble-sep">&sdot;</span>0101
&amp; <span class="blue">0101<span class="nibble-sep">&sdot;</span>0101 1101<span class="nibble-sep">&sdot;</span>0101 0101<span class="nibble-sep">&sdot;</span>0101 0101<span class="nibble-sep">&sdot;</span>0101</span>
<span class="line"></span>
  <span class="green">0100<span class="nibble-sep">&sdot;</span>0100 0101<span class="nibble-sep">&sdot;</span>0100 0100<span class="nibble-sep">&sdot;</span>0100 0000<span class="nibble-sep">&sdot;</span>0101</span>


  0110<span class="nibble-sep">&sdot;</span>1100 1101<span class="nibble-sep">&sdot;</span>0100 0110<span class="nibble-sep">&sdot;</span>0110 1010<span class="nibble-sep">&sdot;</span>0101 &gt;&gt; 1
= <span class="red">0011<span class="nibble-sep">&sdot;</span>0110 0110<span class="nibble-sep">&sdot;</span>1010 0011<span class="nibble-sep">&sdot;</span>0011 0101<span class="nibble-sep">&sdot;</span>0010</span>

  <span class="red">0011<span class="nibble-sep">&sdot;</span>0110 0110<span class="nibble-sep">&sdot;</span>1010 0011<span class="nibble-sep">&sdot;</span>0011 0101<span class="nibble-sep">&sdot;</span>0010</span>
&amp; <span class="blue">0101<span class="nibble-sep">&sdot;</span>0101 1101<span class="nibble-sep">&sdot;</span>0101 0101<span class="nibble-sep">&sdot;</span>0101 0101<span class="nibble-sep">&sdot;</span>0101</span>
<span class="line"></span>
  <span class="brown">0001<span class="nibble-sep">&sdot;</span>0100 0100<span class="nibble-sep">&sdot;</span>0000 0001<span class="nibble-sep">&sdot;</span>0001 0101<span class="nibble-sep">&sdot;</span>0000</span>

Then for each pair of 2-bit counts, we can do the same trick, mask with 0011<span class="nibble-sep">&sdot;</span>0011 0011<span class="nibble-sep">&sdot;</span>0011 0011<span class="nibble-sep">&sdot;</span>0011 0011<span class="nibble-sep">&sdot;</span>0011 to pull out every other 2-bit count, then shift and mask to pull out the <i>other</i> every other 2-bit counts, and add them, then repeat with 0000<span class="nibble-sep">&sdot;</span>1111 0000<span class="nibble-sep">&sdot;</span>1111 0000<span class="nibble-sep">&sdot;</span>1111 0000<span class="nibble-sep">&sdot;</span>1111, etc:

x = (x & 0x55555555) + ((x >>  1) & 0x55555555)
x = (x & 0x33333333) + ((x >>  2) & 0x33333333)
x = (x & 0x0F0F0F0F) + ((x >>  4) & 0x0F0F0F0F)
x = (x & 0x00FF00FF) + ((x >>  8) & 0x00FF00FF)
x = (x & 0x0000FFFF) + ((x >> 16) & 0x0000FFFF)

This is pretty cool, but there's 3 tricks we can do to even further optimize this:

Trick 1: We can simplify the first step because for just two bits, instead of (x & 01) + ((x >> 1) & 01), it so happens that you can just do x - ((x >> 1) & 01); this easy to verify by trying all 4 values of 2 bits (0, 1, 2, 3).

Trick 2: We can simplify the third step by adding before masking, rather than two separate masks before adding, because in the maximum of 8 possible ones easily fits in 4 bits. (Can't do this in earlier steps because, for example, the maximum of 4 possible ones does not fit in 2 bits.)

Trick 3: We can simplify the fourth and later steps by noticing that the maximum possible total of 32 ones easily fits in 8 bits, so we can just multiply by 0x01010101, which is equivalent to x + (x&lt;&lt;8) + (x&lt;&lt;16) + (x&lt;&lt;24).

End result:

x = x - ((x >> 1) & 0x55555555);  // Trick 1
x = (x & 0x33333333) + ((x >> 2) & 0x33333333); // original
x = (x + (x >>  4)) & 0x0F0F0F0F; // Trick 2
x = x * 0x01010101;               // Trick 3
	<style>
	body {
	font-family: monospace;
	font-size: 16px;
	color: #202122;
	white-space: pre-wrap;
	}
	.nibble-sep {
	margin: 0 -.15em;
	font-size: 0.8em;
	}
	.line {
	display: inline-block;
	width: 24em;
	border-top: solid 1px black;
	vertical-align: 0.25em;
	}

	.red { color: #b32424; }
	.blue { color: #2a4b8d; }
	.green { color: #14866d; }
	.brown { color: #ac6600; }
	</style>

	This is based on <a href="https://en.wikipedia.org/wiki/Hamming_weight#Efficient_implementation">the Wikipedia example</a>, <a href="https://stackoverflow.com/a/109025/362030">the StackOverflow thread</a>, and <a href="https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel">Bit Twiddling Hacks</a> by Sean Anderson.

	Note that in practice, modern CPUs have a dedicated instruction for this, interoperable enough that <a href="https://github.com/sunfishcode/wasm-reference-manual/blob/master/WebAssembly.md#integer-population-count">it's even in WebAssembly</a>. <a href="https://vaibhavsagar.com/blog/2019/09/08/popcount/">This post</a> explains what this instruction is useful for. (A variation of this parallelization trick can also be used for <a href="https://graphics.stanford.edu/~seander/bithacks.html#ReverseParallel">reversing the bits in a word</a> (<a href="https://stackoverflow.com/a/746203/362030">see also StackOverflow</a>), which has a dedicated CPU instruction in ARM but not x86, so maybe that's more useful.)

	As an example, consider 1,825,859,237<sub>10</sub> = 0110<span class="nibble-sep">⋅</span>1100 1101<span class="nibble-sep">⋅</span>0100 0110<span class="nibble-sep">⋅</span>0110 1010<span class="nibble-sep">⋅</span>0101<sub>2</sub>, which has 16 ones.

	(For readability, the bytes are separated by spaces and the nibbles are separated by dots.)

	We can of course go through and count them one-by-one, but there's a clever trick where we count them in parallel.

	Sketch:
	First, to count the ones in each pair of bits AB, we just want to do A+B. If we mask, we can pull out the Bs; if we shift and mask, we can pull out the As; then we can add them:

	0110<span class="nibble-sep">⋅</span>1100 1101<span class="nibble-sep">⋅</span>0100 0110<span class="nibble-sep">⋅</span>0110 1010<span class="nibble-sep">⋅</span>0101
	& <span class="blue">0101<span class="nibble-sep">⋅</span>0101 1101<span class="nibble-sep">⋅</span>0101 0101<span class="nibble-sep">⋅</span>0101 0101<span class="nibble-sep">⋅</span>0101</span>
	<span class="line"></span>
	<span class="green">0100<span class="nibble-sep">⋅</span>0100 0101<span class="nibble-sep">⋅</span>0100 0100<span class="nibble-sep">⋅</span>0100 0000<span class="nibble-sep">⋅</span>0101</span>


	0110<span class="nibble-sep">⋅</span>1100 1101<span class="nibble-sep">⋅</span>0100 0110<span class="nibble-sep">⋅</span>0110 1010<span class="nibble-sep">⋅</span>0101 >> 1
	= <span class="red">0011<span class="nibble-sep">⋅</span>0110 0110<span class="nibble-sep">⋅</span>1010 0011<span class="nibble-sep">⋅</span>0011 0101<span class="nibble-sep">⋅</span>0010</span>

	<span class="red">0011<span class="nibble-sep">⋅</span>0110 0110<span class="nibble-sep">⋅</span>1010 0011<span class="nibble-sep">⋅</span>0011 0101<span class="nibble-sep">⋅</span>0010</span>
	& <span class="blue">0101<span class="nibble-sep">⋅</span>0101 1101<span class="nibble-sep">⋅</span>0101 0101<span class="nibble-sep">⋅</span>0101 0101<span class="nibble-sep">⋅</span>0101</span>
	<span class="line"></span>
	<span class="brown">0001<span class="nibble-sep">⋅</span>0100 0100<span class="nibble-sep">⋅</span>0000 0001<span class="nibble-sep">⋅</span>0001 0101<span class="nibble-sep">⋅</span>0000</span>

	Then for each pair of 2-bit counts, we can do the same trick, mask with 0011<span class="nibble-sep">⋅</span>0011 0011<span class="nibble-sep">⋅</span>0011 0011<span class="nibble-sep">⋅</span>0011 0011<span class="nibble-sep">⋅</span>0011 to pull out every other 2-bit count, then shift and mask to pull out the <i>other</i> every other 2-bit counts, and add them, then repeat with 0000<span class="nibble-sep">⋅</span>1111 0000<span class="nibble-sep">⋅</span>1111 0000<span class="nibble-sep">⋅</span>1111 0000<span class="nibble-sep">⋅</span>1111, etc:

	x = (x & 0x55555555) + ((x >> 1) & 0x55555555)
	x = (x & 0x33333333) + ((x >> 2) & 0x33333333)
	x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F)
	x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF)
	x = (x & 0x0000FFFF) + ((x >> 16) & 0x0000FFFF)

	This is pretty cool, but there's 3 tricks we can do to even further optimize this:

	Trick 1: We can simplify the first step because for just two bits, instead of (x & 01) + ((x >> 1) & 01), it so happens that you can just do x - ((x >> 1) & 01); this easy to verify by trying all 4 values of 2 bits (0, 1, 2, 3).

	Trick 2: We can simplify the third step by adding before masking, rather than two separate masks before adding, because in the maximum of 8 possible ones easily fits in 4 bits. (Can't do this in earlier steps because, for example, the maximum of 4 possible ones does not fit in 2 bits.)

	Trick 3: We can simplify the fourth and later steps by noticing that the maximum possible total of 32 ones easily fits in 8 bits, so we can just multiply by 0x01010101, which is equivalent to x + (x<<8) + (x<<16) + (x<<24).

	End result:

	x = x - ((x >> 1) & 0x55555555); // Trick 1
	x = (x & 0x33333333) + ((x >> 2) & 0x33333333); // original
	x = (x + (x >> 4)) & 0x0F0F0F0F; // Trick 2
	x = x * 0x01010101; // Trick 3