Morwenn/bit_floor.md

## bit_floor.md

      
    Raw
  

              bit_floor.md
            
          
    Intrinsics to optimize bit_floor

Considering that make_heap runs in O(n) time, the function sort_heap dominates the complexity when sorting a
sequence from scratch, so it naturally deserves the most attention when trying to find optimizations. I did not manage
to find a better algorithm that did not use extra memory for the job, but I was nevertheless able to obtain a 8~50%
speedup by optimizing bit_floor with compiler intrinsics.
The bit_floor implementation shown previously runs in O(log k) time when computing the bit floor of a k-bit unsigned
integer. Using intrinsices it is possible to compute the bit floor in O(1) by computing the position of the highest set
bit and shifting a single bit to that position. The algorithm below shows how to compute the bit floor of an unsigned
integer with GCC and Clang intrinsics:
unsigned bit_floor(unsigned n)
{
    constexpr unsigned k = std::numeric_limits<unsigned>::digits;
    if (n == 0) return 0;
    return 1u << (k - 1u - __builtin_clz(n));
}
There are equivalent __builtin_clzl and __builtin_clzll intrinsics to handle the long and long long variations
of the unsigned integers type; handling even bigger types requires more tricks that aren't showed here but can be found
in the standard library implementations of the C++20 function std::countl_zero.
Considering that bit_floor is a hot path in the sorting logic, I wanted to get rid of the branch for 0, so I did
just that. The function still returned the expected result but for the wrong reasons: when n == 0 the last line would
return 1u << (k - 1u - k), and shifting by a negative number is undefined behaviour. Shifts by more than k-1 are
also undefined behaviour. Here is the solution I found to get rid of the branch for n == 0 without undefined
behaviour:
unsigned bit_floor(unsigned n)
{
    constexpr unsigned k = std::numeric_limits<unsigned>::digits;
    unsigned shift = k - 1u - __builtin_clz(n);
    return (1u << (shift & (k - 1u))) & n;
}
The solution two bitwise AND to generate the correct result. Here is the reasoning:

First we mask the the number of positions to shift with k-1: thanks to the way unsigned integers are defined in the
standard, k is a power of 2 and k-1 is made entirely of set bits. The result of shift & (k - 1u) is an unsigned
integer that is always smaller or equal to k-1. In our case it should be k-1 when n == 0 and shift unmodified
otherwise.
The shift results in the bit floor of n for any n != 0. Taking the bit floor of an unsigned integer corresponds
to isolating its highest set bit, so bitfloor(n) & n == bitfloor(n): the last bitwise AND with n yields the bit
floor of n unmodified when n != 0. When n == 0 the bitwise AND returns 0 anyway, which is the result we want.

The code in poplar.h handles unsigned int, unsigned long and unsigned long long through compiler intrinsics for
GCC and Clang. Handling more unsigned types or compilers is possible but not relevant for this project, and thus left
as an exercise to the reader.