Skip to content

Instantly share code, notes, and snippets.

@Morwenn
Created March 8, 2020 16:49
Show Gist options
  • Save Morwenn/58403f415473a6005140833d4a909cc1 to your computer and use it in GitHub Desktop.
Save Morwenn/58403f415473a6005140833d4a909cc1 to your computer and use it in GitHub Desktop.
Explaining how I implemented an UB bit_floor

Intrinsics to optimize bit_floor

Considering that make_heap runs in O(n) time, the function sort_heap dominates the complexity when sorting a sequence from scratch, so it naturally deserves the most attention when trying to find optimizations. I did not manage to find a better algorithm that did not use extra memory for the job, but I was nevertheless able to obtain a 8~50% speedup by optimizing bit_floor with compiler intrinsics.

The bit_floor implementation shown previously runs in O(log k) time when computing the bit floor of a k-bit unsigned integer. Using intrinsices it is possible to compute the bit floor in O(1) by computing the position of the highest set bit and shifting a single bit to that position. The algorithm below shows how to compute the bit floor of an unsigned integer with GCC and Clang intrinsics:

unsigned bit_floor(unsigned n)
{
    constexpr unsigned k = std::numeric_limits<unsigned>::digits;
    if (n == 0) return 0;
    return 1u << (k - 1u - __builtin_clz(n));
}

There are equivalent __builtin_clzl and __builtin_clzll intrinsics to handle the long and long long variations of the unsigned integers type; handling even bigger types requires more tricks that aren't showed here but can be found in the standard library implementations of the C++20 function std::countl_zero.

Considering that bit_floor is a hot path in the sorting logic, I wanted to get rid of the branch for 0, so I did just that. The function still returned the expected result but for the wrong reasons: when n == 0 the last line would return 1u << (k - 1u - k), and shifting by a negative number is undefined behaviour. Shifts by more than k-1 are also undefined behaviour. Here is the solution I found to get rid of the branch for n == 0 without undefined behaviour:

unsigned bit_floor(unsigned n)
{
    constexpr unsigned k = std::numeric_limits<unsigned>::digits;
    unsigned shift = k - 1u - __builtin_clz(n);
    return (1u << (shift & (k - 1u))) & n;
}

The solution two bitwise AND to generate the correct result. Here is the reasoning:

  • First we mask the the number of positions to shift with k-1: thanks to the way unsigned integers are defined in the standard, k is a power of 2 and k-1 is made entirely of set bits. The result of shift & (k - 1u) is an unsigned integer that is always smaller or equal to k-1. In our case it should be k-1 when n == 0 and shift unmodified otherwise.
  • The shift results in the bit floor of n for any n != 0. Taking the bit floor of an unsigned integer corresponds to isolating its highest set bit, so bitfloor(n) & n == bitfloor(n): the last bitwise AND with n yields the bit floor of n unmodified when n != 0. When n == 0 the bitwise AND returns 0 anyway, which is the result we want.

The code in poplar.h handles unsigned int, unsigned long and unsigned long long through compiler intrinsics for GCC and Clang. Handling more unsigned types or compilers is possible but not relevant for this project, and thus left as an exercise to the reader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment