Considering that make_heap
runs in O(n) time, the function sort_heap
dominates the complexity when sorting a
sequence from scratch, so it naturally deserves the most attention when trying to find optimizations. I did not manage
to find a better algorithm that did not use extra memory for the job, but I was nevertheless able to obtain a 8~50%
speedup by optimizing bit_floor
with compiler intrinsics.
The bit_floor
implementation shown previously runs in O(log k) time when computing the bit floor of a k-bit unsigned
integer. Using intrinsices it is possible to compute the bit floor in O(1) by computing the position of the highest set
bit and shifting a single bit to that position. The algorithm below shows how to compute the bit floor of an unsigned
integer with GCC and Clang intrinsics:
unsigned bit_floor(unsigned n)
{
constexpr unsigned k = std::numeric_limits<unsigned>::digits;
if (n == 0) return 0;
return 1u << (k - 1u - __builtin_clz(n));
}
There are equivalent __builtin_clzl
and __builtin_clzll
intrinsics to handle the long
and long long
variations
of the unsigned integers type; handling even bigger types requires more tricks that aren't showed here but can be found
in the standard library implementations of the C++20 function std::countl_zero
.
Considering that bit_floor
is a hot path in the sorting logic, I wanted to get rid of the branch for 0
, so I did
just that. The function still returned the expected result but for the wrong reasons: when n == 0
the last line would
return 1u << (k - 1u - k)
, and shifting by a negative number is undefined behaviour. Shifts by more than k-1
are
also undefined behaviour. Here is the solution I found to get rid of the branch for n == 0
without undefined
behaviour:
unsigned bit_floor(unsigned n)
{
constexpr unsigned k = std::numeric_limits<unsigned>::digits;
unsigned shift = k - 1u - __builtin_clz(n);
return (1u << (shift & (k - 1u))) & n;
}
The solution two bitwise AND to generate the correct result. Here is the reasoning:
- First we mask the the number of positions to shift with
k-1
: thanks to the way unsigned integers are defined in the standard,k
is a power of 2 andk-1
is made entirely of set bits. The result ofshift & (k - 1u)
is an unsigned integer that is always smaller or equal tok-1
. In our case it should bek-1
whenn == 0
andshift
unmodified otherwise. - The shift results in the bit floor of
n
for anyn != 0
. Taking the bit floor of an unsigned integer corresponds to isolating its highest set bit, sobitfloor(n) & n == bitfloor(n)
: the last bitwise AND withn
yields the bit floor ofn
unmodified whenn != 0
. Whenn == 0
the bitwise AND returns0
anyway, which is the result we want.
The code in poplar.h
handles unsigned int
, unsigned long
and unsigned long long
through compiler intrinsics for
GCC and Clang. Handling more unsigned types or compilers is possible but not relevant for this project, and thus left
as an exercise to the reader.