In 2018, uses of qsort
function in GCC were redirected to a
custom implementation. A trivial change compared to typical
compiler engineering problems, but the new implementation is claimed to
be notably fast, so I'll try to describe the optimization ideas here.
Throughout the article, blocks like this one give something to ponder on.
Of course, there's no algorithmic novelty in the new gcc_qsort
: it uses a combination of
two well-known schemes, merge sort and sorting networks. The main interest here is in reducing
constant factors, not the big-O time complexity.
But first, a few words on why that happened at all.
Like many tools, GCC wants to offer output reproducibility: machine code emitted by the compiler should not depend on current phase of the moon, amount of free memory, number of CPU cores in the machine, etc. Of particular importance is invariance with regards to debug info generation: instruction stream generated by the compiler shouldn't change if the compiler is asked to augment it with annotations for the debugger.
To ensure this, GCC developers are required to test their changes with a three-stage bootstrap procedure. The first stage is built using a pre-existing compiler. The second stage is built using the first-stage compiler, without debug info. Finally, the third stage is built with the same flags but debug info enabled, and machine code for the last two stages is tested for equality. The build is terminated with an error if some object files differ.
How does LLVM solve this? Standard LLVM build procedure does not involve a three-stage bootstrap.
This approach has been used for a long time to ensure, on an ongoing basis, that
this invariance is kept, and now there's another mechanism,
-fcompare-debug
, that can be used to automatically check it on any
valid program. A similar technique could be used, for instance, to check
invariance with regards to requested warnings.
Now wait a second! You say that GCC strives to be deterministic, but if I run it under
strace
it clearly reads from/dev/urandom
, and there's a-frandom-seed
command-line switch? What gives? Surely it would be only more deterministic if it didn't use random numbers in the first place?
Another desirable property (and something that is not ensured by bootstrapping at all) is invariance with respect to compiler host platform: it would be nice if GCC developers using Glibc systems could easily reproduce compiler crashes found on musl-based or BSD systems, or if cross-compiled binaries would not come out slightly different depending on what architecture the toolchain was running on.
Calls to qsort is one well-recognized cause of this host-dependent variability. Indeed, C qsort is not a stable sort, so there's no guarantee on how it reorders elements that compare equal according to the comparator. If it swaps two elements that compare equal, but nevertheless hold different data, it may affect how rest of compilation proceeds.
Suppose the comparator passed to qsort never returns 0, except if compared elements are bitwise identical. Is it possible for different qsort implementations to reorder the array differently in this case? And if not, can't you simply fall back to bitwise comparison via
memcmp
in all comparators for last tie-breaking to ensure determinism?
Furthermore, it is possible for comparators to be invalid by failing to
impose total order on the input set by lacking transitivity or anti-symmetric
property. This results in undefined behavior, so now development builds of GCC
use the qsort_chk
wrapper that checks those properties.
What is the time complexity of algorithms checking reflexivity, anti-symmetry and transitivity properties?
To make compiler's output independent of qsort implementation, one might attempt to ensure all comparators passed to qsort implement a sufficiently strict order. Unfortunately, there are between 100 and 200 qsort call sites in GCC, almost all with different comparators, so that sounds impractical (and Sysyphean: even if at some point all comparators are claimed to be "proper", a new comparator added the next day might be broken again).
Did we need to apply a patch with 100–200 hunks to redirect all qsort calls to the
qsort_chk
wrapper?
Another option is to use a stable sort instead of qsort. This is an overkill
solution: stability is a stronger property than required here, stable sorting
is relatively more expensive, and there's no standard stable sort in C. Using C++
std::stable_sort
is also a poor option.
There's also the option of avoiding the external dependency on qsort, and using an internal qsort implementation instead. Taking this option gives a nice excuse for writing a shiny new sort function.
Naturally, if we don't want to sacrifice compilation speed, we want the replacement function to be efficient. Indeed, if the replacement improves output reproducibility at the cost of slowing down the compiler by 1%, it might worry reviewers. Therefore, we should begin by investigating what constraints it should obey and how workloads look like.
For sorting, constraints make a lot of difference. C qsort interface is very generic: the implementation cannot rely on being able to allocate storage for even a single element, and needs to check both size and alignment of elements to efficiently swap them. On the other hand of the spectrum, sorting tasks where comparators are fixed in advance admit specialized sorting implementations, perhaps even employing vector extensions, that reach very high throughput.
Since there are many qsort call sites in GCC, we want the replacement function to be signature-compatible to C qsort. On the other hand, we don't need it to be strictly API-compatible, and may relax its contract, namely:
- we may allow
gcc_qsort
to allocate memory, aborting on failure - we can assume the comparator may be applied to elements residing in the allocated buffer
Now let's run GCC with a wrapper to see how it invokes qsort: what are the typical number of elements, element size, time under qsort, etc. This article takes an un-scientific approach of using a single workload, compiling a 56094-line C++ file.
In my experiment, gcc-6 made 406971 calls to qsort, spending roughly 204 million cycles under qsort, with total run time about 57 billion cycles. We can already say that optimizing qsort is not going to significantly improve overall compiler performance, as it accounts for less than 0.5% of total time.
Next, let's look at the distribution of element counts. Plotting call count against element count, we see that GCC frequently calls qsort with a tiny element count, and calls sorting just 2 or 3 elements really stand out:
By recording addresses of functions invoking qsort with
__builtin_return_address
and then resolving them with
addr2line
, we can quickly find out that 257794 calls, forming a
majority of two-three element sorts, come from a single call site in "dominance
walker" class, where it's used to reorder child nodes in a dominator tree.
We can note that typically nodes in dominator trees have 0, 2 or 3 children, and introduce a dedicated path for those cases that doesn't use library qsort at all. This was addressed during GCC 8 development, and now running the same test results in 161703 qsort calls, with not so strong outliers for 2 and 3 element counts:
Instead of call count, we can try plotting total cycle count for given element size. The corresponding graph might make a (misleading!) impression that most of the time we sort 20 elements or less, and that's where the optimization focus should be:
Plotting cumulative time (sum of times for given and smaller counts) gives a more honest impression. Alternatively, aggregating into bins with exponentially increasing sizes would also show the distribution better: time is roughly equally split between small (10 or less) and large element counts.
Aggregating by element size rather than element count shows that GCC mostly sorts 8 byte sized elements (usually pointers; statistics collected from a 32-bit compiler would look different):
Element size, bytes | CPU cycles | Call count |
---|---|---|
4 | 10.69 × 106 | 4992 |
8 | 152.76 × 106 | 147812 |
16 | 0.42 × 106 | 1912 |
32 | 6.94 × 106 | 5185 |
40 | 0.49 × 106 | 1802 |
This suggests we want to have dedicated paths for moving pointer-sized and perhaps int-sized elements. Nevertheless, we should keep the code simple and small: qsort by itself is not a hot function, and we want to avoid evicting other useful code from the L1 cache.
Sorting networks work nicely for small arrays. They needs to be paired with some
divide-and-conquer algorithm for larger arrays. For gcc_qsort
I
went with classic merge sort. Although it requires dynamic allocation, and so
cannot be used to implement C qsort, which is not allowed to fail due to memory
exhaustion, it still can be used for sorting in the compiler.
In gcc_qsort
, applying a sorting network either works in-place or relocates
from one buffer to another: this is used to avoid copying in merge sort.
Here's how it works. First, some helper declarations:
#ifdef __GNUC__
#define likely(cond) __builtin_expect((cond), 1)
#define noinline __attribute__((__noinline__))
#else
#define likely(cond) (cond)
#define noinline
#endif
typedef int cmp_fn(const void *, const void *);
/* Structure holding read-mostly (read-only in netsort) context. */
struct sort_ctx {
cmp_fn *cmp; // pointer to comparator
char *out; // output buffer
size_t n; // number of elements
size_t size; // element size
};
And the sorting network implementation itself:
/* Helper for netsort. Invoke comparator CMP on E0 and E1.
Return E0^E1 if E0 compares less than E1, zero otherwise.
This is noinline to reduce code growth and confine invocation
to a single call site, assisting indirect branch prediction. */
noinline static intptr_t cmp1(char *e0, char *e1, cmp_fn *cmp)
{
intptr_t x = (intptr_t) e0 ^ (intptr_t) e1;
return x & (cmp(e0, e1) >> 31);
}
/* Apply a sorting network to 2 to 5 elements from IN, placing them into C->OUT.
IN may be equal to C->OUT, in which case elements are sorted in place. */
static void netsort(char *in, struct sort_ctx *c)
{
#define CMP(e0, e1) do { \
intptr_t x = cmp1(e1, e0, c->cmp); \
e0 = (char *)((intptr_t)e0 ^ x); \
e1 = (char *)((intptr_t)e1 ^ x); \
} while (0)
char *e0 = in, *e1 = e0 + c->size, *e2 = e1 + c->size;
CMP(e0, e1);
if (likely(c->n == 3)) {
CMP(e1, e2);
CMP(e0, e1);
}
if (c->n <= 3) {
reorder23(c, e0, e1, e2);
return;
}
char *e3 = e2 + c->size, *e4 = e3 + c->size;
if (likely(c->n == 5)) {
CMP(e3, e4);
CMP(e2, e4);
}
CMP(e2, e3);
if (likely(c->n == 5)) {
CMP(e0, e3);
CMP(e1, e4);
}
CMP(e0, e2);
CMP(e1, e3);
CMP(e1, e2);
reorder45(c, e0, e1, e2, e3, e4);
}
Does this
netsort
function implement a stable sort?
Note how this sorting network implementation keeps pointers to elements on registers and swaps them in a branchless fashion based on comparator result. In this approach, no auxiliary array is needed to store the order of comparison steps in the network: they are explicit in program flow. This requires "unrolling" the network, but note how the implementation folds together four networks and ends up only two comparators longer than the longest individual network (11 comparators here vs. 9 comparators for 5-input sorting network).
Of course, extending this implementation to handle 6 inputs would blow up code
size, but at that point we'd also run out of available registers on x86-64:
there are 6 callee-saved registers, which we need for the c
pointer
and e0
, …, e4
. Also, sorting networks get less
efficient in terms of comparator invocations as size grows: at size 5 already
they use more comparisons than necessary (9, while only 7 are necessary and sufficient;
merge sort uses up to 8).
As already noted, the code takes care to swap pointers via bitwise operations rather than conditional branches. It trades control dependencies for more data dependencies, so there's never any risk of throwing away pipelined operations when a branch depending on comparator result is mispredicted. Extra data dependencies imply that some steps are serialized where they could run with overlapping if correctly predicted, but four- and five-input networks nevertheless have some completely independent comparison steps.
As netsort
function only permutes pointers in registers, it is the
job of helper functions like reorder23
to reorder the corresponding
pointed-to values that can be of arbitrary size. This is where using sorting networks
really pays off compared to swap-based sort implementations.
Doesn't this have lots of overhead due to
memcpy
calls? And why do we need those at all, what's wrong with good oldt0 = *(TYPE *)e0
?
/* Helper for netsort. Permute, possibly in-place, 2 or 3 elements,
placing E0 to C->OUT, E1 to C->OUT + C->SIZE, and so on. */
static void reorder23(struct sort_ctx *c, char *e0, char *e1, char *e2)
{
#define REORDER_23(TYPE, STRIDE, OFFSET) do { \
TYPE t0, t1; \
memcpy(&t0, e0 + OFFSET, sizeof t0); \
memcpy(&t1, e1 + OFFSET, sizeof t1); \
char *out = c->out + OFFSET; \
if (likely(c->n == 3)) \
memmove(out + 2*STRIDE, e2 + OFFSET, sizeof t0); \
memcpy(out, &t0, sizeof t0); out += STRIDE; \
memcpy(out, &t1, sizeof t1); \
} while (0)
if (likely(c->size == sizeof(size_t)))
REORDER_23(size_t, sizeof(size_t), 0);
else if (likely(c->size == sizeof(int)))
REORDER_23(int, sizeof(int), 0);
else {
size_t offset = 0, step = sizeof(size_t);
for (; offset + step <= c->size; offset += step)
REORDER_23(size_t, c->size, offset);
for (; offset < c->size; offset++)
REORDER_23(char, c->size, offset);
}
}
Why does the middle step in
REORDER_23
need to usememmove
instead ofmemcpy
?
Finally, for sizes larger than 5, a simple recursive merge sort, again with branchless selection of the smaller element in the merging loop:
Does this
mergesort
function implement a stable sort?
/* Execute merge sort on N elements from IN, placing them into OUT,
using TMP as temporary storage if IN is equal to OUT. */
static void
mergesort(char *in, struct sort_ctx *c, size_t n, char *out, char *tmp)
{
if (likely(n <= 5)) {
c->out = out;
c->n = n;
netsort(in, c);
return;
}
size_t nl = n / 2, nr = n - nl, sz = nl * c->size;
char *mid = in + sz, *r = out + sz, *l = in == out ? tmp : in;
/* Sort the right half, outputting to right half of OUT. */
mergesort(mid, c, nr, r, tmp);
/* Sort the left half, leaving left half of OUT free. */
mergesort(in, c, nl, l, mid);
/* Merge sorted halves given by L, R to [OUT, END). */
#define MERGE_ELTSIZE(SIZE) do { \
intptr_t mr = c->cmp(r, l) >> 31; \
intptr_t lr = (intptr_t)l ^ (intptr_t)r; \
lr = (intptr_t)l ^ (lr & mr); \
out = memcpy(out, (char *)lr, SIZE); \
out += SIZE; \
r += mr & SIZE; \
if (r == out) return; \
l += ~mr & SIZE; \
} while (r != end)
if (likely(c->cmp(r, l + (r - out) - c->size) < 0)) {
char *end = out + n * c->size;
if (sizeof(size_t) == 8 && likely(c->size == 8))
MERGE_ELTSIZE(8);
else if (likely(c->size == 4))
MERGE_ELTSIZE(4);
else
MERGE_ELTSIZE(c->size);
}
memcpy(out, l, r - out);
}
What does the branch following the first comparator invocation achieve?
And at last, a qsort-compatible wrapper that takes care of filling
sort_ctx
and allocating a scratch buffer:
void gcc_qsort(void *base, size_t n, size_t size, cmp_fn *cmp)
{
if (n < 2)
return;
struct sort_ctx c = {cmp, base, n, size};
long long scratch[32];
size_t bufsz = (n / 2) * size;
void *buf = bufsz <= sizeof scratch ? scratch : xmalloc(bufsz);
mergesort(base, &c, n, base, buf);
if (buf != scratch)
free(buf);
}
Plotting cumulative cycle counts as before and also ratio of new vs. old cycle counts shows that proper early-out buys a 10x speedup for sizes 0 and 1, and for sizes 2–4 sorting networks net a roughly 2x speedup. As element counts grow, speedups diminish and slowdowns get more frequent, until at size 1530 a single sort originally taking about a million cycles gets 35% slower.
This is the machine code of the comparator used in the sort that gets dramatically slower:
ipa_icf::sort_congruence_class_groups_by_decl_uid:
mov (%rdi),%rax
mov 0x8(%rax),%rax
mov 0x8(%rax),%rax
mov 0x8(%rax),%rax
mov 0x8(%rax),%rax
mov 0x18(%rax),%rcx
mov (%rsi),%rax
mov 0x8(%rax),%rax
mov 0x8(%rax),%rax
mov 0x8(%rax),%rax
mov 0x8(%rax),%rax
mov 0x18(%rax),%rdx
mov 0x1c(%rdx),%eax
cmp %eax,0x1c(%rcx)
mov $0xffffffff,%edx
setg %al
movzbl %al,%eax
cmovl %edx,%eax
retq
Given the size of the array and the chain of dereferences in the comparator, it's easy to guess the main reason for the slowdown: the comparator frequently incurs cache misses, and where the original qsort could have overlapped execution of adjacent comparator invocations thanks to branch prediction, the new one has data dependencies preventing that.
How to modify
mergesort
to get the best of both worlds: no branches depending on comparator output, and some possibility of overlapping execution?
In conclusion, I want to say that I find that this sort implementation strikes a very interesting complexity/size/speed compromise. A stable sort can be implemented on top of it with very little changes, and increasing pipelining in merge loop is also possible. On the other hand, it's also possible to reduce size further by, say, dropping the 5-input sorting net.
CC-BY-SA © 2018 Alexander Monakov <amonakov@gmail.com>
Thanks for the analysis. This is a fascinating way to benchmark sorting functions. I would love to see how well sort.h does with similar analysis, and if it could benefit from using a network sort for very small sizes.