amonakov/gcc-qsort.md Secret

## gcc-qsort.md

      
    Raw
  

              gcc-qsort.md
            
          
    In 2018, uses of qsort function in GCC were redirected to a
custom implementation. A trivial change compared to typical
compiler engineering problems, but the new implementation is claimed to
be notably fast, so I'll try to describe the optimization ideas here.

Throughout the article, blocks like this one give something to ponder on.

Of course, there's no algorithmic novelty in the new gcc_qsort: it uses a combination of
two well-known schemes, merge sort and sorting networks. The main interest here is in reducing
constant factors, not the big-O time complexity.
But first, a few words on why that happened at all.
Deterministic translation

Like many tools, GCC wants to offer output reproducibility: machine
code emitted by the compiler should not depend on current phase of the moon, amount of
free memory, number of CPU cores in the machine, etc. Of particular importance
is invariance with regards to debug info generation: instruction stream
generated by the compiler shouldn't change if the compiler is asked to augment
it with annotations for the debugger.
To ensure this, GCC developers are required to test their changes with a
three-stage bootstrap procedure. The first stage is built using a pre-existing
compiler. The second stage is built using the first-stage compiler, without
debug info. Finally, the third stage is built with the same flags but debug
info enabled, and machine code for the last two stages is tested for equality.
The build is terminated with an error if some object files differ.

How does LLVM solve this? Standard LLVM build procedure does not involve a
three-stage bootstrap.

This approach has been used for a long time to ensure, on an ongoing basis, that
this invariance is kept, and now there's another mechanism,
-fcompare-debug, that can be used to automatically check it on any
valid program. A similar technique could be used, for instance, to check
invariance with regards to requested warnings.

Now wait a second! You say that GCC strives to be deterministic, but if I run
it under strace it clearly reads from /dev/urandom,
and there's a -frandom-seed command-line switch? What gives?
Surely it would be only more deterministic if it didn't use random numbers
in the first place?

Another desirable property (and something that is not ensured by bootstrapping
at all) is invariance with respect to compiler host platform: it would be nice
if GCC developers using Glibc systems could easily reproduce compiler crashes
found on musl-based or BSD systems, or if cross-compiled binaries would not come out
slightly different depending on what architecture the toolchain was running on.
Calls to qsort is one well-recognized cause of this host-dependent variability.
Indeed, C qsort is not a stable sort, so there's no guarantee on how it reorders
elements that compare equal according to the comparator. If it swaps two
elements that compare equal, but nevertheless hold different data, it may affect
how rest of compilation proceeds.

Suppose the comparator passed to qsort never returns 0, except if compared
elements are bitwise identical. Is it possible for different qsort
implementations to reorder the array differently in this case? And if not,
can't you simply fall back to bitwise comparison via memcmp in
all comparators for last tie-breaking to ensure determinism?

Furthermore, it is possible for comparators to be invalid by failing to
impose total order on the input set by lacking transitivity or anti-symmetric
property. This results in undefined behavior, so now development builds of GCC
use the qsort_chk wrapper that checks those properties.

What is the time complexity of algorithms checking reflexivity, anti-symmetry
and transitivity properties?

To make compiler's output independent of qsort implementation, one might attempt
to ensure all comparators passed to qsort implement a sufficiently strict order.
Unfortunately, there are between 100 and 200 qsort call sites in GCC, almost all
with different comparators, so that sounds impractical (and Sysyphean: even if
at some point all comparators are claimed to be "proper", a new comparator added
the next day might be broken again).

Did we need to apply a patch with 100–200
hunks to redirect all qsort calls to the qsort_chk wrapper?

Another option is to use a stable sort instead of qsort. This is an overkill
solution: stability is a stronger property than required here, stable sorting
is relatively more expensive, and there's no standard stable sort in C. Using C++
std::stable_sort is also a poor option.
There's also the option of avoiding the external dependency on qsort, and using
an internal qsort implementation instead. Taking this option gives a nice excuse
for writing a shiny new sort function.
Constraints

Naturally, if we don't want to sacrifice compilation speed, we want the
replacement function to be efficient. Indeed, if the replacement improves output
reproducibility at the cost of slowing down the compiler by 1%, it might
worry reviewers. Therefore, we should
begin by investigating what constraints it should obey and how workloads
look like.
For sorting, constraints make a lot of difference. C qsort interface is very
generic: the implementation cannot rely on being able to allocate storage for
even a single element, and needs to check both size and alignment of elements to
efficiently swap them. On the other hand of the spectrum, sorting tasks where
comparators are fixed in advance admit specialized sorting implementations,
perhaps even employing vector extensions, that reach very high throughput.
Since there are many qsort call sites in GCC, we want the replacement function
to be signature-compatible to C qsort. On the other hand, we don't need it to
be strictly API-compatible, and may relax its contract, namely:

  we may allow gcc_qsort to allocate memory, aborting on failure 
  we can assume the comparator may be applied to elements residing in the allocated
    buffer

Profiles

Now let's run GCC with a wrapper to see how it invokes qsort: what are the
typical number of elements, element size, time under qsort, etc. This article
takes an un-scientific approach of using a single workload,
compiling a 56094-line C++ file.
In my experiment, gcc-6 made 406971 calls to qsort, spending roughly 204 million
cycles under qsort, with total run time about 57 billion cycles. We can already
say that optimizing qsort is not going to significantly improve overall compiler
performance, as it accounts for less than 0.5% of total time.
Next, let's look at the distribution of element counts. Plotting call count
against element count, we see that GCC frequently calls qsort with a tiny
element count, and calls sorting just 2 or 3 elements really stand out:

By recording addresses of functions invoking qsort with
__builtin_return_address and then resolving them with
addr2line, we can quickly find out that 257794 calls, forming a
majority of two-three element sorts, come from a single call site in "dominance
walker" class, where it's used to reorder child nodes in a dominator tree.
We can note that typically nodes in dominator trees have 0, 2 or
3 children, and introduce a dedicated path for those cases that doesn't use library
qsort at all. This was addressed during GCC 8 development, and now running the
same test results in 161703 qsort calls, with not so strong outliers for 2 and 3
element counts:

Instead of call count, we can try plotting total cycle count for given element
size. The corresponding graph might make a (misleading!) impression that most of
the time we sort 20 elements or less, and that's where the optimization focus
should be:

Plotting cumulative time (sum of times for given and smaller counts) gives a
more honest impression. Alternatively, aggregating into bins with exponentially
increasing sizes would also show the distribution better: time is roughly
equally split between small (10 or less) and large element counts.

Aggregating by element size rather than element count shows that GCC mostly
sorts 8 byte sized elements (usually pointers; statistics collected from a
32-bit compiler would look different):

  Element size, bytes  CPU cycles       Call count
  4                     10.69 × 10⁶                     4992
  8                    152.76 × 10⁶                     147812
  16                     0.42 × 10⁶                     1912
  32                     6.94 × 10⁶                     5185
  40                     0.49 × 10⁶                     1802

This suggests we want to have dedicated paths for moving pointer-sized and
perhaps int-sized elements. Nevertheless, we should keep the code simple and
small: qsort by itself is not a hot function, and we want to avoid evicting
other useful code from the L1 cache.
Implementation

Sorting networks work nicely for small arrays. They needs to be paired with some
divide-and-conquer algorithm for larger arrays. For gcc_qsort I
went with classic merge sort. Although it requires dynamic allocation, and so
cannot be used to implement C qsort, which is not allowed to fail due to memory
exhaustion, it still can be used for sorting in the compiler.
In gcc_qsort, applying a sorting network either works in-place or relocates
from one buffer to another: this is used to avoid copying in merge sort.
Here's how it works. First, some helper declarations:
#ifdef __GNUC__
#define likely(cond)  __builtin_expect((cond), 1)
#define noinline      __attribute__((__noinline__))
#else
#define likely(cond)  (cond)
#define noinline
#endif

typedef int cmp_fn(const void *, const void *);

/* Structure holding read-mostly (read-only in netsort) context. */
struct sort_ctx {
	cmp_fn *cmp; // pointer to comparator
	char *out;   // output buffer
	size_t n;    // number of elements
	size_t size; // element size
};
And the sorting network implementation itself:
/* Helper for netsort. Invoke comparator CMP on E0 and E1.
   Return E0^E1 if E0 compares less than E1, zero otherwise.
   This is noinline to reduce code growth and confine invocation
   to a single call site, assisting indirect branch prediction. */
noinline static intptr_t cmp1(char *e0, char *e1, cmp_fn *cmp)
{
	intptr_t x = (intptr_t) e0 ^ (intptr_t) e1;
	return x & (cmp(e0, e1) >> 31);
}

/* Apply a sorting network to 2 to 5 elements from IN, placing them into C->OUT.
   IN may be equal to C->OUT, in which case elements are sorted in place. */
static void netsort(char *in, struct sort_ctx *c)
{
#define CMP(e0, e1) do {        	   \
	intptr_t x = cmp1(e1, e0, c->cmp); \
	e0 = (char *)((intptr_t)e0 ^ x);   \
	e1 = (char *)((intptr_t)e1 ^ x);   \
} while (0)
	char *e0 = in, *e1 = e0 + c->size, *e2 = e1 + c->size;
	CMP(e0, e1);
	if (likely(c->n == 3)) {
		CMP(e1, e2);
		CMP(e0, e1);
	}
	if (c->n <= 3) {
		reorder23(c, e0, e1, e2);
		return;
	}
	char *e3 = e2 + c->size, *e4 = e3 + c->size;
	if (likely(c->n == 5)) {
		CMP(e3, e4);
		CMP(e2, e4);
	}
	CMP(e2, e3);
	if (likely(c->n == 5)) {
		CMP(e0, e3);
		CMP(e1, e4);
	}
	CMP(e0, e2);
	CMP(e1, e3);
	CMP(e1, e2);
	reorder45(c, e0, e1, e2, e3, e4);
}

Does this netsort function implement a stable sort?

Note how this sorting network implementation keeps pointers to elements on
registers and swaps them in a branchless fashion based on comparator result.
In this approach, no auxiliary array is needed to store the order of comparison
steps in the network: they are explicit in program flow. This requires
"unrolling" the network, but note how the implementation folds together four
networks and ends up only two comparators longer than the longest individual
network (11 comparators here vs. 9 comparators for 5-input sorting network).
Of course, extending this implementation to handle 6 inputs would blow up code
size, but at that point we'd also run out of available registers on x86-64:
there are 6 callee-saved registers, which we need for the c pointer
and e0, …, e4. Also, sorting networks get less
efficient in terms of comparator invocations as size grows: at size 5 already
they use more comparisons than necessary (9, while only 7 are necessary and sufficient;
merge sort uses up to 8).
As already noted, the code takes care to swap pointers via bitwise operations
rather than conditional branches. It trades control dependencies for more data
dependencies, so there's never any risk of throwing away pipelined operations
when a branch depending on comparator result is mispredicted. Extra data
dependencies imply that some steps are serialized where they could run with
overlapping if correctly predicted, but four- and five-input networks
nevertheless have some completely independent comparison steps.
As netsort function only permutes pointers in registers, it is the
job of helper functions like reorder23 to reorder the corresponding
pointed-to values that can be of arbitrary size. This is where using sorting networks
really pays off compared to swap-based sort implementations.

Doesn't this have lots of overhead due to memcpy calls? And why
do we need those at all, what's wrong with good old
t0 = *(TYPE *)e0?

/* Helper for netsort. Permute, possibly in-place, 2 or 3 elements,
   placing E0 to C->OUT, E1 to C->OUT + C->SIZE, and so on. */
static void reorder23(struct sort_ctx *c, char *e0, char *e1, char *e2)
{
#define REORDER_23(TYPE, STRIDE, OFFSET) do {   		 \
	TYPE t0, t1;                                    	 \
	memcpy(&t0, e0 + OFFSET, sizeof t0);            	 \
	memcpy(&t1, e1 + OFFSET, sizeof t1);            	 \
	char *out = c->out + OFFSET;                    	 \
	if (likely(c->n == 3))                          	 \
		memmove(out + 2*STRIDE, e2 + OFFSET, sizeof t0); \
	memcpy(out, &t0, sizeof t0); out += STRIDE;     	 \
	memcpy(out, &t1, sizeof t1);                    	 \
} while (0)

	if (likely(c->size == sizeof(size_t)))
		REORDER_23(size_t, sizeof(size_t), 0);
	else if (likely(c->size == sizeof(int)))
		REORDER_23(int, sizeof(int), 0);
	else {
		size_t offset = 0, step = sizeof(size_t);
		for (; offset + step <= c->size; offset += step)
			REORDER_23(size_t, c->size, offset);
		for (; offset < c->size; offset++)
			REORDER_23(char, c->size, offset);
	}
}

Why does the middle step in REORDER_23 need to use
memmove instead of memcpy?

Finally, for sizes larger than 5, a simple recursive merge sort, again with
branchless selection of the smaller element in the merging loop:

Does this mergesort function implement a stable sort?

/* Execute merge sort on N elements from IN, placing them into OUT,
   using TMP as temporary storage if IN is equal to OUT. */
static void
mergesort(char *in, struct sort_ctx *c, size_t n, char *out, char *tmp)
{
	if (likely(n <= 5)) {
		c->out = out;
		c->n = n;
		netsort(in, c);
		return;
	}
	size_t nl = n / 2, nr = n - nl, sz = nl * c->size;
	char *mid = in + sz, *r = out + sz, *l = in == out ? tmp : in;
	/* Sort the right half, outputting to right half of OUT. */
	mergesort(mid, c, nr, r, tmp);
	/* Sort the left half, leaving left half of OUT free.  */
	mergesort(in, c, nl, l, mid);

	/* Merge sorted halves given by L, R to [OUT, END). */
#define MERGE_ELTSIZE(SIZE) do {	         \
	intptr_t mr = c->cmp(r, l) >> 31;        \
	intptr_t lr = (intptr_t)l ^ (intptr_t)r; \
	lr = (intptr_t)l ^ (lr & mr);            \
	out = memcpy(out, (char *)lr, SIZE);     \
	out += SIZE;                             \
	r += mr & SIZE;                          \
	if (r == out) return;                    \
	l += ~mr & SIZE;                         \
} while (r != end)

	if (likely(c->cmp(r, l + (r - out) - c->size) < 0)) {
		char *end = out + n * c->size;
		if (sizeof(size_t) == 8 && likely(c->size == 8))
			MERGE_ELTSIZE(8);
		else if (likely(c->size == 4))
			MERGE_ELTSIZE(4);
		else
			MERGE_ELTSIZE(c->size);
	}
	memcpy(out, l, r - out);
}

What does the branch following the first comparator invocation achieve?

And at last, a qsort-compatible wrapper that takes care of filling
sort_ctx and allocating a scratch buffer:
void gcc_qsort(void *base, size_t n, size_t size, cmp_fn *cmp)
{
	if (n < 2)
		return;
	struct sort_ctx c = {cmp, base, n, size};
	long long scratch[32];
	size_t bufsz = (n / 2) * size;
	void *buf = bufsz <= sizeof scratch ? scratch : xmalloc(bufsz);
	mergesort(base, &c, n, base, buf);
	if (buf != scratch)
		free(buf);
}
Comparison

Plotting cumulative cycle counts as before and also ratio of new vs. old cycle
counts shows that proper early-out buys a 10x speedup for sizes 0 and 1, and for
sizes 2–4 sorting networks net a roughly 2x speedup. As element counts grow,
speedups diminish and slowdowns get more frequent, until at size 1530 a single sort
originally taking about a million cycles gets 35% slower.

This is the machine code of the comparator used in the sort that gets dramatically slower:
ipa_icf::sort_congruence_class_groups_by_decl_uid:
mov    (%rdi),%rax
mov    0x8(%rax),%rax
mov    0x8(%rax),%rax
mov    0x8(%rax),%rax
mov    0x8(%rax),%rax
mov    0x18(%rax),%rcx
mov    (%rsi),%rax
mov    0x8(%rax),%rax
mov    0x8(%rax),%rax
mov    0x8(%rax),%rax
mov    0x8(%rax),%rax
mov    0x18(%rax),%rdx
mov    0x1c(%rdx),%eax
cmp    %eax,0x1c(%rcx)
mov    $0xffffffff,%edx
setg   %al
movzbl %al,%eax
cmovl  %edx,%eax
retq
Given the size of the array and the chain of dereferences in the comparator,
it's easy to guess the main reason for the slowdown: the comparator frequently
incurs cache misses, and where the original qsort could have overlapped
execution of adjacent comparator invocations thanks to branch prediction, the new one has data dependencies
preventing that.

How to modify mergesort to get the best of both
worlds: no branches depending on comparator output, and some possibility of
overlapping execution?

The End

In conclusion, I want to say that I find that this sort implementation strikes a
very interesting complexity/size/speed compromise. A stable sort can be
implemented on top of it with very little changes, and increasing pipelining in
merge loop is also possible. On the other hand, it's also possible to reduce
size further by, say, dropping the 5-input sorting net.

CC-BY-SA © 2018 Alexander Monakov <amonakov@gmail.com>
Element size, bytes	CPU cycles	Call count
4	10.69 × 10⁶	4992
8	152.76 × 10⁶	147812
16	0.42 × 10⁶	1912
32	6.94 × 10⁶	5185
40	0.49 × 10⁶	1802