Skip to content

Instantly share code, notes, and snippets.

@amonakov
Last active October 6, 2020 12:50
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save amonakov/1de8c395f544e4d65ade63e0b9f50395 to your computer and use it in GitHub Desktop.
Save amonakov/1de8c395f544e4d65ade63e0b9f50395 to your computer and use it in GitHub Desktop.
Optimizing a Sort Function

In 2018, uses of qsort function in GCC were redirected to a custom implementation. A trivial change compared to typical compiler engineering problems, but the new implementation is claimed to be notably fast, so I'll try to describe the optimization ideas here.

Throughout the article, blocks like this one give something to ponder on.

Of course, there's no algorithmic novelty in the new gcc_qsort: it uses a combination of two well-known schemes, merge sort and sorting networks. The main interest here is in reducing constant factors, not the big-O time complexity.

But first, a few words on why that happened at all.

Deterministic translation

Like many tools, GCC wants to offer output reproducibility: machine code emitted by the compiler should not depend on current phase of the moon, amount of free memory, number of CPU cores in the machine, etc. Of particular importance is invariance with regards to debug info generation: instruction stream generated by the compiler shouldn't change if the compiler is asked to augment it with annotations for the debugger.

To ensure this, GCC developers are required to test their changes with a three-stage bootstrap procedure. The first stage is built using a pre-existing compiler. The second stage is built using the first-stage compiler, without debug info. Finally, the third stage is built with the same flags but debug info enabled, and machine code for the last two stages is tested for equality. The build is terminated with an error if some object files differ.

How does LLVM solve this? Standard LLVM build procedure does not involve a three-stage bootstrap.

This approach has been used for a long time to ensure, on an ongoing basis, that this invariance is kept, and now there's another mechanism, -fcompare-debug, that can be used to automatically check it on any valid program. A similar technique could be used, for instance, to check invariance with regards to requested warnings.

Now wait a second! You say that GCC strives to be deterministic, but if I run it under strace it clearly reads from /dev/urandom, and there's a -frandom-seed command-line switch? What gives? Surely it would be only more deterministic if it didn't use random numbers in the first place?

Another desirable property (and something that is not ensured by bootstrapping at all) is invariance with respect to compiler host platform: it would be nice if GCC developers using Glibc systems could easily reproduce compiler crashes found on musl-based or BSD systems, or if cross-compiled binaries would not come out slightly different depending on what architecture the toolchain was running on.

Calls to qsort is one well-recognized cause of this host-dependent variability. Indeed, C qsort is not a stable sort, so there's no guarantee on how it reorders elements that compare equal according to the comparator. If it swaps two elements that compare equal, but nevertheless hold different data, it may affect how rest of compilation proceeds.

Suppose the comparator passed to qsort never returns 0, except if compared elements are bitwise identical. Is it possible for different qsort implementations to reorder the array differently in this case? And if not, can't you simply fall back to bitwise comparison via memcmp in all comparators for last tie-breaking to ensure determinism?

Furthermore, it is possible for comparators to be invalid by failing to impose total order on the input set by lacking transitivity or anti-symmetric property. This results in undefined behavior, so now development builds of GCC use the qsort_chk wrapper that checks those properties.

What is the time complexity of algorithms checking reflexivity, anti-symmetry and transitivity properties?

To make compiler's output independent of qsort implementation, one might attempt to ensure all comparators passed to qsort implement a sufficiently strict order. Unfortunately, there are between 100 and 200 qsort call sites in GCC, almost all with different comparators, so that sounds impractical (and Sysyphean: even if at some point all comparators are claimed to be "proper", a new comparator added the next day might be broken again).

Did we need to apply a patch with 100–200 hunks to redirect all qsort calls to the qsort_chk wrapper?

Another option is to use a stable sort instead of qsort. This is an overkill solution: stability is a stronger property than required here, stable sorting is relatively more expensive, and there's no standard stable sort in C. Using C++ std::stable_sort is also a poor option.

There's also the option of avoiding the external dependency on qsort, and using an internal qsort implementation instead. Taking this option gives a nice excuse for writing a shiny new sort function.

Constraints

Naturally, if we don't want to sacrifice compilation speed, we want the replacement function to be efficient. Indeed, if the replacement improves output reproducibility at the cost of slowing down the compiler by 1%, it might worry reviewers. Therefore, we should begin by investigating what constraints it should obey and how workloads look like.

For sorting, constraints make a lot of difference. C qsort interface is very generic: the implementation cannot rely on being able to allocate storage for even a single element, and needs to check both size and alignment of elements to efficiently swap them. On the other hand of the spectrum, sorting tasks where comparators are fixed in advance admit specialized sorting implementations, perhaps even employing vector extensions, that reach very high throughput.

Since there are many qsort call sites in GCC, we want the replacement function to be signature-compatible to C qsort. On the other hand, we don't need it to be strictly API-compatible, and may relax its contract, namely:

  • we may allow gcc_qsort to allocate memory, aborting on failure
  • we can assume the comparator may be applied to elements residing in the allocated buffer

Profiles

Now let's run GCC with a wrapper to see how it invokes qsort: what are the typical number of elements, element size, time under qsort, etc. This article takes an un-scientific approach of using a single workload, compiling a 56094-line C++ file.

In my experiment, gcc-6 made 406971 calls to qsort, spending roughly 204 million cycles under qsort, with total run time about 57 billion cycles. We can already say that optimizing qsort is not going to significantly improve overall compiler performance, as it accounts for less than 0.5% of total time.

Next, let's look at the distribution of element counts. Plotting call count against element count, we see that GCC frequently calls qsort with a tiny element count, and calls sorting just 2 or 3 elements really stand out:

call count by element count, old

By recording addresses of functions invoking qsort with __builtin_return_address and then resolving them with addr2line, we can quickly find out that 257794 calls, forming a majority of two-three element sorts, come from a single call site in "dominance walker" class, where it's used to reorder child nodes in a dominator tree.

We can note that typically nodes in dominator trees have 0, 2 or 3 children, and introduce a dedicated path for those cases that doesn't use library qsort at all. This was addressed during GCC 8 development, and now running the same test results in 161703 qsort calls, with not so strong outliers for 2 and 3 element counts:

call count by element count, new

Instead of call count, we can try plotting total cycle count for given element size. The corresponding graph might make a (misleading!) impression that most of the time we sort 20 elements or less, and that's where the optimization focus should be:

cycle count by element count

Plotting cumulative time (sum of times for given and smaller counts) gives a more honest impression. Alternatively, aggregating into bins with exponentially increasing sizes would also show the distribution better: time is roughly equally split between small (10 or less) and large element counts.

cumulative cycle count

Aggregating by element size rather than element count shows that GCC mostly sorts 8 byte sized elements (usually pointers; statistics collected from a 32-bit compiler would look different):

Element size, bytes CPU cycles Call count
4 10.69 × 106 4992
8 152.76 × 106 147812
16 0.42 × 106 1912
32 6.94 × 106 5185
40 0.49 × 106 1802

This suggests we want to have dedicated paths for moving pointer-sized and perhaps int-sized elements. Nevertheless, we should keep the code simple and small: qsort by itself is not a hot function, and we want to avoid evicting other useful code from the L1 cache.

Implementation

Sorting networks work nicely for small arrays. They needs to be paired with some divide-and-conquer algorithm for larger arrays. For gcc_qsort I went with classic merge sort. Although it requires dynamic allocation, and so cannot be used to implement C qsort, which is not allowed to fail due to memory exhaustion, it still can be used for sorting in the compiler.

In gcc_qsort, applying a sorting network either works in-place or relocates from one buffer to another: this is used to avoid copying in merge sort. Here's how it works. First, some helper declarations:

#ifdef __GNUC__
#define likely(cond)  __builtin_expect((cond), 1)
#define noinline      __attribute__((__noinline__))
#else
#define likely(cond)  (cond)
#define noinline
#endif

typedef int cmp_fn(const void *, const void *);

/* Structure holding read-mostly (read-only in netsort) context. */
struct sort_ctx {
	cmp_fn *cmp; // pointer to comparator
	char *out;   // output buffer
	size_t n;    // number of elements
	size_t size; // element size
};

And the sorting network implementation itself:

/* Helper for netsort. Invoke comparator CMP on E0 and E1.
   Return E0^E1 if E0 compares less than E1, zero otherwise.
   This is noinline to reduce code growth and confine invocation
   to a single call site, assisting indirect branch prediction. */
noinline static intptr_t cmp1(char *e0, char *e1, cmp_fn *cmp)
{
	intptr_t x = (intptr_t) e0 ^ (intptr_t) e1;
	return x & (cmp(e0, e1) >> 31);
}

/* Apply a sorting network to 2 to 5 elements from IN, placing them into C->OUT.
   IN may be equal to C->OUT, in which case elements are sorted in place. */
static void netsort(char *in, struct sort_ctx *c)
{
#define CMP(e0, e1) do {        	   \
	intptr_t x = cmp1(e1, e0, c->cmp); \
	e0 = (char *)((intptr_t)e0 ^ x);   \
	e1 = (char *)((intptr_t)e1 ^ x);   \
} while (0)
	char *e0 = in, *e1 = e0 + c->size, *e2 = e1 + c->size;
	CMP(e0, e1);
	if (likely(c->n == 3)) {
		CMP(e1, e2);
		CMP(e0, e1);
	}
	if (c->n <= 3) {
		reorder23(c, e0, e1, e2);
		return;
	}
	char *e3 = e2 + c->size, *e4 = e3 + c->size;
	if (likely(c->n == 5)) {
		CMP(e3, e4);
		CMP(e2, e4);
	}
	CMP(e2, e3);
	if (likely(c->n == 5)) {
		CMP(e0, e3);
		CMP(e1, e4);
	}
	CMP(e0, e2);
	CMP(e1, e3);
	CMP(e1, e2);
	reorder45(c, e0, e1, e2, e3, e4);
}

Does this netsort function implement a stable sort?

Note how this sorting network implementation keeps pointers to elements on registers and swaps them in a branchless fashion based on comparator result. In this approach, no auxiliary array is needed to store the order of comparison steps in the network: they are explicit in program flow. This requires "unrolling" the network, but note how the implementation folds together four networks and ends up only two comparators longer than the longest individual network (11 comparators here vs. 9 comparators for 5-input sorting network).

Of course, extending this implementation to handle 6 inputs would blow up code size, but at that point we'd also run out of available registers on x86-64: there are 6 callee-saved registers, which we need for the c pointer and e0, …, e4. Also, sorting networks get less efficient in terms of comparator invocations as size grows: at size 5 already they use more comparisons than necessary (9, while only 7 are necessary and sufficient; merge sort uses up to 8).

As already noted, the code takes care to swap pointers via bitwise operations rather than conditional branches. It trades control dependencies for more data dependencies, so there's never any risk of throwing away pipelined operations when a branch depending on comparator result is mispredicted. Extra data dependencies imply that some steps are serialized where they could run with overlapping if correctly predicted, but four- and five-input networks nevertheless have some completely independent comparison steps.

As netsort function only permutes pointers in registers, it is the job of helper functions like reorder23 to reorder the corresponding pointed-to values that can be of arbitrary size. This is where using sorting networks really pays off compared to swap-based sort implementations.

Doesn't this have lots of overhead due to memcpy calls? And why do we need those at all, what's wrong with good old t0 = *(TYPE *)e0?

/* Helper for netsort. Permute, possibly in-place, 2 or 3 elements,
   placing E0 to C->OUT, E1 to C->OUT + C->SIZE, and so on. */
static void reorder23(struct sort_ctx *c, char *e0, char *e1, char *e2)
{
#define REORDER_23(TYPE, STRIDE, OFFSET) do {   		 \
	TYPE t0, t1;                                    	 \
	memcpy(&t0, e0 + OFFSET, sizeof t0);            	 \
	memcpy(&t1, e1 + OFFSET, sizeof t1);            	 \
	char *out = c->out + OFFSET;                    	 \
	if (likely(c->n == 3))                          	 \
		memmove(out + 2*STRIDE, e2 + OFFSET, sizeof t0); \
	memcpy(out, &t0, sizeof t0); out += STRIDE;     	 \
	memcpy(out, &t1, sizeof t1);                    	 \
} while (0)

	if (likely(c->size == sizeof(size_t)))
		REORDER_23(size_t, sizeof(size_t), 0);
	else if (likely(c->size == sizeof(int)))
		REORDER_23(int, sizeof(int), 0);
	else {
		size_t offset = 0, step = sizeof(size_t);
		for (; offset + step <= c->size; offset += step)
			REORDER_23(size_t, c->size, offset);
		for (; offset < c->size; offset++)
			REORDER_23(char, c->size, offset);
	}
}

Why does the middle step in REORDER_23 need to use memmove instead of memcpy?

Finally, for sizes larger than 5, a simple recursive merge sort, again with branchless selection of the smaller element in the merging loop:

Does this mergesort function implement a stable sort?

/* Execute merge sort on N elements from IN, placing them into OUT,
   using TMP as temporary storage if IN is equal to OUT. */
static void
mergesort(char *in, struct sort_ctx *c, size_t n, char *out, char *tmp)
{
	if (likely(n <= 5)) {
		c->out = out;
		c->n = n;
		netsort(in, c);
		return;
	}
	size_t nl = n / 2, nr = n - nl, sz = nl * c->size;
	char *mid = in + sz, *r = out + sz, *l = in == out ? tmp : in;
	/* Sort the right half, outputting to right half of OUT. */
	mergesort(mid, c, nr, r, tmp);
	/* Sort the left half, leaving left half of OUT free.  */
	mergesort(in, c, nl, l, mid);

	/* Merge sorted halves given by L, R to [OUT, END). */
#define MERGE_ELTSIZE(SIZE) do {	         \
	intptr_t mr = c->cmp(r, l) >> 31;        \
	intptr_t lr = (intptr_t)l ^ (intptr_t)r; \
	lr = (intptr_t)l ^ (lr & mr);            \
	out = memcpy(out, (char *)lr, SIZE);     \
	out += SIZE;                             \
	r += mr & SIZE;                          \
	if (r == out) return;                    \
	l += ~mr & SIZE;                         \
} while (r != end)

	if (likely(c->cmp(r, l + (r - out) - c->size) < 0)) {
		char *end = out + n * c->size;
		if (sizeof(size_t) == 8 && likely(c->size == 8))
			MERGE_ELTSIZE(8);
		else if (likely(c->size == 4))
			MERGE_ELTSIZE(4);
		else
			MERGE_ELTSIZE(c->size);
	}
	memcpy(out, l, r - out);
}

What does the branch following the first comparator invocation achieve?

And at last, a qsort-compatible wrapper that takes care of filling sort_ctx and allocating a scratch buffer:

void gcc_qsort(void *base, size_t n, size_t size, cmp_fn *cmp)
{
	if (n < 2)
		return;
	struct sort_ctx c = {cmp, base, n, size};
	long long scratch[32];
	size_t bufsz = (n / 2) * size;
	void *buf = bufsz <= sizeof scratch ? scratch : xmalloc(bufsz);
	mergesort(base, &c, n, base, buf);
	if (buf != scratch)
		free(buf);
}

Comparison

Plotting cumulative cycle counts as before and also ratio of new vs. old cycle counts shows that proper early-out buys a 10x speedup for sizes 0 and 1, and for sizes 2–4 sorting networks net a roughly 2x speedup. As element counts grow, speedups diminish and slowdowns get more frequent, until at size 1530 a single sort originally taking about a million cycles gets 35% slower.

comparison

This is the machine code of the comparator used in the sort that gets dramatically slower:

ipa_icf::sort_congruence_class_groups_by_decl_uid:
mov    (%rdi),%rax
mov    0x8(%rax),%rax
mov    0x8(%rax),%rax
mov    0x8(%rax),%rax
mov    0x8(%rax),%rax
mov    0x18(%rax),%rcx
mov    (%rsi),%rax
mov    0x8(%rax),%rax
mov    0x8(%rax),%rax
mov    0x8(%rax),%rax
mov    0x8(%rax),%rax
mov    0x18(%rax),%rdx
mov    0x1c(%rdx),%eax
cmp    %eax,0x1c(%rcx)
mov    $0xffffffff,%edx
setg   %al
movzbl %al,%eax
cmovl  %edx,%eax
retq

Given the size of the array and the chain of dereferences in the comparator, it's easy to guess the main reason for the slowdown: the comparator frequently incurs cache misses, and where the original qsort could have overlapped execution of adjacent comparator invocations thanks to branch prediction, the new one has data dependencies preventing that.

How to modify mergesort to get the best of both worlds: no branches depending on comparator output, and some possibility of overlapping execution?

The End

In conclusion, I want to say that I find that this sort implementation strikes a very interesting complexity/size/speed compromise. A stable sort can be implemented on top of it with very little changes, and increasing pipelining in merge loop is also possible. On the other hand, it's also possible to reduce size further by, say, dropping the 5-input sorting net.


CC-BY-SA © 2018 Alexander Monakov <amonakov@gmail.com>
@swenson
Copy link

swenson commented Oct 13, 2018

Thanks for the analysis. This is a fascinating way to benchmark sorting functions. I would love to see how well sort.h does with similar analysis, and if it could benefit from using a network sort for very small sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment