jnthn/x.md Secret

## x.md

      
    Raw
  

              x.md
            
          
    Massively reducing MoarVM Fixed Size Allocator contention

The latest MoarVM release, 2017.04, contains a significant improvement for
multi-threaded applications, especially those that are CPU-bound. I mentioned
the improvement briefly on Twitter, because it's hard to be anything other than
brief on Twitter. This post contains a decidedly less brief, though hopefully
somewhat interesting, description of what I did.
The Fixed Size Allocator

The most obvious way to allocate memory in a C program is through functions like
malloc and calloc. Indeed, we do this plenty in MoarVM. The malloc and
calloc implementations in C libraries have certainly been tuned a bunch, but
at the same time they have to have good behavior for a very wide range of
programs. They also need to keep track of the sizes of allocations, since a
call to free does not pass the size of the memory being released. And they
need to try to avoid fragmentation, which can lead to out-of-memory errors
occurring because the heap ends up with lots of small gaps, but none big enough
to allocate a larger object.
When we know a few more properties of the memory usage of a program, and we
have information around to know the size of the memory block we are freeing,
it's possible to do a little better. MoarVM does this in multiple ways.
One of them is by using a bump-the-pointer allocator for memory that is managed
by the garbage collector. These have a header that points to a type table that
knows the size of the object that was allocated, meaning the size information
is readily available. And the GC can move objects around in memory, since it
can find all of the references to an object and update them, meaning there is a
way out of the fragmentation trap too.
The call stack is another example. In the absence of closures, it is possible to
allocate a block of memory and use it like a stack. When a program makes a
call, the current location in the memory block is taken as the address for the
call frame memory, and the location is bumped by the frame size. This could be
seen as a "push", in stack terms. Because call frames are left in the opposite
order to which they are entered, freeing them is just subtraction. This could
be seen as a "pop". Since holes are impossible, fragmentation cannot occur.
A third case is covered by the fixed size allocator. This is the most difficult
of the three. It tries to do a better job than malloc and friends in the case
that, at the point when memory is freed, we readily know the size of the memory.
This allows it to create regions of memory that consist of N blocks of a fixed
size, and allocate the memory out of those regions (which it calls "pages").
When a memory request comes in, the allocator first checks if it's within the
size range that the fixed size allocator is willing to handle. If it isn't,
it's passed along to malloc. Otherwise, the size is rounded up to the nearest
"bin size" (which are 8 bytes, 16 bytes, 24 bytes, and so forth). A given bin
consists of:

1 or more pages, each with space for a certain number of allocations of the
bin size
A free list of available memory locations of that size, which is maintained as
a linked list (that is, each location contains a pointer to the next free
location, with a NULL marking the end)

If the free list contains any entries, then one of them will be taken. If not,
then the pages will be considered. If the current page is not full, then the
allocation will be made from it. Otherwise, another page will be allocated.
When memory is freed, it is always added to the free list of the appropriate
bin. Therefore, a longer-running program, in steady state, will typically end
up getting all of its allocations from the free list.
Enter threads

Building a fixed size allocator for a single-threaded environment isn't all
that difficult. But what happens when it needs to cope with being used in a
multi-threaded program? Well...it's complicated. Clearly, it is not possible to
have a single global fixed size allocator and have all of the threads just use
it without any kind of concurrency control. Taking an item off the freelist is
a multi-step process, and allocating from a page - or allocating a new page -
is even more steps. Without concurrency control, there will be data races all
over, and we'll be handed a SIGSEGV in record time.
It's worth stopping to consider what would happen if we were to give every
thread its own fixed size allocator. This turns out to get messy fast, as memory
allocated on one thread may be freed by another. A seemingly simple scheme is
to say that the freed memory is simply appended to the freelist of the freeing
thread's fixed size allocator. Unfortunately, this has two bad properties.

When the thread ends, we can't just throw aways the pages - because bits of
them may still be in use by other threads, or referenced in the free lists of
other threads. So they'd need to be somehow "re-homed", which is going to
need some kind of coordination. Further measures may be needed to mitigate
memory fragmentation in programs that spawn and join many threads during
their lifetimes.
Imagine a producer/consumer setup, where one thread does allocations and
passes the allocated memory to another thread, which processes the data in
the memory and frees it. The producing thread will build up a lot of pages to
allocate out of. The consuming thread will build up an ever longer free list.
Memory runs out. D'oh.

So, MoarVM went with a single global fixed size allocator. Of course, this has
the drawback of needing concurrency control.
Concurrency control

The easiest possible form of concurrency control is to have threads acquire a
mutex on every allocate and free operation. This has the benefit of being
very straightforard to understand and reason about. It has the disadvantage of
being extremely costly. Mutex acquisition can be relatively cheap, but it gets
expensive when there is high contention - that is, lots of threads trying
to obtain the lock. And since all CPU-bound threads will typically allocate
some working memory, particularly in a VM for a dyanmic language that doesn't
yet do escape analysis, that adds up to a lot of contention.
So, MoarVM did something more sophisticated.
First, the easy part. It's possible to append to a free list with a CPU-provided
atomic operation, provided taking from the freelist is also using one. So, no
mutex acquisition is required for freeing memory. However, an atomic operation
still requires a kind of locking down at the CPU level. It's cheaper than a
mutex acquire/release for sure, but there will still be contention between CPU
cores for the cache line holding the head of the free list.
What about allocation? It turns out that, in a language with manual memory
management, we can not just take from a freelist using an atomic operation
without hitting the ABA problem
(gory details in footnote). Therefore, some kind of locking is needed to ensure
an ordering on the operations. In most cases, the atomic operation will work on
the first attempt (it's competing with frees, which happen without any kind of
locking, meaning a retry will sometimes be needed). In cases where something
will complete very rapidly, a spinlock
may be used in place of a full-on mutex. So, the MoarVM fixed size allocator
allocation scheme boiled down to:

Acquire the spin lock.
Try to take from the free list in a loop, until either we succeed or the free
list is seen to be empty.
Release the spin lock.
If we failed to obtain memory from the free list, take the slow path to get
memory from a page, allocating another page if needed. This slow path does
acquire a real mutex.

Contention

First up, I'll note that the strategy outlined above does beat the "just take
a mutex for every allocate/free" approach - at least, in all of the benchmarks
I've considered. Frees end up being lock free, and most of the allocations just
do a spin lock and an atomic operation.
At the same time, contention is contention. No lock free data structure or
spinlock changes this. If multiple threads are constantly scrambling to work on
the same memory location - such as the head of a free list - it's going to get
expensive. How expensive? On an Intel Core i7, obtaining a cache line that is
held by another core exclusively - which it typically will be under contention -
costs somewhere around 70 CPU cycles. It gets worse in a multi-CPU setup, where
it could easily hundreds of CPU cycles.
But how much could this end up costing in a real world Perl 6 application? I
recently had chance to find out, and the numbers were ugly. Measurements
obtained by perf showed that a stunning 40% of the application's runtime
was spent inside of the fixed size allocator. (Side note: perf is a sampling
profiler, which - in my handwavey understanding - snapshots the callstack at
regular intervals to figure out where time is being spent. My experience has
been that sampling profilers tend to be better at showing up surprising costs
like this than instrumenting profilers are, even if they are in some senses
less precise.)
Making things better

Clearly, there was significant room for improvement. And, happily, things are
now muchly improved and my real-world program did get something close to a 40%
performance boost.
To make things better, I introduced per-thread freelists, while leaving pages
global and retaining global free lists also.
Memory is allocated in the first place from global pages, as before. However,
when it is freed, it is instead placed on a per-thread free list (with one free
list per thread per size bin). When a thread needs memory, it first checks its
thread-local free list to see if there is anything there. It will only then
look at the global free list, or the global pages, if the thread-local free
list cannot satisfy the memory request. The upshot of this is that the vast
majority of allocations and frees performed by the fixed size allocator no
longer have any contention.
However, as I mentioned earlier, one needs to be very careful when introducing
things like thread-local freelists to not create bad behavior when a thread
terminates or in producer/consumer scenarios. Therefore:

When a thread termiantes, it will donate all the locations on its free list
back to the global free list. Since pages are entirely global, there isn't a
fragmentation issue.
When a thread's local free list for a particular size hits a limit, it will
instead start to free to the global free list, which prevents unbounded
free list growth in producer/consumer style programs.

So, I belive this improvement is both good for performance without being badly
behaved for any cases that previously would have worked out fine.
Can we do better?

Always! While the major contention bottleneck is gone, there are further
opportunities for improvement that are worth exploring in the future.

It would probably make sense to dontate a bunch of freelist entries to the
global free list, rather than donating them individually. This would mean
the contention of doing so only happend every N entries, not every entry,
for the producer/consumer case. This is a fairly easy fix. (Homework,
anyone? :-))
The current scheme is vulnerable to false sharing. The strategy of packing
allocations of the same size together in pages is, in a single-threaded
program, typically good for the CPU cache. But in a multi-threaded program,
we can end up placing allocations that individual threads use next to each
other. This means they may end up on the same cache line (the 32-byte or
64-byte memory regions that CPU caches hold). This is a harder nut to crack.

In summary...

If you have CPU-bound multi-threaded Perl 6 programs, MoarVM 2017.04 could offer
a big performance improvement. For my case, it was close to 40%. And the design
lesson from this: on modern hardware, contention is really costly, and using a
lock free data structure or picking the "best kind of lock" will not overcome
that.

Footnote on the ABA vulnerability: It's decidedly interesting - at least to
me - that prepending to a free list can be safely done with a single atomic
operation, but taking from it cannot be. Here I'll attempt to provide a proof
for these claims.
We'll consider a single free list whose head lives at memory location F,
and two threads, T1 and T2. We assume the existence of an atomic operation,
TRY-CAS(location, old, new), which will - in a single CPU instruction that
may not be interrupted - compare the value in memory pointed to by location
with old and, if they match, replace it with new. (CAS is short for Compare
And Swap.) The TRY-CAS function evaluates to true if the replacement took
place, and false if not. The threads may be preempted (that is, taken off the
CPU) at any point in time.
To show that allocation is vulnerable to the ABA problem, we just need to find
an execution where it happens. First of all, we'll define the operation
ALLOCATE as:
1: do
2:     allocated = *F
3:     if allocated != NULL
4:         next = allocated.next    
5: while allocated != NULL && !TRY-CAS(F, allocated, next)
6: return allocated

And FREE(C) as:
1: do
2:     current = *F
3:     C.next = current;
4: while !TRY-CAS(F, current, C)

Let's consider a case where we have 3 memory cells, C1, C2, and C3. The
free list head F points to C1, which in turn points to C2, which in turn
points to C3.
Thread T1 enters ALLOCATE, but is preempted immediately after the execution
of line 4. At this point, allocated contains C1 and next contains C2.
Next, T2 calls ALLOCATE, and succeeds in making an allocation. F now
points to C2. It again calls ALLOCATE, meaning that F now points to C3.
It then calls FREE(C1). At this point, F points to C1 again, and C1
points to C3. Notice that at this point, cell C2 is considred to be
allocated and in use.
Consider what happens if T1 is resumed. It performs TRY-CAS(F, C1, C2). This
operation will succeed, because F does indeed currently point to C1. This
means that F now come to point to C2. However, we earlier stated that C2
is allocated and in use, and therefore should not be in the free list. Therefore
we have demonstrated the code to be buggy, and shown how the bug arises as a
result of the ABA problem.
What of the claim that the FREE(C) is not vulnerable to the ABA problem? To
be vulnerable to the ABA problem, another thread must be able to change the
state of something that the correctness of the operation depends upon, but
that is not tested by the TRY-CAS operation. Looking at FREE(C) again:
1: do
2:     current = *F
3:     C.next = current;
4: while !TRY-CAS(F, current, C)

We need to consider C and current. We can very reasonably make the
assumption that the calling program is well-behaved, and will never use the
cell C again after passing it to FREE(C) (unless it obtains it again in the
future through another call to ALLOCATE, which cannot happen until FREE has
inserted it into the free list). Therefore, C cannot be changed in any way
other than the code in FREE changes it. The FREE operation holds the sole
reference to C at this point.
Life is much more complicated for current. It is possible for a preemption
at line 3 of FREE, followed by another thread allocating the cell pointed to
by current and then freeing it again, which is certainly a case of an ABA
state change. However, unlike the situation we saw in ALLOCATE, the FREE
operation does not depend on the content of current. We can see this by
noticing how it never looks inside of it, and instead just holds a reference to
it. An operation cannot depend upon a value it never accesses. Therefore,
FREE is not vulnerable to the ABA problem.