Skip to content

Instantly share code, notes, and snippets.

@no-defun-allowed
Last active November 23, 2021 07:59
Show Gist options
  • Save no-defun-allowed/63fdbec5074ae054c011a0636be7e135 to your computer and use it in GitHub Desktop.
Save no-defun-allowed/63fdbec5074ae054c011a0636be7e135 to your computer and use it in GitHub Desktop.
Notes on Pauseless Immix with Thread-Local Heaps (sometimes)

Notes on Pauseless Immix with Thread-Local Heaps (sometimes)

Introduction

We want a thread-local GC for SICL. If most collections are thread-local, we would improve upon latency and throughput, even with many threads.

The plan for a while was to use something like the Doligez-Leroy-Gonthier (DLG) collector [fn:thread-local-ml]. However, the DLG collector “crucially relies on the ML compile-time distinction between mutable and immutable objects” as immutable objects are allocated thread-local and mutable objects are always global. Common Lisp does not have a distinction; even objects with no mutable slots have identity under eq.

[fn:eliminating-read-barriers] describes a modification of the DLG collector which allows for thread-local mutable objects. When an object is made global, generally the whole heap would be traced to move references to the object into the global heap. The effect of tracing is reduced by a “cleanliness” heuristic is used which allows tracking the bounds of where the object may be referenced in the nursery. However, making objects global stil requires scanning the stack, registers and some suffix of the heap at best.

One option to reducing tracing is to promote more objects into the global heap, but this has the effect of increasing pressure on the global collector, so it is not a great solution. But, of course, if objects are made global very frequently, there is not much we could do which is still thread-local.

The heaps

Instead, we are looking to migrate objects to the global heap without moving them in memory. [fn:thread-local-java] achieves this by using mark-sweep for thread-local and global heaps, so that migration can be achieved by modifying lists of objects. However, we are not sure that using mark-sweep for a nursery is a good idea.

Speaking of nursery collectors, we would prefer a collector which preserves allocation order, so that it is easy to identify the oldest objects to promote. As such, we still plan to use Strandh’s sliding collector [fn:sliding-gc].

We plan to use the Immix mark-region collector [fn:immix] for the global heap, as it can collect memory without compacting all of the time. Using different collectors means that we must rebuild some of the metadata used by Immix; specifically the line allocation bitmap must be rebuilt. But, as mark-compact uses contiguous allocation, it is a matter of filling the bitmap up to the last line used.

As we still want a concurrent and parallel collector, and ideally one without global pauses, we intend to use most of the Pauseless collection algorithm [fn:pauseless], with some modifications.

Initially, we will just have the nursery be one region, so it follows that the region size should be a good nursery size. On our hardware, the L2 cache appears to be the largest cache which is not shared between cores, so we would probably make regions about the size of L2 cache.

Collection

Mark

Marking is performed as usual for an on-the-fly collector. Pauseless has another marking scheme which uses a “not marked through” bit on references, with the benefit that “the concurrent Mark phase will complete in a single pass despite the mutators busily modifying the heap.” However, it also requires a read barrier which is frequently trapped, resulting in a “trap storm”. As we intend to rely on the memory management unit for trapping, and it is not awfully fast, we do not want read barriers to trap frequently.

We’re also not convinced that any other write barrier scheme wouldn’t also complete in a single pass, irregardless of mutation rate. It is also not as if writes occur in a vacuum; we could approximately expect every write to have a read (as otherwise we would have pointless writes), and we suspect that more immutable/persistent data is used in Common Lisp programs.

Note that DLG uses a write barrier which modifies a global “dirty” bit, signifying that the heap needs to be scanned again for grey objects; but repeated full scans are inefficient with large heaps and high mutation rates. Pauseless and C4 instead enqueue the objects to be processed by marking threads when a read barrier encounters an unmarked reference. Strandh proposes multiple bitmaps to be used as an “index” to zero in on a gray object quickly (p. 113 of [fn:sicl]). While indeed the Pauseless approach bounds execution time to the number of references in the heap, we anticipate Strandh’s index will be faster in practise, as it is unlikely every object will be modified per GC cycle, and enqueuing is typically slower than updating bitmaps.

We read that ZGC uses the same read barrier as C4 on x86-64. As described in [fn:zgc], a bit mask is tested against an object pointer, and if the mask matches the pointer, the pointer must be corrected. While this avoids MMU trap overhead, it does not avoid the “trap storm” problem.

Sweep

The same as for Immix: use the mark bits to free lines in regions and return them to free lists. We separate sweeping and compacting to make reasoning about this collector somewhat more tractable.

Weak references

We probably have to handle weak references at some point, so here is a mediocre idea: a weak reference is a pair of a “lock” and the actual reference. The lock is in one of three states: unused, read by mutator, and deleted. When a mutator wants to read a weak reference:

  • if the lock is unused, try to CAS to read by mutator, else try again
  • if the lock has been read by mutator, read the actual reference
  • if the lock has been deleted, the weak reference has been killed

The collector should process weak references at the end of the marking phase. However, weak references pose a termination problem: it is possible for the mutator to create a new grey object by reading a weak reference. To disallow this, we imagine we could efficiently disallow mutators from following unlocked weak references, ensure that there are no grey objects, then perform a proper atomic flip to have every mutator check mark bits before deciding to strengthen a weak reference.

We could also use a properly “on the fly” solution, such as the one proposed in [fn:reference-objects], and it would not be too much more sophisticated.

Relocate

Here is where the algorithm gets interesting. The first thing the GC does is protect the regions to relocate, and then request that all mutators checkpoint and flip any roots they have to tospace. We imagine that this would require mutators to perform local collections and update any references to protected regions, much like the DLG collector.

There is still the possibility that a mutator will attempt to load a reference to a protected region before reaching a checkpoint, or after reaching a checkpoint but before every other thread has reached a checkpoint. In this case, the trap handler will checkpoint and then block the mutator until every other thread has checkpointed.

This is not a global pause, but it is arguably “a pause” that is out of the control of a thread nonetheless. Oh well. However, as regions targeted for relocation have fewer objects, and we only relocate a small number of regions, trapping like this should be rare. Otherwise, threads can continue execution in tospace.

Also note that this would require register and stack maps to be generated for every read barrier, presumably.

Relocate

Collector threads scan the heap and update references to the regions which are being relocated. (We could use remembered sets a la G1 [fn:g1gc] to reduce the number of regions which must be visited.) If a mutator read barrier causes a trap, the loaded reference is corrected in the place that the reference was loaded from.

Thread-local allocation and heaps

So, now how do we make some collections thread-local? We will typically use one region as a nursery. As long as no references are made from the global heap into the nursery, we can collect it using a thread-local collector. If such a reference is made (which will be detected by a write barrier, just like DLG), the nursery will be “jettisoned” by giving up control to the global heap, and we will merely use it as a thread-local allocation buffer.

When a jettisoned nursery is exhausted, we have the choice of either acquiring a new region to use as a thread-local nursery, or just using an allocation buffer. The latter option might be used if a mutator thread consistently jettisons nurseries, or memory is tight and we need to reuse more regions. Note that we can’t use partly filled regions as thread-local nurseries, as they may be referred to by the rest of the global heap.

Of course, this is (close enough to) a generational collector, so we also may reuse partly filled regions to promote old objects into.

The read barrier

In Java, where there is a horrid notion of a “primitive” type, something which produces a reference to an object will always produce a valid reference (or null). However, in Common Lisp, among object oriented languages, it is possible for some other tagged datum to be produced. The C4 collector could produce code like:

        mov rax, [place] % The actual load instruction.
        test rax, rax    
        jz skip          % Don't load null.
        mov rbx, [rax]   % Bogus read to trip the MMU.
skip:

As the reference validity test in Common Lisp is more complex, we may consider optimizing the read barrier. We have two heuristics in mind which would reduce the number of required read barriers:

  • An immediate datum has its type tag checked before use. We do not need to check again if we know the value is of some immediate type.
  • A reference to a heap-allocated object is likely to be loaded eventually. If that load instruction is not too far away, we may use that instruction instead of another load.

Another point of concern is how frequently we trip the read barrier, and how often we need to reprotect memory, as both of these operations are considered slow enough to be concerning on an x86 machine. ([fn:compacting] reports 10 microseconds for a trap in 2004, and we found around 2 microseconds on a more recent machine.) We only change memory protections once per compaction cycle, and the read barrier again only trips on references to regions which are being relocated, which are uncommon. So we are mostly concerned with making the fast path of a read barrier very fast.

While we have not tested it, we could also follow [fn:compacting] and clear out (part of) a region on a trap, in order to have fewer (but larger) trap handlers. We could also enqueue such regions to be scanned by GC threads first.

A stupid idea

If we use 16-byte alignment, we get four tag bits and thus can make the tag bits more regular:

  • xxx0 A fixnum.
  • x001 A character.
  • x101 A single float.
  • 0011 A cons cell.
  • 0111 A standard object.
  • 1011 An object rack.
  • 1111 Unused.

Now our read barrier is:

        mov rax, [place] % The actual load instruction.
        mov rcx, rax
        and rcx, 3
        cmp rcx, 3       
        je skip          % Don't load null.
        mov rbx, [rax]   % Bogus read to trip the MMU.
skip:

The only costs are fixing all the magic numbers in SICL, and some storage overhead due to 16-byte alignment (something like 33% at worst for a rack with three slots). On the other hand, 16-byte alignment means dyads won’t straddle cache lines, and SBCL manages just fine with four-bit tags.

References

[fn:pauseless] “The Pauseless GC Algorithm” https://www.usenix.net/legacy/events/vee05/full_papers/p46-click.pdf

[fn:immix] “Immix: A Mark-Region Garbage Collector with Space Efficiency, Fast Collection and Mutator Performance” https://www.cs.utexas.edu/users/speedway/DaCapo/papers/immix-pldi-2008.pdf

[fn:sliding-gc] “An Improvement to Sliding Garbage Collection” http://metamodular.com/SICL/sliding-gc.pdf

[fn:thread-local-java] “Thread-local heaps for Java” https://dl.acm.org/doi/10.1145/512429.512439

[fn:thread-local-ml] “A concurrent, generational garbage collector for a multithreaded implementation of ML” https://xavierleroy.org/publi/concurrent-gc.pdf

[fn:eliminating-read-barriers] “Eliminating Read Barriers through Procrastination and Cleanliness” https://www.cs.purdue.edu/homes/suresh/papers/ismm12.pdf

[fn:g1gc] “Garbage-First Garbage Collection” http://cs.williams.edu/~dbarowy/cs334s18/assets/p37-detlefs.pdf

[fn:compacting] “Mostly concurrent compactation for mark-sweep GC” https://dl.acm.org/doi/10.1145/1029873.1029877

[fn:sicl] “SICL: Building blocks for creators of Common Lisp implementations” http://metamodular.com/SICL/sicl-specification.pdf

[fn:zgc] “A first look into ZGC” https://dinfuehr.github.io/blog/a-first-look-into-zgc/

[fn:reference-objects] “Reference Object Processing in On-The-Fly Garbage Collection” https://kar.kent.ac.uk/40820/1/ismm012-ugawa.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment