no-defun-allowed/notes.org

## notes.org

      
    Raw
  

              notes.org
            
          
    Notes on Pauseless Immix with Thread-Local Heaps (sometimes)

Introduction

We want a thread-local GC for SICL. If most collections are
  thread-local, we would improve upon latency and throughput, even with
  many threads.
The plan for a while was to use something like the
  Doligez-Leroy-Gonthier (DLG) collector [fn:thread-local-ml]. However,
  the DLG collector “crucially relies on the ML compile-time distinction
  between mutable and immutable objects” as immutable objects are
  allocated thread-local and mutable objects are always global. Common
  Lisp does not have a distinction; even objects with no mutable slots
  have identity under eq.
[fn:eliminating-read-barriers] describes a modification of the DLG
  collector which allows for thread-local mutable objects. When an
  object is made global, generally the whole heap would be traced to
  move references to the object into the global heap. The effect of
  tracing is reduced by a “cleanliness” heuristic is used which allows
  tracking the bounds of where the object may be referenced in the
  nursery. However, making objects global stil requires scanning the
  stack, registers and some suffix of the heap at best.
One option to reducing tracing is to promote more objects into the
  global heap, but this has the effect of increasing pressure on the
  global collector, so it is not a great solution. But, of course, if
  objects are made global very frequently, there is not much we could do
  which is still thread-local.
The heaps

Instead, we are looking to migrate objects to the global heap without
  moving them in memory. [fn:thread-local-java] achieves this by using
  mark-sweep for thread-local and global heaps, so that migration can be
  achieved by modifying lists of objects. However, we are not sure that
  using mark-sweep for a nursery is a good idea.
Speaking of nursery collectors, we would prefer a collector which
  preserves allocation order, so that it is easy to identify the oldest
  objects to promote. As such, we still plan to use Strandh’s sliding
  collector [fn:sliding-gc].
We plan to use the Immix mark-region collector [fn:immix] for the
  global heap, as it can collect memory without compacting all of the time.
  Using different collectors means that we must rebuild some of the metadata
  used by Immix; specifically the line allocation bitmap must be
  rebuilt. But, as mark-compact uses contiguous allocation, it is a
  matter of filling the bitmap up to the last line used.
As we still want a concurrent and parallel collector, and ideally one
  without global pauses, we intend to use most of the Pauseless
  collection algorithm [fn:pauseless], with some modifications.
Initially, we will just have the nursery be one region, so it follows that
  the region size should be a good nursery size. On our hardware, the L2
  cache appears to be the largest cache which is not shared between cores,
  so we would probably make regions about the size of L2 cache.
Collection

Mark

Marking is performed as usual for an on-the-fly collector. Pauseless
  has another marking scheme which uses a “not marked through” bit on
  references, with the benefit that “the concurrent Mark phase will
  complete in a single pass despite the mutators busily modifying the
  heap.” However, it also requires a read barrier which is frequently
  trapped, resulting in a “trap storm”. As we intend to rely on the
  memory management unit for trapping, and it is not awfully fast, we do
  not want read barriers to trap frequently.
We’re also not convinced that any other write barrier scheme wouldn’t
  also complete in a single pass, irregardless of mutation rate. It is
  also not as if writes occur in a vacuum; we could approximately expect
  every write to have a read (as otherwise we would have pointless writes),
  and we suspect that more immutable/persistent data is used in Common Lisp
  programs.
Note that DLG uses a write barrier which modifies a global “dirty” bit,
  signifying that the heap needs to be scanned again for grey objects; but
  repeated full scans are inefficient with large heaps and high mutation
  rates. Pauseless and C4 instead enqueue the objects to be processed by
  marking threads when a read barrier encounters an unmarked reference.
  Strandh proposes multiple bitmaps to be used as an “index” to zero in on
  a gray object quickly (p. 113 of [fn:sicl]). While indeed the Pauseless
  approach bounds execution time to the number of references in the heap,
  we anticipate Strandh’s index will be faster in practise, as it is
  unlikely every object will be modified per GC cycle, and enqueuing is
  typically slower than updating bitmaps.
We read that ZGC uses the same read barrier as C4 on x86-64. As
  described in [fn:zgc], a bit mask is tested against an object pointer,
  and if the mask matches the pointer, the pointer must be corrected.
  While this avoids MMU trap overhead, it does not avoid the “trap storm”
  problem.
Sweep

The same as for Immix: use the mark bits to free lines in regions and
  return them to free lists. We separate sweeping and compacting to make
  reasoning about this collector somewhat more tractable.
Weak references

We probably have to handle weak references at some point, so here is a
  mediocre idea: a weak reference is a pair of a “lock” and the
  actual reference. The lock is in one of three states: unused,
  read by mutator, and deleted. When a mutator wants to read a weak
  reference:

  if the lock is unused, try to CAS to read by mutator, else try again
  if the lock has been read by mutator, read the actual reference
  if the lock has been deleted, the weak reference has been killed

The collector should process weak references at the end of the marking
  phase. However, weak references pose a termination problem: it is
  possible for the mutator to create a new grey object by reading a
  weak reference. To disallow this, we imagine we could efficiently
  disallow mutators from following unlocked weak references, ensure that
  there are no grey objects, then perform a proper atomic flip to
  have every mutator check mark bits before deciding to strengthen a weak
  reference.
We could also use a properly “on the fly” solution, such as the one
  proposed in [fn:reference-objects], and it would not be too much more
  sophisticated.
Relocate

Here is where the algorithm gets interesting. The first thing the GC
  does is protect the regions to relocate, and then request that all
  mutators checkpoint and flip any roots they have to tospace. We
  imagine that this would require mutators to perform local collections
  and update any references to protected regions, much like the DLG
  collector.
There is still the possibility that a mutator will attempt to load a
  reference to a protected region before reaching a checkpoint, or after
  reaching a checkpoint but before every other thread has reached a
  checkpoint. In this case, the trap handler will checkpoint and then
  block the mutator until every other thread has checkpointed.
This is not a global pause, but it is arguably “a pause” that is out
  of the control of a thread nonetheless. Oh well. However, as regions
  targeted for relocation have fewer objects, and we only relocate a
  small number of regions, trapping like this should be rare. Otherwise,
  threads can continue execution in tospace.
Also note that this would require register and stack maps to be
  generated for every read barrier, presumably.
Relocate

Collector threads scan the heap and update references to the regions
  which are being relocated. (We could use remembered sets a la
  G1 [fn:g1gc] to reduce the number of regions which must be visited.)
  If a mutator read barrier causes a trap, the loaded reference is
  corrected in the place that the reference was loaded from.
Thread-local allocation and heaps

So, now how do we make some collections thread-local?  We will
  typically use one region as a nursery. As long as no references are
  made from the global heap into the nursery, we can collect it using a
  thread-local collector. If such a reference is made (which will be
  detected by a write barrier, just like DLG), the nursery will be
  “jettisoned” by giving up control to the global heap, and we will
  merely use it as a thread-local allocation buffer.
When a jettisoned nursery is exhausted, we have the choice of either
  acquiring a new region to use as a thread-local nursery, or just using
  an allocation buffer. The latter option might be used if a mutator
  thread consistently jettisons nurseries, or memory is tight and we
  need to reuse more regions. Note that we can’t use partly filled
  regions as thread-local nurseries, as they may be referred to by the
  rest of the global heap.
Of course, this is (close enough to) a generational collector, so we
  also may reuse partly filled regions to promote old objects into.
The read barrier

In Java, where there is a horrid notion of a “primitive” type,
  something which produces a reference to an object will always produce
  a valid reference (or null). However, in Common Lisp, among object
  oriented languages, it is possible for some other tagged datum to be
  produced. The C4 collector could produce code like:
        mov rax, [place] % The actual load instruction.
        test rax, rax    
        jz skip          % Don't load null.
        mov rbx, [rax]   % Bogus read to trip the MMU.
skip:
As the reference validity test in Common Lisp is more complex, we may
  consider optimizing the read barrier. We have two heuristics in mind
  which would reduce the number of required read barriers:

  An immediate datum has its type tag checked before use. We do not
    need to check again if we know the value is of some immediate type.
  A reference to a heap-allocated object is likely to be loaded
    eventually. If that load instruction is not too far away, we may use
    that instruction instead of another load.

Another point of concern is how frequently we trip the read barrier,
  and how often we need to reprotect memory, as both of these operations
  are considered slow enough to be concerning on an x86
  machine. ([fn:compacting] reports 10 microseconds for a trap in 2004,
  and we found around 2 microseconds on a more recent machine.) We only
  change memory protections once per compaction cycle, and the read
  barrier again only trips on references to regions which are being
  relocated, which are uncommon. So we are mostly concerned with making
  the fast path of a read barrier very fast.
While we have not tested it, we could also follow [fn:compacting] and
  clear out (part of) a region on a trap, in order to have fewer (but
  larger) trap handlers. We could also enqueue such regions to be scanned
  by GC threads first.
A stupid idea

If we use 16-byte alignment, we get four tag bits and thus can make the
  tag bits more regular:

  xxx0 A fixnum.
  x001 A character.
  x101 A single float.
  0011 A cons cell.
  0111 A standard object.
  1011 An object rack.
  1111 Unused.

Now our read barrier is:
        mov rax, [place] % The actual load instruction.
        mov rcx, rax
        and rcx, 3
        cmp rcx, 3       
        je skip          % Don't load null.
        mov rbx, [rax]   % Bogus read to trip the MMU.
skip:
The only costs are
  fixing all the magic numbers in SICL, and some storage overhead due to 16-byte
  alignment (something like 33% at worst for a rack with three slots). On the other
  hand, 16-byte alignment means dyads won’t straddle cache lines, and SBCL manages
  just fine with four-bit tags.
References

[fn:pauseless] “The Pauseless GC Algorithm” https://www.usenix.net/legacy/events/vee05/full_papers/p46-click.pdf
[fn:immix] “Immix: A Mark-Region Garbage Collector with Space Efficiency, Fast Collection and Mutator Performance” https://www.cs.utexas.edu/users/speedway/DaCapo/papers/immix-pldi-2008.pdf
[fn:sliding-gc] “An Improvement to Sliding Garbage Collection” http://metamodular.com/SICL/sliding-gc.pdf
[fn:thread-local-java] “Thread-local heaps for Java” https://dl.acm.org/doi/10.1145/512429.512439
[fn:thread-local-ml] “A concurrent, generational garbage collector for a multithreaded implementation of ML” https://xavierleroy.org/publi/concurrent-gc.pdf
[fn:eliminating-read-barriers] “Eliminating Read Barriers through Procrastination and Cleanliness” https://www.cs.purdue.edu/homes/suresh/papers/ismm12.pdf
[fn:g1gc] “Garbage-First Garbage Collection” http://cs.williams.edu/~dbarowy/cs334s18/assets/p37-detlefs.pdf
[fn:compacting] “Mostly concurrent compactation for mark-sweep GC” https://dl.acm.org/doi/10.1145/1029873.1029877
[fn:sicl] “SICL: Building blocks for creators of Common Lisp implementations” http://metamodular.com/SICL/sicl-specification.pdf
[fn:zgc] “A first look into ZGC” https://dinfuehr.github.io/blog/a-first-look-into-zgc/
[fn:reference-objects] “Reference Object Processing in On-The-Fly Garbage Collection” https://kar.kent.ac.uk/40820/1/ismm012-ugawa.pdf