no-defun-allowed/partial-read-barriers.md

## partial-read-barriers.md

      
    Raw
  

              partial-read-barriers.md
            
          
    The "partial read barrier" and Common Lisp

Most CLOS implementations use a representation for instances which
uses an indirection, in order to implement class redefinition and
change-class. In this document I propose a technique which
avoids the indirection most of the time, describe how to optimise
around the assumption that a class does not change, and how
this technique fares with multiple threads.
The partial read barrier and Smalltalk

Smalltalk also allows for class redefinition, and has a become:
method which swaps objects in memory (with similar power to
change-class in Common Lisp). Thus, Smalltalk implementations
either required an indirection for all objects (often through the
"object table"), or the implementations required the forementioned
operations to scan the whole heap to rewrite references to objects
affected by the operations. We'll focus solely on how to implement
class redefinition in most of this document; the only difference
between class redefinition and change-class is the cost model.
Miranda and Béra
implemented a partial read barrier for Smalltalk.
In their scheme, an object may be forwarded, requiring accesses
of the objects to follow a forwarding pointer to access the instance
variables of the object. The forwarding pointer points to the target
of a forwarded object.
Accessing the instance variables ("slots" in Common Lisp terminology)
does not require following a forwarding pointer, if an object has not
had its class redefined, and the object has not been involved in a
become: operation. Forwarding is only required after one of the two
forementioned events occurs to an instance, and the object is marked
as forwarded when such events occur.
(There is somewhat a jump in terminology here, from "indirection"
to "read barrier". But it is justified; the conventional "indirect"
representation requires something like Brooks's read barrier,
which always follows a forwarding pointer, before an instance can
be used. The Miranda-Béra partial read barrier resembles
Baker's read barrier, which
tests if the object has been forwarded, before following a forwarding
pointer. So I'll call the actions that must be done before accessing the
slots of an instance a "read barrier".)
The described read barrier requires another branch, in order to
check if an indirection is necessary or not. But the branch is
easily predictable, as almost all objects are not forwarded; and
the barrier eliminates a data dependency as no forwarding
pointer needs to be followed.
Miranda and Béra, however, use a special class ID to represent
if an object has been forwarded, which allows method dispatch
itself to test if an object has been forwarded. As such, their
read barrier is "partial" in that the barrier only is performed
after method dispatch has encountered a forwarded object.
Regardless of how the need for forwarding is tested, the
forwarding pointer is stored by overwriting the first instance
variable of an object, as the instance variables of that object
will not be accessed anymore. (If an object has no instance
variables, another word is added to the end of the object,
and the word is only used for a forwarding pointer.)
It is also worth noting that the garbage collector could replace
each reference to a forwarded object with the target of each
object. If the garbage collector performs this task concurrently[1],
the mutator must follow forwarding pointers before performing
pointer-equality tests, as the mutator can encounter different
pointers which still represent the same object.
[1] Similarly to incremental compaction,
the collector could find all locations with pointers that must be
fixed without pausing the mutator, and then fix the locations
in a much shorter pause, providing short pause times and
avoiding introducing overhead on the mutator.
An adaptation for Common Lisp

The forementioned technique works perfectly for a Common
Lisp implementation, but we would need to make some changes
for a high-performance implementation.
Most Common Lisp implementations support using multiple
threads. The forwarding pointer is stored in a way which
produces undesirable behaviour in the presence of
data races. Suppose we have a class and instance
(defclass c () ((%a :initform 'a :reader a)))
(defvar *instance* (make-instance 'c))
and we redefine the class c, while another thread is evaluating
(a *instance*). The thread has just dispatched on the class
of the instance, and we forward the object just before the
thread reads the slot %a. The thread reads a forwarding
pointer, rather than any value which the program has ever stored
in %a, which is unexplainable by the semantics of the programming
language. I believe this is an undesirable possibility, and that
concurrent programming is hard enough without the possibility of an
implementation producing unexplainable values.
We may instead store the forwarding pointer in another word
in an instance, leaving all slots of the instance intact. This
introduces some space overhead, but it is no worse than the
conventional representation of instances.
There is another somewhat similar situation, where we might
be tempted to reuse an object when the size of the object does
not change after redefinition. I also think that this optimisation
produces unexplainable values due to data races. Suppose we
redefined the class by
(defmethod update-instance-for-redefined-class
    ((c c) added deleted plist &key)
  (setf (slot-value c '%b) 'b))
  
(defclass c () ((%b :reader b)))
and the same data race occurs. The thread now reads b
which is also quite odd, as b was never written into the
slot %a before. I would also suggest to not reuse objects in this
manner.
Optimisations

We might want to optimise under the assumption that a class
is not redefined, or that the class of an instance does not change.
For example, in
(defmethod d ((c c))
  (unknown-function)
  (a c))
we may want to inline the method of a which is specialised on c,
but such inlining is not sound as (unknown-function) may redefine
the class c. We may inline if we have a way to replace the call
frame if either situation occurs (dynamic deoptimisation).
Then we would never reach the inlined code after redefinition, and
so inlining is sound.
However, I am not sure if scanning the stack for code which has
invalidated assumptions would be fast enough for change-class.
An application may legitimately call change-class many times,
and the function only actually modifies one instance at a time,
leading to little work being done per scan. One approach may be
to scan the stack lazily,
which minimises the work done per call to change-class. Such
lazy scanning also accumulates many instances to check for
in older call frames, as long as those frames are not returned to.
It may not be necessary to scan the stacks of other threads. Either
the other threads correctly synchronise, and thus will naturally
encounter that the class of the instance has changed, or the threads
are involved in a data race. In the latter case, the threads may
access the object as if its class had not been changed yet, which
is acceptable behaviour.
But if it is also desired to avoid such behaviour, the other threads
can resume execution immediately after scanning their own stacks,
with no stop-the-world pause [2]. Escape analysis and/or thread-local
heaps may also help to prove that no other threads could possibly
be relying on assumptions of the class of an instance, when the
object is not reachable by other threads.
Scanning the stacks of all threads, however, could save some space.
The representation used by Miranda and Béra (with the forwarding
pointer stored in the first slot of an instance) could be used in
a concurrent setting, as no threads will attempt to access the slots
of the forwarding object after scanning the stack.
[2] Actually, I think the protocol to avoid stopping the world would be
more complicated, like the handshaking protocol of any on-the-fly
garbage collector. But it might not matter; the Go garbage collector
and the ZGC of Java lazily scan stacks but still stop the world, and
produce pause times below a hundred microseconds.