pervognsen/cliff_threadsafe_inline_caching.md

## cliff_threadsafe_inline_caching.md

      
    Raw
  

              cliff_threadsafe_inline_caching.md
            
          
    Per,
Patching inline-caches is rare, so taking a lock & doing single-threaded updates is fine.  No CAS needed for performance, nor is false-sharing an issue (this data is also code, so nearly always resides with other read-only code).
Patching typically covers a bunch of X86 ops, at least 2 but maybe 3 or 4, depending.  If any of the updates to these words happens on partial instructions, a racing other CPU might see a partial update.
Patching covers a set of X86 instruction words, more than fits in an 8-byte CAS.  16-byte CAS's must be aligned properly.
Putting all this together, these constraints imply:

Updates are done via CAS covering whole instructions.  No CAS spans a partial instruction.
Instructions are not allowed to span a cache-line boundary; really a 16-b boundary for D-CAS.
Typically, multiple CAS's are needed to cover all instructions.
Since instructions cannot span a 16b boundary, sometimes up to 3 NOPs are used for padding.

The instruction updates themselves can be executed by racing other CPUs, and any partial update can be seen by any CPU.  Since whole instructions are patched, other CPUs can see some combination of patched and unpatched instructions (but no partial instructions).  Patches may not be seen in the order patched, even with fences.  I-Cache flushing only helps the patching CPU.  Putting this together, these constraints imply:

the instruction ops monotonically move from some safe-but-boring state (initial) to a caching monomorphic or megamorphic state.  If any thread seens a partial patch, it always defaults to the "safe-but-boring" mode.
You can't change the caching monomorphic state, to cache any other value than the initial one - EXCEPT with a Safepoint.  Some CPU might test the original Key, but end up calling the cached new Value.

There are several X86 op sequences that meet all these constraints.  I've tried a few different ones.  I think the OpenJDK uses:
  mov RAX,0xclass_constant // must patch this constant klass
  call cached_unverifed_entry // must patch this target address

At the entry point we have:
  cmp [RDI+4],RAX // check the object header for being the correct class
  jne slow_path // inline-cache misses, jump to slow fixup code
  // else we are the correct class, so execute the cached method

Hope this helps,
Cliff