tstuefe/Klass-storage-in-lilliput.md

## Klass-storage-in-lilliput.md

      
    Raw
  

              Klass-storage-in-lilliput.md
            
          
    Background

We store class information in Klass, and resolving Klass from oop is a hot path. One example is GC: During GCs, we churn through tons of objects and need to get at least object size (layout helper) and Oopmap from Klass frequently. Therefore, the way we resolve an nKlass matters for performance.
Today (non-Lilliput), we go from Object to Klass (with compressed class pointers) this way: We pluck the nKlass from the word adjacent to the MW. We then calculate Klass* from nKlass by - typically - just adding the encoding base as immediate. We may or may not omit that add, and we may or may not shift the nKlass, but the most typical case (CDS enabled) is just the addition.
Today's decoding does not need a single memory access, it can happen in registers only.
In Lilliput, the nKlass lives in the MW (which allows us to load nKlass with the same load that loads the MW). Therefore, nKlass needs to shrink. The problem with the classic 32-bit nKlass is that its value range is not used effectively. Klass structures tend to be large, on average 500-700 bytes [1], and that means a lot of values in that 32-bit range are "blind" - point into the middle of a class - and are hence wasted. Ideally, one would want to use one nKlass value per class.
In Lilliput, we reduced nKlass to 22-bit. We do this by placing Klass structures only on 1KB-aligned addresses. Therefore, the lower 10 bits are 0 and can be shifted out. 1KB was chosen as a middle-ground that allows us to use both the nKlass value range and the Klass encoding range (4GB) effectively.
The Problem

By keeping Klass structures 1KB-aligned, we march in lockstep with respect to CPU caches. With a cache line size of 64 bytes (6 bits) and an alignment of 1KB (10 bits), we lose 4 bits of entropy. Therefore, loads from a Klass structure have a high chance of evicting earlier loads from other Klass structures. Depending on saturation and number of cache ways, we may only use a 16th of the caches.
We see this effect clearly, especially in GC pause times. The bad cache behavior is clearly noticeable and needs to be solved.
The solutions

Short-term mitigation: Increasing the nKlass size to 26 bits

A simple short-term mitigation for Lilliput Milestone 1 is just increasing the nKlass size. By reducing the nKlass size to 22 bits, we freed up 10 bits, but to date, we only use 6 of them. For now, we have 4 spare bits in the header. We could increase the nKlass to 26 bits and work with a shift of 6 bits instead of 10 bits. That would require no additional work. Klass would be aligned to just 64 bytes, so the cache performance problems would disappear.
Note, however, that we earmarked those 4 spare bits for Valhalla's use. Therefore, reverting to a 26-bit nKlass can only be an intermediate step. And, obviously, it won't do for 32-bit headers.
Use a pointer indirection table

This idea resurfaces every couple of years. The idea is to replace the class space with an indirection pointer table. In this approach, an nKlass would be an index into a global pointer table, and that pointer table contains the real Klass* pointers.
The enticing part of this approach is that we could throw away the class space and a bunch of supporting code. Klass structures could live in normal Metaspace like all the other data. However, we would need some new code to maintain the global Klass* table, recycle empty pointer slots after class unloading, etc.
Decoding an nKlass would mean:

load the nKlass from the object
load Klass* from the indirection table at the index nKlass points to

The approach would solve the cache problem described above since removing any alignment requirement from Klass structures allows us to place them wherever we like (e.g., in standard Metaspace), and their start addresses would not march in lockstep.
However, it introduces a new cache problem since we now have a new load in the hot decoding path. And the Klasspointer table can only be improved so much: only 8 uncompressed pointers fit into a cache line. From a certain number of classes, subsequent table accesses will have little spatial locality.
Place Klass on alternating cache lines

Originally brought up by John Rose [2] when we did the first iteration of 22-bit class pointers in Lilliput. The idea is to alter the locations of Klass structures by cache line size. There is nothing that forces us to use a power-of-two stride. We can use any uneven multiple of cache lines that we like. For example, 11 cache lines (704 bytes) would mean that Klass structures would come to be located on different cache lines.
With a non-pow2 stride, decoding becomes a bit more complex. We cannot use shift, we need to do integer multiplication:

multiply nKlass with 704
add base.

But all of this can still happen in registers only. No memory load is needed.
The prototypes

I wanted to measure the performance impacts of all approaches. So I compared four JVMs:

A) (the unmitigated case) a Lilliput JVM as it is today: 22-bit nKlass, 10-bit shift
B) a Lilliput JVM that uses a 26-bit nKlass with a 6-bit shift
C) a Lilliput JVM that uses a 22-bit nKlass and a Klass pointer indirection table [3]
D) a Lilliput JVM that uses a 22-bit nKlass and a non-pow2 alignment of Klass of 704 [4]

It turned out that Coleen also wrote a prototype with a Klass pointer indirection table [5], but that is identical to mine (C) in all relevant points. The only difference is that Coleen based it on the mainline JVM; mine is based on Lilliput. But I repeated all my tests with Coleen's prototype.
The Tests

I did both SpecJBB2015 and a custom-written Microbenchmark [6].
The microbenchmark was designed to stress Object-to-Klass dereferencing during GC. It fills the heap with many objects of randomly chosen classes. It keeps those objects alive in an array. It then executes several Full GCs and measures pause times. Walking these objects forces the GCs to de-reference many nKlass values. Loads from MW should be cache-optimized since the oops are laid out linearly in an array. So, detrimental cache effects should come mostly from dereferencing different Klass structures.
The Microbenchmark was run on a Ryzen Zen 2, the SpecJBB on an older i7-4770, and Coleen's prototype I also tested on a Raspberry 4. The tests were isolated to 8 cores (well, apart from the Raspberry), and I tried to minimize scheduler interference. The microbenchmark results were pretty stable, but the SpecJBB2015 results fluctuated - despite my attempts at stabilizing them.
I repeated the Microbenchmark for a number of classes (512..16384) and three different collectors (Serial, Parallel, G1).
The Results

The microbenchmark shows very clear results, depending on the number of classes involved. Its results were a lot more stable than SpecJBB 2015.
Microbenchmark, G1GC

See graph [7].


(A) - overall worst performance, with a 41% increase over the best performer (B) at 16k classes.
(B) - best performance
(C) - worst of the three mitigation prototypes. In (around 4k..8k classes) even worse than the unmitigated case (A). Maxes out at +36% over the best performer (B) at 16k classes.
(D) - second best performance, for certain class ranges even best. Maxes out at +11% at 16k classes.

Microbenchmark, ParallelGC

See graphs [8].

Differences are less pronounced, but the results are similar. (B) better than (D) better than (C).
Microbenchmark, SerialGC

See graphs [9].

Again, the same result. Here are the deltas most pronounced, cache inefficiency measured via GC pauses is the most apparent.
Coleen's prototype

I repeated the measurements with Coleens prototype (with G1). No surprises, similar behavior to (C) vs (B). See
[10].
I also did a run with perf to measure L1 misses, and we see up to 32% more L1 cache misses with the klass pointer table [11].
SPecJBB2015

SpecJBB results were quite volatile despite my stabilization efforts. The deltas between maxJOps and critJOps did not rise above random noise. The GC pause times showed (Percentage numbers, compared with (A)==100%):


Run
1      
2      
3      


B  
108.68%
95.88%  
97.01%


C  
104.26%
92.15%  
90.25%


D  
96.20%  
94.31%  
86.44%


(all with G1GC)
Again, (D) seems to perform best. I am unsure what the problem is with (B) here (the 26-bit class pointer approach) since it seems to perform worst in all cases. I will look into that.
Side considerations

Running out of class space/addressable classes?

This issue does not affect the number of addressable classes much. That one is limited by the size of nKlass. With a 22-bit nKlass that is 4 mio. We can only work beyond that with the concept of near- vs far classes suggested by John Rose. In any case, that is not the focus here.
The class-space-based approaches (A), (B), and (D) also have a soft limit in that the number of Klass we can store is limited by the size of the encoding range (4GB). That, however, is a rather soft limit in that there is no hard technical reason preventing us from having a larger encoding range. The 4G limitation exists only because we optimize the addition of the base immediate by using e.g. 16-bit moves on some platforms, so the nKlass must not extend beyond bit 31. Note that 4GB is also very generous.
Reducing the number of Klass structures addressable via nKlass

Coleen had the very good idea that never instantiated classes (e.g., Lambda Forms) don't need an nKlass at all. They, therefore, don't have to live within the Klass encoding range. That would be a great improvement since these classes are typically generated, and their number is unpredictable. By removing this kind of classes from the equation, the question of number-of-addressable classes becomes a lot more relaxed.
Conclusions

For the moment, I prefer (D) (the uneven-cache-lines-approach). It shows the overall best performance, in parts even outperforming the 26-bit approach.
The Klasspointer-Table approach (C) would be nice since we could eliminate class space and a lot of coding that goes with it. But the additional load in hot decoding paths seems to really hurt.
There is also the vague feeling of wanting to future-proof. I am apprehensive about sacrificing Klass resolving performance since Klass lookup seems to be something we will always do.
That said, all input (especially from you, Coleen!) is surely welcome.
Materials

All test results, tests etc can be found here: [12]
Thanks, Thomas

[1] Allocation Histogram for Klass- and Non-Klass Allocations
[2] openjdk/lilliput#13 (comment)
[3] Klasstable prototype
[4] Uneven alignment prototype
[5] Klass table prototype, Coleen
[6] The Microbenchmark
[7] Graph: G1 microbench, absolute GC pauses
[8] Graph: ParallelGC microbench, absolute GC pauses
[9] Graph: SerialGC microbench, absolute GC pauses
[10] Graph: G1 microbench, absolute GC pauses, alternate klasstable prototype
[11] Graph: G1 microbench, L1 misses, alternate klasstable prototype
[12] https://github.com/tstuefe/test-hyperaligning-lilliput/tree/master