Faster box/unbox and Int operations
My work on Perl 6 performance continues, thanks to a renewal of my grant from
The Perl Foundation. I'm especially focusing on making common basic operations
faster, the idea being that if those go faster than programs composed out of
them also should. This appears to be working out well: I've not been directly
trying to make the Text::CSV
benchmark run faster, but that's the result to
be observed from this work.
I'll be writing a few posts on various of the changes I've done. This one will
take a look at some related optimizations around boxing, unboxing, and common
mathematical operations on Int
.
Boxing and unboxing
Boxing is taking a natively typed value and wraping it into an object. Unboxing is the opposite: taking an object that wraps a native value and getting the native value out of it.
In Perl 6, these are extremely common. Num
and Str
are boxes around num
and str
respectively. Thus, unless dealing with natives explicitly, working
with floating point numbers and strings will involve lots of box and unbox
operations.
There's nothing particularly special about Num
and Str
. They are normal
objects with the P6opaque
representation, meaning they can be mixed in to.
The only thing that makes them slightly special is that they have attributes
marked as being a box target. This indicates the attribute out as being the
one to write to or read from in a box or unbox operation.
Thus, a box
operations is something like:
- Create an instance of the box type
- Find out where in that object to write the value to
- Write the value there
While unbox is:
- Find out where in the object to read a value from
- Read it from there
Specialization of box and unbox
A box
is actually two operations: an allocation and a store. We know how to
fast-path allocations and JIT them relatively compactly, however that wasn't
being done for box
. So, step one was to decompose this higher-level op
into the allocation and the write into the allocated object. The first step
could then be optimized in the usual way allocations are.
In the unspecialized path, we first find out where to write the native value to, and then write it. However, when we're producing a specialized version, we almost always know the type we're boxing into. Therefore, the object offset to write to can be calculated once, and a very simple instruction to do a write at an offset into the object produced. This JITs extremely well.
There's a couple of other tricks. Binds into a P6opaque
generally have to
check that the object wasn't mixed in to, however since we just allocated it
then we know that can't be the case and can skip that check. Also, a string is
a garbage collectable object, and when assigning one GC-able object into
another one, we need to perform a write barrier for the sake of generational
GC. However, since the object was just allocated, we know very well that it is
in the nursery, and so the write barrier will never trigger. Thus, we can omit
it.
Unboxing is easier to specialize: just calculate the offset, and emit a simpler instruction to load the value from there.
I've also started some early work on escape analysis, which will allow us to eliminate many box object allocations entirely. It's a great deal easier to implement this if allocations, reads, and writes to an object have a uniform representation. By lowering box and unbox operations into these lower level operations, this thus eases the path to implementing escape analysis for them.
What about Int?
Some readers might have wondered why I talked about Num
and Str
as
examples of boxed types, but not Int
. It is one too - but there's a twist.
Actually, there's two twists.
The first is that Int
isn't actually a wrapper around an int
, but rather
an arbitrary precision integer. When we first implemented Int
, we had it
always use a big integer library. Of course, this is slow, so later on we
made it so any number fitting into a 32-bit range would be stored directly,
and only allocate a big integer structure if it's outside of this range.
Thus, boxing to a big integer means range-checking the value to box. If it
fits into the 32-bit range, then we can write it directly, and set the flag
indicating that it's a small Int
. Machine code to perform these steps is now
spat out directly by the JIT, and we only fall back to a function call in the
case where we need a big integer. Once again, the allocation itself is emitted
in a more specialized way too, and the offset to write to is determined once
at specialization time.
Unboxing is similar. Provided we're specializing on the type of the result,
then we calculate the offset at specialization time. Then, the JIT produces
code to check if the small Int
flag is set, and if so just reads and sign
extends the value into a 64-bit register. Otherwise, it falls back to the
function call to handle the real big integer case.
For boxing, however, there was a second twist: we have a boxed integer cache, so for small integers we don't have to repeatedly allocate objects boxing them. So boxing an integer is actually:
- Check if it's in the range of the box cache
- If so, return it from the cache
- Otherwise, do the normal boxing operation
When I first did these changes, I omitted the use of the box cache. It turns out, however, to have quite an impact in some programs: one benchmark I was looking at suffered quite a bit from the missing box cache, since it now had to do a lot more garbage collection runs.
So, I reinstated it, but this time with the JIT doing the range checks in the produced machine code and reading directly out of the cache in the case of a hit. Thus, in the cache hit case, we now don't even make a single function call for the box operation.
Faster Int operations
You might wonder why we picked 32-bit integers as the limit for the small case of a big integer, and not 64-bit integers. There's two reasons. The most immediate is that we can then use the 32 bits that would be the lower 32 of a 64-bit pointer as our "this is a small integer" flag. This works reliably as pointers are always aligned to at least a 4-byte boundary, so a real pointer to a big integer structure would never have the lowest bits set.
The second reason is that there's no portable way in C to detect if a calculation overflowed. However, out of the basic math operations, if we have two inputs that fit into a 32-bit integer, and we do them at 64-bit width, we know that the result can never overflow the 64-bit integer. Thus we can then range check the result and decide whether to store it back into the result object as 32-bit, or to store it as a big integer.
Since Int
is immutable, all operations result in a newly allocated object.
This allocation - you'll spot a pattern by now - is open to being specialized.
Once again, finding the boxed value to operate on can also be specialized, by
calculating its offset into the input objects and result object. So far, so
familiar.
However, there's a further opportunity for improvement if we are JIT-compiling
the operations to machine code: the CPU has flags for if the last operation
overflowed, and we can get at them! Thus, for two small Int
inputs, we can
simply:
- Grab the values
- Do the calculation at 32-bit width
- Check the flags, and store it into the result object if no overflow
- If it overflowed, fall back to doing it wider and storing it as a real big integer
I've done this for addition, subtraction, and multiplication. Those looking for a MoarVM specializer/JIT task might like to consider doing it for some of the other operations. :-)
In summary
Boxing, unboxing, and math on Int
all came with various indirections for the
sake of generality (coping with mixins, subclassing, and things like IntStr
).
However, when we are producing type-specialized code, we can get rid of most
of the indirections, resulting in being able to perform them faster. Further,
when we JIT-compile the optimized result into machine code, we can take some
further opportunities, reducing function calls into C code as well as taking
advantage of access to the overflow flags.