This document attempts to describe the structure of the MoarVM JIT compiler and document some of the design decisions that shape it.
This part is noteworthy because it is not, in fact, part of the JIT compiler at all. It is a separate subsystem that handles all of the type logging, graph construction and rewriting, call inlining - in short, all of the interesting optimizations of a dynamic language system. The JIT (legacy and ‘expression’) is a code-generation backend for this system. MoarVM is rather unique in this since in most systems, the order is reversed.
In contrast to what might be called ‘language-level’ optimizations implemented by spesh, the objective of the JIT is to remove intepreter-level inefficiencies.
The legacy or ‘lego’ JIT compiler was the first to be developed back in 2014. It has this name because it works by emitting independent fragments of machine code that are stitched together. These fragments are mapped directly from the ‘spesh graph’ (representing a MoarVM subroutine or ‘frame’) that is the input to the compiler. There are very few optimizations applied by the legacy compiler (one notable optimization is the inlining of a polymorphic REPR function pointer if the type of the object is known).
The lego JIT produces a structure called the ‘JIT graph’, which, while a technically correct name, is a bit of a misnomer since it is really a singly-linked list. It uses labels rather than an explicit control flow graph (like spesh does) to indicate internal structure. The JIT graph is then compiled to JIT code. There is in fact little reason to maintain this two-phase setup; the legacy JIT could work just as well in a single phase.
Even for the new JIT, the legacy JIT provides the necessary scaffolding to integrate with the interpreter (such as a proper function prologue and epilogue, support for exceptions and invokish operators, and the various bits and pieces required for correct deoptimization.
DynASM is a third-party library from the luajit project that we’ve modified to support register addressing on the x86-64 architecture. It allows you to interleave the assembly code you’d like to emit with the C code that does the regular compiling business. DynASM has two parts:
- A lua script to preprocess a ‘dasc’ (which is C interleaved with assembly code) file and output a C file.
- A C header to #include into the generated C file that contains all the functions to assemble and link the machine code at runtime
DynASM source files are architecture-specific.
The expression ‘tree’ is the intermediate format of the new ‘expression’ JIT compiler. The tree is a technically incorrect misnomer - it actually represents a graph.
The expression tree is built from a spesh graph by mapping MoarVM instructions to templates. These templates are written in a textual format by a human (or perhaps a script), that is precompiled to a header file during MoarVM compilation. Take note that while the expression templates look a bit like a high-level language some may be familiar with, the concepts it expresses are decidedly ‘low-level’ and can be mapped more-or-less directly to the machine code of a hypothetical RISC CPU.
The syntax used for expression templates is derived from LISP.
- Lists are the basic constructs of the language. A list is delimited
by opening
'('
and closing')'
parentheses. Lists can be nested. Items in the lists are separated by whitespace. - A word is anything that matches the perl regular expression of
[^\s\(\)#"']
.- A word that consists only of alphanumeric symobls (and possibly
underline) is called a symbol:
sp_p6oget_o
. The first symbol in a list is generally it’s operator. - Suffixed by
:
is generally a keyword, e.g.let:
- Prefixed by
$
is a reference, either to a MoarVM instruction operand ($1
,$2
) or to a declared name ($val
). - Prefixed by
^
invokes a list macro, which can be declared with themacro:
keyword. - Macro parameters are prefixed by
,
(e.g.,obj
), and they are replaced with whatever symbol or list is ‘bound’ to them when the macro is invoked. - Parameter macro’s are indicated by a prefix
&
, e.g.&sizeof
. They are always substituted with a function macro:(&sizeof foo)
becomessizeof(foo)
. These must form constant expressions at compilation time.
- A word that consists only of alphanumeric symobls (and possibly
underline) is called a symbol:
- A string is anything delimited by
"
- but note that escape sequences are not supported, so"foo \" bar"
doesn’t do what you might think it does
An example of a template definition would be as follows:
(template: sp_p6oget_o
(let: (($val (load (add (^p6obody $1) $2) ptr_sz)))
(if (nz $val) $val (^vmnull))))
The let:
keyword is not an operator but a special construct for the
template precompiler. It allows the declaration of names for values
that are computed and reused in further operators.
And an example of a macro definition would be:
(macro: ^getf (,object ,type ,field)
(load (addr ,object (&offsetof ,type ,field))
(&SIZEOF_MEMBER ,type ,field)))
These are all the operators that are known by the expression language as of today (<2017-09-25 Mon>).
Operator | Shape | Type | Semantics |
---|---|---|---|
LOAD | (load reg $size) | reg | Load a value from a memory address |
STORE | (store reg reg $size) | void | Store a value to a memory address |
CONST | (const $value $size) | reg | Load a constant value to a register |
ADDR | (addr reg $ofs) | reg | Add a constant offset to a pointer |
IDX | (idx reg reg $scale) | reg | Compute a pointer for an index in an array |
COPY | (copy reg) | reg | Copy this value (opaque to tiling) |
LT | (lt reg reg) | flag | First operand is smaller than second |
LE | (le reg reg) | flag | First operand is smaller than or equal to second |
EQ | (eq reg reg) | flag | First operand is equal to second |
NE | (ne reg reg) | flag | First operand is not equal to second |
GE | (ge reg reg) | flag | First operand is larger than or equal to second |
GT | (gt reg reg) | flag | First operand is larger than second |
NZ | (nz reg) | flag | Operand is nonzero |
ZR | (zr reg) | flag | Operand equals zero |
CAST | (cast reg $from $to $sign) | reg | Convert a value from smaller to larger representation |
FLAGVAL | (flagval flag) | reg | Binary value of comparison operator |
DISCARD | (discard reg) | void | Discard value of child operator |
ADD | (add reg reg) | reg | Add two integer values |
SUB | (sub reg reg) | reg | Subtract second operand from first |
AND | (and reg reg) | reg | Binary AND (intersection) of two operands |
OR | (or reg reg) | reg | Binary OR (union) of two operands |
XOR | (xor reg reg) | reg | Binary XOR of two operands |
NOT | (not reg) | reg | Binary negation of operand |
ALL | (all flag+) | flag | Variadic short-circuit logical AND |
ANY | (any flag+) | flag | Variadic short-circuit logical OR |
DO | (do void* reg) | reg | Execute multiple statements and return last expression |
DOV | (do void+) | void | Execute multiple statements |
WHEN | (when flag void) | void | Execute statement if condition is met |
IF | (if flag reg reg) | reg | Yield value of conditional expression |
IFV | (ifv flag void void) | void | Conditionally execute one of two statements |
BRANCH | (branch reg) | void | Jump to code position |
LABEL | (label $label) | reg | Load position of code |
MARK | (mark (label $label)) | void | Mark this position |
CALL | (call reg (arglist ...) $size) | reg | Call a function and return a value |
CALLV | (call reg (arglist ...)) | void | Call a function without a return value |
ARGLIST | (arglist (carg reg $type)+) | c_args | Setup function call arguments |
CARG | (carg reg $type) | void | Annotate value with ‘parameter type’ |
GUARD | (guard void $before $after) | void | Wrap a statement with code before and after |
TC | (tc) | reg | Refer to MoarVM thread context |
CU | (cu) | reg | Refer to current compilation unit |
LOCAL | (local) | reg | Refer to current frame working memory |
STACK | (stack) | reg | Refer to top of C frame stack |
Next to the register allocator, the instruction selection algorithm (or the ‘tiler’) is the most complex part of the JIT. It is fortunately fairly stable. The goal of tiling is to match the intermediate representation (the expression tree/graph) to the cheapest sequence of instructions on the target architecture (x86-64) that implements the semantics of the expression tree. The implementation is heavily based on the paper by Aho et al - and in fact, the expression tree IR was designed mostly to accomodate it.
The following is a necessarily incomplete description of the actual process - much better described by either the paper linked above or the source code that implements it.
A ‘tile’ in the expression JIT describes a number of overlapping concepts, for which I rely on the reader to disambiguate by context. (Humans are good at that).
- A pattern that can be matched against an expression tree and which is defined textually, much like the expression templates.
- An object for the JIT compiler to represent a machine instruction, and
- A function that emits the this machine code to the compiler buffer, with parameters substituted
The textual ‘tile definition’ contains the name of the function that
implements it (in the example below store_addr
and
test_addr_const
), the pattern that the tile matches, the symbol
that is substituted for the pattern in a succesful match, and the
cost of doing so (in terms of compiled code).
(tile: store_addr
(store (addr reg $ofs) reg $size) void 5)
(tile: test_addr_const
(nz (and (load (addr reg $ofs) $size) (const $val $size))) flag 4)
Conceptually, tiling is a process of reducing the tree structure by replacing nodes by the symbols defined by the tiles. The following example may serve as an illustration. Given the following set of tiles:
Tile | Pattern | Symbol | Assembly code |
---|---|---|---|
local | (local) | reg | rbx |
const | (const $value) | reg | mov reg, $value |
addr | (addr reg $offset) | reg | lea reg, [reg+$offset] |
load | (load reg $size) | reg | mov out, [reg] |
add | (add reg reg) | reg | add reg1, reg2 |
store | (store reg reg $size) | void | mov [reg1], reg2 |
We can reduce a tree and generate code as follows (note that the order of reducing is bottom-up / postorder from left-to-right):
Tree | Tile | Code |
---|---|---|
(add (load (addr (local) 0x8)) (const 17)) | local -> reg | |
(add (load (addr reg 0x8)) (const 17)) | addr -> reg | lea rax, [rbx+0x8] |
(add (load reg) (const 17)) | load -> reg | mov rcx, [rax] |
(add reg (const 17)) | const -> reg | mov rdx, 17 |
(add reg reg) | add -> reg | add rcx, rdx |
Tiling never takes constant parameters into account, which is a severe limitation - some operators are radically different between 8-bit and 64 bit sizes on x86-64. Maybe we can implement architecture-specific templates to optimize our way out of this.
The difference between the model above and the real implementation are:
- A tile can cover more complex expression trees (see the example tiles above)
- A given subtree may be reduced by different sets of tiles, and
- We’d like to choose the cheapest such set, and we’d like to do that efficiently.
Before going further, be warned: the following is rather complex and even now it makes my head hurt. Feel free to skip this section.
In order to figure out if a certain tile can be applied to a given tree, we’d need to know if it’s pattern matches. The pattern matches if it’s structure matches and if the expressions ‘below’ it can be matched to the symbols at its leafs. Because the tiler uses postorder iteration, we can ensure that symbols have been assigned to the leafs when we consider the head operator of the tile.
To avoid having to traverse the tree to find out if the leafs match to the pattern, during precompilation the tile patterns are ‘flattened’ and the nested lists replaced with ‘synthetic symbols’. From the sublist a new (partial) tile pattern is generated that reduces to this generated symbol and compiles to nothing). See for some examples the table below. Because the resulting patterns are flat, they can be matched by inspecting only the symbols of the direct children of the operator.
Original | Flat |
---|---|
(load (addr reg)) -> reg | (load @sym1) -> reg |
(addr reg) -> @sym1 | |
(store (addr reg) reg) -> void | (store @sym2 reg) -> void |
(addr reg) -> @sym2 | |
(add reg (const)) -> reg | (add reg @sym3) |
(const) -> @sym3 | |
(add reg (load (addr reg))) -> reg | (add reg @sym4) -> reg |
(load @sym5) -> @sym4 | |
(addr reg) -> @sym5 |
From the table it is possible to see that a single tile list pattern (like ‘(addr reg)’) can be reduced to many different symbols. In fact, given a set of tiles and their associated costs, we can compute all possible combinations of symbols (symbol sets) that a tile pattern can be reduced to and we can also compute which tile would be the most efficient implementation of that operator given the symbol sets of it’s children. By precomputing this information selecting the optimal tile for an operator can be reduced to a table lookup at runtime.
For completeness I’ll try to describe how the precomputation process
works. The central concept it is based on is that of ‘symbol sets’
(symsets). We begin by initializing a map of all symbol sets, which
are initially just the individual symbols that are generated by all
tile patterns (called @rules
).
$sets{$_->{sym}} = [$_->{sym}] for @rules;
And a reverse lookup table is also generated, from all symbols within a set, to all symbol sets that they occur in. Again, initially this is just the symbol itself.
while (my ($k, $v) = each %sets) {
# Use a nested hash for set semantics
$lookup{$_}{$k} = 1 for @$v;
}
Then for each tile pattern, we lookup all symbols that could be combined with the symbols in the patterns, and we store the symbol that this reduces to. The name ‘%trie’ is also kind of inaccurate, but it gets the picture accross, which is that this is a nested associated lookup table. The purpose is to combine all tiles (and their associated symbols) that map to the same symbol sets together - because that means that these tiles can cover the same tree structures.
for my $s_k1 (keys %{$lookup{$sym1}}) {
for my $s_k2 (keys %{$lookup{$sym2}}) {
$trie{$head, $s_k1, $s_k2}{$rule_nr} = $rule->{sym};
}
}
Having done that, we generate new symbol sets from this lookup table - all the values of the generated hashes are symbols that can be produced by tiles that map to the same trees:
my %new_sets;
for my $gen (values %trie) {
my @set = sort(main::uniq(values %$gen));
my $key = join(':', @set);
$new_sets{$key} = [@set];
}
In general, we run this process multiple iterations with the new sets
of symbols, because the symbolsets that exist influence the tile
patterns that are considered to be equivalent. For instance, in the
table above the tile patterns generating @sym1
, @sym2
and @sym5
are equivalent, and so after a single iteration there will be only a
single set containing those symbols. This means that the patterns of
(load @sym1)
and (load @sym5)
are equivalent, and hence that
@sym4
is always combined with reg
. So this process has to iterate
until there are no more change in the sets, at which point it can read
(from the same table) the all possible combinations of sets. Having
this, it is a small matter to compute the cheapest tile of the set to
cover a given tree [fn:cost].
my (%seen, @rulesets);
for my $symset (values %trie) {
my @rule_nrs = main::sortn(keys %$symset);
my $key = join $;, @rule_nrs;
push @rulesets, [@rule_nrs] unless $seen{$key}++;
}
return @rulesets;
The legacy JIT compiler did not require register allocation, because it never persisted values in registers between fragments. The expression JIT compiler does, that’s the whole purpose of it.
There have been three attempts at implementing a register allocator for the expression JIT. Ironically, all of them have been based on the same algorithm, which is linear scan register allocation on SSA form.
The first attempt tried to do register allocation ‘online’ with code
generation, i.e. during tree walking. This does not work because
you’ll need to know the live range of the values your allocating,
otherwise you don’t know when a register can be freed, and you’ll
either run out or you might be incorrect when freeing
them. Furthermore, you’ll need to manage aliases, i.e. values that
are the same (created by COPY
operators) and nodes that ‘merge’
(created by IF
opreators). Never mind the case where we compile a
CALL
node in a conditional block.
The second attempt improved on that in two ways:
- It made a real attempt to compute the live range extents of values. Initally based on tree iteration order; however, after introducing the tile list, based on tile list position (which is what we use now).
- It used a ‘cross-linked-list’ to identify values that either shared a location or a definition. This was used to deal with aliases and merged values.
However, this proved still not quite sufficient. It had inherited from
the earlier online algorithm the design of allocating a register for
a value at the first instance it is encountered, and assigning that
register to instructions that use the value when they’d be
encountered. Hence it required the ability to determine the ‘current’
position for a value, including if that value is spilled or not. This
is somewhat complicated when a value is represented by multiple
different things (due to aliasing and merging, in a cross-linked
list). What is more, it would only try to resolve aliases at
register-assignment time, i.e. when it’d encounter the relevant COPY
or IF
operator. So that would actually change the extent of the live
range while it was ‘under operation’, which further complicates
several key functionalities such as deciding when a register can be
freed.
The third attempt is the current one and departs from the previous two attempts in the following crucial ways:
- Rather than have a value be synonymous with its definition (the instruction and operator that created it) it represents a set of definitions and uses , that are joined together by aliases and merges. It uses a separate phase and a union-find data structure to implement these values (which are called ‘live ranges’ consistently). This ensures that all uses of the ‘same’ value use the same register location for it.
- During the allocation phase, it iterates over the values in ascending order (i.e. by first definition), rather than over the instructions as the second version did or over the tree as the first. A register can be reused only after its value has expired, which is when the algorithm iterates past its last use. This means that at a single point in the program, no two values can be allocated to the same register, which is the essential correctness property for a register allocator.
- Whenever a register is allocated, it is directly assigned to the instructions that use its value. When that register is changed for whatever reason, we update it as well. Thus we never have to query ‘what is the current location of the value of $x’, which gets tricky to answer in case of spilling, and the register assigned to an instruction is always ‘up to date’.
- It is sometimes necessary to ‘spill’ a value from a register to memory, either because we need more registers or because the ABI forces us to - this is true for all register allocation algorithms. At any point in the program where it is necessary to spil a value, all instructions that define or use that value are wrapped with instructions to move that value between register and memory. (This ensures that the value is spilled to the same position in all possible code paths). The single live range that covered all these uses is then split into ‘atomic’ live ranges that cover only the instruction to load or store the value and the original instruction that would use or define it. The upshot is that a live range that is processed always refers to a value that resides in a register.
- Because of value merging and aliasing, it is sometimes possible for
a value to be undefined within its own life range. This can be
demonstrated by code below. In it,
$a
is first defined as 10, then conditionally redefined as the result of$a * 2 - 3
. So between the definition of$b
and the assignment of the ‘new’ value of$a
, the ‘old’ value of$a = 10
is no longer valid. Hence, it should not be stored to memory at that point. Such undefined ranges are called ‘live range holes’, and they are an of example of the ‘complexity waterbed’ of compiling - SSA form code doesn’t have them, but makes it really complicated to ensure the same register is allocated for all the separate uses of a value.
my $a = 10;
if (rand < 0.5) {
my $b = $a * 2;
$a = $b - 3;
}
printf "\$a = %d\n", $a;
It is (unfortunately) necessary to have the register allocator handle the C function call ABI. This is because certain function arguments are supposed to be passed by in specific registers (meaning they have to be placed there) and some registers can be overwritten by the function that is called (in fact, all of them). So in the general case, the JIT needs to:
- Place values in the correct registers and/or stack positions, making sure not to overwrite values that are still necessary for the call.
- Load values from memory if they are not present in registers. Note that a function call can have more arguments than registers, so it is not always possible to load all values into registers prior to the call. So we cannot rely on the normal register allocator mechanisms to ensure a value is present.
- Ensure all values that ‘survive’ the call are spilled to memory.
This is currently implemented by a function prepare_arglist_and_call
that is over 144 lines long and implements topological sort,
cycle-breaking and register-conflict resolution with values used by
the CALL
operator itself (e.g. in dynamic dispatch).
I feel like this functionality - rearranging registers to deal with register requirements of specific operators - could be generalized, but I’m not sure if it is worth it. The x86-64 instruction set has plenty of ‘irregular’ instructions that have specific register requirements however, virtually all of them use just the single `rax` register, which I’ve decided to keep as the spare register anyway. So I’m not sure what the value of generalizing would be, and I’m fairly sure it would introduce even more complexity.
What is still necessary for completeness is the ability to allocate multiple register classes, specifically floating point registers. The suggested way of handling that is by giving out different number ranges per class, and letting the tiles sort out the details of getting the correct number.
Another thing that is not necessary but certainly nice-to-have is a way to ensure that the register assignment actually meets the constraints of the target architecture. Specifically, while the expression JIT assumes a system that operates as follows:
a = operator(b, c)
What we actually have is on x86-64 is more like:
a = operator(a, b)
The situation is actually considerably more complex due to the many addressing modes that are possible. (For instance, indexed memory access uses a third register ‘c’.). Currently we handle that in tiles by moving values arround, but it would be much nicer if the register allocator could enforce this constraint. (The allocator keeps a ‘scratch’ register free at all times for this purpose).
Finally, the live range hole finding algorithm iterates over the basic blocks of the JIT code in backwards linear order, which is fine so long as there are no loops (the tile list control flow always flows forward), which is true so long as we only ever try to compile single basic blocks, but that is an assumption I’m explicitly trying to break. So I’ll need to find a better, more general iteration order.
To aid debugging, the JIT compiler logs to file. This logging is adhoc and line-based (for the most part). It is not currently useful for general consumption or instrumentation. Adding to that, the linear scan allocator has numerous debug logging statements that are conditionally compiled in.
It would probably be a nice improvement to write ‘structural’ logging information to the JIT or spesh log (that could be useful for instrumentation), and use conditionally compiled ‘adhoc’ logging statements solely for debugging. And it’d be nicer still to have conditional compilation on the section of JIT output that you’d really be interested in (e.g. expression building or register allocation).
The JIT log is setup at initialization time in src/moar.c
, based on
the environment variable MVM_JIT_LOG
. This is arguably not a good
design for embedding; then again, for embedding, this is not a file
you’d want to expose anyway.
Expression trees and tile lists have their own loggers, which implement a more structured format. An expression tree log represents the tree as a directed graph in the ‘dot’ language. There’s probably still some formatting we could apply to that, but on the whole I’m satisfied with that approach. The tile list logger represents each tile with a debug string, which is stored when the tiler table generator processes the tile pttern definitions. It also lists the basic block structure of the program.
The JIT uses three different strategies for memory management:
- Spesh region allocation for the JIT graph.
- Dynamically allocated vectors for the expression tree, tile lists and related structures.
- Anonymous memory pages for the generated machine code.
The spesh
dynamic optimization framework contains a region
allocator that is bound to the spesh graph.
Operating systems that try to provide some measure of security to C programs generally do not allow execution of heap-allocated memory.
[fn:cost] Actually, it is not very simple at all, and this is mostly because we’ve ‘flattened’ the tile list, so we somehow have to ‘reconstruct’ the cost of a tile set that is equivalent to the tree matched by the original. This part really needs to be revisited at some point.
"since because"
"at the and"
"porperty"
"a instruction"
"Note that a" - cut-off sentence
"funcitonality"
"requiremennts"
" staements"