Bike/btb.md

## btb.md

      
    Raw
  

              btb.md
            
          
    In the last couple weeks I have been writing code to compile CVM bytecode to Cleavir BIR, which can then be native compiled. The ultimate goal here is to have a two-tier compilation system: Lisp can be very rapidly compiled to bytecode, and then anything that needs to run more efficiently can be further compiled into machine code. The bytecode is pretty fast, and good enough for code that's only evaluated once or not often, so I expect that this further compilation should only be necessary for comparatively little code. This allows a separation of concerns, in that the compiler to bytecode can focus on being fast, while the compiler from bytecode can disregard compilation speed in favor of better code generation. Hypothetically, the bytecode could be used as an IR in itself, totally replacing the AST representation currently used.
In this text I'll use "bytecode compiler" to mean the compiler to bytecode, i.e. that takes Lisp source as an input and produces bytecode as output. The compiler from bytecode, that takes bytecode as input and outputs BIR, is the "BTB compiler" (bytecode-to-BIR).
These are pretty informal notes. The BTB compiler is functional, but I haven't tested it extensively or done profiling.
Characteristics of bytecode

Details about the bytecode have already been explained elsewhere, so these are just a few summary points.
The bytecode is "minimally compiled" as described in CLHS 3.2.2.2. That is, all macros have been expanded and all load-time-value forms resolved. (The bytecode does now work with file compilation.) The bytecode compiler expands compiler macros unconditionally except for respecting cl:notinline declarations.
So then what is the BTB compiler to do? Well, the bytecode is in a sense "high level". The bytecode compiler only does very general optimizations. It is not a VM like say Guile's, which has a good deal of specialized arithmetic and so on. If you write (+ (the single-float x) (the single-float y)), for example, this will be rather uninspiredly compiled into a call to +. All data dealt with by the VM are boxed objects.
This lack of optimization is of course a disadvantage in that it means the bytecode can't execute optimizable code like this as quick as it could if it was smarter. But the advantage is that the bytecode reflects the source code pretty closely while being compact and quick enough to execute. For example any function call in the source code will, barring a compiler macro, actually end up as a call to the same function in the bytecode. This is pretty much the characteristic we want for an IR - the bytecode compiler purely takes care of minimal compilation and putting things into a compact representation, while leaving complicated optimizations for later passes.
Another advantage of this lack of optimization is that nothing in the bytecode is implementation specific, and it can be produced and run¹ in any implementation. The bytecode can also work as a neutral FASL format. (This is dependent on macros and compiler macros not expanding into implementation-dependent code, which is unfortunately a nontrivial condition.) The BTB compiler can then apply implementation-specific optimizations.
The main optimizations the bytecode compiler carries out are cell elision for non-closed-over variables, and eliding consing up dynamic environments for exit points (e.g. cl:block or cl:tagbody) that are not actually used nonlocally. These optimizations do not especially change the structure of the bytecode from that of the source.
One aspect of this correspondence is that any given source form is compiled to bytecode that takes up a single contiguous range. Any subforms become single contiguous ranges within that range - there is simple hierarchical nesting.
A problem for the BTB compiler is that the bytecode compiler totally ignores the parts of the source that's not required to pay attention to - like the forms, and declarations other than cl:special and cl:notinline. These constructs have no correspondence in the bytecode, but are basically required for any more sophisticated compiler. Knowing the current optimization value for safety is needed to know if unsafe optimizations can be performed, for example; the bytecode compiler skips optimize declarations entirely, as it is entitled to do because nothing it does can possibly the code unsafe.
Another thing the bytecode ignores (or rather did ignore, before my work here) is source information. That is to say that the bytecode does not record information about the source locations for given bytecode regions. That sort of information is very important for debugging, and we want the resulting native code to have it.
The bytecode also flattens variable definitions - in (progn (let ((x ...)) ...) (let ((y ...)) ...)), x and y will be assigned the same register number, and nothing in the bytecode itself indicates that there were two distinct source variables. This is a serious problem for optimization by the BTB compiler, since these two variables may have different types, etc. This is also a problem for debugging as we may want at runtime to know what source variables the value in a given VM register corresponds to.
Bytecode is organized into modules. Each module includes a function and any inner functions (from lambda, flet, etc.). In the file compiler, a module corresponds roughly to a single form processed at toplevel - so (progn (defun ...) (defun ...)) would get two modules. Common Lisp allows jumps from one function into another via cl:return-from and cl:go, and the bytecode reflects this interdependence - so it would be difficult to have a bytecode module have only some of its functions further compiled to native.
The BTB compiler

Characteristics of the BTB compiler

The BTB compiler is there to not just generate native code corresponding to bytecode, but to generate good, optimized native code. This makes it different from say a template JIT that just maps each bytecode instruction to a series of machine instructions. A template JIT runs fast and produces okay code, but the BTB compiler is supposed to not care about speed and produce great code. This is why the BTB compiler targets Cleavir's BIR - once we have BIR, we can run the full suite of optimizations already used in Cleavir to generate good code.
Ideally the BTB compiler should also be simple. It transmutes one IR (the bytecode) into another (BIR) without needing to do any optimizations itself. If converting bytecode into BIR is a complex process that takes supralinear time, the value of bytecode as an IR is considerably diminished.
Annotations

To solve the problem of the paucity of information in bytecode, I used what I call "annotations".² Each bytecode module now contains. in addition to the bytecode itself, a line table containing accessory information about the bytecode. This information is totally ignored by the VM itself, but useful for the BTB compiler, as well as for debuggers operating on bytecode.
Each annotation contains a start point and an end point, and some other information. The start and end are indices into the bytecode, delimiting a region to which the annotation applies, and the other information varies.
As a simple example, one kind of annotation contains source information. The start and end delimit the bytecode corresponding to a form, and the other information is a source location. The BTB compiler can use this source information to generate line tables for the native code debugger. Additionally, a debugger running on the bytecode directly can use this source information. The mapping is quite simple - say the debugger is stopped on the instruction at index N - then the debugger just finds the most specific source location annotation containing N, and reports that as the source location at which the program is stopped.
A more complex form of annotation records informations about variables. A variable annotation's excess information is an alist mapping register locations to variable names. Within the annotation's region, those register locations correspond to source variables of those names. (The bytecode never reuses registers within these ranges - another point of optimizationlessness.) One annotation can contain multiple variables as a cl:let form or lambda list can bind several variables simultaneously; an alternate design would be to have multiple annotations for the same region, each with a different variable.
For the BTB compiler, there are annotations for declarations and for the forms. the annotations have a degenerate bytecode region: the start and end indices are identical. What this means is that the last value pushed to the stack (or put in the multiple values register, depending on a slot in the annotation) at this point is of the given type. The declaration annotations are a bit tricky due to the trickiness of declaration scope in Common Lisp. They are placed so that any bound declarations have a region that starts just after a variable binding instruction. The BTB compiler then needs to note these annotations when processing a variable binding.
A final critical kind of annotation, and actually the first I put in, records the functions for a given bytecode module. A bytecode module can have bytecode for a set of functions, with inner functions etc. ending up in the same module. From the bytecode alone there is no indication of where functions begin or end. While it would be possible to
Annotations are generated by the bytecode compiler as it compiles. This takes extra time, but should only be a constant factor, as annotations can always be generated at the same time as the code they refer to. The labels already used by the bytecode compiler are used for the start and end of each annotation, and at bytecode linking time these labels are resolved into actual indices.
The representation of the annotations is simplistic - just a vector. Each annotation is its own full fledged object within this vector, with slots for the start and end bytecode indices. Probably something cleverer could be worked out. The bytecode compiler generates annotations in order - that is, if an annotation A precedes an annotation B, A's bytecode region starts at or before B's. This makes it easier for the debugger and BTB compiler to find relevant annotations.
The annotations provide enough information for the BTB compiler to produce BIR from bytecode. The produced BIR has type and other declaration information, allowing for optimization to proceed.
Structure of the BTB compiler

The BTB compiler operates in two phases. The first phase is very simple and just iterates through the bytecode making BIR functions and iblocks. The second phase iterates through the bytecode again, as well as the annotations, filling these functions and iblocks with generated IR. Both phases are essentially linear in the length of the bytecode, which is of course as asymptotically good as they could possibly be. It's probably possible to merge both passes into one and thereby iterate over the bytecode just once.
The second phase proceeds using "contexts" to keep track of a version of the VM state. For example, say the BTB compiler hits a const instruction. In the VM, hitting this instruction would push a constant value to the stack. The compiler method for this instruction will create a BIR constant, a bir:constant-reference instruction in the iblock currently being generated, and a bir:output from the instruction. This output is pushed to a stack in the context, reflecting how the VM would push a value. If the next instruction uses a value from the stack - say it's an fdefinition instruction - it will pop this output from the context reflecting how the VM would pop a value from the stack. This all makes straight line code quite simple to generate - probably a bit like a template JIT.
The context also includes information about dynamic environments, the multiple values register, and bound variables. BIR variable structures are created when a variable binding annotation is hit. Compiler methods for variable-related instructions use the context information to find the BIR variable that corresponds to a register.³
Control flow

The handling of control flow is the messiest part of the BTB compiler at the moment. The basic problem is that the context depends on the control flow: obviously the values on the stack at some position in the bytecode depend on what the previously executed code pushed, and the previously executed code may have been before a jump or branch. (This section is a bit technical and can be skipped in favor of waiting for me to come up with more elegant solutions myself.)
The BTB compiler keeps control flow handling relatively simple by relying on several features of the particular bytecode generated by the bytecode compiler. These are basic invariants that the compiler maintains without any trouble that are useful for compilation. I wrote about a few of these in a previous document as I was getting started on the BTB compiler. Some of the important ones are that loop iterations always leave the stack size and MV register state where they found it (e.g. loop: push x; jump loop is never generated) and that when branches remerge they have the same stack and MV register state (e.g. jump-if then; else: push x; jump then; then: is never generated). By assuming these invariants the BTB compiler can skip a whole lot of bookkeeping. But there are a few issues.
First, going in I expected one of the features I could use was that the bytecode would be generated in a dominance order. That is, if A dominates B, A appears before B in the bytecode. If this is the case, then by the time the BTB compiler gets to a block, it has always seen the context prior to execution reaching that block. This property is mostly true, but is not in general thanks to tagbody: in (tagbody (go test) loop ... test (when (whatever) (go loop))), as is commonly generated for cl:loop, the loop block is dominated by the test block, but the bytecode compiler generates tagbodies left to right and so loop precedes test. My solution to this has been to rely on another generated bytecode property - these tagbody blocks never pop from the stack or read the MV register from outside the block - so they can just use the context in place when the tagbody is entered.
Second, the bytecode does not make clear what should be phi nodes from merged control flow. That is there is no obvious indication whether the bytecode from a branch is generating a value or not. Take (when x (print x)). If this code is in a single value context (e.g., (list (when x (print y))), the bytecode will be ref 0; jump-if-8 L0; nil; jump-8 L1; L0: fdefinition 'print; ref 0; call-receive-one; L1. jump-if-8 branches, and then one branch pushes nil while the other pushes the result of the print call, so the BIR block for L1 needs a phi to represent this alternative. But in a zero value context, e.g. (progn (when x (print y)) ...), the bytecode will be ref 0; jump-if-8 L0; jump-8 L1; L0: fdefinition 'print; ref 0; call-receive-fixed 1 0; L1:, meaning that both branches push nothing, so L1 does not need a phi. It is not obvious which is which without actually computing the stack states at the end of both branches and comparing them. The BTB compiler does this, and it works, but it is easily the ugliest part of the compiler. I am hoping to add a new annotation type to eliminate this comparison.
Finally, there's unreachable code. We want the bytecode to include unreachable code to keep the bytecode in a straightforward correspondence with the source, and so that later BIR passes can intelligently delete the unreachable code while possibly alerting the user. The problem is that the context to use when compiling the unreachable code is not so obvious, as they have no predecessors. For example take the basic function call (foo (go loop)). This will be compiled to bytecode as fdefinition 'foo; restore-sp 0; jump-8 L0; call-receive-fixed 1 0. The call-receive-fixed is the call to foo and unreachable, but it pops a value - supposed to be the callee foo - from the stack. The first fix for this was to create sham BIR data for inputs to instructions generated in unreachable code. This works, but is ugly, and also means the data from the reachable part are not hooked up correctly (e.g., in this case the fdefinition instruction is reachable, but its output is used for nothing, instead of going into the unreachable call).
The newer fix is to use the context from before the jump. This is better, but has caused problems as unreachable code may mess up some of the other properties the BTB compiler expects from bytecode, such as the aforementioned property of having the same state at all entries to merge points. I actually had to add a new instruction to normalize the state, even though this instruction is never actually executed. Very bad kludge. I think the new annotation type should be a better fix.
Supporting changes to bytecode

I adopted some of the changes I was working on in that gist. Primarily, multiple value calls work differently now, with their arguments being taken from the VM stack rather than the multiple value register.
A few minor fixes have put in place to keep consistency in VM state between branches. For example, calls in zero-value contexts now use call-receive-fixed n 0 instead of just call, so that the MV register is not set. This is harmless in the VM but makes the bytecode less regular.
Results

It works. I haven't done any metrics, but it looks like the BTB compiler can efficiently compile code as you'd expect, e.g. type inference applies and reduces float operations to unboxed.
Future and niggling questions

Manual compilation

Currently in Clasp the BTB compiler is only hooked up to be run manually by the programmer. We have this setup:

Bytecoded functions are considered compiled for CL purposes (i.e. they are of type cl:compiled-function)
cl:compile on a bytecoded function invokes the BTB compiler to replace the definition.
BTB compiled functions are also considered compiled (are of type cl:compiled-function).
cl:compile on a BTB compiled function is a no-op.

The second point is the weirdest. I believe it is conformant, as the standard for cl:compile says "If the definition is already a compiled function, compile either produces that function itself (i.e., is an identity operation) or an equivalent function." (emphasis mine). Since BTB compilation by definition produces a function that behaves identically with respect to the allowed compilation semantics, I don't see a problem.
Bytecode functions are certainly minimally compiled so there is no conformance problem in treating them as compiled. An alternate approach would be to consider bytecode functions non-compiled - which I also think would be conforming - and therefore emphasize the difference between bytecode and BTB compiled code for programmers. But really I have no sense of what people expect the compiled-function type to indicate - I don't think people really use it much. So I don't know if it really matters either way. Possibly we could put in a bytecodedp extension predicate, or something.
The BTB compiler works fine on compile-file'd bytecode. The only possible snafu would be relating to cl:load-time-value, but the bytecode doesn't know about that anyway - it just treats load time values as constants, with the values put in by the loader. The BTB compiler also works on closures, which I think is pretty cool.
JIT

One application of the BTB compiler is to act as a just-in-time (JIT) system. An implementation can bytecompile everything, and then automatically BTB compile only those functions that actually need to run efficiently. I expect that for most functions, the efficiency gain from BTB compilation will be minor enough that it the compilation will take more time than that saved. But for some (e.g. arithmetic- or array-access-heavy code) BTB compilation will provide great gains.
It's not obvious what metric should be used to determine when a function should be further compiled. Ideally, we would BTB compile functions that will be called a lot before they are called a lot, and never BTB compile others, but obviously it's hard to guess what needs compilation a priori. The basic approach would be to associate a counter with each bytecode function, which is then incremented on each call; then once some threshold is reached the function is compiled.
I haven't yet found any literature on this subject. JVM implementations generally seem to compile their bytecode immediately rather than ever run a VM, with the bytecode simply serving as an interchange format. Guile adopts the counter approach, additionally incrementing the counter for each loop iteration (I assume this means that it iterates the counter for a function for each loop iteration within that function, but this is not clear from this page).
Another idea would be to quickly scan bytecode functions to guess whether BTB compilation would be helpful. This scanning could include checking what functions are called (e.g. arithmetic and aref would make BTB compilation more attractive), as well as checking saved declarations (high optimization qualities means the programmer wants compilation, and having lots of type declarations makes it more likely compilation will do good).
There is some disadvantage to the BTB compiler being a real compiler rather than just a template JIT here, in that it can take an unpredictably long amount of time to work. JIT compilation can happen at any time without warning, which could disrupt programmers in various ways. Some dynamic control of the JIT would be useful, even as simple as allowing the programmer a without-jit macro when they need predictable performance. It would probably also be possible to run the JIT in a worker thread rather than synchronously; assuming swapping out a function definition can be done atomically this should be no problem.
Compilation environments and timing

The BTB compiler does not fit into the CL picture of compilation straightforwardly. The compilation semantics in CLHS 3.2.2 basically imagine uncompiled code and compiled code; there is no further-compiled code, which is what the BTB compiler does. Therefore it's not always obvious how compilation should work - whether the native code should be compiled as if the compilation environment and settings in place at the time of the original bytecode compilation should be honored, or the later BTB compilation. I think the least confusing option is to honor the settings at the time of bytecode compilation, as that is when the code was minimally compiled. This is especially the case if JIT is used, as then BTB compilation may happen in any context without prompting, and grabbing whatever settings are in place at that time could really confuse the programmer.
As an example of the difference, currently the BTB compiler uses global cl:optimize values when the code does not lexically declare them. Subtler differences include global type declarations. Keeping the bytecode-compilation-time information will require the bytecode compiler to store that information.
Ideally the bytecode compiler would work with an explicit environment structure rather than rely on side effectual global definitions. The portable version does, but not the one in Clasp.
It is possibly worth noting that most possible redefinitions would violate the restrictions on conforming code in 3.2.2.3. Of course the fact that they are defined to be illegal doesn't prevent them from happening.
Inlining and local calls

The most glaring weakness in the BTB compiler as far as optimizations go is that no inlining is performed. This is the only optimization performed by the source-code-to-BIR-to-native system in Clasp that the BTB compiler does not perform. Inlining is obviously a very important optimization so this needs to be rectified.
I think there are two basic approaches I could take:

Have the BTB compiler perform inlining - when it sees a function call, it looks for an inline definition in the environment, and if one is present it is used. This would be simple.
Have the bytecode compiler perform inlining. This may necessitate changes to the BTB compiler and possibly VM to account for argument parsing within functions.

I'm leaning towards option two for two reasons. First, inlining is a relatively simple optimization that would not impact speed of bytecode compilation so much. It would probably improve performance of bytecode as bytecode function calls are expensive relative to native functions. Second, inlining at bytecode compilation time sidesteps the redefinition issues of the previous section - the inline definition available to the BTB compiler will be that saved by the bytecode compiler.
I would also like to use bytecode as the format for inline definitions, in Clasp. Currently Clasp saves inline definitions as ASTs, which is very time- and memory-intensive⁴, and additionally entails a good amount of complex infrastructure for serializing and deserializing a complex DAG. Bytecode is just as environment-independent as ASTs but faster to generate, and can already be serialized and deserialized.
Local calls have been a very helpful optimization in Cleavir and it may be worthwhile to implement them in the bytecode compiler as well. Bytecode instructions for local calls could then be directly translated by the BTB compiler into local call BIR instructions. I haven't thought through the details yet.
Verification

Currently the BTB compiler will fail weirdly if given invalid bytecode, including possibly by generating native code that will run and have possibly damaging effects. Since bytecode can be loaded from files, which may be maliciously or accidentally damaged, this situation is unsafe. The BTB compile currently does some checks itself but nothing comprehensive.
The better solution is to establish formal rules for what bytecode is valid, as mentioned in that other gist, and then writing code to verify bytecode. The bytecode compiler can be proven to produce valid bytecode, so the verifier only needs to be run on loaded bytecode. WIth verification in place, the BTB compiler can simply assume the bytecode it's given is correct without further tests.
This verification would extend to checking that annotations on the bytecode are sensible, e.g. that variable bindings don't overlap, etc.
Controlling annotation generation

Currently the bytecode compiler produces annotations unconditionally. This takes some time and memory. Doing so is unnecessary for some code, if it won't be BTB compiled or even will only be run once. It would be good to conditionalize annotation generation, e.g. based on cl:compilation-speed.
Some annotations may be possible to construct after the fact. The control flow related annotations mentioned above certainly could be. Many, like declarations, obviously couldn't. It might be conforming to BTB compile annotation-less code anyway as long as it's assumed to be safe code, and notinline declarations are respected by the bytecode compiler.
Footnotes


Unfortunately the pure-CL implementation of the VM is inefficient in multiple values handling, and could use more optimization. ↩


These are totally distinct from the "annotations" described in the bytecode compiler paper. I probably should have picked a different term. ↩


An alternate option would be to treat each register as its own variable, and then rely on a later SSA conversion to split ranges into separate variables. This was rejected as being more complex than simply recording the variable bindings in the bytecode compiler, especially considering that those bindings need to be recorded anyway to provide information to the debugger. An SSA pass could still help the BTB compiler as it could help any compiler. ↩


For example, when bytecode compiling SLIME, over half the time is spent constructing ASTs for inline definitions, even though there aren't many inline definitions. The bytecode compiler is very fast and building ASTs is very not. ↩