teryror/hblc impl notes.md

## hblc impl notes.md

      
    Raw
  

              hblc impl notes.md
            
          
    Crafting a Compiler from Scratch: Implementation Notes

For the past two weeks or so, I've been working on a little compiler project
in C, mostly for educational purposes, i.e. to understand how a compiler
really works. I'm not using any libraries, other than the C runtime
library.
Introduction

I have a hand-written lexer and parser, and a simple code generator targetting
the Nintendo Gameboy Advance. I don't support named variables yet, but I do
have pointer dereference and assignment, so I can write programs like this:
# Test Program: Collatz Conjecture

0x02000000.* = 27   # Starting number
0x02000004.* =  0   # Step counter

loop {
    if 0x02000000.* & 1 {
        0x02000000.* *= 3
        0x02000000.* += 1
    } else {
        0x02000000.* = 0x02000000.* >> 1
    }
    
    0x02000004.* += 1
} while 0x02000000.* - 1

I'm saying I'm target the GBA, rather than "the ARMv4T instruction set"
because I'm only generating valid ROM images, no other types of executable
file. I'm also assuming the absence of memory protection featurs, and the
GBA's memory map - address 0x02000000 is where EWRAM (external work RAM) is
mapped.
As you can see from the snippet above, I have loops and if statements for
control flow, but no functions or type checking. Those features are planned, of
course, but I wanted to take a step back and talk about what I've learned so
far.

I've been primarilly referring to two sources on compiler construction;
Crafting Interpreters by Bob Nystrom, and Programming an x64 compiler
from scratch, by Per Vognsen.
The former is a freely available, though (as of the time of writing)
unfinished book that assumes familiarity with imperative programming, but not
much more. It's concise, accessible, and a fun read.
The latter is a series of recordings of Per live streaming development,
totaling almost 15 hours of video, with periods of silent thinking, debugging,
and refactoring thrown in. There is a whole host of valuable knowledge in
there, but in his explanations, Per assumes not only familiarity with various
C-specific idioms, but also some amount of familiarity with compiler-specific
terminology. It's not the most accessible source, to say the least.
I don't want to retread the ground covered in Crafting Interpreters here, so
I'll assume you've read at least the first few chapters. Instead, consider this
a complementary work, where I'll try to relate the new information from Per's
videos, as well as my own findings.
With all that said, let's get started. We'll work our way through the compiler
stages, starting with the lexer:
Lexing in O(1) Space

The lexer we see in Crafting Interpreters works as a precursory pass to parsing;
it calls scanToken() until it reaches the end of the file, and then passes an
array of all tokens to the parser. This is nice in principle, because having
random access into the token stream allows us to parse pretty much any grammar
we want.
In practice, it is totally unnecessary; language designers generally frown upon
languages that require unbounded lookahead, and some languages, like Lisp, can
be parsed without any lookahead at all.
Knowing this, Per opted to only keep the most recently lexed token around, and
to call ReadToken() directly from the parser. With this setup, we have a strict,
forward streaming interface. If we want lookahead, we have to manually keep track
of previous tokens in local variables.
I didn't want to do that either, though, because I knew I'd need at least two
tokens of lookahead for the grammar I wanted, and wouldn't be able to leverage
the call-stack to cleanly manage that state. As a compromise between the two
methods, I implemented a ring buffer of tokens, which would be refilled completely
whenever the parser tried to lookahead farther than had been lexed.
static Token lookahead(TokenStream * tokens, u32 offset) {
    // We want LookaheadBufferSize to be a power of two so that the modulo
    // reduction of ring buffer indices can be optimized to a bitwise AND.
    StaticAssert(IsPowerOf2(LookaheadBufferSize));
    
    Assert(offset <= MaxTokenLookahead);
    Assert(tokens->read_cursor <= tokens->write_cursor);
    
    u32 index = tokens->read_cursor + offset;
    if (index < tokens->write_cursor) {
        return tokens->lookahead_buffer[index % LookaheadBufferSize];
    }
    
    { // Refill lookahead buffer:
        while (tokens->write_cursor - tokens->read_cursor < LookaheadBufferSize) {
            u32 write_index = tokens->write_cursor++ % LookaheadBufferSize;
            Token * write_ptr = &tokens->lookahead_buffer[write_index];
            bool stop_lexing = read_token(tokens, write_ptr);
            
            if (stop_lexing) break;
        }
        
        Assert(tokens->read_cursor <= tokens->write_cursor);
        Assert(tokens->write_cursor - tokens->read_cursor > offset);
    }
    
    return tokens->lookahead_buffer[index % LookaheadBufferSize];
}

If I really wanted to micro-optimize this, I could tweak LookaheadBufferSize.
For now, I arbitrarilly chose 32, but the order of magnitude might really matter
for parsing speed: make it too large, and you won't be reading tokens from the
data cache; make it too small and you'll be refilling the ring buffer very often,
constanly switching between lexer and parser code, and potentially thrashing the
instruction cache.
Realistically, it doesn't matter that much. Compilers generally don't spend
the majority of their time in lexing or parsing; if it does end up being a
problem, profiling will show that, and I'll have one avenue of optimization
figured out already.
Lexing with Unicode Support

While this is strictly a research project to me, I want to learn what goes into
a production-grade compiler, and one thing I'd definetely want there is clean
handling of Unicode.
I figured requiring source files to be UTF-8 encoded would be okay, but what if
a file was using a different encoding? Just producing encoding errors, or worse,
producing (and attempting to parse) garbage tokens would make for a, shall we say,
interesting user experience.
So, when we begin lexing a file, before we even fill the lookahead buffer with
tokens, we do this:
{ // Detect Unicode encodings other than UTF-8
    const char *EncodingError =
        "Encoding Error in '%s': source files must be UTF-8! (Detected %s)\n";
    
    if (file_size >= 4) {
        const char *UTF_32_BE_BOM = "\x00\x00\xFE\xFF";
        const char *UTF_32_LE_BOM = "\xFF\xFE\x00\x00";
        
        if (memcmp(file_memory, UTF_32_BE_BOM, 4) == 0) {
            printf(EncodingError, file_name, "UTF-32 BE byte order mark");
            return true;
        }
        
        if (memcmp(file_memory, UTF_32_LE_BOM, 4) == 0) {
            printf(EncodingError, file_name, "UTF-32 LE byte order mark");
            return true;
        }
    }
    
    if (file_size >= 2) {
        const char *UTF_16_BE_BOM = "\xFE\xFF";
        const char *UTF_16_LE_BOM = "\xFF\xFE";
        
        if (memcmp(file_memory, UTF_16_BE_BOM, 2) == 0) {
            printf(EncodingError, file_name, "UTF-16 BE byte order mark");
            return true;
        }
        
        if (memcmp(file_memory, UTF_16_LE_BOM, 2) == 0) {
            printf(EncodingError, file_name, "UTF-16 LE byte order mark");
            return true;
        }
    }

    if (file_size >= 3) {
        const char *UTF_8_BOM = "\xEF\xBB\xBF";
        
        if (memcmp(file_memory, UTF_8_BOM, 3) == 0) {
            // Skip byte order mark in UTF8 files
            file_memory += 3;
        }
    }
}

In the context of this block, return true basically tells the calling function
"something went wrong, don't even bother attempting parsing". This code is adapted
from the example code in the "Check for BOM markers" section of this chapter
of the Programming with Unicode book.
For the actual UTF-8 decoding, I'll just refer you to this article written
by Bjoern Hoehrmann, where he describes a really nice technique based on finite
state machines.
There are some corners of Unicode that can cause trouble for programmers,
though. Currently I detect zero-width spaces, as well as a few other characters
that could be used for code obfuscation. I'm told jslint does similar checks,
but I couldn't find a list of sensible characters to forbid. An important one
is the Greek question mark (U+037E), which will use the same glyph as the semicolon
in most fonts.
Lexing Identifiers and Keywords

In order to do name resolution, you need some way of comparing two identifier tokens
for equality. The easiest way to do that is to store a slice (i.e. a pointer and
a length) into the file buffer directly on the token. This is bad, because it forces
you to keep the file in memory for at least the lifetime of all tokens, which makes
reclamation of resources a pain.
You could just copy the string, but then you're making a lot of small allocations,
and don't forget that a string compare is quite slow: two pointer dereferences,
at least one of which is likely to be a cache miss, and O(n) comparisons.
I picked up the solution to this problem from Jonathan Blow live streaming work
on his Jai compiler: string interning. The idea is to maintain a table of unique
strings you've seen before, and deduplicating as early as possible. This allows
us to replace any and all string comparisons we might otherwise have to do with
a single integer comparison, at the cost of doing some upfront work for each
identifier token. Since pretty much every identifier in a program will be looked
up in some table eventually, this trick is definetely worth it.
Another cool thing is that this allows you to do keyword recognition with the same
table lookup: At program startup, intern all your keyword strings, and remember
their unique representation. After lexing an identifier, intern it, and compare
it to all the keywords. If your "unique representation" is a sequential integer
ID, you can even do this with just one branch and one addition:
u32 atom = intern_string(token_start, length_in_bytes);
if (atom < KeywordCount) {
    result->token_type = KeywordIf + atom;
} else {
    result->token_type = IdentifierToken;
    result->value      = atom;
}

That's pretty much free. You have to be careful to intern your keywords in the
same order they appear in the TokenType enumeration, though.
Encoding and Emitting Instructions

This topic is pretty trivial if you've worked with assembly language before, but
it's not covered in Crafting Interpreters, so here's a 10000ft overview:
Conceptually, you can think of your CPU as a state machine. Its internal state
consists of a number of registers, tiny pieces of memory, typically 32 or 64
bits for machines.
One of those registers serves a special purpose: it holds the address of the
next instruction to be executed. This register is called, unsurprisingly, the
instruction pointer, or sometimes the program counter, depending on the
architecture.
Executing an instruction really means changing the state of the processor. We can
group different instructions based on how they affect the processor state:
arithmetic instructions such as ADD, SUB, or MUL, will move the data from
two registers into the arithmetic logic unit, where the result of the operation
is calculated, and then move the result into a third register. Memory access
instructions such as LDR and STR will move a register-size piece of data from
RAM into a register, or vice-versa. The CPU cannot do arithmetic on values in
memory, they have to first go through a register. The third important category are
branch instructions, which affect the instruction pointer.
Different instructions are systematically assigned binary representations. This
underlying system is called the instruction encoding. The x64 CPU inside your
desktop machine uses a variable-length encoding, where different instructions
take up a varying number of bytes, whereas the ARM CPU inside the GBA uses a
simpler fixed-length encoding, where all instructions are the same size.
It makes sense to split the compiler back-end into two parts: the code generator,
which walks the abstract syntax tree and determines what instructions should be
used, and the instruction emitter, which is just a big collection of functions
that encode instructions and insert their binary representation into an output
stream.
If you have a suitable output stream abstraction, writing an emitter is mostly
mechanical work. Just refer to the instruction set manual for the CPU you're
targeting, which should include relevant sections on instruction encoding. Here's
a snippet from my emitter:
static inline void
Emit3OperandSum(u16 op, u16 dst_register, u16 src_register, u16 operand) {
    Assert(dst_register < 8);
    Assert(src_register < 8);
    Assert(operand < 8);
    
    EmitRaw(0x1800 | op << 9 | operand << 6 |
            src_register << 3 | dst_register);
}

#define EmitADD_R_R_R(Dst, Src, Operand) Emit3OperandSum(0, (Dst), (Src), (Operand))
#define EmitSUB_R_R_R(Dst, Src, Operand) Emit3OperandSum(1, (Dst), (Src), (Operand))
#define EmitADD_R_R_I(Dst, Src, Operand) Emit3OperandSum(2, (Dst), (Src), (Operand))
#define EmitSUB_R_R_I(Dst, Src, Operand) Emit3OperandSum(3, (Dst), (Src), (Operand))

The idea here is that one function handles one instruction encoding, and macros
provide more convenient ways of calling the appropriate function to emit a specific
instruction. This is helpful because, in my case, there's also an EmitADD_R_I
and EmitADD_R_R, both of which call different functions, for example.
Now imagine about 150 more lines of very similar code, and you know what my
emitter looks like.
Generating Code

Now, figuring out which instructions to emit, and in what order, that's a little
more challenging. There are many algorithms for doing this with varying degrees
of sophistication; I mostly coded along with Per's videos here, so I ended up with
something more on the simple side.
Evaluating Expressions

You should already be familiar with the simple tree-walk approach to interpretation:
static u32 eval_expression(const * AstExpr expr) {
    switch (expr->type) {
        case PrimaryExpression: {
            return expr->primary.value;
        } break;
        case BinaryExpression: {
            u32 lhs = eval_expression(expr->binary.lhs);
            u32 rhs = eval_expression(expr->binary.rhs);
            
            if (expr->binary.op == '+') {
                return lhs + rhs;
            } else if (expr->binary.op == '-') {
                return lhs - rhs;
            } /*
             ...
              */
        } break;
    }
}

We can use a perfectly analogous approach to code generation, where every return
statement is replaced with some Emit calls, and functions instead return the index
of the register that will contain the result of the operation:
static u16 emit_expression(const * AstExpr expr) {
    switch (expr->type) {
        case PrimaryExpression: {
            u16 result = allocate_register();
            
            // MOVes the constant value ("Immediate") into the Register result
            EmitMOV_R_I(result, expr->primary.value);
            return result;
        } break;
        case BinaryExpression: {
            u16 lhs_reg = emit_expression(expr->binary.lhs);
            u16 rhs_reg = emit_expression(expr->binary.rhs);
            
            if (expr->binary.op == '+') {
                // ADDs the value in Register rhs_reg to the value in lhs_reg,
                // and stores the result in lhs_reg
                EmitADD_R_R_R(lhs_reg, lhs_reg, rhs_reg);
            } else if (expr->binary.op == '-') {
                EmitSUB_R_R_R(lhs_reg, lhs_reg, rhs_reg);
            } /* ... */
            
            free_register(rhs_reg);
            return lhs_reg;
        } break;
    }
}

You may have heard scary things about register allocation, but that really only
becomes hard when you try to keep a function's local variables in registers - just
like the tree walk interpreter becomes more complicated when you add state and
statements.
I won't show the implementation here, since it's a lot of bit-fiddling, but the
basic idea is to maintain 8 bits of state - one for each register - indicating
which registers are occupied with temporary values in the execution context
of the instruction we emit next. To allocate a register, you find the index of
any zero bit, set it to one, and return the index. To free the register, you mask
out the bit at the specified index.
This approach is nice and simple, but it doesn't exactly generate spectacular
code. For example, for the expression 1 + (2 + 3), it will generate the
following piece of assembly:
MOV r0, 0x1
MOV r1, 0x2
MOV r2, 0x3
ADD r1, r1, r2
ADD r0, r0, r1

At the very least, I'd like to see something more like this:
MOV r0, 0x1
MOV r1, 0x2
ADD r1, 0x3      ; small constants can be encoded in the same instruction!
ADD r0, r0, r1

We can achieve this by delaying code generation for primary expressions until
we know more about how they'll be used, i.e. on our way back up the call stack.
To do that, we cannot simply return register indices anymore, we need to introduce
a new type:
enum { ImmediateOperand, RegisterOperand };
typedef struct {
    u32 type;
    union {
        u32 imm;
        u16 reg;
    };
} Operand;

static Operand emit_expression(const AstExpr * expr) {
    switch (expr->type) {
        case PrimaryExpression: {
            Operand result;
            result.type = ImmediateOperand;
            result.imm = expr->primary.value;
            return result;
        } break;
        case BinaryExpression: {
            Operand lhs = emit_expression(expr->binary.lhs);
            emit_as_register(&lhs);
            
            Operand rhs = emit_expression(expr->binary.rhs);
            
            if (expr->binary.op == '+') {
                if (rhs.type == ImmediateOperand && rhs.imm < 0x100) {
                    // We only have 8 bits to encode immediate operands...
                    EmitADD_R_I(lhs.reg, rhs.imm);
                } else {
                    emit_as_register(&rhs);
                    EmitADD_R_R_R(lhs.reg, lhs.reg, rhs.reg);
                    free_register(rhs.reg);
                }
            } /* ... */
            
            return lhs;
        } break;
    }
}

emit_as_register is a helper function that will do nothing when called with a
RegisterOperand, but when called with an ImmediateOperand, it will allocate a
register, emit the appropriate load instruction, and modify the Operand to
reflect that the value is now held in a register.
This approach is really flexible; we have enough wiggle room to rathole on
instruction selection (the above code will call EmitADD_R_I for 1000 + 1, but
not for 1 + 1000, for example), and we can get constant folding pretty much for
free by checking whether both operands are immediates.
If we restructure our code a bit, such that we do the check on expr->binary.op
before recursing to get lhs and rhs, we can even do operator-specific
optimizations, such as eliminating the right sub-expression of a multiplication
if the left sub-expression is a constant 0, or emitting a SUB instruction for
expressions of the shape a + (-b).
Dereferencing Pointers

If it isn't obvious from the code snippet in the introduction, I denote a pointer
dereference A.*. I chose this syntax for several reasons, none of which really
matter right now; all that matters is that this expression evaluates to the value
stored at address A. We can shoehorn that into the emit_expression function
from the previous section pretty easily:
case UnaryExpression: {
    Operand addr = emit_expression(expr->unary.operand);
    emit_as_register(&addr);
    
    if (expr->unary.op == OperatorDereference) {
        // Load from the address in addr.reg (offset by 0),
        // and store the value in the same register:
        EmitLDR_R_R_OFF(addr.reg, addr.reg, 0);
    } /* ... */
    
    return addr;
} break;

That does not allow us to store anything, though. For that, we need assignment
operators. I don't really like C's idea of assignments being expressions, and
wouldn't want users of my language to think of them as such. But if you were to
look at my parser, you'd see that I still treat them that way internally.
That's in part because it made my life a little bit easier, but also because it
allows for better error reporting. Think Error: assignments are not expressions,
rather than Syntax Error: unexpected token '='. This technique of parsing a
super-set of the language you intend to actually accept is generally pretty good,
and similar techniques can be applied to other parts of a compiler as well.
That said, let's look at two assignment operators, plain-old = and +=.
Here, I'll assume that we have not yet emitted code for the two sub-expressions
by the time we know we're looking at an assignment:
/* ... */
} else if (expr->binary.op == '=') {
    AstExpr lhs = expr->binary.lhs;
    Assert(lhs.type == UnaryExpression &&
           lhs.unary.op == OperatorDereference);
    
    Operand addr = emit_expression(lhs.unary.operand);
    emit_as_register(&addr);
    
    Operand value = emit_expression(expr->binary.rhs);
    emit_as_register(&value);
    
    EmitSTR_R_R_OFF(value.reg, addr.reg, 0);
    free_register(value.reg);
} else if (expr->binary.op == OperatorPlusEqual) {
    AstExpr lhs = expr->binary.lhs;
    Assert(lhs.type == UnaryExpression &&
           lhs.unary.op == OperatorDereference);
    
    Operand addr = emit_expression(lhs.unary.operand);
    emit_as_register(&addr);
    u16 tmp_reg = allocate_register();
    EmitLDR_R_R_OFF(tmp_reg, addr.reg, 0);
    
    Operand value = emit_expression(expr->binary.rhs);
    emit_as_register(&value);

    EmitADD_R_R_R(tmp_reg, tmp_reg, value.reg);
    EmitSTR_R_R_OFF(tmp_reg, addr.reg, 0);
    
    free_register(value.reg);
    free_register(tmp_reg);
} /* ... */

The code duplication, the fact that assignments are not actually expressions, and
don't evaluate to anything (and therefore don't need to return a RegisterOperand),
all suggest that we should split this code into a different function, which is in
fact what I did in my code.
Factoring style aside, though, the basic = assignment works pretty much like the
+ and - operators we've seen before, modulo the reaching into lhs to skip
the pointer dereference.
The += assignment is a bit more interesting, I think. Because we need both the
address and the value at that address, we allocate a temporary register and load
into that before generating code for rhs.
Note that I'm omitting the optimization of the addition itself; if value was an
ImmediateOperand, we could save another instruction here. Furthermore, we're
always emitting load and store instructions with 0 offset; we could pack pointer
arithmetic into the same instructions by again reaching into lhs.unary.operand.
Branching and Looping

Now that we have side effects at our disposal, we may become more interested in
control flow, i.e. if and while statements. The basic framework will assume
that a statement will point to the statement immediately following it; the last
statement in a scope has no successor, even if the scope is part of a statement
that does:
static void emit_statement(const AstStmt * stmt) {
    switch (stmt->type) {
        case IfElseStatement: {
            // ...
        } break;
        case LoopStatement: {
            // ...
        } break;
        case ExpressionStatement: {
            emit_expression(stmt->expr);
        } break;
    }
    
    if (stmt->next) {
        emit_statement(stmt->next);
    }
}

To figure out what we need to fill in the blanks here, we need to take a look at
how control flow works at the machine level: in addition to the instruction pointer,
there's another special register called the status register. It's not accessible
directly by most instructions. Instead, it holds information about previous
operations, such as the carry flag (telling us whether the last arithmetic
instruction caused an overflow), the zero flag (which is exactly what you think
it is), and a few others.
The CMP instruction is essentially the same as the SUB instruction - except
that the result of the operation is discarded. Only these flags are affected. The
other piece of the puzzle are conditional branches, which, depending on the
status register and the condition encoded in the instruction, may or may not add
some constant offset to the instruction pointer. (These are called relative jumps,
because the address of the code we want to jump to is encoded relative to the
location of the jump instruction itself. There are also jumps to absolute addresses,
and variable target jumps, where the target location is read from a general
purpose register).
With that, we are equipped to look at the kind of code we want to generate. For
instance, for the statement if x % 2 { /* ... */ } else { /* ... */ } we want
to generate this structure:
MOV r1, 0x1
AND r0, r1    ; assume x is in r0
CMP r0, 0x0   ; C-style implicit conversion of condition
BEQ else
; ... execute then-branch statements here
B end_if
else:
; ...
end_if:
; ...

There's a problem here: because we don't know how many instructions we'll emit for
the two branch bodies, we don't know the locations of the two labels, and thus
cannot encode the instructions BEQ else and B end_if. The solution here is
called backpatching: we emit a placeholder, and remember it's position. When we
reach the label we want to backpatch, we emit the missing instruction.
In my own code, I use a macro called Backpatch for that, which temporarily sets
the write cursor to the specified value, emits one instruction, and then restores
the cursor. We can write something like this:
case IfElseStatement: {
    Operand cond = emit_expression(stmt->if_else.cond);
    emit_as_register(&cond);
    EmitCMP_R_I(cond.reg, 0);
    free_register(cond.reg);
    
    u16 * beq_else = EmitPlaceholder();
    emit_statement(stmt->if_else.then_branch);
    u16 * b_end_if = EmitPlaceholder();
    
    s16 jmp_offset = JumpOffsetFwd(beq_else);
    Backpatch(beq_else, EmitBEQ_OFF(beq_else));
    
    emit_statement(stmt->if_else.else_branch);
    
    jmp_offset = JumpOffsetFwd(b_end_if);
    Backpatch(b_end_if, EmitB_OFF(b_end_if));
} break;    

Great, now let's do loops. There are a couple different ways to write loops in
assembly, and they have slightly different properties. The one you see the most
often is this:
B cond     ; pre-check, go to the _end_ to the loop
loop:
; ... loop body goes here
cond:      ; condition is evaluated at the bottom of the loop
MOV r1, 0x1
AND r0, r1
CMP r0, 0x0
BNE loop   ; Branch if Not Equal

To get a post-checked loop, we can just skip emitting the first instruciton, B cond - I'll only show the pre-checked variant here:
case LoopStatement: {
    u16 * b_cond = EmitPlaceholder();
    u16 * loop = emit_cursor;
    
    emit_statement(stmt->loop.body);
    s16 jmp_offset = JumpOffsetFwd(b_cond);
    Backpatch(b_cond, EmitB_OFF(jmp_offset));
    
    Operand cond = emit_expression(stmt->loop.cond);
    emit_as_register(&cond);
    EmitCMP_R_I(cond.reg, 0);
    EmitBNE_OFF(JumpOffsetBkwd(loop));
    
    free_register(cond.reg);
} break;

Now the code I show here works, but if you play around with it a little bit,
you'll see that it generates some super awkward code in some cases. Things like
jumps to jump instructions - which we could replace with a jump to the actual
target, without any indirection.
You could do that with the same "reaching in" trick I've discussed before (i.e.
looking at the interior nodes of the AST node you're currently looking at), but
there are many different combinations where this can occur, and the necessary
checks will get more and more messy.
I haven't actually done this myself yet, but a trick I've learned from a paper
(recommended by Fabian Giesen on his blog) is called destination-driven
code generation. The idea is similar to the delayed code generation we're
using for expressions: We introduce the notion of a control destination that we
pass around as we traverse the syntax tree, and delay the backpatching of jump
instructions until we know where they should go.
If you've never read a computer science paper before, this one might make a
good starting point. There's some scary-looking equations in there, but they
are explained in a very nice way.
Buffering Backpatch Locations

So far, the code I've shown is pretty much instruction set agnostic. It applies
to an x64 backend just as well as for the THUMB instruction set we've been using.
You'd need to replace the emitter and and adjust the calls to any Emit functions
accordingly, but other than that, it'll "just work". But now I want to talk about
an annoyance that is instruction set specific, and the solution I came up with:
As mentioned before a THUMB instruction is two bytes, and thus the MOV rd, imm
instruction you'd use inside emit_as_register can only encode immediate values
up to 255 - there's only so much encoding space. What you're supposed to do is
use a PC-relative load, LDR rd, off, which loads a 32-bit value from an address
close to the instruction pointer.
Since we don't want the CPU to try and execute that data, we need to store it in
a suitable location, called a literal pool. That is to say, you generate code
until you reach a good place to dump all the values you couldn't fit into the
instruction stream, and then backpatch a bunch of load instructions.
Any given expression might require any number of these, so we need to maintain a
buffer of backpatch locations that we can push requests into. The question then
becomes where to check whether emtiting a literal pool is necessary; if you
checked after every single statement, you'd need to emit jump instructions to
prevent it from being executed. That's not only ugly, those jumps aren't free,
so you're also making the generated program slower.
Instead, you want to flush the backpatch location buffer, as I call it, only in
places where you're emitting jumps anyway. In a pre-checked loop, we can put it
between the unconditional jump to the condition and the loop body; in an if/else
statement, we can fit it in between the unconditional jump at the end of the first
branch and the beginning of the second branch.
What's nice about this approach is that we can repurpose it to also handle break
and continue statements. What's not so nice is that it doesn't solve a fundamental
robustness problem: just like the size of an immediate value is limited, so is the
size of the offset we can encode in a load instruction - there's a hard limit to
how many instructions we can emit between putting a location into the buffer, and
flushing the buffer. That limit is quite large (about 500 instructions), and we're
not likely to run into any problems for simple programs. But there are pathological
cases where this will blow up and fail to generate valid code.
I don't have a nice solution for this problem yet, and I don't want to dedicate
too much effort to it for now. Going forward, I'd rather work on more language
features.
Conclusion

And there you have it, all the interesting things I've done and learned so far.
I hope this will be useful to somebody - if so, maybe I'll do a similar write-up
once I've made some progress on the project.
I will say that this has already made me a much better programmer. Not using any
libraries means you cannot rely on Stackoverflow to solve your cookie-cutter
problems. Since you're laying the foundations of a larger system, you cannot rely
on hacky bug fixes, you need to get down to the root cause. Successfully solving
challenging problems in this context has been a real boost to my confidence in my
skills.
At the same time, it was a really humbling experience. When you have a high-level
understanding of how compilers work, it's easy to fall into thinking that they are
simple. This illusion is shattered as soon as you get into the trenches and fight
with all the little details.
All in all, this has been a very rewarding project so far, and I'd recommend doing
something similar to anyone even remotely interested. I certainly don't intend to
quit anytime soon.