Skip to content

Instantly share code, notes, and snippets.

@teryror
Last active January 24, 2024 21:06
Show Gist options
  • Star 23 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save teryror/97a96e02a7766c54bef54b509a945b6c to your computer and use it in GitHub Desktop.
Save teryror/97a96e02a7766c54bef54b509a945b6c to your computer and use it in GitHub Desktop.

Crafting a Compiler from Scratch: Implementation Notes

For the past two weeks or so, I've been working on a little compiler project in C, mostly for educational purposes, i.e. to understand how a compiler really works. I'm not using any libraries, other than the C runtime library.

Introduction

I have a hand-written lexer and parser, and a simple code generator targetting the Nintendo Gameboy Advance. I don't support named variables yet, but I do have pointer dereference and assignment, so I can write programs like this:

# Test Program: Collatz Conjecture

0x02000000.* = 27   # Starting number
0x02000004.* =  0   # Step counter

loop {
    if 0x02000000.* & 1 {
        0x02000000.* *= 3
        0x02000000.* += 1
    } else {
        0x02000000.* = 0x02000000.* >> 1
    }
    
    0x02000004.* += 1
} while 0x02000000.* - 1

I'm saying I'm target the GBA, rather than "the ARMv4T instruction set" because I'm only generating valid ROM images, no other types of executable file. I'm also assuming the absence of memory protection featurs, and the GBA's memory map - address 0x02000000 is where EWRAM (external work RAM) is mapped.

As you can see from the snippet above, I have loops and if statements for control flow, but no functions or type checking. Those features are planned, of course, but I wanted to take a step back and talk about what I've learned so far.


I've been primarilly referring to two sources on compiler construction; Crafting Interpreters by Bob Nystrom, and Programming an x64 compiler from scratch, by Per Vognsen.

The former is a freely available, though (as of the time of writing) unfinished book that assumes familiarity with imperative programming, but not much more. It's concise, accessible, and a fun read.

The latter is a series of recordings of Per live streaming development, totaling almost 15 hours of video, with periods of silent thinking, debugging, and refactoring thrown in. There is a whole host of valuable knowledge in there, but in his explanations, Per assumes not only familiarity with various C-specific idioms, but also some amount of familiarity with compiler-specific terminology. It's not the most accessible source, to say the least.

I don't want to retread the ground covered in Crafting Interpreters here, so I'll assume you've read at least the first few chapters. Instead, consider this a complementary work, where I'll try to relate the new information from Per's videos, as well as my own findings.

With all that said, let's get started. We'll work our way through the compiler stages, starting with the lexer:

Lexing in O(1) Space

The lexer we see in Crafting Interpreters works as a precursory pass to parsing; it calls scanToken() until it reaches the end of the file, and then passes an array of all tokens to the parser. This is nice in principle, because having random access into the token stream allows us to parse pretty much any grammar we want.

In practice, it is totally unnecessary; language designers generally frown upon languages that require unbounded lookahead, and some languages, like Lisp, can be parsed without any lookahead at all.

Knowing this, Per opted to only keep the most recently lexed token around, and to call ReadToken() directly from the parser. With this setup, we have a strict, forward streaming interface. If we want lookahead, we have to manually keep track of previous tokens in local variables.

I didn't want to do that either, though, because I knew I'd need at least two tokens of lookahead for the grammar I wanted, and wouldn't be able to leverage the call-stack to cleanly manage that state. As a compromise between the two methods, I implemented a ring buffer of tokens, which would be refilled completely whenever the parser tried to lookahead farther than had been lexed.

static Token lookahead(TokenStream * tokens, u32 offset) {
    // We want LookaheadBufferSize to be a power of two so that the modulo
    // reduction of ring buffer indices can be optimized to a bitwise AND.
    StaticAssert(IsPowerOf2(LookaheadBufferSize));
    
    Assert(offset <= MaxTokenLookahead);
    Assert(tokens->read_cursor <= tokens->write_cursor);
    
    u32 index = tokens->read_cursor + offset;
    if (index < tokens->write_cursor) {
        return tokens->lookahead_buffer[index % LookaheadBufferSize];
    }
    
    { // Refill lookahead buffer:
        while (tokens->write_cursor - tokens->read_cursor < LookaheadBufferSize) {
            u32 write_index = tokens->write_cursor++ % LookaheadBufferSize;
            Token * write_ptr = &tokens->lookahead_buffer[write_index];
            bool stop_lexing = read_token(tokens, write_ptr);
            
            if (stop_lexing) break;
        }
        
        Assert(tokens->read_cursor <= tokens->write_cursor);
        Assert(tokens->write_cursor - tokens->read_cursor > offset);
    }
    
    return tokens->lookahead_buffer[index % LookaheadBufferSize];
}

If I really wanted to micro-optimize this, I could tweak LookaheadBufferSize. For now, I arbitrarilly chose 32, but the order of magnitude might really matter for parsing speed: make it too large, and you won't be reading tokens from the data cache; make it too small and you'll be refilling the ring buffer very often, constanly switching between lexer and parser code, and potentially thrashing the instruction cache.

Realistically, it doesn't matter that much. Compilers generally don't spend the majority of their time in lexing or parsing; if it does end up being a problem, profiling will show that, and I'll have one avenue of optimization figured out already.

Lexing with Unicode Support

While this is strictly a research project to me, I want to learn what goes into a production-grade compiler, and one thing I'd definetely want there is clean handling of Unicode.

I figured requiring source files to be UTF-8 encoded would be okay, but what if a file was using a different encoding? Just producing encoding errors, or worse, producing (and attempting to parse) garbage tokens would make for a, shall we say, interesting user experience.

So, when we begin lexing a file, before we even fill the lookahead buffer with tokens, we do this:

{ // Detect Unicode encodings other than UTF-8
    const char *EncodingError =
        "Encoding Error in '%s': source files must be UTF-8! (Detected %s)\n";
    
    if (file_size >= 4) {
        const char *UTF_32_BE_BOM = "\x00\x00\xFE\xFF";
        const char *UTF_32_LE_BOM = "\xFF\xFE\x00\x00";
        
        if (memcmp(file_memory, UTF_32_BE_BOM, 4) == 0) {
            printf(EncodingError, file_name, "UTF-32 BE byte order mark");
            return true;
        }
        
        if (memcmp(file_memory, UTF_32_LE_BOM, 4) == 0) {
            printf(EncodingError, file_name, "UTF-32 LE byte order mark");
            return true;
        }
    }
    
    if (file_size >= 2) {
        const char *UTF_16_BE_BOM = "\xFE\xFF";
        const char *UTF_16_LE_BOM = "\xFF\xFE";
        
        if (memcmp(file_memory, UTF_16_BE_BOM, 2) == 0) {
            printf(EncodingError, file_name, "UTF-16 BE byte order mark");
            return true;
        }
        
        if (memcmp(file_memory, UTF_16_LE_BOM, 2) == 0) {
            printf(EncodingError, file_name, "UTF-16 LE byte order mark");
            return true;
        }
    }

    if (file_size >= 3) {
        const char *UTF_8_BOM = "\xEF\xBB\xBF";
        
        if (memcmp(file_memory, UTF_8_BOM, 3) == 0) {
            // Skip byte order mark in UTF8 files
            file_memory += 3;
        }
    }
}

In the context of this block, return true basically tells the calling function "something went wrong, don't even bother attempting parsing". This code is adapted from the example code in the "Check for BOM markers" section of this chapter of the Programming with Unicode book.

For the actual UTF-8 decoding, I'll just refer you to this article written by Bjoern Hoehrmann, where he describes a really nice technique based on finite state machines.

There are some corners of Unicode that can cause trouble for programmers, though. Currently I detect zero-width spaces, as well as a few other characters that could be used for code obfuscation. I'm told jslint does similar checks, but I couldn't find a list of sensible characters to forbid. An important one is the Greek question mark (U+037E), which will use the same glyph as the semicolon in most fonts.

Lexing Identifiers and Keywords

In order to do name resolution, you need some way of comparing two identifier tokens for equality. The easiest way to do that is to store a slice (i.e. a pointer and a length) into the file buffer directly on the token. This is bad, because it forces you to keep the file in memory for at least the lifetime of all tokens, which makes reclamation of resources a pain.

You could just copy the string, but then you're making a lot of small allocations, and don't forget that a string compare is quite slow: two pointer dereferences, at least one of which is likely to be a cache miss, and O(n) comparisons.

I picked up the solution to this problem from Jonathan Blow live streaming work on his Jai compiler: string interning. The idea is to maintain a table of unique strings you've seen before, and deduplicating as early as possible. This allows us to replace any and all string comparisons we might otherwise have to do with a single integer comparison, at the cost of doing some upfront work for each identifier token. Since pretty much every identifier in a program will be looked up in some table eventually, this trick is definetely worth it.

Another cool thing is that this allows you to do keyword recognition with the same table lookup: At program startup, intern all your keyword strings, and remember their unique representation. After lexing an identifier, intern it, and compare it to all the keywords. If your "unique representation" is a sequential integer ID, you can even do this with just one branch and one addition:

u32 atom = intern_string(token_start, length_in_bytes);
if (atom < KeywordCount) {
    result->token_type = KeywordIf + atom;
} else {
    result->token_type = IdentifierToken;
    result->value      = atom;
}

That's pretty much free. You have to be careful to intern your keywords in the same order they appear in the TokenType enumeration, though.

Encoding and Emitting Instructions

This topic is pretty trivial if you've worked with assembly language before, but it's not covered in Crafting Interpreters, so here's a 10000ft overview:

Conceptually, you can think of your CPU as a state machine. Its internal state consists of a number of registers, tiny pieces of memory, typically 32 or 64 bits for machines.

One of those registers serves a special purpose: it holds the address of the next instruction to be executed. This register is called, unsurprisingly, the instruction pointer, or sometimes the program counter, depending on the architecture.

Executing an instruction really means changing the state of the processor. We can group different instructions based on how they affect the processor state: arithmetic instructions such as ADD, SUB, or MUL, will move the data from two registers into the arithmetic logic unit, where the result of the operation is calculated, and then move the result into a third register. Memory access instructions such as LDR and STR will move a register-size piece of data from RAM into a register, or vice-versa. The CPU cannot do arithmetic on values in memory, they have to first go through a register. The third important category are branch instructions, which affect the instruction pointer.

Different instructions are systematically assigned binary representations. This underlying system is called the instruction encoding. The x64 CPU inside your desktop machine uses a variable-length encoding, where different instructions take up a varying number of bytes, whereas the ARM CPU inside the GBA uses a simpler fixed-length encoding, where all instructions are the same size.

It makes sense to split the compiler back-end into two parts: the code generator, which walks the abstract syntax tree and determines what instructions should be used, and the instruction emitter, which is just a big collection of functions that encode instructions and insert their binary representation into an output stream.

If you have a suitable output stream abstraction, writing an emitter is mostly mechanical work. Just refer to the instruction set manual for the CPU you're targeting, which should include relevant sections on instruction encoding. Here's a snippet from my emitter:

static inline void
Emit3OperandSum(u16 op, u16 dst_register, u16 src_register, u16 operand) {
    Assert(dst_register < 8);
    Assert(src_register < 8);
    Assert(operand < 8);
    
    EmitRaw(0x1800 | op << 9 | operand << 6 |
            src_register << 3 | dst_register);
}

#define EmitADD_R_R_R(Dst, Src, Operand) Emit3OperandSum(0, (Dst), (Src), (Operand))
#define EmitSUB_R_R_R(Dst, Src, Operand) Emit3OperandSum(1, (Dst), (Src), (Operand))
#define EmitADD_R_R_I(Dst, Src, Operand) Emit3OperandSum(2, (Dst), (Src), (Operand))
#define EmitSUB_R_R_I(Dst, Src, Operand) Emit3OperandSum(3, (Dst), (Src), (Operand))

The idea here is that one function handles one instruction encoding, and macros provide more convenient ways of calling the appropriate function to emit a specific instruction. This is helpful because, in my case, there's also an EmitADD_R_I and EmitADD_R_R, both of which call different functions, for example.

Now imagine about 150 more lines of very similar code, and you know what my emitter looks like.

Generating Code

Now, figuring out which instructions to emit, and in what order, that's a little more challenging. There are many algorithms for doing this with varying degrees of sophistication; I mostly coded along with Per's videos here, so I ended up with something more on the simple side.

Evaluating Expressions

You should already be familiar with the simple tree-walk approach to interpretation:

static u32 eval_expression(const * AstExpr expr) {
    switch (expr->type) {
        case PrimaryExpression: {
            return expr->primary.value;
        } break;
        case BinaryExpression: {
            u32 lhs = eval_expression(expr->binary.lhs);
            u32 rhs = eval_expression(expr->binary.rhs);
            
            if (expr->binary.op == '+') {
                return lhs + rhs;
            } else if (expr->binary.op == '-') {
                return lhs - rhs;
            } /*
             ...
              */
        } break;
    }
}

We can use a perfectly analogous approach to code generation, where every return statement is replaced with some Emit calls, and functions instead return the index of the register that will contain the result of the operation:

static u16 emit_expression(const * AstExpr expr) {
    switch (expr->type) {
        case PrimaryExpression: {
            u16 result = allocate_register();
            
            // MOVes the constant value ("Immediate") into the Register result
            EmitMOV_R_I(result, expr->primary.value);
            return result;
        } break;
        case BinaryExpression: {
            u16 lhs_reg = emit_expression(expr->binary.lhs);
            u16 rhs_reg = emit_expression(expr->binary.rhs);
            
            if (expr->binary.op == '+') {
                // ADDs the value in Register rhs_reg to the value in lhs_reg,
                // and stores the result in lhs_reg
                EmitADD_R_R_R(lhs_reg, lhs_reg, rhs_reg);
            } else if (expr->binary.op == '-') {
                EmitSUB_R_R_R(lhs_reg, lhs_reg, rhs_reg);
            } /* ... */
            
            free_register(rhs_reg);
            return lhs_reg;
        } break;
    }
}

You may have heard scary things about register allocation, but that really only becomes hard when you try to keep a function's local variables in registers - just like the tree walk interpreter becomes more complicated when you add state and statements.

I won't show the implementation here, since it's a lot of bit-fiddling, but the basic idea is to maintain 8 bits of state - one for each register - indicating which registers are occupied with temporary values in the execution context of the instruction we emit next. To allocate a register, you find the index of any zero bit, set it to one, and return the index. To free the register, you mask out the bit at the specified index.

This approach is nice and simple, but it doesn't exactly generate spectacular code. For example, for the expression 1 + (2 + 3), it will generate the following piece of assembly:

MOV r0, 0x1
MOV r1, 0x2
MOV r2, 0x3
ADD r1, r1, r2
ADD r0, r0, r1

At the very least, I'd like to see something more like this:

MOV r0, 0x1
MOV r1, 0x2
ADD r1, 0x3      ; small constants can be encoded in the same instruction!
ADD r0, r0, r1

We can achieve this by delaying code generation for primary expressions until we know more about how they'll be used, i.e. on our way back up the call stack. To do that, we cannot simply return register indices anymore, we need to introduce a new type:

enum { ImmediateOperand, RegisterOperand };
typedef struct {
    u32 type;
    union {
        u32 imm;
        u16 reg;
    };
} Operand;

static Operand emit_expression(const AstExpr * expr) {
    switch (expr->type) {
        case PrimaryExpression: {
            Operand result;
            result.type = ImmediateOperand;
            result.imm = expr->primary.value;
            return result;
        } break;
        case BinaryExpression: {
            Operand lhs = emit_expression(expr->binary.lhs);
            emit_as_register(&lhs);
            
            Operand rhs = emit_expression(expr->binary.rhs);
            
            if (expr->binary.op == '+') {
                if (rhs.type == ImmediateOperand && rhs.imm < 0x100) {
                    // We only have 8 bits to encode immediate operands...
                    EmitADD_R_I(lhs.reg, rhs.imm);
                } else {
                    emit_as_register(&rhs);
                    EmitADD_R_R_R(lhs.reg, lhs.reg, rhs.reg);
                    free_register(rhs.reg);
                }
            } /* ... */
            
            return lhs;
        } break;
    }
}

emit_as_register is a helper function that will do nothing when called with a RegisterOperand, but when called with an ImmediateOperand, it will allocate a register, emit the appropriate load instruction, and modify the Operand to reflect that the value is now held in a register.

This approach is really flexible; we have enough wiggle room to rathole on instruction selection (the above code will call EmitADD_R_I for 1000 + 1, but not for 1 + 1000, for example), and we can get constant folding pretty much for free by checking whether both operands are immediates.

If we restructure our code a bit, such that we do the check on expr->binary.op before recursing to get lhs and rhs, we can even do operator-specific optimizations, such as eliminating the right sub-expression of a multiplication if the left sub-expression is a constant 0, or emitting a SUB instruction for expressions of the shape a + (-b).

Dereferencing Pointers

If it isn't obvious from the code snippet in the introduction, I denote a pointer dereference A.*. I chose this syntax for several reasons, none of which really matter right now; all that matters is that this expression evaluates to the value stored at address A. We can shoehorn that into the emit_expression function from the previous section pretty easily:

case UnaryExpression: {
    Operand addr = emit_expression(expr->unary.operand);
    emit_as_register(&addr);
    
    if (expr->unary.op == OperatorDereference) {
        // Load from the address in addr.reg (offset by 0),
        // and store the value in the same register:
        EmitLDR_R_R_OFF(addr.reg, addr.reg, 0);
    } /* ... */
    
    return addr;
} break;

That does not allow us to store anything, though. For that, we need assignment operators. I don't really like C's idea of assignments being expressions, and wouldn't want users of my language to think of them as such. But if you were to look at my parser, you'd see that I still treat them that way internally.

That's in part because it made my life a little bit easier, but also because it allows for better error reporting. Think Error: assignments are not expressions, rather than Syntax Error: unexpected token '='. This technique of parsing a super-set of the language you intend to actually accept is generally pretty good, and similar techniques can be applied to other parts of a compiler as well.

That said, let's look at two assignment operators, plain-old = and +=. Here, I'll assume that we have not yet emitted code for the two sub-expressions by the time we know we're looking at an assignment:

/* ... */
} else if (expr->binary.op == '=') {
    AstExpr lhs = expr->binary.lhs;
    Assert(lhs.type == UnaryExpression &&
           lhs.unary.op == OperatorDereference);
    
    Operand addr = emit_expression(lhs.unary.operand);
    emit_as_register(&addr);
    
    Operand value = emit_expression(expr->binary.rhs);
    emit_as_register(&value);
    
    EmitSTR_R_R_OFF(value.reg, addr.reg, 0);
    free_register(value.reg);
} else if (expr->binary.op == OperatorPlusEqual) {
    AstExpr lhs = expr->binary.lhs;
    Assert(lhs.type == UnaryExpression &&
           lhs.unary.op == OperatorDereference);
    
    Operand addr = emit_expression(lhs.unary.operand);
    emit_as_register(&addr);
    u16 tmp_reg = allocate_register();
    EmitLDR_R_R_OFF(tmp_reg, addr.reg, 0);
    
    Operand value = emit_expression(expr->binary.rhs);
    emit_as_register(&value);

    EmitADD_R_R_R(tmp_reg, tmp_reg, value.reg);
    EmitSTR_R_R_OFF(tmp_reg, addr.reg, 0);
    
    free_register(value.reg);
    free_register(tmp_reg);
} /* ... */

The code duplication, the fact that assignments are not actually expressions, and don't evaluate to anything (and therefore don't need to return a RegisterOperand), all suggest that we should split this code into a different function, which is in fact what I did in my code.

Factoring style aside, though, the basic = assignment works pretty much like the + and - operators we've seen before, modulo the reaching into lhs to skip the pointer dereference.

The += assignment is a bit more interesting, I think. Because we need both the address and the value at that address, we allocate a temporary register and load into that before generating code for rhs.

Note that I'm omitting the optimization of the addition itself; if value was an ImmediateOperand, we could save another instruction here. Furthermore, we're always emitting load and store instructions with 0 offset; we could pack pointer arithmetic into the same instructions by again reaching into lhs.unary.operand.

Branching and Looping

Now that we have side effects at our disposal, we may become more interested in control flow, i.e. if and while statements. The basic framework will assume that a statement will point to the statement immediately following it; the last statement in a scope has no successor, even if the scope is part of a statement that does:

static void emit_statement(const AstStmt * stmt) {
    switch (stmt->type) {
        case IfElseStatement: {
            // ...
        } break;
        case LoopStatement: {
            // ...
        } break;
        case ExpressionStatement: {
            emit_expression(stmt->expr);
        } break;
    }
    
    if (stmt->next) {
        emit_statement(stmt->next);
    }
}

To figure out what we need to fill in the blanks here, we need to take a look at how control flow works at the machine level: in addition to the instruction pointer, there's another special register called the status register. It's not accessible directly by most instructions. Instead, it holds information about previous operations, such as the carry flag (telling us whether the last arithmetic instruction caused an overflow), the zero flag (which is exactly what you think it is), and a few others.

The CMP instruction is essentially the same as the SUB instruction - except that the result of the operation is discarded. Only these flags are affected. The other piece of the puzzle are conditional branches, which, depending on the status register and the condition encoded in the instruction, may or may not add some constant offset to the instruction pointer. (These are called relative jumps, because the address of the code we want to jump to is encoded relative to the location of the jump instruction itself. There are also jumps to absolute addresses, and variable target jumps, where the target location is read from a general purpose register).

With that, we are equipped to look at the kind of code we want to generate. For instance, for the statement if x % 2 { /* ... */ } else { /* ... */ } we want to generate this structure:

MOV r1, 0x1
AND r0, r1    ; assume x is in r0
CMP r0, 0x0   ; C-style implicit conversion of condition
BEQ else
; ... execute then-branch statements here
B end_if
else:
; ...
end_if:
; ...

There's a problem here: because we don't know how many instructions we'll emit for the two branch bodies, we don't know the locations of the two labels, and thus cannot encode the instructions BEQ else and B end_if. The solution here is called backpatching: we emit a placeholder, and remember it's position. When we reach the label we want to backpatch, we emit the missing instruction.

In my own code, I use a macro called Backpatch for that, which temporarily sets the write cursor to the specified value, emits one instruction, and then restores the cursor. We can write something like this:

case IfElseStatement: {
    Operand cond = emit_expression(stmt->if_else.cond);
    emit_as_register(&cond);
    EmitCMP_R_I(cond.reg, 0);
    free_register(cond.reg);
    
    u16 * beq_else = EmitPlaceholder();
    emit_statement(stmt->if_else.then_branch);
    u16 * b_end_if = EmitPlaceholder();
    
    s16 jmp_offset = JumpOffsetFwd(beq_else);
    Backpatch(beq_else, EmitBEQ_OFF(beq_else));
    
    emit_statement(stmt->if_else.else_branch);
    
    jmp_offset = JumpOffsetFwd(b_end_if);
    Backpatch(b_end_if, EmitB_OFF(b_end_if));
} break;    

Great, now let's do loops. There are a couple different ways to write loops in assembly, and they have slightly different properties. The one you see the most often is this:

B cond     ; pre-check, go to the _end_ to the loop
loop:
; ... loop body goes here
cond:      ; condition is evaluated at the bottom of the loop
MOV r1, 0x1
AND r0, r1
CMP r0, 0x0
BNE loop   ; Branch if Not Equal

To get a post-checked loop, we can just skip emitting the first instruciton, B cond - I'll only show the pre-checked variant here:

case LoopStatement: {
    u16 * b_cond = EmitPlaceholder();
    u16 * loop = emit_cursor;
    
    emit_statement(stmt->loop.body);
    s16 jmp_offset = JumpOffsetFwd(b_cond);
    Backpatch(b_cond, EmitB_OFF(jmp_offset));
    
    Operand cond = emit_expression(stmt->loop.cond);
    emit_as_register(&cond);
    EmitCMP_R_I(cond.reg, 0);
    EmitBNE_OFF(JumpOffsetBkwd(loop));
    
    free_register(cond.reg);
} break;

Now the code I show here works, but if you play around with it a little bit, you'll see that it generates some super awkward code in some cases. Things like jumps to jump instructions - which we could replace with a jump to the actual target, without any indirection.

You could do that with the same "reaching in" trick I've discussed before (i.e. looking at the interior nodes of the AST node you're currently looking at), but there are many different combinations where this can occur, and the necessary checks will get more and more messy.

I haven't actually done this myself yet, but a trick I've learned from a paper (recommended by Fabian Giesen on his blog) is called destination-driven code generation. The idea is similar to the delayed code generation we're using for expressions: We introduce the notion of a control destination that we pass around as we traverse the syntax tree, and delay the backpatching of jump instructions until we know where they should go.

If you've never read a computer science paper before, this one might make a good starting point. There's some scary-looking equations in there, but they are explained in a very nice way.

Buffering Backpatch Locations

So far, the code I've shown is pretty much instruction set agnostic. It applies to an x64 backend just as well as for the THUMB instruction set we've been using. You'd need to replace the emitter and and adjust the calls to any Emit functions accordingly, but other than that, it'll "just work". But now I want to talk about an annoyance that is instruction set specific, and the solution I came up with:

As mentioned before a THUMB instruction is two bytes, and thus the MOV rd, imm instruction you'd use inside emit_as_register can only encode immediate values up to 255 - there's only so much encoding space. What you're supposed to do is use a PC-relative load, LDR rd, off, which loads a 32-bit value from an address close to the instruction pointer.

Since we don't want the CPU to try and execute that data, we need to store it in a suitable location, called a literal pool. That is to say, you generate code until you reach a good place to dump all the values you couldn't fit into the instruction stream, and then backpatch a bunch of load instructions.

Any given expression might require any number of these, so we need to maintain a buffer of backpatch locations that we can push requests into. The question then becomes where to check whether emtiting a literal pool is necessary; if you checked after every single statement, you'd need to emit jump instructions to prevent it from being executed. That's not only ugly, those jumps aren't free, so you're also making the generated program slower.

Instead, you want to flush the backpatch location buffer, as I call it, only in places where you're emitting jumps anyway. In a pre-checked loop, we can put it between the unconditional jump to the condition and the loop body; in an if/else statement, we can fit it in between the unconditional jump at the end of the first branch and the beginning of the second branch.

What's nice about this approach is that we can repurpose it to also handle break and continue statements. What's not so nice is that it doesn't solve a fundamental robustness problem: just like the size of an immediate value is limited, so is the size of the offset we can encode in a load instruction - there's a hard limit to how many instructions we can emit between putting a location into the buffer, and flushing the buffer. That limit is quite large (about 500 instructions), and we're not likely to run into any problems for simple programs. But there are pathological cases where this will blow up and fail to generate valid code.

I don't have a nice solution for this problem yet, and I don't want to dedicate too much effort to it for now. Going forward, I'd rather work on more language features.

Conclusion

And there you have it, all the interesting things I've done and learned so far. I hope this will be useful to somebody - if so, maybe I'll do a similar write-up once I've made some progress on the project.

I will say that this has already made me a much better programmer. Not using any libraries means you cannot rely on Stackoverflow to solve your cookie-cutter problems. Since you're laying the foundations of a larger system, you cannot rely on hacky bug fixes, you need to get down to the root cause. Successfully solving challenging problems in this context has been a real boost to my confidence in my skills.

At the same time, it was a really humbling experience. When you have a high-level understanding of how compilers work, it's easy to fall into thinking that they are simple. This illusion is shattered as soon as you get into the trenches and fight with all the little details.

All in all, this has been a very rewarding project so far, and I'd recommend doing something similar to anyone even remotely interested. I certainly don't intend to quit anytime soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment