Skip to content

Instantly share code, notes, and snippets.

@molenzwiebel
Last active November 3, 2017 23:58
Show Gist options
  • Save molenzwiebel/2839f812b1d795ac22ee26a49d8e82ba to your computer and use it in GitHub Desktop.
Save molenzwiebel/2839f812b1d795ac22ee26a49d8e82ba to your computer and use it in GitHub Desktop.
Flash

Flash - Blazing Fast Brainfuck JIT

This will most likely be the fastest brainfuck runner in assembly that a student will ever hand in. This may sound over-confident, but I am certain that it is almost impossible to optimize brainfuck even more without implementing special cases for common constructs. I encourage you to read the code (and the accompanying comments). Although it is over 1000 lines of assembly, the comments give a lot of the information also outlined in this document.

Flash is an optimizing JIT compiler. JIT, or Just-In-Time, refers to the technique where the code to be ran is translated into machine instructions on the fly, which are then written to an executable block of memory and directly interpreted by the CPU. This is similar to how normal compilers work, but instead of writing the resulting instructions to an executable they are written and executed directly.

Program Flow

Flash goes through the following steps, in order:

  1. Construct And Optimize Code
    • Build IR
    • Optimize Minor
    • Compute Offsets
    • Unroll Loops
    • Detect If-Statements
  2. Generate Machine Code
    • Calculate Executable Size
    • Write Instructions
  3. Run Machine Code
    • Prepare Values And Registers
    • Jump To Allocated Executable Block

All of these steps are further explained below.


0. Building The IR

Flash starts by building an IR, or intermediate representation, of the brainfuck source code. This is done by simply looping over the initial set of instructions and creating an array of the following structs:

struct ir_node {
    unsigned char opcode;
    char coefficient;
    short offset;
} __attribute__((packed));

In this initial IR building loop, found in build_representation, subsequent +/- and >/< are folded into a single IR node and any invalid brainfuck instruction is ignored. The total value of the collapsed instructions, or the coefficient, is stored in the second byte of the IR representation. The last two bytes, the offset, are simply left as 0 on the initial iteration. For opcodes, see the table below. This initial building loop also recognizes the [-] construct and emits the WRITE_CELL opcode instead.

After this initial IR node array has been built, subsequent functions look at this array and perform potential optimizations, which are detailed below.

Value Name Description
1 OP_NOP Does nothing. Exists to prevent having to resize arrays when we change/remove instructions due to optimization.
2 OP_WRITE A simple addition or subtraction to a cell. p[offset] += coefficient;
3 OP_MOVE A simple pointer move. Optimized out after offset calculation. p += coefficient;
4 OP_READ A single character read from stdin (the , instruction).
5 OP_PRINT A single character write to stdout (the . instruction).
6 OP_LOOP_OPEN The start of a loop ([).
7 OP_LOOP_CLOSE The end of a loop (]).
8 OP_WRITE_CELL A direct write to a cell, instead of an addition or subtraction. Mainly used in zeroing loops ([-]). p[offset] = coefficient;
9 OP_MULT_WRITE A write to the cell at the specified offset by multiplying the cell at index 0 with a constant value. Used in unrolling loops. p[offset] += p[0] * coefficient;
10 OP_UNROLL_LOOP_BEGIN The start of an unrolled loop. See loop unrolling for more info.
11 OP_UNROLL_LOOP_END The end of an unrolled loop. See loop unrolling for more info.
12 OP_CELL_ADD A special case for OP_MULT_WRITE when coefficient == 1, so that we don't emit a useless imul.
13 OP_IF_END The end of a loop (]), where it can be statically proven that the loop only runs once. Different from LOOP_CLOSE in that it does not emit the checks to potentially loop back.
14 OP_UNROLL_SET A conditional version of OP_WRITE_CELL, used in unrolled loops. Equivalent to if (p[0]) p[offset] = coefficient; in C.
15 OP_POINTER_SEEK A single instruction variant of pointer seeks (loops similar to [>>] that search for the first empty cell). Emits slightly better assembly.

1. Initial Optimizations

Before computing the offsets, we first do some initial optimizations, as seen in pre_offset_optimizations. The following simple optimizations are done:

  • Convert OP_WRITE and OP_MOVE with a coefficient of 0 into OP_NOP. This should be self explanatory.
  • Convert an OP_WRITE_CELL followed by an OP_WRITE into a single OP_WRITE_CELL. This simply converts p[x] = 0; p[x] += 3; into a single p[x] = 3;, which is slightly faster and allows us to do more reasoning later.
  • Remove any OP_WRITE_CELL that is preceded by an OP_LOOP_CLOSE. It should be evident that after a loop, the cell at index 0 is currently guaranteed to be 0. As such, the construct ][-] effectively contains a useless zeroing loop that can be removed.
  • Find any sequence of OP_LOOP_OPEN, OP_MOVE, OP_LOOP_CLOSE and turn it into an OP_POINTER_SEEK. A pointer seek loop, or [>], is a loop that will search for the nearest empty cell. Since they are a fairly common construct too, we emit special code. As such, we need to find them, which we do here since OP_MOVE instructions will get nopped out in the next step.

2. Computing Offsets

Rationale:
When there is a long streak of brainfuck instructions that only write to cells and move the cell pointer, it is advantageous for pipelining if we do not move the pointer at all but simply supply offsets to the generated instructions instead. This way we prevent data dependencies between the instructions and we emit less move instructions.

Do note that before any kind of loop iteration, whether it is a [ or ], we will still need to emit any potential pointer moves we had queued up. This is because we cannot guess how many times a loop will run.

Effect:
If we cache pointer movements the following code, ++>+<<<->>>[>+, is translated into the following pseudo-C:

p[0] += 2;
p[1] += 1;
p[-1] += -1;
p += 3; // move since we are before a loop
while (*p) {
    p[1] += 1;

Implementation:
The assembly function compute_offsets simply loops through the entire construct and writes the current offset to the offset field of every instruction (regardless of whether that instruction actually uses it). If it encounters an OP_MOVE, it adjusts the current offset and then turns the move into a NOP. The offset field for any loop instructions is repurposed to instead store the amount that the pointer itself needs to move, but any instructions within said loop start at 0 again.

The only interesting detail within this function is the handling for OP_POINTER_SEEK instructions. Since the actual generated assembly for this instruction generates a do-while loop, we will always move over one iteration too much. As such, we add the inverse of the move count within the pointer seek to the current offset, as seen below with the code [>>>]+:

do {
    temp = *p != 0;
    p += 3;
} while (temp);
p[-3] += 1; // act at -3 since we always add 3 too much

3. Loop Unrolling/Linearizing/Unfolding/Flattening

Note: The operations described in this section are mostly referred to as unrolling within the code, but all of the terms in the title are used to refer to the same action.

Rationale:
One of the largest optimization grounds is to remove unnecessary loops. One of the most common of these loops is the move/add loop: [->>>+<<<]. It should be clear that this loop adds the value of cell 0 to the value of cell 3, then zeros cell 0. As such, we can express this loop in two different ways:

while (*p) {
    p[0]--;
    p[3]++;
}

// or

p[3] += p[0];
p[0] = 0;

It should be obvious here that the second alternative is much better, since it transforms an O(n) operation into an O(1) operation.

However, we can generalize this loop construct into a single definition: If a loop does not contain any IO and only increments and pointer moves, and if these pointer moves always cancel out, and if the adjustment to cell 0 is always -1, we can prove that a loop only iterates N times, where N is the amount in cell 0. Using this definition, it should be clear that [>++>+++<<<->--+] is also optimizable, since we can statically prove that this loop only runs N times. As such, we can turn that loop into:

p[1] += p[0] * 2;
p[2] += p[0] * 1;
p[-1] += p[0] * -1;
p[0] = 0;

Implementation:
The assembly function possibly_unroll_loop does exactly this. Given the start of a loop, it will iterate through all instructions within the loop. If it sees a subloop, any IO or any other operation that invalidates the above requirement, it will simply return without doing anything. While going through the instructions, it will also keep track of any adjustments made to p[0], to check if they end up being -1.

If the end of the loop is reached, and the last instruction has an offset value of 0 (which indicates that all moves cancelled out), and our total writes to p[0] amount to -1, we have a valid loop. At this point, we loop over the subloop instructions again, this time replacing any +/- with their OP_MULT_WRITE equivalents instead. The loop opening and closing node opcodes are also replaced by their unrolled variants.

Cell Writing And Unrolling:
Do note that the possibly_unroll_loop function is slightly different from the above specification. It has a special case for OP_WRITE_CELL instructions, if the cell does not write to p[0]. If such, they can be unrollable as well, but we need to make sure that they only run if p[0] was non-zero (which we cannot guarantee with the default OP_WRITE_CELL since we remove all conditionals). As such, we have a special OP_UNROLL_SET, which emits slightly differently. The code [->+>[-]<<] converts into the following pseudo-c:

p[1] += p[0];
if (p[0]) p[2] = 0;
p[0] = 0;

4. If-detection

Rationale:
If we have a zeroing loop or an unrolled loop, we can prove statically that a cell at a certain index will be 0. If such an instruction is found within a loop, and happens to adjust the index that the loop iterates on, we can convert the while loop into an if loop instead:

Take [>+<[-]]. Naively, this would be emitted as

while (p[0]) {
    p[1] += 1;
    p[0] = 0;
}

Instead, we can see that the cell at index 0 is zero'd and emit an if instead:

if (p[0]) {
    p[1] += 1;
    p[0] = 0;
}

The effects of this optimization are minor, but they add up in the end.

Implementation:
The possibly_convert_to_if assembly function works fairly similar to the loop unrolling one. A single, cursorly loop is done through the instructions to see if the cell at index 0 is zero'd. A simple valid flag starts at 0 and is set to 1 if an OP_WRITE_CELL or OP_UNROLL_LOOP_BEGIN is encountered, and back to 0 if a write to index 0 is detected.

If at the end of the loop the valid flag is true, the closing ] is replaced with an OP_IF_END, which emits slightly different machine-code, most importantly not emitting a jump and a comparison.


JIT

The actual JIT is easier than might be expected. To make a jit, one only needs to allocate an executable block of memory using mmap, then write instructions to said block, and finally make the processor jump to the block, which will cause it to interpret the data as instructions.

To compute the size of the allocated block, we simply use a lookup table (CODE_SIZES), as seen in compute_jit_size. For every instruction, we have it's length defined as the "worst-case" value for that particular instruction, so we never go over the allocated size.

Once the allocated block is created, compile_bf does a simple loop over the optimized instruction tree and emits the correct bytecode for every instruction. The generated bytecode uses the following registers:

  • rbx: The current pointer (p in all examples).
  • r12: The address of clib putchar (used for printing).
  • r13: The address of clib getchar (used for reading).

The function run_compiled_bf will make sure that the context, including these registers, is appropriate. After doing this, it will cause the processor to jump to the block of memory.

Now follows a brief overview of what every instruction compiles into. The OFFSET value is the 16-bit byte-extended value of the offset field, the VALUE field is the 8-bit value inside the coefficient field:

OP_WRITE

addb VALUE, OFFSET(%rbx)

OP_READ

call *%r13
mov %al, OFFSET(%rbx)

OP_PRINT

movzbl OFFSET(%rbx), %edi
call *%r12

OP_LOOP_OPEN

leaq OFFSET(%rbx), %rbx # only emitted if OFFSET != 0
cmpb $0, (%rbx)
je <offset> # offset is patched later by the loop/if close

OP_LOOP_CLOSE

leaq OFFSET(%rbx), %rbx # only emitted if OFFSET != 0
cmpb $0, (%rbx)
jne <offset> # offset is the difference between current addr and [.

OP_WRITE_CELL

movb VALUE, OFFSET(%rbx)

OP_MULT_WRITE

movb VALUE, %eax
imul %edx, %eax # edx contains the value of p[0], see table for more details
add %al, OFFSET(%rbx)

OP_UNROLL_BEGIN

leaq OFFSET(%rbx), %rbx # only emitted if OFFSET != 0
movzbl (%rbx), %edx # mult write expects value in edx

OP_UNROLL_END

movb 0, (%rbx)

OP_CELL_ADD

addb %dl, OFFSET(%rbx) # dl contains the value of p[0], see table for more details

OP_IF_END

leaq OFFSET(%rbx), %rbx # only emitted if OFFSET != 0
# also patches up the address for the opening instruction

OP_UNROLL_SET

movb VALUE, %al
movzbl OFFSET(%rbx), %ecx
testb %dl, %dl # if dl isn't 0
cmovbne %al, %cx # then we move the constant value into cx
movb %cl, OFFSET(%rbx) # cl is old if dl == 0, else cl is constant

OP_POINTER_SEEK

leaq OFFSET(%rbx), %rbx # only emitted if OFFSET != 0
cmpb $0, (%rbx)
leaq VALUE(%rbx), %rbx
jne -13 # total size for this snippet is 13 bytes, move back to start

Final Notes And Remarks

That's everything! If you have any questions, please do not hesitate to send me a message (my NetID is tmolendijk). Also provided in this gist is a sample C file that will dump the internal representation as pseudo-C. It is not runnable in all cases and lacks the initial setup and teardown at the end, but it is a great tool to display how Flash interprets a certain piece of brainfuck.

In my virtual machine, on the default TU-delft student laptop project hardware, Hanoi runs in 0-4ms, which is a huge accomplishment considering I started at 30s with a simple interpreter loop. For testing purposes, I confirmed that the compiler works with all programs in this sample benchmark suite. I'd love to know how fast my JIT was on real hardware (even if that hardware might be slower), so please publish all the numbers!

If possible, I'd like to publish this project on github once every brainfuck compiler/interpreter has been tested. Is that okay?

#include "stdio.h"
#include "stdlib.h"
enum opcode {
OP_NONE = 0,
OP_NOP,
OP_WRITE,
OP_MOVE,
OP_READ,
OP_PRINT,
OP_LOOP_OPEN,
OP_LOOP_CLOSE,
OP_WRITE_CELL,
OP_MULT_WRITE,
OP_UNROLL_LOOP_BEGIN,
OP_UNROLL_LOOP_END,
OP_CELL_ADD,
OP_IF_END,
OP_UNROLL_SET,
OP_POINTER_SEEK
};
struct ast_entry {
unsigned char opcode;
char extra;
short offset;
} __attribute__((packed));
extern unsigned char* build_representation(const char* a);
extern void pre_offset_optimizations(unsigned char* buf);
extern void compute_offsets(unsigned char* buf);
extern void possibly_unroll_loop(struct ast_entry* buf);
extern unsigned char* compile_bf(unsigned char* buf);
extern void possibly_convert_to_if(struct ast_entry* buf);
void print_extended_ast(struct ast_entry* entry) {
if (entry->opcode == OP_WRITE) {
printf("p[%hd] += %hhi;\n", entry->offset & 0xFFFF, entry->extra & 0xFF);
} else if (entry->opcode == OP_READ) {
printf("p[%hd] = getchar();\n", entry->offset & 0xFFFF);
} else if (entry->opcode == OP_PRINT) {
printf("putchar(p[%hd]);\n", entry->offset & 0xFFFF);
} else if (entry->opcode == OP_LOOP_OPEN) {
if (entry->offset != 0) printf("p += %hhi; ", entry->offset & 0xFFFF);
printf("%s (p[0]) { \n", entry->extra == -1 ? "if" : "while");
} else if (entry->opcode == OP_LOOP_CLOSE) {
if (entry->offset != 0) printf("p += %hhi; ", entry->offset & 0xFFFF);
printf("}\n");
} else if (entry->opcode == OP_WRITE_CELL) {
printf("p[%hd] = %hhi;\n", entry->offset & 0xFFFF, entry->extra & 0xFF);
} else if (entry->opcode == OP_MULT_WRITE) {
printf("p[%hd] += p[0] * %hhi;\n", entry->offset & 0xFFFF, entry->extra & 0xFF);
} else if (entry->opcode == OP_UNROLL_LOOP_BEGIN) {
if (entry->offset != 0) printf("p += %hhi; ", entry->offset & 0xFFFF);
printf("\n");
} else if (entry->opcode == OP_UNROLL_LOOP_END) {
printf("p[0] = 0;\n");
} else if (entry->opcode == OP_CELL_ADD) {
printf("p[%hd] += p[0];\n", entry->offset & 0xFFFF);
} else if (entry->opcode == OP_IF_END) {
printf("}\n");
} else if (entry->opcode == OP_UNROLL_SET) {
printf("if (p[0]) p[%hd] = %hhi;\n", entry->offset & 0xFFFF, entry->extra & 0xFF);
} else if (entry->opcode == OP_POINTER_SEEK) {
if (entry->offset != 0) printf("p += %hhi; ", entry->offset & 0xFFFF);
printf("do { p += %hhi; } while (p[0]);\n", entry->extra & 0xFF);
}
}
int main(int argc, char** argv) {
FILE *f = fopen(argv[1], "r");
fseek(f, 0, SEEK_END);
long size = ftell(f);
rewind(f);
char* code = calloc(1, size + 1);
fread(code, size, 1, f);
unsigned char *repr = build_representation(code);
unsigned char *repr2 = repr;
pre_offset_optimizations(repr);
compute_offsets(repr);
struct ast_entry* entries = (struct ast_entry*) repr;
int len = 0;
while (entries->opcode) {
if (entries->opcode == OP_LOOP_OPEN) possibly_unroll_loop(entries);
len++;
entries++;
}
entries = (struct ast_entry*) repr;
while (len >= 0) {
struct ast_entry* entry = &entries[len - 1];
if (entry->opcode == OP_LOOP_OPEN) possibly_convert_to_if(entry);
len--;
}
int depth = 0;
while (*repr) {
struct ast_entry* entry = (struct ast_entry*)repr;
if (entry->opcode == 7 || entry->opcode == OP_IF_END) depth += 2;
if (entry->opcode != 1) {
printf("%*s", depth, "");
print_extended_ast(entry);
}
if (entry->opcode == 6) depth -= 2;
repr += 4;
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment