wxsBSD/bf2y.md

## bf2y.md

      
    Raw
  

              bf2y.md
            
          
    bf2y: Brainfuck to YARA

bf2y is a brainfuck compiler which targets the YARA virtual machine. You can get it from my bf2y branch at
https://github.com/wxsBSD/yara/tree/bf2y.
Requirements

You will need python3 and ply (https://pypi.org/project/ply/)
The Toolchain

The toolchain comes in two steps. The first is a brainfuck compiler, which generates bytecode suitable for execution on
the YARA VM. The second step is a C program which loads the bytecode into memory and executes it on the YARA VM.
The compiler is called bf2yc.py and the C program which actually executes the generated code is bf2y.
Building

Building the toolchain is just the normal build process (./configure && make). Once it's built you should have a bf2y
binary.
Examples

There are two example brainfuck programs included: cat.bf and helloworld.bf. You can execute them like this:
wxs@wxs-mbp yara % python3 bf2yc.py cat.bf cat.bin
wxs@wxs-mbp yara % ./bf2y cat.bin
A
A
B
B
wxs@wxs-mbp yara %

For this example you can break out of the loop by hitting ^D.
wxs@wxs-mbp yara % python3 bf2yc.py helloworld.bf helloworld.bin
wxs@wxs-mbp yara % ./bf2y helloworld.bin
Hello World!
wxs@wxs-mbp yara %

How Does It Work (The Compiler)

Brainfuck is a simple language which operates on memory cells and a pointer. The pointer can be moved forward and backward
between the cells, one cell at a time. The contents of those cells can be incremented or decremented. Brainfuck also
supports a print operator, which prints the contents of the current cell, and an input operator which reads input into the
current cell.
These are the 8 brainfuck commands:
> - Move the pointer to the right
< - Move the pointer to the left
+ - Increment the memory cell at the pointer
- - Decrement the memory cell at the pointer
. - Output the character at the pointer
, - Read a character of input and store it at the pointer
[ - Jump past the matching ] if the cell at the pointer is 0
] - Jump back to the matching [ if the cell under the pointer is non-zero.

The YARA VM (the thing that executes your condition statements) is a stack based virtual machine with a handful of
registers and memory addresses. I decided to use the zero'th memory address in the YARA VM as the brainfuck pointer and
left the rest of the addresses for the brainfuck program to use.
With this understanding the bf commands can be translated into YARA bytecode as follows:
> - OP_INCR_M 0                 ; incremement YARA mem[0]

< - OP_PUSH -1                  ; push -1 on the stack
    OP_ADD_M 0                  ; add the value on the stack to YARA mem[0]
    
+ - OP_BF_PUSH_P                ; push the value at mem[mem[0]] onto the stack
    OP_PUSH 1                   ; push 1 on the stack
    OP_ADD                      ; add the two values on the stack and store the result on the stack
    OP_BF_POP_P                 ; pop the value off the stack into mem[mem[0]]
    
- - OP_BF_PUSH_P                ; This operates the same as + but it adds -1
    OP_PUSH -1
    OP_ADD
    OP_BF_POP_P
    
. - OP_BF_OUTPUT                ; Print the character at mem[mem[0]]

, - OP_BF_INPUT                 ; Read a character from stdin to mem[mem[0]]

The two remaining commands ([ and ]) are interesting because they require fixups to happen. These fixups are stored in a
list by the compiler. We also use a stack to make sure we have the matching brackets lined up and to ensure we have no
extra brackets when we are done.
When a [ is found by the compiler it records the offset into the bytecode for the instructions it is about to emit and
then emits the following:
OP_BF_PUSH_P                    ; Push mem[mem[0]] onto the stack
OP_JFALSE_P 0                   ; Jump to 0 if value on the stack is false

The "0" emitted into the bytecode is just a placeholder for the fixup that will happen later.
When a ] is encountered the compiler first makes sure there is a corresponding [ on the list and then records the
offset of the bytecode it is about to emit, which is:
OP_BF_PUSH_P                    ; Push mem[mem[0]] onto the stack
OP_JTRUE_P 0                    ; Jump to 0 if value on the stack is true

Once all the bytecode is emitted the compiler prepends it with a list of fixups that need to happen before it can be
executed.
How Does It Work (The Runtime)

bf2y is very minimal. It takes a file as an argument, which is supposed to be the output from the compiler. It maps that
file into memory and then does the fixups needed.
The fixup structure look like this:
DWORD number_of_fixups
DWORD lb_offset
DWORD rb_offset

If there are no fixups there will be 0 in number_of_fixups and there will be no lb_offset or rb_offset value.
The runtime then calculates where the start of the code is (immediately following the fixups, if any) and then uses the
lb_offset and rb_offset values to calculate the correct memory addresses to replace the "0" inserted by the compiler.
This step is necessary because the compiler doesn't know where in memory the bytecode will be mapped and YARA does not
support relative jumps.
After the fixups have been applied bf2y creates fake YARA context and rules structures. These are needed because the
YARA virtual machine relies on them existing. Once the fake structures are allocated the start of the bytecode (which we
just fixed up) is injected into the fake rules and yr_execute_code() is called directly.
Hacks

I had to do some hacks to get this to work, mostly because I wasn't interested in generating a complete YARA rule set
similar to what yarac generates. There are a lot of things in there that are not necessary for my goal of compiling
brainfuck to YARA bytecode.
I had to expose yr_execute_bytecode() as part of the YR_API.
I had to add 4 new YARA instructions: OP_BF_PUSH_P, OP_BF_POP_P, OP_BF_OUTPUT, and OP_BF_INPUT. The I/O functions
are obvious and unavoidable. The fact that I'm using mem[0] as a layer of indirection requires that there be two new
instructions since I couldn't figure out a nice way to do it otherwise.
Finally, when the execution is complete YARA attempts to unload all objects - which ends up using a hashtable. Rather than
allocate a hash table and put it in the fake context I just modified the unloading code to see if that hash table was
null.
Improvements

The amount of memory available to brainfuck is quite limited. There are, I think, 8 memory addresses in the YARA VM. Each
holds a YARA_VALUE which is a union of a bunch of different types, but is ultimately 64 bits wide. One area for
improvement is to find a way to change the brainfuck pointer so it is pointing to each individual byte in the YARA memory
addresses instead of just moving 64bits at a time.
Another area for improvement is to remove the OP_BF_PUSH_P and OP_BF_POP_P operations and find a way to do it with
native YARA instructions. It was late at night when I made this change and I haven't revisited it yet.
Lastly, if bf2yc.py generated a fully compliant YARA rules file we could eliminate bf2y entirely and just let YARA
load the compiled rules and execute them. Doing this is left as an exercise for the reader...
Why?

Honestly, I needed a break from other things and this had been in my head for quite a while. It has absolutely no use I
can see, but it was fun to think through it as I built it. Sometimes I just like to do things purely for my own
amusement...