Skip to content

Instantly share code, notes, and snippets.

Last active May 8, 2021
What would you like to do?
Tails! A tiny Forth core written as a hack for May Forth 2021, using some of Wasm3's secret optimization sauce.
// Copyright (C) 2021 Jens Alfke. All Rights Reserved.
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// See the License for the specific language governing permissions and
// limitations under the License.
TAILS is a very minimal, incomplete Forth interpreter core, made as a hack on May Fo[u]rth 2021
by me, Jens Alfke (@snej on Github.)
What can it do? Not much. It knows how to add and multiply integers!! and call functions!!!1!
As proof, the test program in this source file evaluates ((4 + 3) ^2 * 2) ^2.
It can't even load external programs; the program has to be hardcoded as a C++ static array.
But it's miniscule! Only 172 bytes, and some of those are NOPs I couldn't get rid of.
And it's very easy to extend -- see the to-do list at the end.
The reason I wrote this, besides to fulfil a decades-old dream of building a working Forth
interpreter, is to apply the really elegant and efficient implementation technique used by Wasm3
(a WebAssembly interpreter), which is probably the fastest pure non-JIT interpreter there is.
See below under "Performance".
In Forth, "threading" refers to the unusual way the interpreter runs, not anything to do with
concurrency. Forth code consists of a list of "word" (function) pointers, which the interpreter
dispatches to one after the other. [1]
Tails uses "direct threading", where each address in the code is literally a pointer to a native
word. That makes dispatching as fast as possible. However, it wouldn't be a real language if
you couldn't define functions in it, so there has to be a way to call a non-native word!
This is done by the CALL word, which reads a Forth word address out of the instruction stream
and calls it. The downsides are that calling Forth words is somewhat slower, and Forth words
take up twice as much space in the code.
(The opposite approach is called "indirect threading", where there's another layer of indirection
between the pointers in a word and the native code to run. This adds overhead to each call,
especially on modern CPUs where memory fetches are serious bottlenecks. But it's more flexible.)
Tails was inspired by the design of the Wasm3 interpreter[2], which cleverly uses tail calls and
register-based parameter passing to remove most of the overhead and allow C (or C++) code to
be nearly as optimal as hand-written. I realized this would permit the bootstrap part of a
Forth interpreter -- the part that implements the way words (functions) are called, and the core
words -- to be written in C++.
It did turn out to be as efficient as Steven Massey promised, though only with the proper
build flags. For Clang, and I think also GCC, they are:
-O3 -fomit-frame-pointer -fno-stack-check -fno-stack-protector
The key parts are enabling optimizations so that tail calls become jumps, and disabling stack
frames so functions don't unnecessarily push and pop the native stack.
For example, here's the x86-64 assembly code of the PLUS function, compiled by Clang 11:
3cd0 movl (%rdi), %eax ; load top of data stack into eax
3cd2 addl %eax, 0x4(%rdi) ; add eax to second data stack item
3cd5 addq $0x4, %rdi ; pop the data stack
3cd9 movq (%rsi), %rax ; load next word pointer into rax
3cdc addq $0x8, %rsi ; bump the program counter
3ce0 jmpq *%rax ; jump to the next word
Storing the top-of-stack in a register/parameter would be even more optimal, I bet.
Wasm3 doesn't use the native stack; instead it keeps its own call stack. This is probably more
optimal, but I haven't done it yet simply because it's more work.
* Add an explicit interpreter call stack(?)
* Add memory that programs can read/write
* Add metadata for words: their name, attributes and code pointer.
* Collect the metadata into a structure so a word can be looked up by name (regular Forths use a
linked list.)
* Define a bunch more core words in native code -- SWAP, ROT, `-`, ...
* Define the basic lexer to read words from stdin
* Define the all-important `:` and `;` words, so words can be defined in Forth itself...
C VS C++
I don't write in C. But this code uses very few C++ specific features, so with a few changes it
could compile as C.
* The constructors of Instruction -- you could remove them, but the function literals
(`Square`, `Program`) would become uglier since you'd have to use C to initialize the unions.
* The stack is declared as a `std::array`, but you could easily change that to a C array.
#include <array>
#include <stdlib.h>
// Magic function attributes!
// Note: The `musttail` attribute is new (as of early 2021) and only supported by Clang.
// Without it, ops will not use tail recursion _in unoptimized builds_, meaning
// the call stack will grow with every word called and eventually overflow.
// <>
#if defined(__has_attribute) && __has_attribute(musttail)
# define MUSTTAIL [[clang::musttail]]
# define MUSTTAIL
#define ALWAYS_INLINE [[gnu::always_inline]]
#define NOINLINE [[gnu::noinline]]
// If ENABLE_TRACING is defined, a function `TRACE(sp,pc)` will be called after each Instruction.
//======================== INTERPRETER CORE ========================//
/// A native word is a C++ function with this signature.
/// Interpreted words consist of an array of (mostly) Op pointers,
/// but some native ops are followed by a parameter read by the function.
/// @param sp Stack pointer. Top is sp[0], next is sp[1], ...
/// @param pc Program counter. Points to the _next_ op to run.
/// @return The stack pointer. (But almost all ops tail-call via `NEXT()`
/// instead of explicitly returning a value.)
using Op = int* (*)(int *sp, union Instruction *pc);
/// A Forth instruction. Code ("words") is a sequence of these.
union Instruction {
Op native; // Every instruction starts with a native op
int literal; // This form appears after a LITERAL op
Instruction* word; // This form appears after a CALL op
Instruction(Op o) :native(o) { }
Instruction(int i) :literal(i) { }
Instruction(Instruction *w) :word(w) { }
/// Tracing function called at the end of each native op when `ENABLE_TRACING` is defined.
/// \warning Enabling this makes the code much less optimal, so only use when debugging.
NOINLINE static void TRACE(int *sp, Instruction *pc);
# define TRACE(SP,PC) (void)0
// The standard Forth NEXT routine, found at the end of every native op,
// that jumps to the next op.
// It uses tail-recursion, so (in an optimized build) it _literally does jump_,
// without growing the call stack.
#define NEXT() TRACE(sp, pc); MUSTTAIL return pc->native(sp, pc + 1)
/// Calls an interpreted word pointed to by `fn`. Used by `CALL` and `run`.
/// @param sp Stack pointer
/// @param word The first instruction of the word to run
/// @return The stack pointer on completion.
static inline int* call(int *sp, Instruction *word) {
return word->native(sp, word + 1);
// TODO: It would be more efficient to avoid the native call stack,
// and instead pass an interpreter call stack that `call` and `RETURN` manipulate.
//======================== NATIVE OPS ========================//
// ( -> i) Pushes the following instruction as an integer
static int* LITERAL(int *sp, Instruction *pc) {
*(--sp) = (pc++)->literal;
// (a -> a a)
static int* DUP(int *sp, Instruction *pc) {
sp[0] = sp[1];
// (a b -> a+b)
static int* PLUS(int *sp, Instruction *pc) {
sp[1] += sp[0];
// (a b -> a*b)
static int* MULT(int *sp, Instruction *pc) {
sp[1] *= sp[0];
// ( -> ) Returns from the current word. Every word ends with this.
static int* RETURN(int *sp, Instruction *pc) {
return sp;
// (? -> ?) Calls the subroutine pointed to by the following instruction.
static int* CALL(int *sp, Instruction *pc) {
sp = call(sp, (pc++)->word);
//======================== TOP LEVEL ========================//
/// The data stack. Starts at `DataStack.back()` and grows downwards/backwards.
static std::array<int,1000> DataStack;
/// Initializes & runs the interpreter, and returns the top value left on the stack.
static int run(Instruction* start) {
return *call(DataStack.end(), start);
//======================== TEST CODE ========================//
static Instruction Square[] = {
static Instruction Program[] = {
CALL, Square,
CALL, Square,
/// Tracing function called at the end of each native op -- prints the stack
static void TRACE(int *sp, Instruction *pc) {
printf("At %p: ", pc);
for (auto i = &DataStack.back(); i >= sp; --i)
printf(" %d", *i);
int main(int argc, char *argv[]) {
printf("Program is at %p\n", (void*)Program);
printf("Square is at %p\n", (void*)Square);
int n = run(Program);
printf("Result is %d\n", n);
if (n != 9604)
return 0;
Copy link

snej commented May 8, 2021

that repo again is

or take a trip in time to the original upload of this gist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment