Skip to content

Instantly share code, notes, and snippets.

@paniq
Last active Aug 12, 2022
Embed
What would you like to do?
ad-hoc devnotes (September 2021+)

12.8.2022

Musings on memory allocation and the stack.

Logically, we can model every program as a graph of branching and merging continuations. But in practice, we distinguish between calls and returns. A call provides a return address (a continuation), and a return jumps to that address, without providing yet another address. Also, a call allocates memory on the stack, whereas the matching return frees that memory.

In fact, I suspect that use of the stack is the primary reason why we use this dichotomy, giving us a simple implicit way to operate a function-scoped arena allocator. In addition, we use the (C) convention that the caller provides all stack addresses, for both call and return arguments. A function can not return in a way that leaves new data on the stack - that would prevent it from cleaning up its own temporary stack memory. Instead, it must write to an address provided by the caller. Hence, if we call with a variable sized buffer argument, the task is easy. But if we want to return a variable sized buffer argument on the stack, the caller must perform several steps: a first call computes the size of the buffer to be returned. Then the caller allocates the buffer on the stack, and finally provides its address to the function which is supposed to fill the buffer.

Despite this simple arrangement, program stacks are small. On Linux, we only get 8 MB of stack memory or so. We are supposed to use heap memory for the bulk of data we're fiddling with, and then implement our own allocators there. Of course nothing stops us from implementing application specific stacks there.

The benefit of heap memory, apart from its huge size, is that we can cross-connect pointers across allocations since we're in full control of their lifetimes. On a stack, it is only truly safe for pointers to point downward, to earlier allocations; only this way we can guarantee that no pointer will ever point to invalid memory. A side effect of this rule is that we have topological ordering. We can still support upward pointers that describe cyclical graphs, by guaranteeing that they reside within single allocations that we pop as a whole, e.g. an array.

So, in principle, the stack is "memory complete". We do not need a heap. But the fact that the caller must provide the memory and perform a layout pass (estimate size, allocate, then fill the buffer) can slow down algorithms. For some algorithms, size estimation entails allocating temporary buffers within the function, only to count their size, throw the work away, and then recompute the data we had already computed before. At best this only increases complexity by a factor (2, to be precise), and constitutes negligible overhead, in my opinion. At worst, complexity increases by factor 2 with every nested call, and so we get exponential complexity. That's not so nice.

In many cases, we are likely able to compute upper bounds for memory requirements of a function at compile time, or can even require the program to always be written in a way that makes upper bounds inferrable. For such functions, we can skip the size computations (or generate a function reduced to the fewest necessary operations to compute the size) and provide a buffer that fits.

Now, I wrote all this with purely immutable memory in mind. If we consider mutable memory, perhaps just as a simple optimization where a buffer is replaced with an altered copy of itself, then it is possible that the buffer, should it contain pointers, ends up pointing to memory above it. The altered copy would have been strictly pointing down, but mutating an earlier allocation instead will give us both up- and down-pointers. The only way to avoid this is to always duplicate buffers into which we're inserting references to data that is newer than that buffer. String-like buffers can always be mutated, since they hold no references. This means a pointer tree can only be safely mutated when we swap some of its contents, or insert data from a different pointer tree of same or older age.

Similar caution must be applied when writing references to a return buffer - no reference in that buffer may point to data in the current function's stack. This complicates returning pointer trees: our return buffer must hold space not only for the root buffer, but also all new, locally generated data elements we intend to reference from it. This implies that a buffer is always allocated in the function in which its lifetime ends i.e. where it is not returned from the function. Thus every function requires memory with super lifetime, provided by the caller, for the results it intends to return, and memory for local lifetime, for the results it consumes before it returns (e.g. through a reduction and comparison).

When we use this stack allocation scheme with compile time estimation, it is possible that we waste a lot of memory on conditional returns that return very little on one branch, and a lot on the other. At every turn, the worst case has to be considered, and so allocations could grow much larger than they have to be. Particularly for this class of function, it could be worth it to perform a double pass.

Another concern is finite system resources like handles to files or GPU buffers. By definition, these resources do not point to our stack, and so constitute leaf structures, just like string buffers.

It stands to reason that, if we split up a program into pure functions that produce system commands and continuations, we will need to provide, at the top level of the program, enough memory to hold the system command, the closure environment and everything that environment references, which constitutes the complete state of the program.

It gets tricky when we move from one root continuation to the next. The values returned by the second iteration could be referencing data from the first, and so we need to stack the second iteration on top of the first. It becomes obvious here that without a GC and data introspection, we can never free prior buffers, unless we can guarantee somehow that there are no old references held. It is obvious that we will have similar problems within the program, namely with loops or multiple functions executed in sequence.

Considering these caveats, I wonder if it wouldn't be best to just stick to the present system of refcounted heap allocated buffers, and simplify usage where we can. For example, where we perform refcounting at compile time, we can elide most of the runtime bookkeeping. Where we can prove that a heap allocated buffer fits register size, we can pass it by value. Where we can prove that a mid-sized buffer remains function local, we can allocate it on the stack.

9.8.2022

Three more changes to FLIR that simplify the IR and IR typing:

  1. Support for vectorized operators is being removed i.e. the stride attribute disappears from all instructions that use them, and shuffle, extract, insert and vecmap will be removed entirely. The main reason is that vectorized operations can not even be guaranteed for every target, and writing them is only of interest to specialists - which runs somewhat contrary to the intent of FLIR, which is to keep abstractions at a logical minimum, and make optimization purely the compiler's business. This means that blobs no longer compete with cells when it comes to composite structuring. You want a vector? Use a cell and cellmap. It's the compilers job to figure out later how to optimize these instructions, and it is not difficult to describe code in a way that makes vectorization easy. it turns out gnu/clang C support vector extensions, and so I'm keeping this for a while longer.

  2. System commands will be generated by a new set of instructions, rather than constructed from cells. This allows us to give them a special opaque value type, and infer argument types, making the signature explicit, but also most importantly: we type arguments where the instructions are applied, before they are merged by branches, rather than at the program endpoint, where we only see the merged command structures, which makes typing template closures an unnecessary special case.

  3. The closure ABI changes from concatenating a tuple of arguments to the environment cell (env...) + args -> (env..., args) to always producing a pair of env + args -> (env, args), which makes it easier to maintain structure when env has a size that is not determinable at compile time, and also allows for env to be a blob rather than a cell. Since nested functions in flassembly need to be able to index their environment and arguments statically, the function concatenates args to env... as it usually does when it constructs subsequent calls, which was previously the callers job, but now falls to the callee - should they need it. With this convention, we wouldn't be able to isolate values external to the scope that depend on environment arguments from a different function. Presently that's no problem, because environments stack call arguments for nested closures in a way that is invariant to the subclosure we're in. That would not be the case anymore with this change.

1.8.2022

Okay! I have now made major changes to the FLIR structures that allow for better handling of recursive references. Functions now no longer reference each other directly by content, but by their index in the top level function table, represented by the program instruction. This also plays well into the code generation pass which requires functions to be top level anyway, and is easy to generate in the parsing and syntactical expansion steps.

Specializing this structure now allows us to rewrite individual functions without having to rewrite all dependencies, or generate thunks for functions we're still in the process of typing, since functions are no longer addressed by content.

The next big task is to figure out when and how to merge reentrant function types during specialization. We have two kinds of reentrant structures:

  1. A loopfunction (formerly loopbody) tail-recurses into itself (or out of itself, however you want to see it), fully updating its own environment. These have a cycle budget, so must halt within a fixed time of computing time spent.
  2. A continuation closure consisting of a function (formerly body) and an environment passed to the system and invoked in the next program iteration, often with an additional argument appended to the environment. Since these kind of closures keep the program responsive as they give the system time to suspend/resume, they are allowed to be cyclical. They are head-recursive during typing, so we can detect this type of reentrance with a compile time stack.

For both kinds, for the same function, the type of the environment can change on every iteration. Some elements may be iterating constants, some elements remain variables of constant type, some elements can be variables of varying type.

When the input arguments to a loopfunction are constant i.e. they do not depend on an outer environment, then the environment can be inlined and the loop fully computed to its natural conclusion: the loop is then a compile time loop. For a fully constant continuation closure, the environment can be inlined into the function so that in the closure it is empty.

For varying types, we can iteratively type a loop multiple times, hoping for the types to settle: as soon as we deduce an environment type that has been previously seen, we have hit a cycle in the type itself.

In either case however, it is likely that we never settle, simply because of structures like {N <type>}, where N is an integer that steadily increases. Or we could also generate infinitely nesting types, or growing elements, the list of permutations is endless. The art here is to figure out which elements never change, and to generalize the others in the most useful, least destructive way, i.e. perform some kind of type integration.

I would probably start by picking the most destructive way and then improve where obvious. For loopfunction, this is a local process. For continuation closures, we only employ this technique when we recognize that reentrant typing is taking place:

We merge the initial and the new environment type to produce a more general type that fits both, which increases the likelihood that on the next iteration, a merge results in the same type, which allows us to close the loop - it could possibly take multiple iterations for the loop to generalize successfully, and it would be useful to be able to prove here that generalization always terminates.

For loopfunction, we can then simply use this closed loop as the loops only form. For continuation closures, we could patch the stack so that the topmost reentrant function uses this updated, more generic implementation instead.

The type merging process is not dissimilar to merging branch types, though with branches, we can often afford to build sets, since the number of branches is finite. However with loops, we could potentially end up performing many iterations until sets are complete, and simple numerical sets can be 2^32 big. Hence, this gives us an avenue to optimization levels during compilation: the higher the desired optimization level, the later we give up on a set and generalize it as continuous.

We will, however, never generalize function references (fref) and maintain their sets, which must by definition be finite.

29.7.2022 B

After sleeping on it, I may have the beginning of an approach to reentrant function typing that doesn't suck and extends well to dealing with other types of indirect recursion. I also figured out something about optimizing buffer usage through better instructions, which I'll explain after.

First, let's think about how we're going to port continuation based code to the GPU later on. There is no support for free control flow. That means no goto labels, let alone function pointers. We do have switch cases though, which means that we can encode function pointers as program-level unique integer labels, and at runtime, use a switch case to dispatch to the correct function call. That can create a lot of overhead on divergence, but it runs!

If we don't perform this transformation at code generation time but directly in FLIR form (which is possible, since we have switch cases), we end up being able to elide function pointers entirely: a function reference always appears directly at a call site, and any indirect handling of functions is performed through integer label and switches. By collecting sets of labels as we type expressions, we keep track of their possible values (there can only be as many values in a set as there are functions in the program), and so we can transform apply operations into switches with the exact number of cases required.

In other words, where required, we turn closures into sum types and applications into switch trees. How do we figure out when it is required? As to the continuation recursion issue, it helps during specialization to keep track of the current function reference stack. That is, when we apply or continue function C within function B within function A, this creates a stack of A B C. If function A is now referenced again in function C, then we know that this parameter can not be inlined into the function as it is presently being applied. In other words: it is no longer a body constant, and must become a label - this one is then safe to inline.

There's still one point where we need function pointers: at the edges of the program, where we produce commands with attached continuations. Since we return a sum type instead of a closure (i.e. a pair whose first member is a label), but the exterior dispatcher needs an entry point, we can instead generate a wrapper that performs the call. We can generate this wrapper only after the wrapped function has been typed, since we would just enter another recursion otherwise.

The only corner case here is that recursive calls within the same hierarchy can not be resolved, but FLIR does not permit direct recursion anyway - precisely because that allows more code to be GPU compatible.

Now. On the CPU, resorting to a switch case is not the fastest method, since we do have function pointers available, so here we can instead set up a static LUT (lookup table) that is initialized with the function pointers when the executable starts. When a label is referenced, we can just fetch its index from the LUT then and there. Fortunately, the LUT is already generated by the compiler, and so we can solve this by simple forward declaration: first, all functions are forward declared, then they are implemented, and now every function has access to the function pointer of every other function. Since the mapping of label to function is known at compile time, we can now directly select the correct function.

Since we have now slightly different behavior on CPU and GPU, but need to transform the code during specialization, we need to find the method that is a superset of both. It is easy to start with a simple label transform and permit apply to be called on labels. The GPU code generator can then later also use forward declaration and generate a simple switch case in-place.

We do have the added issue on GPU that unions aren't supported either, but that can be overcome with canonical forward and inverse transforms: we can transform a structure to a canonical form the GPU supports, i.e. an array of words, and then perform an inverse transform to a different structure. The center parts can often be elided and so we get the shortest path transformation between two structures. It's not as fast as casting a pointer, but it's safe.

Alright. Now on to the second topic, which is about optimizing buffer usage. On present hardware, any program benefits greatly from being able to avoid allocations at runtime, and so it should create as few buffers as necessary, possibly so small in their entirety that they could be passed around in registers or on the stack before we need to resort to using the heap. But when we need heap buffers, it would be useful if we could reuse them. Therefore, instructions that "generate" new memory should, where possible, instead be rewritten to reuse existing memory, so that we can effectively limit the use of (re)allocations to a single instruction, which helps with analysis. And this can be done in a pure functional way.

Following exemplary instructions and commands, which produce values greater than the data that went in should change to become operations that do not always provoke implicit allocations, and make reuse explicit:

# now derives `count` from the buffer provided, and potentially moves that buffer when possible
vecmap stride:U32 count:u32 closure:fn(u32) -> vecmap stride:U32 count:<0... x stride> closure:fn(u32)

# we extend the definition of `insert` to permit elements that can be a multiple of `stride`, making this a
  `memcpy`-like function. We also rename it.
insert stride:u32 value:<0... x stride> element:<1 x stride> index:u32 -> <delete>
join a:<0... x u8> b:<0... x u8> ~ overlay stride=1 a:<0... x u8> b:<0... x u8> index:u32

# read some memory potentially reusing the provided buffer, calling the continuation with a buffer of same or smaller size.
: flir_cmd_stdin size:u32 cont:closure -> : flir_cmd_stdin dest:<0... x u8> cont:closure

In addition, we need a way to construct, extend and discard buffers:

# declares a blob with undefined contents of requested count; undefined contents of a buffer will be
  in practice initialized with zeroes, but accessing them will produce undefined buffers as well.
undef stride:U32 count:u32

# an annotation operation that forwards the buffer and declares its contents undefined, allowing for additional
  optimizations, see below.
delete value:<0... x ?>

The join operation can then be implemented as:

join = fn (a b)
    result = overlay 1 (undef 1 (+ (len 1 a) (len 1 b))) a 0
    overlay 1 result b (len 1 a)

We can then later gather all uses of an undef instruction and find, through lifetime analysis, places where buffers are dropped, and chain up these buffers with discard placed inbetween. This chaining can overcome function boundaries, by routing undefined buffers into and out of functions if it benefits reuse.

29.7.2022

I now have support for typing expressions and functions - just the typing, the C types are still generic.

But while typing continuation closures of command buffers I am running into infinite recursions, and that's not fun at all. i'm still chewing on that one.

The problem is that the program data is both immutable and directed acyclic, which means we can't directly patch in a recursive reference, but have to instead pass the self-reference through the function arguments so the function can (indirectly) call itself. Fortunately, this is already done by the user, so we can always be sure that when we see a connected function refer one further up in the continuation order, there has been an explicit body constant passed to its environment.

So, theoretically, all we need to do is recognize this indirect argument and make sure that it is properly typed where it is first passed, but then it is also possible that this by-argument passed function is specialized in multiple ways, which means now we need to patch the environment and call order to bring in more instances. However, this can be simplified a little by just replacing the single body reference with a cell of bodies that resides in the same spot, but performs one more indirection to retrieve the correct instance.

How to implement this mess with an elegant rewriting technique though, I'm not sure yet. Say we have a graph like:

%10 = body
    cell "STDO" "10 GOTO 10\n"
        cell
            cellextract (fenv) 0 0
            cellslice (fenv) 0 1
cell %10 (cell (cell %10)) # the closure

We can see that the closure passes the body to itself so it can construct a closure for itself. Simple deductive typing gives us

%10 = body
    cell "STDO" "10 GOTO 10\n"
        cell
            cellextract (fenv) 0 0 == %10
            cellslice (fenv) 0 1 == (cell (cell %10)) 
cell %10 (cell (cell %10))

The problem is that even though we are able to extract a constant, we can't put %10 into itself there. To make matters worse, typing %10 itself deductively specializes the function but not the captured template, and so we lose the relationship:

%11 = body<(cell (cell %10) (cell))>
    cell "STDO" "10 GOTO 10\n"
        cell %10 (cell (cell %10)) 
cell %11 (cell (cell %10))

Continuing the typing, we just create an infinite recursion:

%11 = body<(cell (cell %10) (cell))>
    cell "STDO" "10 GOTO 10\n"
        %12 = body<(cell (cell %10) (cell))>
            cell "STDO" "10 GOTO 10\n"
                cell %10 (cell (cell %10)) 
        cell %12 (cell (cell %10)) 
cell %11 (cell (cell %10))

22.7.2022

It occurred to me earlier that FLIR isn't really "necessary" as a binary format. It's a specialized intermediate representation that when loaded into a FLIR program itself consists of nothing blobs and cells, so I wonder whether it was really worth it to require a SPIR-V like format. For the compiler, barely anything changes. The interpreter becomes a little more complex.

In a generic format of blobs and cells only, we are able to replace blob ... instructions with just blobs, but each instruction must still be a cell that begins with a blob, and each and every cell must be an instruction. Here, it is technically possible to produce a malformed instruction by beginning with a cell rather than a blob, but a validation layer can catch these.

How the binary format would work here in the simplest case is that for either blob or cell we specify a size in bytes, followed by the data, padded with zeroes at the end for DWORD alignment, without signifying whether the content is of blob or cell type. The header of the file, the first four bytes, is an id, the index of the root node of the graph, with its 4 most significant bits reserved for the node kind, indicating whether the reference is a blob, a cell, or inlined directly into the id. From here, we can type the entire tree. The contents of a blob can be arbitrary, but each cell is an array of id values. As with FLIR, we'd enforce that references can only point back to earlier nodes, so that the root node is at the end of the file, and all nodes in the file are in topological order.

The benefit of this format would be that we can use it for serializing both programs and data.

21.7.2022

On eliding conditional function pointers in pure functional code.

There are effectively two methods: one by inlining, and the other by switch tables. Inlining works as follows:

f = fn (...)
g = fn (...)
(if x f g) args... -> if x (f args...) (g args...)

But this has limited application, since a function that conditionally generates functions would have to be completely inlined as well so we can resolve the condition.

Switch tables operate on a conditional type. if x f g produces a union f|g. If each function in our code has a unique index, then we can assign index F to function f, and index G to function g. At the point of function application, we can insert a switch case to dispatch to the correct function. This method also works in shaders, provided we do not end up with a recursion:

h = fn (x)
    if x F G
switch (h x)
case F: f args...
case G: g args...

20.7.2022 B

Thought I'd quickly describe how closures are implemented in frameloop.

First of all, they aren't. FLIR, the intermediate representation has no concept of closures. It has body (or loopbody for tail-recursive functions), which is an instruction that binds the root of an expression graph that depends on the fenv instruction, i.e. can only be evaluated when fenv is bound to an environment using the apply instruction (apply <body> <env>), which by convention is a cell of arbitrary content - i.e. in the simplest case, just a list of arguments.

But FLIR has also two map-like constructs, one for blobs (vecmap) and one for cells (cellmap), and these do indeed require a "closure", which really is just a cell containing template arguments to use when cellmap is invoking apply internally. We also use this definition of a closure when the system calls back into the program after a I/O operation was requested.

These constraints prescribe the basic format for a closure: it is a cell with two elements, of which the first element is a body value, and the second element is a parent environment of cell type. On application, the argument(s) passed to apply are appended to the parent environment to form the final function environment that fenv will be bound to when the body is evaluated. Or, as a formula:

closure := [body,[x0,...,xN,[body,y0,...,yN]]]
environment := [x0,...,xN,[body,y0,...,yN],args]
closure(args) -> apply(body, environment)

This format for environments provides several affordances, provided we track the lexical depth of a function as level L while generating the IR from text:

  1. Every environment env represents a stack of applied closures, in which a function of level L (which for a top level function is L=0) finds its local environment at index env[L*2] and its local arguments at index env[L*2+1], and can in addition also access all values bound to its parent closures this way.

  2. A function can provide a closure to itself from its environment, simply by truncating the environment to level L*2+1, or as formula:

environment := [x0,...,xN,[body,y0,...,yN],args]
closure := [environment[2*L][0],environment[:2*L+1]]

FLIR does not permit reentrant functions, but it permits self-captures that can be passed as continuation to a function-loop like construction (see 8.7.2022).

  1. Since a closure permits passing extra arguments, these can be used to capture/memoize expressions in the parent function and pass them on into the new environment. This too can be done during parsing, by recording names to be captured, including a cell of captured values in the closure, and then binding those names to cell extractions from a constant location in the next level.

As a result of these affordances, we can conceptually work with runtime closures even though all function bodies defined in FLIR (including loop bodies) are top level functions and are easily translatable to C.

In addition, since we built all these abstractions with cells and pure instructions, an optimization pass can identify unused values and strip/flatten arguments and environments, or even bake additional expressions that only depend on the parent environment right into the closure.

If a function does not require its own environment, then the optimizer can reduce its closure to [body,[]], and the environment of the function will therefore simply be [args], and can be further simplified during code generation.

20.7.2022

A function-loop like continuation interface (as discussed on 8.7.2022) seems to be the best way for Frameloop programs to hand off control to I/O: the program's return value is a command identifier for the I/O operation to perform, followed by arguments, typically including a closure; which is called with the result of the completed operation, which becomes the next entry point into the program.

Presently I solve this through a list of commands, which has the problem that several of them accept callback closures, which can not cooperate on the same state. A continuation engine serializes all operations, and so processing is sequential.

But this means that we need the interface to model parallel I/O operations explicitly. The most generic example would be to permit threading for a program, which is akin to branching: the program supplies two or more closures to be run as parallel threads, as well as a merge closure (a fence), to invoke when all threads have completed, with the results of those threads.

Each thread has its own system I/O (since I/O would be the only reason to perform explicit async operations anyway) and can exit with a result other than just a return code.

The only issue I can see here is hanging - if one of the threads is accidentally caught in an endless I/O loop due to user error, the merge closure will never be reached. This is detectable by the system, as it is managing all threads, and it can present a debuggable error.

It is then possible to write small collector-like components that generate I/O statements and link each other up as continuations, and a bit of macro sugar around this, to get syntax like this:

include "std.flasm"

IO.main
    IO.stdout "program start\n"
    IO.loop () # a macro that binds IO.break
        IO.read 1
            IO.then (s)
                if (is s "\n")
                    IO.block # open a new sequence
                        IO.stdout s
                        IO.break # break from IO.loop
                    #else
                    IO.stdout s
    IO.stdout "done"
    IO.stdout "\n"

How this works is that e.g. IO.stdout does not perform the operation immediately, but instead returns another closure that takes a continuation argument; when that is passed, it produces the final function. this allows us to link up several continuations from back to front in order to build the final sequence.

Consequently applied to the entire program, a pure functional program can build imperative control flow this way, without needing monads - all we require is an impure driver loop that wraps the program and isolates it from side effects. This driver could then interface with the file system, maintain threads, sockets and even perform indirect FFI calls into C.

There are not only benefits, though. This model lacks system-side resource management. The lifetime of monads can be tracked to figure out when they're no longer needed. Here, for every allocated resource, we would require opaque handles returned by the system with a special type whose lifetime can be tracked. It does not matter that these handles belong to a system with side effects, since we only ever return them from the program in order to interact with them, and what happens outside the program doesn't effect its purity.

New tracked handle types still pose a complication to the system. Sometimes though, as in the case of the threading system outlined above, it is possible to avoid adding new types, by aligning the lifetime of the resource with the lifetime of a continuation.

18.7.2022

Now we come to specializing the types of a FLIR program. We start out with polymorphic expressions composed of either of our three high level types: body, blob and cell, and try to lower those to C, depending on the invariants we discover in the types.

To begin with, each of the three types already has a C implementation, but by discovering invariants, we can simplify the type.

Any value is, at first a fl_value, which is a pointer-sized tagged union, consisting of a 16-byte aligned pointer, and the tag encoded in the lower 4 bits.

body is modeled as fl_value (*fl_body_func)(fl_value), but if we know more about the environment or the return type, we can specialize the signature and improve by-value passing.

blob and cell are generally fairly similar: they're both vectors, implemented as piece tables of a single element type, i.e. they join at most log2(n) slices from different chunks for n elements. Where they differ is in their element type: blobs are always byte-sized, whereas cells hold elements of type fl_value.

So here are the invariants we can discover for fl_value:

  • It is either a body (callable) or a piece of data (blob or cell). This can be discovered either by seeing it appear as first argument to apply (when typing upstream), or by the appearance of a body instruction (when typing downstream).
  • If it is a body, we can tell from apply and by downstream-typing its graph what its argument and return type is.
  • We can discover it is constant, i.e., it consists of nothing but cell, body and blob instructions.
  • If it is a piece of data, it is a blob or a cell. We can discover this upstream by it being passed to a single-typed instruction, or downstream, by being returned from a single-typed instruction, starting with the cell or blob constructor.
  • If it is a piece of data, we can discover its minimum size 0 <= Smin and its maximum size Smax <= infinity. If Smin == Smax, then the size is constant. We can also discover that its size always shrinks or grows in a fixed increment (i.e. its stride).
  • If it is a cell, we can discover that all its elements are of the same type (it is an array) or the type of its elements is exactly known (it is a tuple).
  • cells can have a recursive definition, which is always created by loops. This recursive type is polymorphic and always has a terminal/leaf type that indicates the end of the recursion.
  • Some invariants are conditional i.e. they only apply within branches.
  • if and switch produce tagged unions of types, where the tag is equivalent to the condition/switch value. But we can also leave out the tag for less precision and greater generality.

So we land at these annotated types:

  • "..." for a perfectly constant blob.
  • cell C0 ... CN for a perfectly constant cell.
  • sum (tag0 T0) ... (tagN TN) for an explicitly tagged polymorphic value.
  • union T0 ... TN for an implicitly tagged polymorphic value.
  • blob S for a blob of fixed size S.
  • vblob stride Smin Smax for a variadic blob of minimum size Smin, maximum size Smax ("" for infinity) and stride stride.
  • array elementT S for a cell array of fixed element count S.
  • varray elementT Smin Smax for a variadic cell array of minimum element count Smin and maximum count Smax ("" for infinity).
  • tuple T0 ... TN for a cell tuple of fixed element count.
  • vtuple T0 ... TN for a cell tuple of fixed element count, however the last element is used as an extension, so that indices [i] >= N are transformed to [N][i-N]. The last element must be of type varray, vtuple or vblob.
  • let ((name0 T0) ... (nameN TN)) T for a recursive cell type, e.g. let ((CONS (tuple (union (blob 1 0 "") CONS) CONS))) CONS to describe a cons cell, or let ((R (union (blob 4 4 4) (tuple B B))) (B (union (blob 4 4 4) (tuple R R)))) (tuple (blob 1 0 "") B) for a red-black tree with a top level string label.
  • body returnT envT for a body with provided return type and environment type. If the environment type is a tuple, and it is never processed as a whole, then we can elide the tuple and pass values by arguments. I doubt that this brings great performance wins though.

These types can then be lowered to C accordingly, which we'll call the Frameloop ABI:

  • "..." becomes a global string constant.
  • cell becomes a global struct initializer.
  • When a types final size exceeds 16 bytes, its data is moved to the heap and refcounted.
  • sum becomes a struct type with a tag field followed by a union type field.
  • union becomes a struct type with an implicit tag field followed by a union type field.
  • blob becomes a uint8_t[] array. Other element types may be chosen based on the largest possible alignment.
  • vblob becomes a struct type with a size field, followed by a uint8_t[] field. Other element types may be chosen based on the largest possible alignment.
  • array follows the similar rules as blob, but uses elementT for its array type.
  • varray follows the similar rules as vblob, but uses elementT for its array type.
  • tuple becomes a struct type.
  • vtuple becomes a struct type whose last element contains an unsized array.
  • let enforces every named value to become a refcounted heap pointer of opaque struct type.
  • body turns into its equivalent C function pointer type.

17.7.2022 B

Finally! flc (The FLIR compiler) can compile itself!

  1. compiling flc.flir to flc.c by running flc.flir in the FLIR interpreter takes 17 seconds
  2. compiling flc.flir to flc.c by running flc.elf (which is the previously compiled flc.c) takes 8 seconds

We only get about a x2 speed-up because we merely optimize out the interpreter and its conditional branching; the FLIR code itself is not optimized (lots of redundancy at function boundaries), and datatypes are not specialized either (lots of malloc/free and copying).

17.7.2022

Hooray! After some heavy bug fixing, testing.flasm compiles to C and executes for the first time. flc.flasm mostly compiles, but I found a problem in my assumptions: the topological order is in itself conditional; failing to account for that, we get cyclical dependencies on conditions we can not resolve.

Perhaps it's best to start out from the simplest solution instead: do a depth-first walk that forks on each conditional, so we get a DAG of different topological orderings, that is a working (yet 2^n redundant for n conditions) imperative sequence in itself, which readily translates to C.

Many branches in this DAG will have redundant parts, i.e. instructions that appear in both branches, and we will have to figure out how to merge them safely where possible. This needs to be done ASAP, because multiple conditions in series produce an exponential number of branches, which may all copy a common preamble.

15.7.2022

I have now all parts together to get to compact and efficient C function body generation, save for one: merging common conditional blocks. It appears the best way to go about this is to first build a fragmented conditional instruction block tree, and then merge adjacent blocks without changing the overall order.

The entire block tree is built out of these primitives:

BLOCK := 
    let %ID ...
    (%ID = INSTRUCTION) | IF
    ...
    %PARENT-ID... = %ID...
IF :=
    if CONDITIONS: BLOCK
    else: BLOCK

Technically we could declare all variables at the beginning of the function, but scoping the declarations will help with debugging compiler bugs later.

Now let's apply this in practice. Take this fragmented example of 20 instructions:

let %1
%1 = fenv
let %2
%2 = tupleextract %1 1 1
let %3
%3 = typeof %2
let %4
%4 = blob "cell"
let %5
%5 = is %3 %4
if (%5 0 0 1): let %6
if (%5 0 0 1): %6 = countof %2
if (%5 0 0 1): let %7
if (%5 0 0 1): %7 = blob 2
if (%5 0 0 1): let %8
if (%5 0 0 1): %8 = ieq 4 %6 %7
if (%5 0 0 (%8 0 0 1)): let %9
if (%5 0 0 (%8 0 0 1)): %9 = blob 0
if (%5 0 0 (%8 0 0 1)): let %10
if (%5 0 0 (%8 0 0 1)): %10 = cellextract %2 %9
if (%5 0 0 (%8 0 0 1)): let %11
if (%5 0 0 (%8 0 0 1)): %11 = typeof %10
if (%5 0 0 (%8 0 0 1)): let %12
if (%5 0 0 (%8 0 0 1)): %12 = blob "body"
if (%5 0 0 (%8 0 0 1)): let %13
if (%5 0 0 (%8 0 0 1)): %13 = is %11 %12
if (%5 0 0 (%8 0 0 (%13 0 0 1))): let %14
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %14 = blob 1
if (%5 0 0 (%8 0 0 (%13 0 0 1))): let %15
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %15 = cellextract %2 %14
if (%5 0 0 (%8 0 0 (%13 0 0 1))): let %16
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %16 = typeof %15
if (%5 0 0 (%8 0 0 1)): let %17
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %17 = is %16 %4
if (%5 0 0 1): let %18
if (%5 0 0 (%8 0 0 1)): %18 = if %13 %17 %13
let %19
if (%5 0 0 1): %19 = if %8 %18 %8
let %20
%20 = if %5 %19 %5

The conditions used here are described in multivariate BDD notation (%x C t f) ~ (%x == C)?t:f. I left out the (mandatory) else: blocks here since they're empty yet.

The first transformation we can do is to rewrite if expressions as separate conditions:

if (%5 0 0 1): let %18
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %18 = %17
if (%5 0 0 (%8 0 0 (%13 0 1 0))): %18 = %13
let %19
if (%5 0 0 (%8 0 0 1)): %19 = %18
if (%5 0 0 (%8 0 1 0)): %19 = %8
let %20
if (%5 0 0 1): %20 = %19
if (%5 0 1 0): %20 = %5

Also, we always reorder the let declarations within the same block, which are leaves and thus not constrained upward. The rule is that in each block all let declarations come first, and can in fact become two ordered sets per block:

let %1 %2 %3 %4 %5 %19 %20
%1 = fenv
%2 = tupleextract %1 1 1
%3 = typeof %2
%4 = blob "cell"
%5 = is %3 %4
if (%5 0 0 1): let %6
if (%5 0 0 1): %6 = countof %2
if (%5 0 0 1): let %7
if (%5 0 0 1): %7 = blob 2
if (%5 0 0 1): let %8
if (%5 0 0 1): %8 = ieq 4 %6 %7
if (%5 0 0 (%8 0 0 1)): let %9
if (%5 0 0 (%8 0 0 1)): %9 = blob 0
if (%5 0 0 (%8 0 0 1)): let %10
if (%5 0 0 (%8 0 0 1)): %10 = cellextract %2 %9
if (%5 0 0 (%8 0 0 1)): let %11
if (%5 0 0 (%8 0 0 1)): %11 = typeof %10
if (%5 0 0 (%8 0 0 1)): let %12
if (%5 0 0 (%8 0 0 1)): %12 = blob "body"
if (%5 0 0 (%8 0 0 1)): let %13
if (%5 0 0 (%8 0 0 1)): %13 = is %11 %12
if (%5 0 0 (%8 0 0 (%13 0 0 1))): let %14
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %14 = blob 1
if (%5 0 0 (%8 0 0 (%13 0 0 1))): let %15
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %15 = cellextract %2 %14
if (%5 0 0 (%8 0 0 (%13 0 0 1))): let %16
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %16 = typeof %15
if (%5 0 0 (%8 0 0 1)): let %17
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %17 = is %16 %4
if (%5 0 0 1): let %18
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %18 = %17
if (%5 0 0 (%8 0 0 (%13 0 1 0))): %18 = %13
if (%5 0 0 (%8 0 0 1)): %19 = %18
if (%5 0 0 (%8 0 1 0)): %19 = %8
if (%5 0 0 1): %20 = %19
if (%5 0 1 0): %20 = %5

In the next round, we merge N adjacent blocks by the rule:

if (%x C %t0 %f0): let X0; X0
if (%x C %t1 %f1): let X1; X1
...
if (%x C %tN %fN): let XN; XN
<=>
if (%x C 1 0):
    let X0... X1... XN...
    if %t0: X0
    if %t1: X1
    ...
    if %tN: XN
else:
    let X0... X1... XN...
    if %f0: X0
    if %f1: X1
    ...
    if %fN: XN

Which gives us, if we only do the top level:

let %1 %2 %3 %4 %5 %19 %20
%1 = fenv
%2 = tupleextract %1 1 1
%3 = typeof %2
%4 = blob "cell"
%5 = is %3 %4
if (%5 0 1 0):
    %20 = %5
else:
    let %6 %7 %8 %18
    %6 = countof %2
    %7 = blob 2
    %8 = ieq 4 %6 %7
    if (%8 0 0 1): let %9
    if (%8 0 0 1): %9 = blob 0
    if (%8 0 0 1): let %10
    if (%8 0 0 1): %10 = cellextract %2 %9
    if (%8 0 0 1): let %11
    if (%8 0 0 1): %11 = typeof %10
    if (%8 0 0 1): let %12
    if (%8 0 0 1): %12 = blob "body"
    if (%8 0 0 1): let %13
    if (%8 0 0 1): %13 = is %11 %12
    if (%8 0 0 (%13 0 0 1)): let %14
    if (%8 0 0 (%13 0 0 1)): %14 = blob 1
    if (%8 0 0 (%13 0 0 1)): let %15
    if (%8 0 0 (%13 0 0 1)): %15 = cellextract %2 %14
    if (%8 0 0 (%13 0 0 1)): let %16
    if (%8 0 0 (%13 0 0 1)): %16 = typeof %15
    if (%8 0 0 1): let %17
    if (%8 0 0 (%13 0 0 1)): %17 = is %16 %4
    if (%8 0 0 (%13 0 0 1)): %18 = %17
    if (%8 0 0 (%13 0 1 0)): %18 = %13
    if (%8 0 0 1)): %19 = %18
    if (%8 0 1 0)): %19 = %8
    %20 = %19

Performing this breadth-first, the next layer will coagulate like this:

let %1 %2 %3 %4 %5 %19 %20
%1 = fenv
%2 = tupleextract %1 1 1
%3 = typeof %2
%4 = blob "cell"
%5 = is %3 %4
if (%5 0 1 0):
    %20 = %5
else:
    let %6 %7 %8 %18
    %6 = countof %2
    %7 = blob 2
    %8 = ieq 4 %6 %7
    if (%8 0 1 0):
        %19 = %8
    else:
        let %9 %10 %11 %12 %13 %17
        %9 = blob 0
        %10 = cellextract %2 %9
        %11 = typeof %10
        %12 = blob "body"
        %13 = is %11 %12
        if (%13 0 0 1): let %14
        if (%13 0 0 1): %14 = blob 1
        if (%13 0 0 1): let %15
        if (%13 0 0 1): %15 = cellextract %2 %14
        if (%13 0 0 1): let %16
        if (%13 0 0 1): %16 = typeof %15
        if (%13 0 0 1): %17 = is %16 %4
        if (%13 0 0 1): %18 = %17
        if (%13 0 1 0): %18 = %13
        %19 = %18
    %20 = %19

And after the last layer is done, we have our block tree:

let %1 %2 %3 %4 %5 %19 %20
%1 = fenv
%2 = tupleextract %1 1 1
%3 = typeof %2
%4 = blob "cell"
%5 = is %3 %4
if (%5 0 1 0):
    %20 = %5
else:
    let %6 %7 %8 %18
    %6 = countof %2
    %7 = blob 2
    %8 = ieq 4 %6 %7
    if (%8 0 1 0):
        %19 = %8
    else:
        let %9 %10 %11 %12 %13 %17
        %9 = blob 0
        %10 = cellextract %2 %9
        %11 = typeof %10
        %12 = blob "body"
        %13 = is %11 %12        
        if (%13 0 1 0): 
            %18 = %13
        else:
            let %14 %15 %16
            %14 = blob 1
            %15 = cellextract %2 %14
            %16 = typeof %15
            %17 = is %16 %4
            %18 = %17
        %19 = %18
    %20 = %19

As it appears, we can build the entire canonical tree just by a single block concatenation function which performs concatenations recursively.

9.7.2022 B

For FLIR, It appears that it is a good structural improvement to reorder instructions in topological order (starting from the leaves) so that every instruction is as close to its last dependency preceding it as possible, of which the instructions' conditional nodes are additional dependencies.

For example,

%0 = null 0 0 [0]
%1 = fenv [1]
%2 = tupleextract %1 2 1 0 [1]
%3 = tupleextract %1 1 2 0 [1]
%4 = ult 4 %2 %3 [1]
%5 = blob 1 [%4]
%6 = iadd 4 %2 %5 [%4]
%7 = tupleextract %1 2 1 1 [1]
%8 = tupleextract %1 1 1 [%4]
%9 = extract 1 %8 %2 [%4]
%10 = uconvert 1 %9 4 [%4]
%11 = blob 32 [%4]
%12 = ult 4 %10 %11 [%4]
%13 = blob 126 [(%4 & ~%12)]
%14 = ugt 4 %10 %13 [(%4 & ~%12)]
%15 = if %12 %12 %14 [%4]
%16 = blob "\x" [(%4 & %15)]
%17 = cell [(%4 & %15)]
%18 = cell %17 [(%4 & %15)]
%19 = cell %<body 1> %18 [(%4 & %15)]
%20 = tupleextract %19 0 [(%4 & %15)]
%21 = tupleextract %19 1 [(%4 & %15)]
%22 = cell %20 %9 [(%4 & %15)]
%23 = cell %22 [(%4 & %15)]
%24 = celljoin %21 %23 [(%4 & %15)]
%25 = apply %20 %24 [(%4 & %15)]
%26 = join %16 %25 [(%4 & %15)]
%27 = blob "\"" [(%4 & ~%15)]
%28 = ieq 1 %9 %27 [(%4 & ~%15)]
%29 = blob "\\"" [(%4 & (~%15 & %28))]
%30 = if %28 %29 %9 [(%4 & ~%15)]
%31 = if %15 %26 %30 [%4]
%32 = join %7 %31 [%4]
%33 = cell %6 %32 [%4]
%34 = countof %1 [%4]
%35 = isub 4 %34 %5 [%4]
%36 = cellinsert %1 %33 %35 %5 [%4]
%37 = repeat %36 [%4]
%38 = break %7 [~%4]
%39 = if %4 %37 %38 [1]

can be reordered to (I did this one manually)

%0 = null 0 0 [0]
%1 = fenv [1]
%7 = tupleextract %1 2 1 1 [1]
%3 = tupleextract %1 1 2 0 [1]
%2 = tupleextract %1 2 1 0 [1]
%4 = ult 4 %2 %3 [1]
%38 = break %7 [~%4]
%34 = countof %1 [%4]
%8 = tupleextract %1 1 1 [%4]
%11 = blob 32 [%4]
%9 = extract 1 %8 %2 [%4]
%10 = uconvert 1 %9 4 [%4]
%12 = ult 4 %10 %11 [%4]
%13 = blob 126 [(%4 & ~%12)]
%14 = ugt 4 %10 %13 [(%4 & ~%12)]
%15 = if %12 %12 %14 [%4]
%27 = blob "\"" [(%4 & ~%15)]
%28 = ieq 1 %9 %27 [(%4 & ~%15)]
%29 = blob "\\"" [(%4 & (~%15 & %28))]
%30 = if %28 %29 %9 [(%4 & ~%15)]
%17 = cell [(%4 & %15)]
%18 = cell %17 [(%4 & %15)]
%19 = cell %<body 1> %18 [(%4 & %15)]
%21 = tupleextract %19 1 [(%4 & %15)]
%20 = tupleextract %19 0 [(%4 & %15)]
%22 = cell %20 %9 [(%4 & %15)]
%23 = cell %22 [(%4 & %15)]
%24 = celljoin %21 %23 [(%4 & %15)]
%25 = apply %20 %24 [(%4 & %15)]
%16 = blob "\x" [(%4 & %15)]
%26 = join %16 %25 [(%4 & %15)]
%31 = if %15 %26 %30 [%4]
%32 = join %7 %31 [%4]
%5 = blob 1 [%4]
%35 = isub 4 %34 %5 [%4]
%6 = iadd 4 %2 %5 [%4]
%33 = cell %6 %32 [%4]
%36 = cellinsert %1 %33 %35 %5 [%4]
%37 = repeat %36 [%4]
%39 = if %4 %37 %38 [1]

which when linearly conditionally scoped, produces a good contiguous structure:

%0 = null 0 0 [0]
%1 = fenv [1]
%7 = tupleextract %1 2 1 1 [1]
%3 = tupleextract %1 1 2 0 [1]
%2 = tupleextract %1 2 1 0 [1]
%4 = ult 4 %2 %3 [1]
if %4:
    %34 = countof %1 [%4]
    %8 = tupleextract %1 1 1 [%4]
    %11 = blob 32 [%4]
    %9 = extract 1 %8 %2 [%4]
    %10 = uconvert 1 %9 4 [%4]
    %12 = ult 4 %10 %11 [%4]
    if %12:
    else:
        %13 = blob 126 [(%4 & ~%12)]
        %14 = ugt 4 %10 %13 [(%4 & ~%12)]
    %15 = if %12 %12 %14 [%4]
    if %15:
        %17 = cell [(%4 & %15)]
        %18 = cell %17 [(%4 & %15)]
        %19 = cell %<body 1> %18 [(%4 & %15)]
        %21 = tupleextract %19 1 [(%4 & %15)]
        %20 = tupleextract %19 0 [(%4 & %15)]
        %22 = cell %20 %9 [(%4 & %15)]
        %23 = cell %22 [(%4 & %15)]
        %24 = celljoin %21 %23 [(%4 & %15)]
        %25 = apply %20 %24 [(%4 & %15)]
        %16 = blob "\x" [(%4 & %15)]
        %26 = join %16 %25 [(%4 & %15)]
    else:
        %27 = blob "\"" [(%4 & ~%15)]
        %28 = ieq 1 %9 %27 [(%4 & ~%15)]
        if %28:
            %29 = blob "\"" [(%4 & (~%15 & %28))]
        %30 = if %28 %29 %9 [(%4 & ~%15)]
    %31 = if %15 %26 %30 [%4]
    %32 = join %7 %31 [%4]
    %5 = blob 1 [%4]
    %35 = isub 4 %34 %5 [%4]
    %6 = iadd 4 %2 %5 [%4]
    %33 = cell %6 %32 [%4]
    %36 = cellinsert %1 %33 %35 %5 [%4]
    %37 = repeat %36 [%4]
else:
    %38 = break %7 [~%4]
%39 = if %4 %37 %38 [1]

9.7.2022

Four more BDD insights:

  1. The complex transform (a b c) d e -> a (b d e) (c d e) can be avoided by requiring that terminal nodes must always be 0 or 1, and implementing operators to use substitution instead of combination, ((a b c) d e) -> (a b c).subst(0=d, 1=e).

This changes how boolean operations are implemented:

# old form to new form
a & b -> (a 0 b) -> a.subst(0=0, 1=b)
a | b -> (a b 1) -> a.subst(0=b, 1=1)
~x -> (x 1 0) -> x.subst(0=1, 1=0) (leaf swap)
(~a b c) -> (a c b) (root swap)
  1. Because we can avoid complex conditional intermediates, we can now specialize branches to type Term := {id, 0|1|Term, 0|1|Term}, disallowing complex terms in the conditional.

This also changes variable bubbling (see previous post):

a (b c d) e -> a (b c d) (b e e) -> b (a c e) (a d e)
a b (c d e) -> a (c b b) (c d e) -> c (a b d) (a b e)

In fact, any variable a in the BDD can be pulled to the top by the operation

u = Q.subst(a = 0)
v = Q.subst(a = 1)
Q' = (a u v)
  1. The BDD can be extended to support and optimize switch cases by allowing integer leafs and extending the term to Term := {id, number, number|Term, number|Term}; (a b c d) then translates to C as (a == b)?c:d or switch(a){case b: c; break; default: d}; We can then port BDD nodes as (a b c) -> (a 0 b c). The arithmetic is generally fairly similar, only that specialization changes a little.

The operator a == C can then be described via (a C 1 0), or, for a complex left hand, by comparing any leaf number L of a with C.

With these extensions, we can now perform constant set testing. x in (A B C) can be described as (x A 1 (x B 1 (x C 1 0))), whereas x not in (A B C) becomes (x A 0 (x B 0 (x C 0 1))).

  1. If we consistently apply the swap rule (see previous post) to order our variables, and also order for constants that way when the variables are the same, then redundant selectors on the same variable will always crop up next to each other, and we do not need a contextual mapping to drop the variable. Instead, the following replace rule will suffice:
# regular BDD
(a (a b c) (d e f)) -> (a b (d e f))
(a (b c d) (a e f)) -> (a (b c d) f)
# BCDD (binary comparison decision diagram)
(a B (a B c d) (e F g h)) -> (a B c (e F g h))
(a B (a Q c d) (e F g h)) -> (a B d (e F g h)) # a==B so can't be Q
(a B (c D e f) (a B g h)) -> (a B (c D e f) h)
(a B (c D e f) (a Q g h)) -> (a B (c D e f) (a Q g h)) # isn't B, but could be Q, so no change

update: with the new model and the new replace rules, my test case reduced from 110ms to 19ms, as the memoizer can now work correctly. Without result caching, I get 300ms.

8.7.2022 B

For Binary Decision Diagram (BDD) normalization, it becomes important to bubble-up/push-down switch variables in order to guarantee that the nesting always uses the same ordering. This is possible whenever we have a BDD term of this shape, where both branches use the same conditional:

a (b c d) (b e f)

The truth table for this term is

a b out
0 0 c
0 1 d
1 0 e
1 1 f

If we swap out the middle two lines, we effectively swap the conditional order:

a b out
0 0 c
1 0 e
0 1 d
1 1 f

And from this, we can project back to the term, where a, b, d and e are swapped:

a (b c d) (b e f) -> b (a c e) (a d f)

It is possible to perform this operation on any term, by first transforming it to a redundant version:

a b c -> a (b 0 1) (b c c) -> b (a 0 c) (a 1 c)
a b c -> a (c b b) (c 0 1) -> c (a b 0) (a b 1)

Is this correct? Let's apply it twice!

b (a 0 c) (a 1 c) -> b ((a 0 c) 0 1) ((a 0 c) (a 1 c) (a 1 c)) -> (a 0 c) (b 0 (a 1 c)) (b 1 (a 1 c))

and reduce:

-> a (0 (b 0 (a 1 c)) (b 1 (a 1 c))) (c (b 0 (a 1 c)) (b 1 (a 1 c))) # reduce
-> a (b 0 (a 1 c)) (c (b 0 (a 1 c)) (b 1 (a 1 c))) # fold constant
-> a (b 0 1) (c (b 0 c) (b 1 c)) # apply a
-> a (b 0 1) (c (b 0 0) (b 1 1)) # apply c
-> a (b 0 1) (c 0 1) # fold constant
-> a b c # fold constant

8.7.2022

Here's a pattern for performing recursive evaluation in programming models that do not permit recursion, but allow closures as first class values. The technique works by inverting the control flow, by having a reactor loop apply continuations that produce continuations. The closures then maintain their own stack/coroutine memory.

# described in unrestricted frameloop assembly
function-loop = fn (entryf args)
    loop (f args = entryf args)
        if (empty? f) # raise/return
            break args
            #else
            repeat (f args)

As a (somewhat useless) example, a recursive fibonacci function can then be written as follows:

fib = fn (n)
    it = fn (cont n)
        self = this-function
        if (> n 1)
            do self
                cell
                    fn (x)
                        do self
                            cell
                                fn (y) cont (+ x y)
                                - n 2
                    - n 1
            #else
            cont n
    function-loop it (cell (fn (x) empty x) n)

function-loop may then choose to prescribe a stricter argument format and memoize the arguments, or mandate a user-defined state argument through which functions can implement their own shared memoization, stack or state machine.

7.7.2022

After pondering for weeks and coming to no conclusion on how to properly handle lazily evaluating conditional expressions when translating pure functional code to imperative C, it occurred to me that, I might finally have a clean and elegant way to deal with its corner cases.

The FLIR compiler already solves closures (by not supporting them, as they can be easily generated explicitly during parsing) and loops (by treating every loop as a function with special boundary rules), which means our functions are top level and loopless, but still not branchless.

Take this simple FLIR snippet as an exemplary corner case:

# assuming G, F, t, h, k, a, b are defined
g = apply G t
u = if a g h
v = if b k g
uv = cell u v
result = apply F uv

A straightforward translation (ignoring memory management for now) would be:

// ...
fl_value g = fl_apply(G, t);
fl_value u = fl_tobool(a)?g:h;
fl_value v = fl_tobool(b)?k:g;
fl_value uv = fl_cell(u,v);
fl_value result = fl_apply(F, uv);

Only that this code violates the requirement that g must be evaluated lazily, i.e. only when it is selected by a conditional branch. So we translate g as scoped:

fl_value u; if (fl_tobool(a)) { fl_value g = fl_apply(G, t); u = g; } else { u = h; }
fl_value v; if (fl_tobool(b)) { v = k; } else { fl_value g = fl_apply(G, t); v = g; }
fl_value uv = fl_cell(u,v);
fl_value result = fl_apply(F, uv);

This is better, yet now we violate the requirement that each expression must be evaluated at most once, whereas here, for the case that a && !b, g will be evaluated twice.

The question is now: how to create a proper imperative form in which we guarantee that

  • each expression of a function is evaluated at most once
  • conditional expressions are evaluated lazily

The solution I have arrived at is to generate a flat, imperative block without lexical subscopes, in which every expression x = op ... is converted to a statement of the form

fl_value x; if (<conditions>) x = op(...);

where <conditions> is computed by propagating branch conditions up the function graph, starting from the sink node, so that we know exactly for every expression the totality of conditions that lead to its evaluation. The statement allows for x to remain undefined, but since all dependent expressions use compatible conditions, no statement will ever depend on an undefined variable.

Our example program then is first annotated by conditions ? <term>: expr as follows:

# assuming G, F, t, h, k, a, b are defined
? a || !b: g = apply G t
? true: u = if a g h
? true: v = if b k g
? true: uv = cell u v
? true: result = apply F uv

It then straightforwardly translates to imperative form (where we can shorthand ? true to a direct assignment), by maintaining topological order:

// ...
fl_value g; if (fl_tobool(a) || !fl_tobool(b)) g = fl_apply(G, t);
fl_value u = fl_tobool(a)?g:h;
fl_value v = fl_tobool(b)?k:g;
fl_value uv = fl_cell(u,v);
fl_value result = fl_apply(F, uv);

In practice, we would not directly translate to C, but rather to a single static assignment tree that is initially flat, and in a subsequent optimization step merges multiple concurrent statements depending on the same conditions into nested lexically scoped branches.

To represent conditions, we may either use Binary Decision Diagrams (I explored them on 19.3.2021, extending the idea to switch cases), which are DAGs of ternary expressions (where the leaves are either variable, true or false), or the simpler, non-recursive Sum Of Product (SOP) form, though BDDs describe our desired final ternary form well, and make it easy to construct nested branches.

It won't be easy to always optimize branches so they assume their most efficient form, but it can generally be said that this structure is already quite SIMD friendly, since our form somewhat mimicks execution masking, and we don't duplicate any statements, keeping the code compact.

Lastly, what to do with switches? We can trivially understand switches as control flow following a test whether an integer is or is not contained within a set of constant literals. A switch instruction is useful because C compilers can often translate it into efficient jump tables that make it fast to perform branching set and equality tests on single variables simultaneously.

Hence we should abstract this feature specifically, as an instruction in x c0 ... cN that evaluates to true when x is contained in the set, and false otherwise. We can then rewrite terms such as (x == 1)||(x == 2) as x in set(1,2), which translates to C as:

bool q; switch (x) { case 1: case 2: q = true; break; default: q = false; }

and this expands our ternary from c?a:b to (c in i1[,...])?a:b, which we can reflect as a new branch type in the BDD.

As an example, a switch case like

y = switch(x) {
case 1:
case 2:
case 3: y = a; break;
case 5: y = b; break;
default: y = c;
}

would be specified in conditional FLIR as

? x in set(1,2,3): a = ...
? x in set(5): b = ...
? !(x in set(1,2,3,5)): c = ...
y = switch x 1 a 2 a 3 a 5 b null c

which naively translates to the less optimal but correct

bool c_a; switch(x) { case 1: case 2: case 3: c_a = true; break; default: c_a = false; }
fl_value a; if (c_a) a = ...;
bool c_b = (x == 5);
fl_value b; if (c_b) b = ...;
bool c_c; switch(x) { default: c_c = true; break; case 1: case 2: case 3: case 5: c_c = false; }
fl_value c; if (c_c) c = ...;
fl_value y; switch(x) { case 1: case 2: case 3: y = a; break; case 5: y = b; break; default: y = c; }

Clang and GCC differ wildly in how well they can optimize such constructs. Here's what GCC makes of it, showing how it has been used as an intermediate target platform for a long time:

testf:
        sub     rsp, 8
        xor     eax, eax
        call    op0
        lea     edx, [rax-1]
        cmp     edx, 2
        jbe     .L6
        cmp     eax, 5
        mov     eax, 0
        je      .L7
        add     rsp, 8
        jmp     op3
.L7:
        add     rsp, 8
        jmp     op2
.L6:
        xor     eax, eax
        add     rsp, 8
        jmp     op1

And here is clang's unholy mess:

testf:                                  # @testf
        push    rbp
        push    r14
        push    rbx
        xor     eax, eax
        call    op0@PLT
        mov     ebx, eax
        lea     ebp, [rbx - 1]
        cmp     ebp, 2
        ja      .LBB0_2
        xor     eax, eax
        call    op1@PLT
        mov     r14, rax
        jmp     .LBB0_3
.LBB0_2:
        cmp     ebx, 5
        jne     .LBB0_3
        xor     eax, eax
        pop     rbx
        pop     r14
        pop     rbp
        jmp     op2@PLT                         # TAILCALL
.LBB0_3:
        cmp     ebp, 5
        jae     .LBB0_4
        mov     eax, 23
        bt      eax, ebp
        jb      .LBB0_6
.LBB0_4:
        xor     eax, eax
        call    op3@PLT
.LBB0_6:
        cmp     ebx, 5
        ja      .LBB0_8
        mov     ecx, 46
        bt      ecx, ebx
        jae     .LBB0_8
.LBB0_9:
        mov     rax, r14
        pop     rbx
        pop     r14
        pop     rbp
        ret
.LBB0_8:
        mov     r14, rax
        jmp     .LBB0_9

5.7.2022

Learned today that the correct term for the logic of borrow checking is based on is linear logic, also known as "logic of resources".

For such a fundamental model, this proposal is rather new, from 1987, and thus directly connected to the computer revolution. One prerequisite to understanding the notation is to be familiar with sequent calculus, which was invented in the 1930's. It also helps to search latex symbols.

Unfortunately the author lifts yet another set of symbols from computer science, and then finally a third set from lambda calculus, always expecting me to already be on the same page, and that's too intractable. I don't understand why there are no "explain like i'm five" versions of this that allow me to translate this to the field I am familiar with. It takes way too long to figure out whether the ideas would help or not. So I understand, such ideas only travel by accident; If you have someone who was forced as a student to understand all this, and then later, by pure happenstance, be in the fortunate position to translate it to a real world application after graduation. If you're already in the muck, that shit just reads like the gods laughing at you.

At least I'm somewhat in the same headspace and so I can deduce some things intuitively. The revelation that linear logic only allows each expression to be used once, that we are dealing with sequential operations, and that there are two versions of AND and OR, a linear and a bilinear one, has helped to make some sense of it.

There are ways to read AND/OR as timeline operations on events: x is an event that happens. NOT x means event x doesn't happen. x AND y means that events x,y occur in the same timeline (parallel or sequential - AND is unclear here); whereas x OR y means x,y potentially occur on separate timelines; x XOR y guarantees that either x happens or y happens, but not both - so describes a real split timeline.

AND could also be called "all" (all these events occur), whereas OR is called "any" (any of those events occur), and XOR is "either".

IMPLY (the material conditional) describes a dependency of events in this context: x IMPLY y means that when x occurs, y should also occur, though it does not prescribe non-occurence of y when x does not occur, hence x must be a cause of y, but not the cause. This means that IMPLY can not fully separate causation from coincidence. y could occur along with x by coincidence. Only when x occurs and y does not, do we know that they are causally unrelated.

I also learned about notation in sequential calculus, where (x AND y AND z AND ...) implies (u OR v OR w OR ...) can be shorthanded as x,y,z,...⊢u,v,w,... where is called the turnstile operator. In terms of events it means both "if all x,y,z,... have occurred, then any of u,v,w,... occur" and "if any of u,v,w,... occur, then all of x,y,z,... have occurred". So we get a time-reversible expression that we can follow in either direction, as opposed to the (material conditional; "implies") arrow, which is not readily reversible.

You may also imagine a,b,c⊢d,e,f as sort of a TODO-list snapshot or dependency notation; the left-hand side of ⊢ is what has been done (all these things have happened); the right-hand side of ⊢ is what has to be done next (any of these have to happen). at the end of recorded time, the right-hand side is empty; at the beginning of time, the left-hand side is empty.

So a,b⊢c is a dependency: in order for c to happen, a and b must happen. There are though statements such as x,...⊢x,..., and these are called axioms. Here, x depends on itself.

How do we describe the ternary operator c?t:f in sequent calculus? c stands for the condition, t means the true branch is evaluated, f means the false branch is evaluated. I tried to derive it from classical formulation, but ended up formulating the expression from scratch, using two branching terms, which I think is complete in this way. For this, I use the absurdity constant to represent false:

⊢t,f    c,f⊢⊥ 

The first term means "t or f is implied", suggesting either will be evaluated (or both), whereas the second term means "c and f is false/impossible", thus turning c into the condition that decides whether t is selected. We can use c to perform a cut of ⊢t,f to get

⊢c,f    c⊢t    c,f⊢⊥ 

Which reads as "c is true and/or f is evaluated", "if c is true, t is evaluated" and "c is true and f being evaluated is impossible".

21.6.2022 (II)

Having some optimization success by replacing cells with an immutable data structure similar to piece tables, requiring essentially only two operations: slice and join. slice truncates a piece table, culling pieces, while join merges two tables. After merging tables, if the piece count exceeds log2(element_count), the smallest two adjacent pieces are merged until the piece count satisfies the constraint. The effect is that indexing into the table, which requires linear search, has O(log n) complexity at most, even though we're not walking or sorting a tree.

This brought down the run time of the transpiler test from 1670ms to 672ms. It worked so well that I will apply the same idea to blobs, so that we also significantly reduce the amount of work involved in string concatenations.

Insertions are presently naively modeled as follows:

insert(cell,element,i) -> cell[0:i] + [element] + cell[i+1:countof(cell)]

Update: A better heuristic for merging pieces appears to be to pick the two pieces with the lowest absolute (square root) difference, or the lowest absolute log ratio, i.e. neighboring pieces of symmetrical size.

Here are some examples for 32 insertions for back, center and front.

back:
(1,)
(2,)
(2, 1)
(2, 2)
(2, 2, 1)
(4, 1, 1)
(4, 2, 1)
(4, 2, 2)
(4, 2, 2, 1)
(4, 4, 1, 1)
(8, 1, 1, 1)
(8, 2, 1, 1)
(8, 2, 2, 1)
(8, 4, 1, 1)
(8, 4, 2, 1)
(8, 4, 2, 2)
(8, 4, 2, 2, 1)
(8, 4, 4, 1, 1)
(8, 8, 1, 1, 1)
(16, 1, 1, 1, 1)
(16, 2, 1, 1, 1)
(16, 2, 2, 1, 1)
(16, 4, 1, 1, 1)
(16, 4, 2, 1, 1)
(16, 4, 2, 2, 1)
(16, 4, 4, 1, 1)
(16, 8, 1, 1, 1)
(16, 8, 2, 1, 1)
(16, 8, 2, 2, 1)
(16, 8, 4, 1, 1)
(16, 8, 4, 2, 1)
(16, 8, 4, 2, 2)
front:
(1,)
(2,)
(1, 2)
(2, 2)
(1, 2, 2)
(2, 2, 2)
(1, 4, 2)
(2, 4, 2)
(1, 2, 4, 2)
(2, 2, 4, 2)
(1, 4, 4, 2)
(2, 4, 4, 2)
(1, 2, 8, 2)
(2, 2, 8, 2)
(1, 4, 8, 2)
(2, 4, 8, 2)
(1, 2, 4, 8, 2)
(2, 2, 4, 8, 2)
(1, 4, 4, 8, 2)
(2, 4, 4, 8, 2)
(1, 2, 8, 8, 2)
(2, 2, 8, 8, 2)
(1, 4, 8, 8, 2)
(2, 4, 8, 8, 2)
(1, 2, 4, 16, 2)
(2, 2, 4, 16, 2)
(1, 4, 4, 16, 2)
(2, 4, 4, 16, 2)
(1, 2, 8, 16, 2)
(2, 2, 8, 16, 2)
(1, 4, 8, 16, 2)
(2, 4, 8, 16, 2)
center:
(1,)
(2,)
(1, 2)
(2, 2)
(2, 1, 2)
(2, 2, 2)
(2, 1, 4)
(2, 2, 4)
(2, 1, 2, 4)
(2, 2, 2, 4)
(4, 1, 2, 4)
(4, 2, 2, 4)
(4, 3, 2, 4)
(7, 1, 2, 4)
(7, 2, 2, 4)
(7, 3, 2, 4)
(7, 3, 1, 2, 4)
(7, 3, 2, 2, 4)
(7, 3, 1, 4, 4)
(7, 3, 2, 4, 4)
(7, 3, 1, 2, 8)
(7, 3, 2, 2, 8)
(7, 3, 1, 4, 8)
(7, 3, 2, 4, 8)
(7, 3, 3, 4, 8)
(7, 3, 1, 7, 8)
(7, 3, 2, 7, 8)
(7, 3, 1, 2, 15)
(7, 3, 2, 2, 15)
(7, 3, 1, 4, 15)
(7, 3, 2, 4, 15)
(7, 3, 3, 4, 15)

21.6.2022

Had a short vacation last week and used the opportunity to mull over whether to use move semantics in the Frameloop IR. What the proposed change means in particular:

  • Several instructions "move" one or more of their arguments, meaning that the values providing the argument must only be moved once, and can not be accessed after the move. In practice, this means that every value can be a viewed (referenced) argument multiple times, but a moved argument only once.
  • A move instruction forces a move of a value and provides the reference implementation.
  • A copy instruction conceptually duplicates a value so that the duplicate may be moved separately.
  • Instructions that extract or slice values like extract, cellextract, slice, cellslice, produce slice views that allow us to defer a copy. These views are weak, i.e. they become inaccessible when the value they reference is moved.
  • Values can not be moved out of composites.
  • Views can be inserted into composites. To prevent this, move or copy must be used on the element to be inserted. A composite with view elements can become partially inaccessible.
  • When an element of a composite view element is extracted, the view can be truncated. e.g. viewof value 1 0 1 can be reduced to just viewof (localof (cellextract value 1)) 0 1, if element 1 of value is a view.
  • We move only where an imagined mutation would be of benefit, otherwise we view.
  • We change native argument evaluation order from right to left, as the first argument is often by convention the one to be moved, and auxiliary arguments are often just views of the first.
  • All arithmetic unary operations, repeat, break and merge move their argument.
  • All arithmetic binary operations, select, join and celljoin move the left-hand argument and view the right-hand one.
  • insert and cellinsert move their composite argument.
  • All values produced within a function (or loop) are dropped (moved to nowhere) when the function returns, except for the return value.

Here's the pros and cons.

Pros:

  • Instructions can directly mutate their arguments, meaning no speculative duplication is required, which saves us memcpy and malloc calls. Append operations that would before require n*(n-1)/2 memory units for n operations could now use dynamic array allocation patterns instead. Repeated insertions into arrays of fixed length could all be transformed into in-place mutations, reducing a memory operation to n. All this makes it possible to implement indexof in the language instead.
  • We get fine-grained control over which data to duplicate, and which data to transform in-place, allowing us to reason about performance and memory usage, just based on the amount of copy instructions we see used.

Cons:

  • The type system becomes more complex. We need to annotate for arguments whether they are moved or viewed. Return values can be local or remote (a slice of values contained within a composite). Static analysis requires computing lifetimes of views.
  • For debugging, the interpreter must keep track of which instruction moved a value, and even undefined values must be annotated with the instruction that caused the move.
  • The user must maintain internal knowledge of which instructions move and/or produce views. It is less mentally taxing to have the compiler figure it out. But this could also be provided by a meta-language.
  • Requires manual application of copy instructions wherever we transform a value more than once. move or copy must be used after extraction and before insertion, if we want to prevent a weak insertion.
  • All views on a unique value are invalidated even when they reference slices that have not been mutated by an insertion, because the value as a whole has been moved.

13.6.2022

I have serious trouble getting the bootstrap interpreter for Frameloop to perform well enough that it can execute the compiler bytecode without taking 15 minutes to do so. So far I employed the strategy of memoizing instructions by function environment. But that is not enough.

Since all generated values are immutable, each mutation has to be modeled as a copy + modification of the copy. So in simple appending to a dynamic array, we need to duplicate the array every time, plus the new element we're appending, which goes through n*(n-1)/2 elements for n elements appended. If the item going in had a reference count of 1, we could perform the mutation in-place, but then we have to be careful not to mess up memoization - memoization and in-place mutation interfere with each other.

So I'm considering improvements. One fundamental improvement I was thinking about was using move semantics. Every transformative instruction we use requires its arguments to have a reference count of 1, and decreases the reference of its arguments. If the same argument is used a second time, the reference count is zero (we keep dead objects around just so we can safely check for this), and an error is thrown. So these functions treat values as monadic.

Any transformation expects its arguments to be unique, that is, they have a refcount of 1 (owned by the calling environment, and ownership is transferred to the transform). If the refcount is > 1, then the transform must first make a copy of the argument before using it (if it is composite, the refcount of each of its elements is increased by 1), otherwise it can mutate the memory directly, and produce a new value with refcount 1 (ownership is transferred back to the calling environment). After transformation, the refcount of arguments is reduced by 1. If they were unique, their refcount is now 0 and their memory can be freed.

In addition to these unique values, we also need non-destructive slices, which are weak references to ordered subsets of unique composite values that do not have ownership. Weak references allow us to keep the reference count low, and save memory copy operations. Instructions that just retrieve values from a composite do not change its reference count. If the value that a slice is indexing into has a refcount of 0, the slice becomes inaccessible.

When a transformation receives a slice instead of a unique value, it will not decrease the reference of the composite, but take ownership of the referenced elements, even though they are contained within the composite. This means that we can end up with composites of which some elements are inaccessible. This is by design, so we can treat the environment of a function as no different than any other composite.

We need two more auxiliary instructions: move and copy. move is an indentity transform that transfers ownership without changing the provided argument, becoming the reference operator used by all other transforms. copy increases the reference count of a unique value or a slice, allowing us to move that value one more time before its refcount reaches zero.

I've mostly talked about composites in the context of cells. For blobs, operation is slightly different. Since elements are octets and therefore have no refcount on their own, copy only increases the refcount on uniques, and makes actual copies of slices. move can not decrease the refcount of individual octets of a slice, and so will also make copies, except for when the entire blob is referenced. We could in theory simulate refcounting per octet, which requires maintaining a list of region refcounts, which are increased by copy, and decreased by move.

What I am a bit worried about is how well this arrangement works with pure functions and our automatic topological arrangement of instructions. A function is supposed to produce the same value every time it is called. But our calls fold, and so some behavior could be surprising. e.g. two separate instances of copy x in a function fold to the same expression %1 = copy x, and will both produce the same value, and so cell (copy x) (copy x) is really cell %1 %1, which will fail because the reference count of copy x is zero after the first argument has been moved, same with cell %1 (copy %1), which fails because we evaluate and move arguments in the order they are passed. Only cell (copy %1) %1 will work correctly, i.e. cell (copy (copy x)) (copy x). Effectively, we can only use a copy of a value once. For more copies, we must make copies of copies. This means that we should always copy at the call site, and not expect a pre-bound copy instruction to produce an unlimited amount of copies.

Can we break this behavior with memoization? %copy = fn (x y) (copy x) will produce two different instructions for apply %copy x 1 and apply %copy x 2. If apply will however always evaluate and move arguments, then the behavior is the same: the second attempt to pass x will fail. In any case however, this function will fail to operate as intended, since x has already been moved into the function, and will be dropped on exit, and so we haven't really made a copy. Only a macro function can perform here. But macros are always liable to accidentally fold copies. The copy instruction should always be used at the earliest point.

We could of course also consider the alternative, where copy is never memoized, and so every copy instruction will always increase the reference. If we now declare x* = copy x, then passing x* as an argument always produces a new copy, and copy x* will produce two copies, which probably makes no sense. Hence copy is now more of an annotation rather than a instruction.

In closing, let's walk through a very common operation that would benefit from move semantics, incrementing a value in a container:

%1 = cell %A %B %C
%2 = cellextract %1 2   # %1[2] = %C
%3 = iadd %2 1          # %1[2] + 1
%4 = cellinsert %1 %3 2 # %1[2] = %1[2] + 1

We'll annotate the refcount of each element after every instruction:

# starting out: %A:1, %B:2, %C:1 (for example)
%1 = cell %A %B %C         # %A:0, %B:1, %C:0, %1:1
%2 = cellextract %1 2      # %A:0, %B:1, %C:0, %1:1, %2:view(%1[2])
%3 = iadd %2 1             # %A:0, %B:1, %C:0, %1:1, %2:view(%1[2]), %3:1
%4 = cellinsert %1 %3 2    # %A:0, %B:1, %C:0, %1:0, %2:view(%1[2],dead), %3:0, %4:1

7.6.2022

A hybrid of the generator and the collector, the replacer:

R = {
   result,
   if empty/full: {}
   else:          {value, (value) -> R}
}

The replacer has an evolving state, a next item and an update function to replace that item. The user is expected to receive the item, process it and update the replacer until the replacer produces no new items, upon which the result is complete.

Because of this structure, a replacer can for example operate on a conditional expression: first, it returns its condition for evaluation; when it is fed back in, and is undecidable, the replacer continues to return and update both branches in sequence. When the condition is constant, the replacer can instead only return the selected branch, and replace the original conditional expression with the branch result that is fed back in.

It is impossible to skip elements in the replacer - one must feed as many items back into it as have been received. A replacer can accept a special value to indicate skipping the insertion, but it must still be able to update its state.

6.6.2022

Based on my yield excursion yesterday, I developed a generator and collector protocol that has feature parity with generators and collectors in Scopes, but is simpler to set up, as we can implicitly capture state with closures at runtime.

A generator is defined as follows:

G =
   if empty: {}
   else:     {value, () -> G}

The first value at generates the value. The second function next produces the next generator for a sequence. The generator indicates through its empty-state that it is spent.

A collector works like this:

C = {
   result,
   if full:  {}
   else:     (value) -> C
}

The first element holds the present state, or the result; The function push receives a value, possibly evaluates it, commits it to the result and produces a new collector with a new result. The collector indicates through setting push as empty that it is full and accepts no further input.

5.6.2022

I implemented the map primitives in the interpreter today, which are serially processed but semantically parallelizable. The C implementation will later use threads to massively parallelize processing.

An open problem I keep thinking about is supporting yield control flow. Do we really need coroutines to implement it?

The problem is essentially, how do we return and continue from inside a loop?

yieldrange = fn (count)
   loop (i = 0)
      if (< i count)
         yield i
            repeat (+ i 1)
         break;

We expect this function to produce a value and a continuation 0 (fn (x) (repeat (+ i 1))), which is an illegal form, as repeat can not be moved out of the loop. Were this a recursive function, it would be more straightforward:

yieldrange = fn (count)
    l = fn (i)
        self = this-function
        if (< i count)
            cell i
                fn () (self (+ i 1))
            cell;
    l 0

returns 0 (fn () (l (+ i 1))); loop is a closure, and so we can perform a recursive tail call here. It is not really recursive, because we have already exited loop and only carry its captured environment, and the application has been wrapped so it is evaluated after we have exited the function. This will port straightforwardly to C.

So yield is really rewrite sugar that transforms loop, repeat and break. We do not really need to support this in our assembly.

3.6.2022

That was an extraordinarily productive day, without any major obstaces. I generated and routed debug information straight into the interpreter, being able to print a stack trace now, and I wrote a simple stackless mark & sweep garbage collector that prevents memory from filling up.

A minor kink in the debug info protocol is that body and foldbody (that's the body of a loop) instructions need to reference a di node. While the interpreter can easily track the di stack parallel to regular processing, we don't associate di nodes with results, which means we don't have that information for closures, which are preassembled. When a previously returned body/foldbody references a di node though, we can continue debug info tracking from there.

We still lose information about the function itself though (as the di for the body is below it), and the presence of a di node in the dataflow interferes with deduplication and debug info stripping (before, we could just take the value of a di and voila, it's stripped). So what I would prefer here is that the interpreter entirely figures it out on its own, by rewriting bodies to their di nodes on return, so we can unwrap those later on application. We could also keep a separate map, with the drawback that a body can have multiple identities, so we don't know exactly what function we have originally been calling.

2.6.2022

The jerry-rigged FLIR interpreter works well now. It is reasonably performant, but wastes a lot of memory, which doesn't matter in the grand scheme of things.

I am now working on the actual FLIR compiler, which is written in frameloop assembly (flasm), a text based language I invented so I could write some FLIR. The compiler can load and debug print FLIR modules, as well as compute predecessors, postdominators and scope trees. I also wrote a C code generator that turns a template tree into C code and takes care of proper indenting, quoting and delimiting.

The first version of the compiler will generate C modules that work with basic refcounted sum types exclusively, i.e. no local specialization of datatypes or elision of copy-on-write operations yet. All functions use the same signature result:value <- (closure:value, args:value), and by convention always move their arguments, i.e. decrease the references. So to keep an argument alive, a caller must incref it before passing.

An open issue is tracking debug information for expressions. I want to make use of the approach I developed last year (26.7.2021), where a debug information graph is kept as a metastructure for our program, in a way that allows us to strip the metadata at any point.

All we need is one new node di (debug information):

"di" : defop('DI  ', d(value = id, md = id, subdi = array(id)))

The idea here is that we tag an instruction value with metadata md, and then provide a new di node for each of its id typed fields. For metadata, we start out with a simple text location node:

"md_at" : defop('MDAT', d(file = id, line = u32, column = u32))

Later, we can add more nodes to provide more detailed data.

25.5.2022

Finally figured out how to properly generate and evaluate pure closures in FLIR and the FLIR interpreter in a way that deduplicates well, without requiring the C stack machine. I'm only going to talk about functions, but loops work similarly to this.

A function definition fn <level> <body> requires only a scope level and the body of the template expression. The parameters of that function can then be addressed via argsof <level>. We require that topmost functions start with level 0. Nested functions will later require to access both their own parameters as well as the one of the function they're contained in, and so they need to be defined at a progressively higher level. We can later apply a function by apply <function> <arguments>.

With these mechanisms, we can then describe nested functions that curry arguments:

%1 = fn 0
   fn 1
      fn 2
         cell (argsof 0) (argsof 1) (argsof 2)
# produce a (cell 1 2 3)
%2 = apply (apply (apply %1 1) 2) 3

For execution, the interpreter requires an environment instruction env <level> <args> <prev-env> that can be chained, where the null instruction indicates an empty environment, as well as a closure instruction closure <env> <level> <body>.

The interpreter carries the currently active env as global state. At runtime, a fn instruction is evaluated to a closure instruction, which captures the currently active env (truncated to the parent of the env instruction matching the fn level). When an apply is evaluated, the first argument must be a closure. The currently active env is saved to a stack, the closures env is selected, and then a new env is chained to the closure, associating the closure's level with the arguments passed to apply. The state of apply is marked as "in evaluation", then the body of the closure is queued for evaluation. When evaluation has concluded and apply is visited for a second time, the original env is retrieved from the stack and activated, and the original apply instruction id is aliased to the id of the evaluation result, as we always do.

One more change to aliasing: under environment influence, an instruction does not always evaluate to the same result. For instance, argsof 0 will produce different results depending on which environment chain is presently active. Therefore we alias instructions together with their environment id, and also later resolve the aliasing with the presently active environment id, searching up the environment chain to find a match.

One can optimize for better reuse of instructions, by exploiting the fact that not all expressions evaluated within a nested function require all bound arguments of the environment. By tracking for each instruction which environment level it actually requires, we can pick a lower level for aliasing, so then later more branching environments can reuse the result, and by that also determine in what scope an instruction actually is. For example, an instruction with the null environment is always a global instruction.

13.5.2022

Alright. So these are the building blocks of Frameloop that presently exist:

  1. A compact FLIR instruction spec in Python that can autogenerate a C header file
  2. A FLIR macro assembler written in Python that uses (1) to parse a text file in SLN format and outputs a FLIR binary module.
  3. A FLIR library written in C that can prettyprint FLIR modules, but most importantly specialize a FLIR module to a new FLIR module.

(3) is essentially an interpreter whose datatypes are instruction types. Instructions that are undecidable at evaluation time are passed through. This folds all constant expressions, and for a module that is entirely decidable, the resulting module will only consist of constant expressions.

10.5.2022

What is purity? A pure function is a function which, given the same input, will always return the same output. A function that reads a piece of data can only be pure if no other function alters that data. A pure function that causes other functions to become impure, by e.g. altering state that other functions read, is said to have side effects.

Can we write a completely pure graphical application? Yes, when the program only describes one frame of the app, that is, one iteration of its mainloop.

This means that any side effects are allowed to happen between two frames - after the evaluation of the last, and before the evaluation of the next frame. Every repeated evaluation of the same expression during that frame however must yield the same result.

For functions that usually have side effects, e.g. such as a function that reads from a file, we can trivially introduce purity by caching the result until the next frame. This allows us to evaluate lazily: even though reading from a file has side effects, we can choose to only perform the read when it is actually requested. It does not matter how late in the frame this read comes, and whether the result would have been different had we read earlier in the frame, the only thing that matters (for purity) is that any subsequent read yields the same results.

This is particularly interesting when streaming. Say we read an 8-character string from a generating function f(size), as f(8). We want to read more data, but by the definition of purity, f(8) will always resolve to the same string, not the next eight characters.

An abstract, solve-all way is to have monadic state that breaks the purity of the function, so we get function arguments like f(offset,size)->newoffset,value, specifying our offset in the stream. With the same offset, we get the same result. We can also rewind to an earlier offset if we like. To implement this function, we need to cache the entire stream read to the largest requested offset+size to an append buffer, and then return slices from that buffer on request. But when the frame ends, we can clear this cache, and allow our offset to reset to zero.

7.5.2022

The Frameloop VM really only needs two kinds of immutable values, blobs and cells:

  • blobs (aka chunks, strings, bytes, etc) can represent any kind of contiguous serialized binary data (integers,floats,booleans,utf-8 text,symbols,vectors,bitmaps,matrices,file contents) but are flat and schemaless.
  • cells represent any ordered composition of values with overloadable operators (arrays,structs,tuples,maps,sets,string wrappers) but can not be serialized as-is.

A cell can be composed of cells and blobs and also specify a type, which is another cell, and in that way allow to implement typesafety and type based dispatching. Cell keys used for forms are also cells, as are functions and built-in forms, S-expression lists, and so on.

We can convert a cell to a contiguous blob using a generic serialization function, and a blob back to a cell provided the format is correct, and by that way we have a bridge between the two kinds.

We encode arithmetic types like integers, floats, booleans as single blobs in cells with a matching type that has the relevant operations overloaded.

Lowering code that uses blobs is simple, because they're just memory buffers; short buffers can be directly handled in registers based on their size and usage patterns, e.g. a 4-byte blob that is repeatedly added should best be stored in a 32-bit integer register.

Lowering cells to structs and arrays requires analysis of how these cells are used and transformed. Since all our cells are immutable, we do not need to account for mutation, but can rather optimize for elision of temporary values in favor of in-place transformations.

6.5.2022

After writing the terrain engine and reexperiencing, despite all the friction that Scopes has already removed in comparison to a C/C++ solution, how annoying and convoluted it still is to iterate on a GPU pipeline, when it could be as much fun as writing a Shadertoy, I am back at giving the idea of a GPU/CPU integrated VM a second try, because I really enjoyed the prototype I had built last year, and I think this backend simply needs to exist. I then want to reimplement Nowhere's terrain engine in it.

I singled out the runtime from Tukan and named the project "Frameloop", after its runtime model, which defines a program as a directed acyclical graph of operations that construct a single frame of a graphical app, which is then fed back to itself to maintain state.

  1. Write a throwaway interpreter in python that can execute Frameloop sourcecode (looks a bit like scopes, has *.flsc file extension), so the implementation of Frameloop can be written in itself. The interpreter exists but isn't entirely complete. This is where I am at at the moment.
  2. Write a transpiler in flsc that can convert Frameloop sourcecode to FLIR (Frameloop Intermediate Representation), which is based on the FIR model I developed for the Tukan VM. FLIR is the native, universal form of Frameloop programs, and there is no one source code representation for it.
  3. Write another transpiler in flsc that converts FLIR to standalone C source code.
  4. Write a compiler/optimizer in flsc that reads FLIR, lowers it to more efficient forms, and writes FLIR again.

This gives us a complete self-hosting compiler. With TCC added in, live coding is possible and we can provide JIT like services.

  1. Write a FLIR generating backend for Scopes, and build Frameloop integration for Scopes, so Scopes users can use Frameloop to provide sandboxed modding for their games.
  2. Write the rest of Nowhere's tooling in Frameloop and further develop its procedural engine.

In the long run, I also want to add web0 capabilities to Frameloop, so users can share and execute programs without the help of a browser.

28.3.2022

The project manager I theorized in my last post now exists.

18.3.2022

I keep returning to my idea for an agnostic, project oriented package manager that allows solo devs and small teams to manage dependencies for a project without having to subscribe to a global shared repository or a fixed way of doing things that doesn't match their development culture, without it becoming a dedicated job that requires sinking substantial time into it.

I have a personal need for such a tool, and even thought I presently don't have time to work on it, I thought I'd write down my feature list at least once, for future reference:

  • ships as single binary and is scriptable, very similar to the ideals of GENie or premake.
  • project oriented; that is, it uses its configuration script to find the root directory of the project, and then operates only within its subdirectories. the package cache also lives in this folder.
  • the project's recipe repository also lives in the project folder; we can check it in together with the source code and share it through source control.
  • we typically use the package manager to bootstrap and install project dependencies once when setting up the project on a new machine, and then subsequently only update partially, which requires removing an existing installation; this is done by first installing into a staging folder, archiving this folder for caching, unpacking it to the actual installation image (to verify our archive works), then later using the filename list from the archive to know which files to remove.
  • simple conditional dependency directed acyclic graph.
  • has a default, POSIX guided understanding of where installed packages go ($/bin, $/lib, $/include, and so on), but other installation directories can be configured.
  • recipes can be imported from other projects and from online template repositories.
  • Cross-platform; can operate on Linux, Windows and Mac OS X.
  • Can flexibly use existing system infrastructure to install additional tooling when required, namely svn, hg and git.
  • Project recipes can implicitly download files, check out source repositories (see previous point), unpack tar.gz, bz2, zip and 7z archives, copy files and directories, run shell commands and conditionally depend on other recipes to be executed first.
  • Platform recipes work the same way, but can install prerequisite command line tools system wide if need be, so that we can start off a machine project file to prepare the overall dev environment, e.g. install msys or build-essentials.
  • The recipe format should be compatible to scopes list notation. The format is super easy to parse, at lot easier to retain in human memory than YAML, and we can read recipes in Scopes too if need be.

About the implementation:

  • Could be prototyped in Scopes, and built as single binary from there.
  • Codebase should settle quickly, so that we can use a built binary over years.
  • Bootstrapping the PM requires the ability to download archives and files from HTTPS protocol, as well as the ability to handle a native introspectable archive format that we use for archive caching and bookkeeping.
  • A library like LMDB can be used to persist state where needed, although if state is already inferable from presence or absence of other files (e.g. archives), then that is to be prefered.
  • All other capabilities of the PM can be augmented with command line tools.
  • Most work will go into recipe parsing, topological execution of DAG, filesystem I/O, path and string formatting/replacement as well as piping data to and from command line tools.
  • Platform recipes should capture most of the "administrative wisdom".

9.3.2022

Nowherian space exists on a flat 3-torus, which means that if we travel long enough along one direction, orthogonally, we end up back at the beginning.

How do we introduce a daylight cycle into such a space? We can not put a sun at a distant point, because light cycles within our space and is gradually absorbed by the medium. There is no "outside" to our space, at least not in this dimension.

We can place the sun within our plane, as a point light, but it would be very hot, and nothing could be built near it. For a cycle, the sun would have to move, but not every place would experience a night. An alternative is for the sun to pulse, so that it is off for half of its cycle, which would cause a night everywhere at the same time. Places further apart would never fully shadow other places, because the light loops around and scatters through the medium, so any point has the ability to be lit by 27 immediately "adjacent" suns.

A different way to introduce daylight in a way that creates timezones would be to originate the light from a parallel dimension. Let's drop to the flatland case to make better sense of the situation. We live within a repeating plane. The light source is outside this plane, perhaps as a traveling emitter that irradiates the surface like a scanner light. "Photons" hitting the plane would be tied to its surface, effectively losing one dimension, and be randomly deflected along the plane to hit any solid object. Perhaps the photons would also only be created by the medium as a reaction of the exocosmic irradiation, as a luminescent effect.

Without further constraints, everything would be lit up everywhere, even the insides of caves. So concentrations of matter would have to deflect radiation before it hits the surface, effectively causing light to always build up further away from matter, preserving the darkness within cavities. This would look like in practice as if the medium glowed like a light in the fog, or a bright sky. The light would have a smooth ambient lighting character, seemingly coming from everywhere and producing no hard shadows.

If the irradiator is focused, the resulting irradiation pattern could be spherical and moving only along one orthogonal axis, so that depending on world position, we get more or less light in a day; there would be a concept of a spatial "equator", which receives the most light in a day, which would influence temperature and climate in that region, and the light could shift laterally to the orthogonal axis over the course of a year, so that we end up with changing daytime and "seasons". The intensity of light could also influence the color of the luminescence, so that at the fringes of the spherical pattern we get rainbow hued evening colors as are familiar on earth.

There is also the matter of recycling. The bubble is an apparent perpetuum mobile, in which waste heat is converted back to matter. How it could work is that an exocosmic coolant transfers heat away from the world (what the universal expansion does for us), concentratic it someplace else, and eventually injecting it back into a particularly empty region of space. One might imagine this cooling effect as the result of heat "falling" into a narrowing exocosmic funnel that loops back to a point in space, guided by the same field that repels exocosmic radiation, injecting primordial matter. To an observer it might seem like a volcano erupting from a point rather than from the ground, its lava eventually cooling down and forming a new planetoid.

The light and energy produced by the radiation emitter is also sourced from collected heat.

The funnel has a bottleneck and a threshold, and so there is a bandwidth limit as to how much heat can fall into the funnel per second (with the greatest part absorbed by the radiation emitter), and a threshold that requires a substantial build-up of temperature within the funnel that triggers a cascading effect causing an eruption. So the world may experience a greenhouse effect if more waste heat is produced than can be absorbed, but this would not cause a runaway cascade effect, as the amount of energy that can be burned within the closed system is finite.

There is an alternative outcome, in which the system does not burn extra energy because all metabolic activity has ceased. This will only cycle light energy, of which the smaller part will be lost in the funnel and occasionally ejected as matter, which does not degrade fast enough under radiation. Eventually there will be an abundance of matter as the rest energy left for light depletes, and at some point the ambient sun will glow so faintly that it can barely be registered, and the lights go out.

8.3.2022

Been crunching Nowherian time and space scales today to figure out how big to make the map and how to make the short time of the game feel like a long time. I do have some good metrics now, and the terrain can be 64km³ big (with lots of empty space inbetween), at a resolution of 1 meter per sample.

In general it's interesting to observe how spacetime changes in proportion. Let's say we map the approximate circumference of earth, 40075km, to a game map of diameter ~1336km, with the approximate conversion factor k = 1 / 30.

To orbit earth at 960km/h, the best common speed of a commercial aircraft, would require a little under 42 hours. We can scale this duration directly by k and estimate that our map would be orbited in a little over 83 minutes instead, at the same speed.

To arrive at the same time spent, we must either slow down the aircraft by factor k to 32km/h, or redefine the concept of an hour within the game, so that 1 game hour = k real hours (or k * game_duration = real_duration), exactly two minutes. We'll use gh for game hours, h for real hours in this text.

So now we can translate 960km/h directly as 960km/gh. Orbiting the map still takes 83 minutes, but since the clocks in the game run much faster, they will show 42 hours have passed. We also run the day cycle at this speed, so a day takes 48 minutes now. The impression of seeing the sun go down and then up again during flight will have the same proportions.

Because our days are compressed, our lifetime would be too. Assuming we skip 1/3 of each day in the game because of sleep, a game year takes a little over 8 days in the real world. 80 game years pass in a little over 1 real year and 9 months. A game completion time of 44 real hours would cover a little over 82 days in the game.

We do have a considerable scale issue here, because hardly anyone would see the end of the game if they had to spend 23360 hours to complete it; assuming you play 2 hours a day and 8 hours a day on the weekend, that's 26 hours a week, 1352 hours a year, so you'd effectively need over 17 real years to get to 80 game years.

But if you compress time so much that 80 game years fit into 44 real hours, then a day in the game would last about 8 seconds. The sun and the moon would whoosh by with confounding speed, and the days would be over faster than you could make a decision.

So we say OK, we compromise, and either days are, in relation to time, much longer, and this includes all cycles they entail, or the lifespan in the game is much shorter, so you age 1 year per day, and thus turn 80 game years around 42 real hours. In the end, it means the same thing.

If a game day is representative for a game year of one's life, then we could apply storytelling techniques and treat it as pars pro toto, the part for the whole, as if every day had the weight of 365 days. What you eat, you ate 365 times. What you earned or lost, is incurred 365 times. How you slept, you slept 365 times. Inflation rises in a game day by how much it rises in a real year. That seems a bit strong for some applications. Will you have to cut your hair back every day too, which grows with the pace of a magical shrub?

This arrangement also implies that you complete major stages of your life within 20 game days (or under 11 real hours), from childhood, to young adulthood, to middle-aged, to old age.

What happens to other annual cyclical change, like seasons? Let's say we want the player to go through the seasons five times, so that in each stage of their life they go through the seasons once, but do not end up in the same phase, then a season transitions within 4 game days, or 2 real hours. That's too fast.

If we, instead, map five seasons to a lifetime, then a season lasts 16 game days, or 8.5 real hours. As four seasons define the length of a year, a game life then lasts a little over a game year, i.e. our life is annual. So concerning some annual effects, we can ease up a little, and say that 16 game days have the impact of 91.25 real days, so 1 game day can instead have the impact of 5.7 real days, a bit over 1 day shorter than a week.

What happens to our concept of weeks and months? We only have 64 game days a game year to distribute. Say we only assign a single month to each season, then a game month has 16 game days. If we assign 5 game days to a game week (2:40 hours to play), then a game month has a little over 3 game weeks, which gives us a first week, a peak week, and a last week for each season.

7.3.2022

Trying to figure out what chunk size of terrain data I can practically support. Every time I update a chunk, I need to update all parent LOD chunks, so log2(W) steps for a world. Our morton codes should fit into 64-bits. 32 bits would be even better, but then max(log2(W)) == 10, and at one meter per sample, that's merely a cube of a kilometer sidelength. So we can do 21 bits per sidelength, or 2097 km cubic sidelength.

Let's say I want to spend as much bandwidth as I would on one full HD screen: B = 1920*1080 = 2073600. So I divide by the number of LOD levels to be updated, and I get C = 2073600 / 21 = 98742.6 cells per chunk to be updated. Since C = W³, W = C^(1/3) = 46.2, and so 46³ would be the chunk size to use to fit a HD screen. The next lower power of 2 is 32, so 32³ would be a good choice here.

8.2.2022

Notes on terrain sector meshing.

We recursively generate the u32 sector keys to process and write them into a mapped GPU buffer, which is used in the next step, one key per workgroup.

The next step can be done in the same shader with shared memory and barriers, as long as we keep temporaries under 32KB. If we can stick to 8 bytes per cell, then we can accomodate 4096 = 16³ cells.

  1. preprocess 4³ sector; per cell,
    • fetch 8 values of dual lattice (5³ environment)
    • compute and store 3 edge sign bits
    • compute and store feature plane
      • probably fits all into 8 bytes
      • packed normal + distance: 3 * 16snorm = 6 bytes
      • sign bits in last 2 bytes
  2. generate triangles from 4³ sector
    • read sign bits from 2x2x2 cluster
    • generate up to 3 quads for sign bits:
      • compute feature vertex from feature plane (normal * distance)
      • use plane normal for vertex normal

6.2.2022

From experiments, I see that sampled distance fields produce cleaner and sharper MC isosurfaces than density fields, as their wider support region enables C1 continuous gradients (and beyond), but i find them harder to reason with as source data.

Density fields make probabilistic statements about occupancy within a cell: a density of d=0 means there is no mass in that cell, d=1 means the cell is maximally dense, and d=1/2 means the cell is half full. We do not know where the particles are in that cell, but if we took a uniformly random sample, with d=1/2 we would hit a particle half of the time. That's what "probabilistic" means.

A density field has by definition no embedded isosurface, so by convention we set the isosurface at d=1/2, that is, where our field crosses the 50% boundary, we determine inside/outside of the surface. A more correct rendering of the field would instead use its density as inverted transmissiveness for a participating media raymarching, ending the march where transmission is 0%.

A signed distance field only makes statements about the distance the center point in our cell has to the surface, and we extract the isosurface at d=0, as we have positive and negative distances as support. We need the support region to extract a smooth gradient and curvature using finite differences. We can not directly derive density from the field; this would require integrating a step function over the interpolated cell, so that negative values map to zero, and positive values map to a constant.

Where density fields shine is when we integrate them to lower resolutions, because we can conserve the spatial energy with decreasing accuracy. Signed distance fields can not be properly integrated. Their weighted average can only be used at greater distances, and one always has to drop to the highest resolution the closer one comes to the surface. Lowering their resolution requires building a min-function over the cells to be merged.

1.2.2022

Further worked on the terrain engine with LOD support. This is the present approach to updating the terrain.

We use a double buffering scheme that updates over multiple frames.

  1. Compute addresses of all cells required for the player position.
  2. Copy all reused generated cells from the last vertex buffer to the new vertex buffer, and remember for each cell its region in the array, so we can find it in the next update.
  3. Over multiple frames, generate vertices for all new cells and append them to the vertex buffer. Also remember those cells regions in the array. Only process a fixed number of cells per frame.
  4. Swap out the old vertex buffer for the new one and repeat from (1).
  5. The vertex buffer can then be drawn as one.

We should also keep a MRU cache for triangles generated at cells, faces and edges that we can fetch from when a discarded tile reappears.

26.1.2022

Developed a stateless heuristic to maintain the tile list, such as int(min(9.0, -log2(length(p/16.0) + 1.0/1024.0))).

How it will work is that over several frames, we just statelessly generate the list of required tiles based on the latest position, and then request from the tile cache the cells and faces we need, possibly meshing new tiles that we haven't seen before.

The tile cache then has a memory cap, and starts throwing out old tiles (i.e. least recently used) when it runs out of its allocated memory.

With 32-bit addressing, we can cover a tile space of 1024³; each tile then has a resolution of 32³, which gives us an effective resolution of 32768³; that is more than enough.

When a change in the terrain occurs, we regenerate all visible tiles in the cache overlapping with the invalidated area, and discard the invisible ones.

24.1.2022

Ok, let's see if I can formulate an algorithm to manage the terrain.

We need to manage two kinds of partition objects, cubes for cells and squares for faces, namely where cells are connecting. All cells have a dual sample resolution of 9³, so we can run 8³ jobs to build the mesh of a cell and get good GPU occupancy. Similarly, faces have a sample resolution of 9² and 8² jobs for good occupancy.

The scale of a cell or face only doubles or halves, so it's always of size 2^N, where N is an integer LOD level.

Each cell owns a triangle soup that represents the contoured mesh in its range.

We store cells by address in a hashtable, Lewiner style. This way, neighborhood, parent and child queries can be performed implicitly.

We also use three more hashtables for X, Y and Z oriented faces, where each face has the address of the cell whose lower front left corner it touches, and resides on the same level as the lower of the two LOD levels it connects.

LOD levels: 0 is the highest resolution (cell has smallest scale), and N the lowest (cell has largest scale).

We maintain the general rule that the LOD difference between two adjacent tiles can't be greater than 1, so that we can use an invariant 1:2 meshing kernel for boundaries of different scale.

21.1.2022

Sigh. i don't think my cascaded terrain mesh is panning out, or making lots of sense for that matter. It looks like I'll have to work with convex clusters like everyone else.

I'm thinking the simplest way to go about it is to connect dual meshed chunks of size 32³. We go up the LOD level for chunks that are further away, so that the maximum possible level is 5, with one vertex per chunk. With dual meshing, we only need the 6-neighborhood of each chunk to connect. We also further require that adjacent LOD levels can only differ by 1 maximum, so that we only need to deal with 1:2 border cases.

So what we end up with is an L³ lattice of lodable N³ chunks, with LOD levels increasing away from the center of the lattice. Once a cluster has reached its highest LOD level, the lattice becomes a L³ chunk of a higher K³ lattice (if that is even necessary). As we move across the lattice, high LOD level chunks are streamed in at the corners, and chunks are piecewise updated to reflect their correct LOD level for the distance they have to the center.

We do need to adapt the partial updating scheme to this arrangement and properly schedule chunk updates.

7.1.2022

Happy new year!

I finished my first prototype for a camera centered LOD lattice. The demo in the video even sources a precomputed volume for data. But I don't believe it's worth the effort to fix and optimize it further. The next step would have been to switch to dual marching cubes. Still, using volume cascades has multiple benefits, mainly that translation only invalidates a border of the volume.

Scheduling cascade updates is also simpler, as we can alternatively update one to two cascades per frame, depending on how much frame time we have.

    0         1         2         3         4         5         
    012345678901234567890123456789012345678901234567890123456789   60 FPS
    ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
    0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|   33 ms
     1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |   67 ms
       2   |   2   |   2   |   2   |   2   |   2   |   2   |   2   133 ms
           3       |       3       |       3       |       3       267 ms
                   4               |               4               533 ms
                                   5                               1067 ms

    0         1         2         3         4         5         
    012345678901234567890123456789012345678901234567890123456789   60 FPS
    ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
    010101010101010101010101010101010101010101010101010101010101   33 ms, 33 ms
    2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|   33 ms
     3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |   67 ms
       4   |   4   |   4   |   4   |   4   |   4   |   4   |   4   133 ms
           5       |       5       |       5       |       5       267 ms
                   6               |               6               533 ms
                                   7                               1067 ms

So the effort to update a cascade depends on whether terrain was invalidated, which fully updates the cascade around the invalidated terrain, and/or whether the camera has translated, which requires a memory move within the cascade, and generating new slabs at the border.

8.12.2021

Developed a few terrain related solutions and published them as Shadertoys.

  • Features from Curvature, an alternative to QEF based feature finding for dual contouring, where we just do a quasimontecarlo search for high curvature points, by using gradient descent to find the surface, then picking the point with the highest curvature. If none of the gradient descent samples managed to hit a corner exactly, we don't get a perfect point though.
  • Subjective Filtering, using spherical inversion and hyperbolic transformation to produce a volume in which sampling detail gradually decreases with distance from the center, but never reaches a border, so that the volume conceptually stretches out into infinity, described with a simple average sample. This was meant as a continuous alternative to sampling cascades, so that we can generate information into a single grid rather than into several ones, and don't have to manage continuity at the edges. Bi/triquadratic interpolation is enough to produce a smooth field, but must be performed in the inverted space. As long as ray directions originate from the center though, they remain linear under inversion. Only lateral rays become curved.
  • Baryquadratic Interpolation (2D), interpolating four samples with C1 continuity and strictly quadratic interpolants using triangle subdivision. I was hoping to extend this to 3D, but the interpolant would have to be subdivided into 40 tetrahedra.
  • First Positive Root of Hextic, using gradient interval arithmetic to find first roots on higher order polynomials. This was preparation for raytracing a triquadratically interpolated surface.
  • Triquadratic Raymarching, using gradient interval arithmetic and dimensional reduction of a polynomial to efficiently raymarch a triquadratic surface. Such a surface can be used as an interpolant for terrain data.
  • Relaxed Distance Field (JFA), inspired by Relaxed Cone Stepping, computes the relaxed distance field for a signed distance field, which is the distance of a point to its nearest backfacing surface. The data can be used to march through a field with significantly fewer iterations, as we can guarantee that we don't miss any roots.
  • Relaxed Sphere Marching, extending the ideas of the RDF to analytic shapes such as circles and rectangles, and exploring composability. RDFs don't work that well for implicit functions, unfortunately.

I'm now continuing work on the functional reactive semantics I was working on earlier, while writing actual systems for the game in it. Right when it was time to make a visual programming GUI for it, I lost all motivation, and I guess that's because i want to write the GUI in the semantics of the language, but I want to use the GUI to write the GUI! A typical hen and egg problem. So I figured I could probably get a lot of mileage out of working in text alone for the time being, and once that is comfortable enough, I can start working on a GUI for it, and at least do it with comfortable semantics.

I also see so many things that need to be done but I have to somehow ignore all that future work and stop longing for ideal outcomes and instead just plough through, and solve problems when they come up.

It's already got great ideas. We have closures without scope limitations, side-effect free programming, no GPU/CPU barriers, direct access to GPU compute and GPU rasterizer, implicit deduplication of expressions and culling of unused expressions and automatic GPU control instruction ordering deduced from the expression dependency graph.

What we don't have yet is metaprogramming facilities like operators, custom types and operator overloading. We don't have loops, recursive state and state serialization yet, as well as interfacing with other I/O, namely files, audio, game controllers and network sockets.

These problems all have easy solutions. A harder optimization problem that I have no solution to yet is functional compression: since the functional DAG is entirely lambda dropped, there's an opportunity to reduce code size by lambda lifting, which requires finding expression subgraphs that have similiar shape, and replacing them with function calls. But we'll cross that bridge when we get there.

10.11.2021

I'm looking at the sampling mesh that produces truncated octahedra. This one has a bunch of desirable properties that made me look into it:

  • The sampling mesh is a tetragonal disphenoid honeycomb, in which we build dual vertices within simplices exclusively; here it is impossible that dual vertices produce broken 2-manifolds because every possible configuration only produces one surface.
  • When extra balancing vertices are added to the center of faces, the truncated octahedron allows plane symmetrical 3-coloring of vertices, greatly simplifying barycentric UV mapping.
  • Only six hermite points go into the QEF process, increasing precision and saving computations.
  • A singular interior sign produces a well tesselated spherical surface with 44 to 72 faces (depending on face subdivision) and 24 to 38 vertices, producing well tesselated curvature for small curvaceous details.

The spatial structure of the mesh isn't the most straightforward though. We have these data points:

  • sign bits are on a BCC lattice with two unique points per eight cells, at the zero corners
  • body-diagonal hermite points are on a cubic lattice with one point per cell (eight per eight cells), on BCC mirrored diagonals
  • axis-aligned hermite points are on a triplanar lattice, six unique per eight cells
  • feature vertices are on a triplanar lattice, four per axis i.e. twelve unique per eight cells
  • depending on tesselation, up to 6*2+8*6 triangles per eight cells, or 12 triangles per cell (96 per eight cells); but not really, as we're building 2-manifold meshes, so the real triangle demand will probably be around half that.

The hermite to vertex ratio is an optimal 6:1; the hermite to feature ratio is 7:6. the float data to feature ratio is 23:2 or 11.5:1.

Storage: per cell we have 2/8 bits of signs, 7/4 hermite points and 3/2 feature vertices. Uncompressed, this costs us X*Y*Z*(30 + 2/64) bytes (log2 offset 4.908392). A 32³ volume would be 961 KiB in size.

Working with the sheared cubic lattice is vastly easier though.

9.11.2021

More documentation on the FCC lattice with miller index (111).

Samples in the 111 lattice, per cell:

  • 1 distance field sign (1 bool per sign)
  • 6 hermite edges (normal + offset, 4 floats per edge)
  • 3 contoured dual vertices (3 floats per vertex)

All in all 33 floats, but can be losslessly reduced to 27 with octahedral normal packing (I'm not counting the sign bit, which can be infered or stored in the sign of another absolute coordinate, of which there are plenty)

The lattice can be subdivided without taking new samples, which is important for level of detail. New corner sign bits are inferable from hermite planes, new hermite points are extrapolated from the dual surface, new dual vertices are computed from hermite points.

The cubic lattice based equivalent would carry per cell 1 field sign, 3 hermite edges and one dual vertex, which comes down to 15/12 floats, that's a reduction of 1:2.2 to about 44-45% the size. But the cubic lattice carries only one third the amount of dual vertices per cell, so the ratio of data to vertex is 15:1, 12:1 at best, whereas the 111 lattice has a better ratio of 11:1, 9:1 at best.

Building the octahedral dual vertex requires 12 hermite points, similar to building a cubic dual vertex, whereas the tetrahedral points only require half that. So the hermite to vertex ratio is a constant 12:1 for the cubic lattice, the 111 lattice's ratio is 8:1 on average.

With a body diagonal, a tetrahedral partition and a squeeze, the cubic lattice could reduce its data footprint. It would require four more edge samples, seven in total, but could generate six feature vertices, bringing down the data to feature ratio from 12:1 to 39:6, or 6.5:1; and the hermite to vertex ratio to a constant 6:1. We'd generate faces of two types: hexagonal and quadliteral. The resulting, uncontoured surface would have the shape of a bitruncated cubic honeycomb with the more spherical looking cells in BCC close-packing arrangement, as opposed to lattice 111's rhombic dodecahedra in FCC close-packing arrangement.

After pondering packing configurations a little further, I have concluded that I want to maximize the density of feature points. The best dual vertex packing configuration is FCC, whose vertex image is the Tet-Oct Honeycomb, which produces only equilateral triangular faces, the ideal image of a surface. It feels correct. Each vertex is constrained by the same space, a rhombic dodecahedron, which comes down to clamping the vertex against three planes in the octant it is in.

I look at the FCC neighborhood and conclude that we end up using everything! corner signs are defined on a cube lattice, axis signs in a FCC pattern L of half frequency (and so we have irregular samples on the BCC lattice), hermite points on the rhombic dodecahedron's edges connecting the signs, the dual vertices contoured from the hermite points on the opposite half-frequency FCC lattice of L (clamped by the rhombic dode), and, finally, the connected faces of tetra/octahedra.

It's a little more involved; to compute each point, we need to add up twice as much edges as for the cubic lattice solution.

Data storage / addressing is as follows: RD corner bits can be stored in a volume of full density, if need be. RD axis bits are 4 per 8 cells in an even/odd pattern, so we merge cells from odd z layers into the even layers, halving the Z resolution. Generated vertices can be stored the same way, for the alternating pattern. All corner-axis edges of a cubic cell, of which each edge holds one or no hermite point, are connected to the same RD axis bit, so they too can use a volume with half Z resolution, eight edges per cell; we could also instead force spatial equivalency and double the volume's XY resolution and use one corner-axis edge per cell.

To summarize storage; for dual contouring a XYZ cubic space (minus the borders):

  • X*Y*Z*1 bit for SDF signs at rhombic dodecahedral's cubic corners
  • X*Y*(Z/2)*1 bit for SDF signs at rhombic dodecahedral's pyramid corners
  • (X*2)*(Y*2)*Z*vec4 for hermite points; .xyz components define the normal gradient, while .w holds the edge scalar (with a value < 0 encoding no sample). A normal can be losslessly compressed as octahedral (bringing us down to vec3), and as all values are in the normal range, 16bit floats are probably enough. A vec2 might suffice.
  • X*Y*(Z/2)*vec3 for dual vertices; these vertices have cell-relative coordinates, so 16bit floats packed into a vec2 might be enough.

So, the storage cost for a dense X*Y*Z volume is X*Y*Z*(54 + 3/16) bytes uncompressed (log2 offset 5.76), X*Y*Z*(36 + 3/16) bytes compressed (log2 offset 5.18). A 32³ volume would thus require 1.69 MiB uncompressed or 1.13 MiB compressed.

Let's compare with a pure, cubic lattice solution, which would require X*Y*Z*(48 1/8) bytes uncompressed (log2 offset 5.59), X*Y*Z*(32 1/8) bytes compressed (log2 offset 5.17). Doesn't make that much of a difference, does it?

8.11.2021

It's time to revisit dual contouring and rewrite the landscape engine. Based on incomplete work I did in 2015, I laid out the new sampling & mesh generation system on Twitter, but I'm going to repeat it here for posterity.

We generate 0,3 or 4 out of 6 quads per type A simplex, using the signs at the simplices' vertices to decide. The neighborhood of 11 QEF contoured dual vertices is sourced for the quads: 1 at the center, 4 from octahedral neighbors and 6 from type B simplex neighbors.

To keep working with the coordinate system and underlying datastructures neatly orthogonal (read: to prevent ourselves from going crazy), we project the coordinates from a FCC coordinate system. This allows us to store samples in a cubic lattice; per cell we only need to store 1 corner sample, 6 edge samples and/or 3 body samples, depending on what we're doing (sampling, QEF contouring or meshing). The shadertoy demonstrating the transform is here.

The reasons why I chose this particular projection for dual contouring:

  • we can use marching tetrahedra, so the mesh is a true manifold, without self-intersections or double-booked vertices
  • We only generate quads
  • Each quad triangle is equilateral, the surface produces sphere-packed rhombic dodecahedra, so mesh quality is great

The only caveat here is that contoured vertices are constrained to different polyhedral spaces, but the math is trivial in FCC space:

  • All samples are constrained to the unit cube
  • Tet/Oct/Tet vertices are constrained to manhattan distance 0..1 / 1..2 / 2..3 respectively

1.11.2021

Been improving my math knowledge, mostly by working on problems involving piecewise and analytic integration, least squares fitting, preintegrated convolution, fourier decomposition, complex numbers and spherical harmonics.

As a result, I have a much better understanding of things that I have previously made use of hastily, and I am therefore now able to work on problems more accurately, and with the aid of wider inspiration.

So, I might have done LPV occlusions wrong until now. A quirk of the LPV is that we're using the center point of each cell as a kind of environment map; an occlusion should therefore be represented as a masking with the oriented solid angle of the contour of the occluder.

When sampling a distance field, we can only approximate this information. Previously, I've simply scaled a projected gradient normal by 0..1, depending on how deep we were into the surface; so, assumed coverage of the cell translates to an oriented occluder.

An actual integrated sample at that position would never be able to mask a solid angle greater than 180°; for when we're integrating beneath the surface, there's a sharp discontinuity to a solid 4 * pi of coverage.

Because of this discontinuity, when we're sampling a SH field, surfaces should not incorporate SH samples that lie inside surfaces.

I also noticed that simple scalar interpolation in linear light space is wrong. Instead, the light should be interpolated logarithmically, with the exponential taken afterwards, which greatly reduces discontinuities.

I suspect that SH samples also should not be interpolated linearly, as we're simulating the integration of what a surface sees as it moves between two samples. I suspect that, as SH coefficients are supposed to be squared in the dot product, and we're dealing with perspective projected areas, they need to be interpolated in the space of x² or 1/x².

4.9.2021

The smallest detail visible with the naked eye is ~0.05mm in size; that's about the median size of a plant cell. A game map area perceived as "huge" is 32km in diameter. this gives us a "perfect" resolution scale of 1:640,000,000; or about 29 log2 mip levels.

Devising an efficient integration/differentiation rule that can be applied to a (2^29)³ sized spatial reality is an interesting problem. We must not only consider a static world, but a moving one too. It's obvious that we don't have 2^87 bits of storage available, so some kind of compression must be employed. Even in a sparse, surface only representation, we're still left with 2^58 bits of data, a terabyte of information.

My preferred solution is semantic compression. Particularly when we get to microstructures, materials have great crystalline homogeinity. Instead of a bottom-up datastructure that is integrated as we go up, we should instead consider a top-down representation that procedurally generates detail from statistics. The lowest miplevel of this world is a single composite value describing the entire world. If this value were a seed number and a given total mass, the next highest level, in an octree, would be a 2³ structure with that mass distributed among all octants, and new sub-seed numbers for each octant.

What we want here is a universal but configurable fractal production rule that can be recursively applied to all 29 levels, and which we use to generate local detail. The headache begins when player induced state changes are involved. Depending on which levels those changes appear, we can still use the same data format and project it to lower miplevels. But we now need to integrate those changes to realize and update higher miplevels, and that means switching from seed based generation to data storage, unless we can figure out how to find the correct seed for the change we've made.

29.8.2021

continued here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment