paniq/rubberducking2.md Secret

## rubberducking2.md

      
    Raw
  

              rubberducking2.md
            
          
    26.6.2024
I am still hard at work shipping Scopes². I have been asked what the roadmap to Scopes² is, so here it goes:
The primary goal is to get to self-hosting. UVM has now become STIL, the Scopes Template Intermediate Language. The STIL multi-stage compiler is the first element; it's a standalone executable that parses STIL, written in Scopes notation, and generates C, compiles shared binaries from generated C (using clang) and finally loads  the shared binaries back in to execute the generated functions (to turn a AOT pass into a JIT pass, which constitutes the multi-stage element). Each of those stages is accessible separately.
Through the multi-stage features, the STIL->C code generator will support plugins, not unlike the sugars and spices we have now, but they return STIL - i.e. they need to perform both expansion and typechecking, for which utility functions will be provided.
The plugins then bootstrap the rest of the environment, including the full type system and an improved C generator with more optimizing instructions. The existing Scopes environment, wich includes the SPIR-V backend and the libclang bridge, will become an interface layer on top. Our present environment is a bit too complex in some parts so over time I expect to simplify and retire bits and pieces of it.
The defining features of the new architecture, in addition to all existing features, are:

Self-hosting
Perfect C ABI compatibility
Monotypic
Memory-safe
Realtime-friendly memory management
Offline persistence
Closures anywhere
Arbitrarily deep stacks
Unstructured control flow
Aggressive program optimization
Instant compiler bootup

21.12.2023
In the last months, I've been working on Scopes² and have made significant progress in rewriting its compiler from C++ to a Scopes-based system called Universal VM (UVM) (C++ implementation, scopes bindings, compiler target). This new interpreter allows for offline caching of code, improving compile times significantly. UVM executes programs by operating with immutable cells instead of using the heap directly, which is better suited for the cacheable environment and makes it easier to rewrite existing Scopes sources. The current target for this work is a self-hosting compiler in February, followed by an optimizer that can lower back from UVM to Scopes IL and then C. This will further boost compile performance while providing support for users who want to directly target UVM.
25.4.2023
Apologies, it's been a while since my last entry, because we moved and I couldn't program as much as I wanted to. The move is now thoroughly completed, the new place furnished, my new office in familiar disarray, and hence coding can commence.
In the past year I have been experimenting with a new unified CPU/GPU compiler architecture and low-level VM runtime called Frameloop for more productive prototyping of our game NOWHERE. While the intermediate results were very promising, I was still miles away from a completed bootstrapping, while neglecting the existing Scopes codebase. Torn between a tool that works and a shiny new idea I fell between two chairs, and I felt paralyzed, unsure how to progress.
That is now over, as I have found the perfect solution. I'm going to implement Frameloop on top of Scopes, but first longstanding issues with the compiler have to be fixed:

The main reason why I do not enjoy working on Scopes is the bootstrapped C++ compiler which is difficult to extend. I have delayed numerous ideas simply because that would require writing more C++, something that I want to do less of.
The C++ implementation leaks all its compile-time memory, which makes it unsuitable for applications that need its multi-stage compilation features over a long runtime, such as e.g. a web browser. The reason for leaking is that Scopes did initially have no borrow checking features or support for types with linear logic, and so had no way of managing memory for objects produced by the API. It was simply easier to fully ignore that part and get 64GB of RAM instead in order to iterate faster.

Another bit that draws my ire is the LLVM code generator. As I have learned over the years, LLVM is not fun to work with/for, for these reasons:

The full API is only accessible through C++
To get reflection in debuggers, debug information has to be painstakingly annotated by hand on a by-variable basis.
Not C ABI compliant. All the C ABI bits, including multi/cross-platform compatibility and calling conventions, are part of clang.
It is easy to procedurally produce LLVM IR that crashes the code generator, requiring manual debug builds of LLVM to investigate, something which I do not have the time for.
It's a moving target that requires more and more annotations. Code that compiled fine in one version crashes the code generator in the next one.
Code optimization crashes or produces crashing code when the annotations aren't exactly matching how clang does it.
There's only one tool that can and wants to deal with LLVM IR, and that's LLVM.
A piece of LLVM IR is generated for one target, and one target only.
LLVM's code generator practically hangs itself on single static assignments with large array types, producing pages of inefficient register swapping operations.

So I figured it's time to finally get to that self-hosting stage, and port the compiler from C++ to Scopes. We have all the basic data structures in Scopes required for a Scopes compiler written in Scopes. Originally I wanted to do it part by part, and gradually "retcon" scopes parts into the C++ compiler, but that proved to be a waste of time, as it would require extensive changes to the C++ codebase to interop well with external modules.
The first thing I wanted to replace was the LLVM backend, and I figured that C, despite its many failings, could be a useful code generation target and replacement for LLVM. "clang has it all - so why don't we talk to clang instead?" I thought to myself. This work is completed, and there's now a C code generator that can translate typed scopes IL to C, with all existing tests passing. I'm quite happy with it. The benefits of C as a compile target are:

Lo-fi communication only through text file formats and command-line, no API interfacing. Easier to debug, simpler to build caching mechanism.
Reflection fully works in the debugger with no additional work. Even though we only get #line to sync up instructions, it's already enormously helpful.
Fully C ABI compliant - duh. Support for cross-compilation and all the calling conventions out of the box. Saves me tons of time.
The C compiler practically never crashes. You always get clear error messages. Very easy to debug.
C is for the most part done and remains thoroughly backwards compatible. New versions don't fail existing standards, through -std we can pick our own time to upgrade.
Optimization works immediately and reliably, from -O1 to -O3.
There are multiple C compilers to choose from: GCC, clang, MSVC, TCC and more. In practice, presently only clang provides the extensions necessary for feature parity with LLVM, but GCC is catching up. In any case, adding compatibility modes for other compilers is always possible.
It is possible to generate retargetable code that compiles differently for different targets.
All data operations conceptually happen on the stack, and are only lowered to registers during optimization. Large array types aren't lowered and don't hang the code generator.

The second bit that is completed is the C bridge, which allows us to import declarations from C header files directly into Scopes. It works exactly like its C++ counterpart, but talks to libclang's C interface instead, as we can't access C++ code from Scopes (yet).
I also streamlined and respecified the FLIF file format which I developed for Frameloop, as the .bag file format which comes with its own reasonably efficient minimalistic flat schemaless immutable object hierarchy, and is a great way to de/serialize directed acyclic graph structures, a task that is useful nearly anywhere, from disk caching to network transport. The new compiler will extensively work with blobs and cells as replacement for strings and lists, and with them there's going to be an implementation independent cycle-free specification of Scopes IL schema that I spec'ed for Frameloop and will port over.
Only a few more major components need to be ported from C++ to Scopes so we can achieve feature parity with the existing compiler, and finally replace the C++ codebase for good, in topological order (I'm leaving out the datastructures here since they don't really "do" anything):
(10.07.2023: Updated for recent changes)

lexerparser, which reads text streams in the SLN format and produces a hierarchy of location-annotated lists. This has been ported to the compiler.LexerParser module.
expander, which takes lists produced by (1) and expands them into untyped Scopes IL nodes. This has been ported to the compiler.expander module.
prover, which types untyped Scopes IL nodes produced by (2) and produces compileable typed Scopes IL nodes.
quote, a subpart of (3), which quasi-/pseudoquotes untyped Scopes IL, transforming it into instructions that produce that code.
gen_llvm, which translates typed Scopes IL produced by (3) into object files. This part is fully replaced by the new compiler.target.C module.
gen_spirv, which translates typed Scopes IL produced by (3) to SPIR-V files. We also use SPIRV-Cross internally to port it to GLSL, which would require a C wrapper so we can talk to it from Scopes. .bag files are strongly inspired by SPIR-V by the way. This part is now fully covered by compiler.target.SPIR-V, and has already supplanted the C++ implementation.
globals, which prepares the runtime environment that core.sc needs to bootstrap the rest of the compiler, exposing the reflection API, types, basic forms and instructions.
c_import, which compiles C code and translates its declarations to Scope globals and types. This part is fully replaced by the new bridge module.

When all that is done, and the new compiler can compile itself, we can start adding new features to Scopes (see 13.9.2022), and then reimplement Frameloop in it.
1.10.2022
How difficult would it be to simulate mutation with immutable operations? Let's consider the simplest case, where we rebind the name of a counter:
i = 0
i = i + 1

We haven't really mutated i, yet get similar benefits out of simply rebinding the name. However, our operation is always limited to the innermost lexical scope. See this (failing) example:
i = 0
if true
    # bind i to 1 locally
    i = i + 1
# will fail, since we left the conditional lexical scope
assert (i > 0)

To fix this, we have to route i back out of the conditional block:
i = 0
i = if true
    i = i + 1
    i
else i
# success
assert (i > 0)

The general implication here is that we simulate "mutating" a value by implicitly exporting its changed binding up to the scope in which it was initially bound. Let's call this binding mutation. This can be decided by recognizing a special assignment instruction set! that tags the local scope with names to export. This is done during AST generation, which is still binding-aware, not execution:
i = 0
if true
    set! i (i + 1)
    # implicitly exports `i` as mutated
    # `if` sees `i` exported from `true-branch`, also exports it from `false-branch` to make it match up
# success
assert (i > 0)

Similarly, loops need to export mutated bindings to themselves as well as to the parent scope they're breaking to.
Top level functions shouldn't export mutated bindings, so we avoid ending up with global state changes, but within closures, it possibly makes sense. A closure that exports a mutated binding also alters every callsite by a new return value. However, since the closure also captures the binding, and that capture is typically contained in the closure's environment, the closure environment must be altered as well for the next time the function is called, which means now the closure binding itself got mutated, and must also be exported.
Perhaps a variant that is easier to understand is a function that mutates the binding of one of its arguments. The caller, receiving the new exported binding, must now also mutate the argument binding after the call, and export that binding in turn. This way, we can simulate functions returning values through their inputs.
Lastly, and this is possibly the hardest part, we must take care of simulating mutations of values that have been pulled from containers. We can only simulate mutation under copy semantics, as retroactively altering a value in an unrelated container is a fools errand. See this simple example:
i = 0
j = (f i)
set! i 1
# we can rebind `i` but not rebind `j`, as we'd have to trace `i`'s path through `f` for that
assert (i == 1) # true
assert (j == (cell 0 0)) # true

However, if our binding has been trivially extracted from a named container, we can support a virtual "reference extraction" operation that functions on names rather than values, and stores an extraction U @ V as e.g. bindextract "U" V. With this, we can then rebind U when the target changes, by rewriting set! (bindextract "U" V...) W as set-insert! U W V...:
C = (: 0)
i = C @ 0 # i = bindextract "C" 0
# we can implicitly update index 0 of `C` as well, which also mutates the binding of `C`.
set! i 1 
assert (i == 1) # true
assert (C == (: 1)) # true

This requires though that bindextract is always performed on an unconditionally bound named value. The index can be variable, but the container must not be. So nested extractions such as bindextract (bindextract "U" V) W are still traceable, but should be contracted to bindextract "U" V W.
So in closing we can conclude that simulating mutation as a convenience layer on top of copy semantics is possible, if we maintain the ability of lexical scopes to update their parents, and function scopes to update the scope of their callsite.
29.9.2022
Updating Merkle tree mipmaps: let's say we want to maintain and update the hash for a mutable array as it changes. It is obviously too costly to hash the entire array every time we make a change. But with a merkle tree, we only need to update log2(n) slots for n elements - even less if our blocks are larger. 32-bit (4 byte) hashes of 32 byte blocks gives us an even better ratio of 8:1, reducing the number of pointer indirections we need to follow, and putting more values in the read cache.
The tree update can be made lazy, by instead zeroing all non-zero hashes that our mutation affects, going up the tree, and queueing the zeroed hash addresses for recalculation later. When the hash of the array is requested, and the top level hash is zero, the queue must be worked off first, and is already in correct order of dependency. After rehashing all queued blocks, the top level hash is current again.
19.9.2022
Two invariants I have discovered today:
(A) I only see two distinct reasons to hide a large value behind a pointer indirection within a composite structure:

the value is an element shared by two or more other values/functions
the value is of dynamic size and the size difference covers magnitudes

(B) When we want to memoize a function, we only have to preserve a mapping in the functions cache for as long as all items of its keys can be constructed. If there is e.g. any value in the key which is no longer accessible, then the mapping can also be removed. At no point do we know better that a value can be purged from caches than when it is dropped. Its drop point in a program is always the last location of its use. The functions in a program are finite. We could analytically determine, going forward in the program, every function that this value has conditionally been passed to as an argument, and then at the drop point, generate code to purge all relevant call results.
17.9.2022
While exploring pattern matching, I kept noticing that many recursive and iterative algorithms really like the concept of consing (prepending a new argument to a linked list), and that cons cells do seem to be a natural fit for pattern matching.
Too bad then, that the naive implementation of a cons cell, as a pair of two pointers, is so inefficient for the hardware: going through a single list, we already keep chasing pointers, where indexing a pointer array would have been a lot more efficient. In addition, we can't seek freely in a linked list.
So what can we do? A better fit for the cons operation cons x y is to use reference counted dynamic arrays, where we duplicate shared blocks, and otherwise apply.
Internally, we represent the cons cell for the empty list () as a pair of base pointer and element count, in this case {NULL, 0}. cons x () = cons x {NULL, 0} then produces our first real memory allocation, {{x}, 1}.
Now, for any other cons, we duplicate the referenced array if our element count is smaller than the array size (since it has been appended to since), and then append our new element, so that cons x {{y0, y1, ..., yN-1}, N} = {{y0, y1, ..., yN-1, x}, N+1}. This has the advantage that we keep appending to the same dynamic array as long as we haven't appended twice to the same "list". Other list pointers can keep addressing this array, even though it has been mutated, because they point at the part that hasn't changed.
Hence, the cons list representation of {{y0, y1, ..., yT-1}, N} is (yN-1 ... y1 y0) - the elements in one structure are listed in reverse in the other.
This cons array, as I call it, has the added benefit that we can now iterate through the "list" without touching any memory. To truncate the list by one element using (next (cons x y)) = y, we can now implement it as (next {A, N}) = {A, N-1}. The element at the front can be retrieved by (at (cons x y)) = x, implemented as (at {A, N}) = A?A[N-1]:{NULL, 0}.
From here, we can make one more obvious improvement to really get the best out of the arrayness of our cons cells. We extend our virtualized cons cells from {array, size} to {array, offset, size} i.e. a fully implemented slice, so we can now slice, truncate and randomly seek within lists without any cost but altering offset and size. This also allows us to pop elements from the back of the list (the front of the array), and saves us the cost of reversing the list for some algorithms
For backwards compatibility, one might think it useful to extend the cons cell to {array, offset, size, reversed?}, so that reversing the list merely toggles a flag, and all existing list operations then function the other way around, saving us the extra cost of a duplication merely for reading data in a different direction. Since consing to a logically reversed list would require us to prepend to the front of the array, we must make another copy here, during which we perform a physical reversion, so that subsequent cons operations will append again.
14.9.2022
An interesting aspect of loops is that we can freely choose their phase with a simple transform, which comes in handy when we are trying to reduce the number of live variables at the boundary. Take this pseudo-flasm example, which reverses a cons list and increments its values:
in = cons 1 (cons 2 (cons 3 '()))
out = '()
out = loop (in out = in out)
    #--- one iteration
    if (empty? in)
        break out
    at next = in
    in = next
    out = cons (+ at 1) out
    #--- loop around    
    repeat in out
return out

Now to freely choose our phase, we duplicate the loop body. This does change nothing about the loop, just makes it a bit more redundant.
in = cons 1 (cons 2 (cons 3 '()))
out = '()
out = loop (in out = in out)
    #--- even iteration
    if (empty? in)
        break out
    at next = in
    in = next
    out = cons (+ at 1) out
    #--- odd iteration
    if (empty? in)
        break out
    at next = in
    in = next
    out = cons (+ at 1) out
    #--- loop around
    repeat in out
return out

Now we treat the loop as a modular space, and shift-rotate the spacing (to a less practical position) to indicate the new outline of the loop. Nothing about the order changes, we just visually indicate where we want to make the new cuts:
in = cons 1 (cons 2 (cons 3 '()))
out = '()
out = loop (in out = in out)
    if (empty? in)
        break out
    at next = in
    #--- begin of new iteration
    in = next
    out = cons (+ at 1) out
    if (empty? in)
        break out
    at next = in
    #--- end of new iteration
    in = next
    out = cons (+ at 1) out
    
    repeat in out
return out

And finally we move the preamble and epilog out of the loop, to either side and adjust the loop variables to route through the live variables at that point in the phase. The epilog can be deleted:
in = cons 1 (cons 2 (cons 3 '()))
out = '()
#--- preamble
if (empty? in)
    return out
at next = in

out = loop (at next out = at next out)
    #--- new iteration
    in = next
    out = cons (+ at 1) out
    if (empty? in)
        break out
    at next = in
    
    repeat at next out
return out

And that's it.
13.9.2022
I'm making more and more discoveries about how to handle side effects in a pure functional way, while also sensing the need to be able to map high functional concepts to lowlevel platform concepts all within the same dataformat, and I'm also getting increasingly tired by the unwieldiness of LLVM as a compile time dependency, so I've decided to reorganize FLIR to cover the needs of Scopes as well, while keeping the C target. That effectively means that the Scopes compiler backend is getting a rewrite.
Currently, Scopes doesn't even have its own intermediate format. We do have an intermediate representation (SSAT), but it doesn't exist in a serializable form. The new intermediate format is going to be SCIR, which is going to get all SSAT elements for the low level representation. SCIR, like FLIR, in turn is a FLIF (Frameloop Interchange Format) file, that means it only consists of nested blobs (byte strings) and cells, which look like s-expression lists.
Each instruction in the SCIR format is a cell that starts with a switchable 32-bit code to identify the kind of instruction, followed by the instruction data.
Based on what I've learned, different to Scopes, the SCIR compiler bootstraps itself.
We start off with the Scassembler, which is a 2000 line program implemented in Python and consists of a SLN-compatible parser and a code generator that takes the parsed syntactic structure (informally called "Scassembly", and using the file extension .scasm), produces a SCIR graph from it and serializes it back to disk (using the file extension .scir). The Flassembler helps us to get out of the turing tarpit ASAP, providing light syntactic support for conveniently defining instruction DAGs, lexical closures, loops, modular dependencies. It also implements an unhygienic macro system, on which it is possible to e.g. implement structural schemas.
The Scassembler needs to be light because it is essentially a throwaway implementation, only required for bootstrapping. As such, it only implements the layer of the language we need to write the rest of it, conveniently.
To run SCIR files, we need an interpreter since the compiler doesn't exist yet. This interpreter, written in C and called scvm, is also just a throwaway bootstrap interpreter, which is not particularly performant but allows us to run SCIR code at all.
Implemented in Scassembly then is scirc, the SCIR compiler, which can load a .scir file from stdin and output a .c file to stdout. The generated C code shares with scvm a common runtime called scirt.h, which implements the high level datatypes that the interpreter requires, and that C code can fall back on where no specialization is possible.
Naturally, scirc can compile itself, although it is not particularly good at lowering instructions yet, and so its compiled performance only improves by about factor x3-x10 compared to its interpreted run.
So far, so good.
To fully support Scopes then, we need

Forking the FLIR codebase to a SCIR one, porting the existing tools and redefining the fundamental instructions.
A SCIR native adaptation of Scassembly's parser and code generator, which is altered and extended to provide all the bells and whistles that the Scopes language provides, namely the sugar and spice expanders. This one is also written in scasm. Since we target C now, we need to ship and invoke an external C compiler and linker to provide on-demand compilation features, but it's going to feel the same as the old system in practice.
A new, separate bridge to libclang that translates C headers to SCIR modules, written in scasm. The present implementation in Scopes can also compile modules in-memory, which we'll have to refit to the external on-demand system described in (2).
A new backend for scc that generates SPIR-V and GLSL, probably written in Scopes.

As a result, we'll get these new features in Scopes:

No compilation dependency on LLVM anymore.
Support for different C compiler backends (GCC and Clang supported out of the box, TCC also planned).
Support for all platform targets that existing C compilers support, support for cross-compilation.
Flawless C ABI support through existing C compilers.
Faster bootstrap and compile times since we can now cache modules at the SCIR level.
Support for writing new native frontends (languages) and backends (targets) for SCIR, as well as custom SCIR processing tools.
Programmable compile time state tracking through simplifications of the language model.
Support for improved and programmable borrow and mutation checking, using programmable compile time state tracking.
Access to the high-level Frameloop DSL, which allows users to write sandboxed, web-proof unified CPU and GPU compute & rasterization pipeline programs.
Unions can now be represented natively.
Better type deduction for loops and recursive functions.
Automatic transformation of head-recursive functions to loops, solving the problem of stack overflows.
Compile time capacity estimation through implicit interval arithmetic.
Simpler conditions for C++ support in the future.

These are the changes that Scopes users can expect with the rewrite:

Cyclical pointers in hand-assembled AST nodes are no longer permitted.
A backwards compatibility layer that can be imported as a module, which ensures that existing code keeps running as-is, even though it will take slightly longer to compile than code that has been written for the new version. Therefore, all following changes are opt-in.
Structural changes in the AST provided to sugar and spices itself can be expected.
LLVM intrinsics are no longer available (until I provide a new LLVM backend, which is planned).

12.9.2022
For good error reporting, every value in a compiler needs to have a source location (called anchor in my codebase), but that is not enough. In practice, every value has two sources of origin:

The source location of its unevaluated expression. e.g. an evaluated value may be 2, which has no source location, but it is the result of evaluating the expression (+ x y), which does.
The source location of the context expression under which the evaluation took place. e.g. (+ x y) may have a source code location, but it is inside a function g, so we must also point at the application of g - the source code of its unevaluated expression. Or, (+ x y) may have been generated by a macro, so (1) points at the line in the macro, but (2) must point at the location where the macro was applied.

Information (2) forms a chain, since application/call contexts are nested. So, we're listing the source location of the callsite (1), but then we also need to list the context under which the callsite was applied. And so on, and so on. In practice, that means that we're building a cactus tree, a cons list of anchors. Hence the format of an anchor is
Anchor = (source-path:str source-lineno:int source-column:int [source-offset:int] context:Anchor)
For the value of a disputed symbol x in (apply (fn (x) (apply f x)) y), its first anchor would be its argument location in apply f ->x<-, its context anchor then the x in fn (->x<-), and the context anchor of that the y in the application binding apply ... ->y<-.
For let x = y; f x, we can understand the order similarly, as this is sugar for the same expression above. The context order is let x = y; f ->x<-, let ->x<- = y; f x and then let x = ->y<-; f x.
10.9.2022
As part of reading about and experimenting with linear logic, I devised a small data structure for a cyclical directed memory graph with explicit sharing nodes and 1:1 pointer-pointee relationships otherwise. With an ownership model, the graph can be used to implement GC-free memory management, where the read cost lies in double-dereferencing shared pointers, and the write cost is maintaining shared nodes and bidirectional pointers. Memory overhead is 8 bytes per allocation, and 16 bytes per shared edge.
Here are my notes on the data structure:
GC-less memory with linear pointers (one address, one user):

two kinds of allocations: data / rake

|  | | | | |     
|  +-+-+-+-+ rake
|  |
|  
|  |
|  +-+-+-+-+ data
v  | | | | |

each pointer to an allocation is tagged with 1 bit:
    * kind of allocation: data / rake

each data allocation has a 8-byte header, describing the back reference:
    * tagged address of pointer to this allocation
    * (+ 8-bytes for applications that need backedge traversion: 
       tagged address of allocation containing pointer)

allocation pointers always point to the content after the header
the header is retrieved by decrementing the pointer by sizeof(header)

for data, the content is user-defined

a rake allows the sharing of a data resource by multiple pointers

a rake structure for N users looks as follows:
    { <data*> <edge>[N] }

where <edge> is a structure that looks as follows:
    { <header> <data**> }

    * <header> is the same 8-byte header used for data allocations, only here
      we store the backreference to each pointer to the shared resource
    * `data**` is the pointer to the beginning of the rake

a pointer to the edge of a rake (called a rake pointer) points to its offset field. 
we can then convert a rake pointer to a data pointer using:

    data_pointer = **(data ***)rake_pointer
    
we can compute the index of our in-edge from the offset:

    offset = (uintptr_t)rake_pointer - **(uintptr_t **)rake_pointer
    edge_index = offset / 16 - 1

Some properties of this structure:

Data pointers are dereferenced once to get to the data. This makes access as fast as any native solution when the pointer is linear.
Rake pointers are dereferences thrice, but rake edges are stored sequentially and cache-friendly.
Types can be used to specialize pointers to data/rake where the kind is known at compile time, in which case the pointer tag is not required. For polymorphic pointers, a conditional check is required first to canonicalize the data pointer.
Each data allocation has only one user, whose backreference can be read from the header. Each pointer has a unique target address and direct access to the backreference to it. Targets and sources have a bidirectional 1:1 relationship.
The only exception to this rule is the rake, which encapsulates many-to-one edge relationships. But since all duplicated pointers are local, bidirectional changes are local as well, and therefore trivial to perform.
Swapping out a data allocation requires updating one backreference, and updating the backreferences of all contained pointers, if any.
Edges can be followed and cut bidirectionally. That turns every pointer into an optional weak reference.
Unique references can be shared simply by inserting a rake inbetween the source and the target. A shared reference can be made unique simply by duplicating the data allocation.
When a data pointer is nulled, the referenced data has no more users and can be deleted. Referenced rakes delete the relevant edge (typically the last edge is moved to fill the gap), and can be reduced to a linear pointer when only one edge remains.
Rakes can be grown when their capacity is exceeded, which requires fixing the data's backreference and all pointers to the rake.
Cyclical references can be quickly dissolved at rakes through an ownership model, either by having an invariant policy such as "edge 0 is strong, edges 1..N are weak", or by extending rake edges with a weak flag.

4.9.2022
The simplest way to manage the memory of a directed acyclic graph is to use reference counting, which is equivalent to counting the out-edges of a node, which represents a resource to be managed. In such a graph, we only have strong references, and we free a node as soon as the number of out-edges reaches zero.
But it is also possible to instead compute the scope of a node, which is the parent node whose removal also definitely removes all references to our node. I posted an example graph and resource tree for that graph on mastodon. As you can see, the postdominator of a node is also it scope node.
A resource tree effectively splits up a DAG into a nesting of DAGs, in which each nested root, in addition to its in-edges, manages the memory required by the nodes in its scope.
A better way to imagine this arrangement is to think of a directory tree, in which we want the same file appear in different directories. We can use ln to produce multiple links, but in what directory should the original file live? If we impose the constraint that the target path for ln must not descend into parallel directories, then we must know: what is the deepest directory we can place the file in so that any link to it can reach it with the fewest repetitions of ../? This directory is the scope of our file.
The first question is now: what sort of supportive structures and algorithms do we need at runtime to have a DAG manage a parallel resource tree as it is transformed? And the second question is: how much about this process can we already infer at compile time?
A node with such support will need to maintain a resource set, an array of nodes that are in its scope (that it owns), as well as an array of in-edges. Each edge is a local reference, describing a path to the parent node that holds the resource, and the index of that node in its resource array. We could also specify a direct pointer to that node.
Consider a depth-first search of the resource tree: because in a tree (as opposed to a DAG) the path to a node is always linear, the stack of nodes we are presently visiting is also the scope path to the topmost node on the stack. This nodes' edges path to the owner could then simply consist of two numbers: the number of hops we must traverse up the stack to find the resource set, and the index of the node within that set.
A depth-first search of the DAG itself then could be performed on the resource tree, if we follow a (to-visit) stack of resource tree stacks/paths instead: at any point that we need to visit a node that is owned by a parent further up, we can truncate the tree path and push it to our to-visit stack, so that afterwards we can pick up work on the previous tree path we were traversing. This always gives us the canonical path of a DAG node.
Moving on to insertion. Referencing a new node Q to a parent node P, Q will also be appended to the resource set of P. If we now also reference Q from node M, then we need to move the ownership of Q from its current parent P to scope(P, M). scope(X,...) can be implemented by finding the longest common path of all arguments. The scope can only expand.
However, removing the edge M->Q, we can contract the scope of Q, but figuring out to which parent node would require once again the longest common path of all out-edges.
My general intuition would be that this is really impractical to do at runtime, but it might be good enough for compile time, when all we want to know for a value is: what is the scope at which we should manage its memory?
2.9.2022
For the past weeks, I've been racking my brain trying to devise a (pure functional) alternative to borrow checking, and I think the following technique could work, or at least be the beginning of something that works.
In order to facilitate parallel mutation in a pure functional expression tree, the mutation history on a blob can be represented by a tree of zoned revisions. We begin with a single root zone with revision r0, that covers our entire space 0..sizeof(blob):
[r0                                                   ]

We can now mark subzones as changed, by giving the parent zone a list of non-overlapping child zones, which in turn can become parent zones as well. All non-overlapping zones may use the same incremented revision.
[r0  [r1   ]          [r1]    [r1                  ]  ]

A subzone may fully change multiple times.
[r0  [r1   ]          [r8]    [r2                  ]  ]

And a subzone can in turn have subzones, where revisions must be greater than that of the parent:
[r0  [r1   ]          [r8]    [r2 [r4    ][r3     ]]  ]

We can now check the availability of a zone (e.g. before performing a read or write) with a desired revision by testing a zone for its largest revision:
[r0  [r1   ] <r?    > [r8]    [r2 [r4    ][r3     ]]  ] = r0
[r0  [r1<r?]          [r8]    [r2 [r4 >  ][r3     ]]  ] = r8

When some or all parallel operations have concluded, the tree can be flattened by propagating the latest revisions upwards:
[r0  [r1   ]          [r8]    [r2 [r4    ][r3     ]]  ]
[r0  [r1   ]          [r8]    [r4                  ]  ]
[r8                                                   ]

It is also valid to upgrade subregions to the latest revision:
[r0  [r1   ]          [r8]    [r8                  ]  ]

By a similar mechanism, we can create overlapping revisions:
[r0  [r1   ]          [r8]<r8 [r2 [r4  > ][r3     ]]  ] -->
[r0  [r1   ]          [r8][r8                      ]  ]

Or another example for the simplest case, see this sequence, where the second operation overlaps the first one:
1) [r0       <r1                   >                     ]
2) [r0       [r1          <r2      ]         >           ]
3) [r0       [r2                             ]           ]

We could also perform smarter rewrites, that keep more of previous regions accessible:
1) [r0       <r1                   >                     ]
2) [r0       [r1          <r2      ]         >           ]
3) [r0       [r1         ][r2                ]           ]

We track these operations at compile time in order to determine which slices and subslices of a blob are still accessible even though multiple mutations have been performed on the same blob, or even within the same slice, and which previously fetched references are no longer accessible because they overlap regions that have since been mutated.
30.8.2022
A complication of immutable trees is that a change in a tree node requires a pointer update in the path to it. But if we want to replace all occurrences of that node in the tree, and we're already in an iterative process that isn't compatible with depth-first traversion, an algorithm for this can become needlessly complex.
Here, it would help to instead represent each node by an index (a local pointer) and store its content in a table (a list of nodes). Traversing the graph now requires to follow indirections through the table. There are two types of elements in the table, lists and strings (as the terminal nodes). Each list only holds indices into the table.
It needs to be said though: while the graph table is also a good precursor to a serialization format, and deduplication is easier to manage, it's not great for every job. Tables allow for cyclical references (which may be of benefit, depending on what kind of graph you want). Non-sparse tables require index reordering. Unused nodes do not disappear automatically, unless we add refcounting. Node indices aren't unique when they're used by multiple tables, and nodes can be foreign to each other when they belong to different tables.
We can understand the graph in its refcounted pointer form as being stored in a global table, the program's heap memory. But the heap is mutable, while we want to retain the immutability guarantee so we can make copies of a table.
Once we have the graph to table transform, we can also formulate its inverse, to produce a graph without indirections again.
29.8.2022
Ruminations.
I noticed that translating high level pure functional code to different semantics likely requires extensions to the IR that allow us to describe the types and operations that C supports, before we do the final translation to C. I already do an underspecified translation from nested functional dependencies to nested conditional instruction blocks, and some optimization operations are easier to perform in this space, while others better work in the functional form.
The current stages of translation roughly look like this:
flasm:
1. flasm text -> flasm parser-lexer -> generic syntax tree
2. generic syntax tree -> flasm codegen -> FLIR tree
3. FLIR tree -> flasm serializer -> FLIR binary
flc:
4. FLIR binary -> flc deserializer -> FLIR tree
5. FLIR tree -> flc strip unused functions -> FLIR tree
6. FLIR tree -> flc fold constant expressions -> FLIR tree
7. FLIR tree -> flc conditional toposort -> SSA tree
8. SSA tree -> flc annotate refcounts + generate destructors -> SSA tree
9. SSA tree -> C code generator -> C formatter tree
10. C formatter tree -> C formatter -> C text

In addition to bringing instructions into an explicit order of evaluation, the SSA tree format puts all instructions into a table and remaps all instruction arguments to indices into that table. Doing this labels every instruction so it's easy to derive a variable name for it in the C form, but it makes tree rewrites much harder. Likewise, explicit instruction blocks are easy to append and inject to, and they give us the final lexical scope in which an instruction appears, but we can't easily rewrite operator trees anymore.
Type specialization, which requires replacing instructions, does not yet really happen in any stage, and this is something I think should be performed on the FLIR tree. Presently, we perform type deduction with a separate cached self-updating structure that for any expression deduces its resulting type by descending into the tree and deducing bottom-up.
But we might need two sets of types: the first set is for propagation, deduction and constant expression folding. The second set would be for C. Suppose we needed a more low level set of FLIR instructions that maps closer to C instructions, a kind of version of C that turns mutations into "pure" operations and enforces proper memory management, and is trivial to translate to C syntax or some dialect. In this version, we don't have explicit ordering yet, so it should be sort of "C with implicit ordering". The type system and set of instructions mirror the ones of C.
for C types, I'm thinking of a set comparable to this one:
# types
void
bool
int 8|16|32|64 <signed?>
float 32|64
array <size> <type>
struct <type> ... # unnamed structs
union <type> ... # unnamed unions
pointer <type> # address or NULL
function <return-type> <argument-type> ... # we don't support variadic C functions
typeref <index> # valid inside a typedef, used like a name, references indices of the parent typedef
typedef <type> <type> ... # used to make recursive types

# typing
typed <value> <type>

A linked list would then be defined as follows:
typedef (struct (pointer "void") (pointer (typeref 0)))

In addition, we would need C specific instructions, such as
field <value> <offset> # value._offset # for structs and unions
at <value> <offset> # value[offset], *value
cast <value> <type> # (type)value
extern <name>
static <value>
call <value> <args> ...

To support mutation as pure, deduplicable operation, we would need operations that allow multiple read operations (and enforce an ordering in which all relevant reads must be performed before a write), but are restricted to a single write operation per branch - moving the value.
Stacks are private to functions and are freed on exit, whereas there is only one heap that needs to be routed through functions and loops.
# operations
store <element> <pointer[A]> -> <pointer[A+1]>
heap # describes the initial heap
malloc <heap|pointer> <size> -> <pointer>
free <malloc|free> <pointer> -> <heap>
stack # describes the initial stack
alloca <stack|alloca> <size>

I haven't addressed one of the biggest issues yet, which is that we often need to offset a pointer and then write a value to it, which implicitly mutates the pointer base offset. Worse, if that pointer is retrieved from somewhere else, we don't have the base anymore. In principle: any pointer exhibits "local heap" properties, in that there can be independent suballocations that are referenced by other suballocations, and mutating those technically mutates the entire local heap. It's a lifetime nightmare to sort out.
In addition, I wonder what the value of a separate C instruction set is if there are no meaningful transformations we can or want to run on it, besides transforming it to C text. So I guess I won't do this.
But some good has come from this: I see now that explicit type annotation through a typed decorator makes it easier to glance from the generated tree what a value's type is, without having to permanently extend an expression with a new argument. The "hidden" type annotator that I presently use, with its own map, makes it harder to debug print what is going on.
Also, perhaps it's better after all not to use a separate SSAT, but instead extend FLIR with new instructions that give us the same functionality:
block <value> ... <result> # evaluate all arguments from first to last, return result
incref <offset> <value> # increment the refcount of value by offset
decref <offset> <value> # decrement the refcount of value by offset

16.8.2022
A non sequitur about simple string compression. I've been reading about Huffman coding, suffix trees and longest repeating substrings, and came up with a fun (but not particularly good) way to pack a string with the help of a suffix tree.
We start with a fun example I lifted off the Wikipedia article:
A_DEAD_DAD_CEDED_A_BAD_BABE_A_BEADED_ABACA_BED

The longest repeated substring in this sequence is DED_A, so we use a new character to replace this sequence in the string:
a = DED_A
A_DEAD_DAD_CEa_BAD_BABE_A_BEAaBACA_BED

Now we find the longest repeated substring again and iterate on this until our longest repeated substring is smaller than a character. We end up with
a = "DED_A"
b = "A_BE"
c = "_BA"
d = "AD_"
"A_DEdDdCEacDcBE_bAaBACbD"

We have everything to build a Huffman coding for our sequence. We start by counting the frequency of each letter. I do this manually by first ordering the letters, counting the repetitions and finally ordering the table by frequency:
__AAABBCCDDDDEEEaabbccdd
_ = 2
B = 2
C = 2
a = 2
b = 2
c = 2
d = 2
A = 3
E = 3
D = 4

The next step is building our tree, by repeatedly replacing the two lowest scoring nodes with a pair of the two, which gets the combined score. After the first step, we have
C = 2
a = 2
b = 2
c = 2
d = 2
A = 3
E = 3
D = 4
(_ B) = 4

After the second, we get
b = 2
c = 2
d = 2
A = 3
E = 3
D = 4
(_ B) = 4
(C a) = 4

Drawing the rest of the owl
(((E D) ((_ B) (C a))) ((b c) (d A))) = 24

Now the huffman code of each character is the sequence of indices we require to index into this structure; since we can only go left (0) or right (1), the code is equivalent to a binary number, whose number of digits equals the tree depth of the character. We get the table
E = 000
D = 001
b = 100
c = 101
d = 110
A = 111
_ = 0100
B = 0101
C = 0110
a = 0111

Renumbering these entries to get the canonical coding, we get
A = 000
D = 001
E = 010
b = 011
c = 100
d = 101
_ = 1100
B = 1101
C = 1110
a = 1111

And now we can replace the entire string with this 80 bit sequence:
00011000010101010011011110010111110000110011010101100011000111111010001110011001

Presto, we have packed a 46 byte string into 10 bytes. Well, not exactly, since there's still the table to take care of. And there we see that it wasn't such a good idea to substitute the longest strings, since that means we also need to pack the table.
# encoding number of elements for each bit width, followed by dictionary
(0,0,6,4) (A,D,E,A_BE,_BA,AD_,_,B,C,DED_A)

Consider a huge chunk of text appearing twice. By committing it to the table that early, we lose all opportunities to pack down that chunk further. It would have been better to start with short repeating substrings that appear very often. For example, requiring our substrings to be of length 2, we get
a = "D_"
b = "A_"
c = "bB"
d = "DE"
e = "Aa"
f = "da"
g = "cE"
h = "BA"
"bdeDeCEfcehBE_gAfAhCgD"

Which brings us down from 46 characters to 22+16=38, and would form a much better basis for huffman coding. Let's say we encode both table and content (without the table header), we get
((((g h) A) ((_ B) (e D))) ((E (C a)) ((b c) (d f))))
A 000
E 001
_ 0100
B 0101
C 0110
D 0111
a 1000
b 1001
c 1010
d 1011
e 1100
f 1101
g 1110
h 1111
bdeDeCEfcehBE_gAfAhCgDD_A_bBDEAadacEBA ==
10011011110001111100011000111011010110011110101001010011100001101000111101101110011101110100000010010010101011100100010001011100010100010101000

A 143 bit sequence, or about 18 bytes. Together with the table and header data, that's about 18+14+1+1+1=35 bytes. We have compressed the string to about 76% of its original size.
12.8.2022
Musings on memory allocation and the stack.
Logically, we can model every program as a graph of branching and merging continuations. But in practice, we distinguish between calls and returns. A call provides a return address (a continuation), and a return jumps to that address, without providing yet another address. Also, a call allocates memory on the stack, whereas the matching return frees that memory.
In fact, I suspect that use of the stack is the primary reason why we use this dichotomy, giving us a simple implicit way to operate a function-scoped arena allocator. In addition, we use the (C) convention that the caller provides all stack addresses, for both call and return arguments. A function can not return in a way that leaves new data on the stack - that would prevent it from cleaning up its own temporary stack memory. Instead, it must write to an address provided by the caller. Hence, if we call with a variable sized buffer argument, the task is easy. But if we want to return a variable sized buffer argument on the stack, the caller must perform several steps: a first call computes the size of the buffer to be returned. Then the caller allocates the buffer on the stack, and finally provides its address to the function which is supposed to fill the buffer.
Despite this simple arrangement, program stacks are small. On Linux, we only get 8 MB of stack memory or so. We are supposed to use heap memory for the bulk of data we're fiddling with, and then implement our own allocators there. Of course nothing stops us from implementing application specific stacks there.
The benefit of heap memory, apart from its huge size, is that we can cross-connect pointers across allocations since we're in full control of their lifetimes. On a stack, it is only truly safe for pointers to point downward, to earlier allocations; only this way we can guarantee that no pointer will ever point to invalid memory. A side effect of this rule is that we have topological ordering. We can still support upward pointers that describe cyclical graphs, by guaranteeing that they reside within single allocations that we pop as a whole, e.g. an array.
So, in principle, the stack is "memory complete". We do not need a heap. But the fact that the caller must provide the memory and perform a layout pass (estimate size, allocate, then fill the buffer) can slow down algorithms. For some algorithms, size estimation entails allocating temporary buffers within the function, only to count their size, throw the work away, and then recompute the data we had already computed before. At best this only increases complexity by a factor (2, to be precise), and constitutes negligible overhead, in my opinion. At worst, complexity increases by factor 2 with every nested call, and so we get exponential complexity. That's not so nice.
In many cases, we are likely able to compute upper bounds for memory requirements of a function at compile time, or can even require the program to always be written in a way that makes upper bounds inferrable. For such functions, we can skip the size computations (or generate a function reduced to the fewest necessary operations to compute the size) and provide a buffer that fits.
Now, I wrote all this with purely immutable memory in mind. If we consider mutable memory, perhaps just as a simple optimization where a buffer is replaced with an altered copy of itself, then it is possible that the buffer, should it contain pointers, ends up pointing to memory above it. The altered copy would have been strictly pointing down, but mutating an earlier allocation instead will give us both up- and down-pointers. The only way to avoid this is to always duplicate buffers into which we're inserting references to data that is newer than that buffer. String-like buffers can always be mutated, since they hold no references. This means a pointer tree can only be safely mutated when we swap some of its contents, or insert data from a different pointer tree of same or older age.
Similar caution must be applied when writing references to a return buffer - no reference in that buffer may point to data in the current function's stack. This complicates returning pointer trees: our return buffer must hold space not only for the root buffer, but also all new, locally generated data elements we intend to reference from it. This implies that a buffer is always allocated in the function in which its lifetime ends i.e. where it is not returned from the function. Thus every function requires memory with super lifetime, provided by the caller, for the results it intends to return, and memory for local lifetime, for the results it consumes before it returns (e.g. through a reduction and comparison).
When we use this stack allocation scheme with compile time estimation, it is possible that we waste a lot of memory on conditional returns that return very little on one branch, and a lot on the other. At every turn, the worst case has to be considered, and so allocations could grow much larger than they have to be. Particularly for this class of function, it could be worth it to perform a double pass.
Another concern is finite system resources like handles to files or GPU buffers. By definition, these resources do not point to our stack, and so constitute leaf structures, just like string buffers.
It stands to reason that, if we split up a program into pure functions that produce system commands and continuations, we will need to provide, at the top level of the program, enough memory to hold the system command, the closure environment and everything that environment references, which constitutes the complete state of the program.
It gets tricky when we move from one root continuation to the next. The values returned by the second iteration could be referencing data from the first, and so we need to stack the second iteration on top of the first. It becomes obvious here that without a GC and data introspection, we can never free prior buffers, unless we can guarantee somehow that there are no old references held. It is obvious that we will have similar problems within the program, namely with loops or multiple functions executed in sequence.
Considering these caveats, I wonder if it wouldn't be best to just stick to the present system of refcounted heap allocated buffers, and simplify usage where we can. For example, where we perform refcounting at compile time, we can elide most of the runtime bookkeeping. Where we can prove that a heap allocated buffer fits register size, we can pass it by value. Where we can prove that a mid-sized buffer remains function local, we can allocate it on the stack.
9.8.2022
Three more changes to FLIR that simplify the IR and IR typing:


Support for vectorized operators is being removed i.e. the stride attribute disappears from all instructions that use them, and shuffle, extract, insert and vecmap will be removed entirely. The main reason is that vectorized operations can not even be guaranteed for every target, and writing them is only of interest to specialists - which runs somewhat contrary to the intent of FLIR, which is to keep abstractions at a logical minimum, and make optimization purely the compiler's business. This means that blobs no longer compete with cells when it comes to composite structuring. You want a vector? Use a cell and cellmap. It's the compilers job to figure out later how to optimize these instructions, and it is not difficult to describe code in a way that makes vectorization easy. it turns out gnu/clang C support vector extensions, and so I'm keeping this for a while longer.


System commands will be generated by a new set of instructions, rather than constructed from cells. This allows us to give them a special opaque value type, and infer argument types, making the signature explicit, but also most importantly: we type arguments where the instructions are applied, before they are merged by branches, rather than at the program endpoint, where we only see the merged command structures, which makes typing template closures an unnecessary special case.


The closure ABI changes from concatenating a tuple of arguments to the environment cell (env...) + args -> (env..., args) to always producing a pair of env + args -> (env, args), which makes it easier to maintain structure when env has a size that is not determinable at compile time, and also allows for env to be a blob rather than a cell. Since nested functions in flassembly need to be able to index their environment and arguments statically, the function concatenates args to env... as it usually does when it constructs subsequent calls, which was previously the callers job, but now falls to the callee - should they need it. With this convention, we wouldn't be able to isolate values external to the scope that depend on environment arguments from a different function. Presently that's no problem, because environments stack call arguments for nested closures in a way that is invariant to the subclosure we're in. That would not be the case anymore with this change.


1.8.2022
Okay! I have now made major changes to the FLIR structures that allow for better handling of recursive references. Functions now no longer reference each other directly by content, but by their index in the top level function table, represented by the program instruction. This also plays well into the code generation pass which requires functions to be top level anyway, and is easy to generate in the parsing and syntactical expansion steps.
Specializing this structure now allows us to rewrite individual functions without having to rewrite all dependencies, or generate thunks for functions we're still in the process of typing, since functions are no longer addressed by content.
The next big task is to figure out when and how to merge reentrant function types during specialization. We have two kinds of reentrant structures:

A loopfunction (formerly loopbody) tail-recurses into itself (or out of itself, however you want to see it), fully updating its own environment. These have a cycle budget, so must halt within a fixed time of computing time spent.
A continuation closure consisting of a function (formerly body) and an environment passed to the system and invoked in the next program iteration, often with an additional argument appended to the environment. Since these kind of closures keep the program responsive as they give the system time to suspend/resume, they are allowed to be cyclical. They are head-recursive during typing, so we can detect this type of reentrance with a compile time stack.

For both kinds, for the same function, the type of the environment can change on every iteration. Some elements may be iterating constants, some elements remain variables of constant type, some elements can be variables of varying type.
When the input arguments to a loopfunction are constant i.e. they do not depend on an outer environment, then the environment can be inlined and the loop fully computed to its natural conclusion: the loop is then a compile time loop. For a fully constant continuation closure, the environment can be inlined into the function so that in the closure it is empty.
For varying types, we can iteratively type a loop multiple times, hoping for the types to settle: as soon as we deduce an environment type that has been previously seen, we have hit a cycle in the type itself.
In either case however, it is likely that we never settle, simply because of structures like {N <type>}, where N is an integer that steadily increases. Or we could also generate infinitely nesting types, or growing elements, the list of permutations is endless. The art here is to figure out which elements never change, and to generalize the others in the most useful, least destructive way, i.e. perform some kind of type integration.
I would probably start by picking the most destructive way and then improve where obvious. For loopfunction, this is a local process. For continuation closures, we only employ this technique when we recognize that reentrant typing is taking place:
We merge the initial and the new environment type to produce a more general type that fits both, which increases the likelihood that on the next iteration, a merge results in the same type, which allows us to close the loop - it could possibly take multiple iterations for the loop to generalize successfully, and it would be useful to be able to prove here that generalization always terminates.
For loopfunction, we can then simply use this closed loop as the loops only form. For continuation closures, we could patch the stack so that the topmost reentrant function uses this updated, more generic implementation instead.
The type merging process is not dissimilar to merging branch types, though with branches, we can often afford to build sets, since the number of branches is finite. However with loops, we could potentially end up performing many iterations until sets are complete, and simple numerical sets can be 2^32 big. Hence, this gives us an avenue to optimization levels during compilation: the higher the desired optimization level, the later we give up on a set and generalize it as continuous.
We will, however, never generalize function references (fref) and maintain their sets, which must by definition be finite.
29.7.2022 B
After sleeping on it, I may have the beginning of an approach to reentrant function typing that doesn't suck and extends well to dealing with other types of indirect recursion. I also figured out something about optimizing buffer usage through better instructions, which I'll explain after.
First, let's think about how we're going to port continuation based code to the GPU later on. There is no support for free control flow. That means no goto labels, let alone function pointers. We do have switch cases though, which means that we can encode function pointers as program-level unique integer labels, and at runtime, use a switch case to dispatch to the correct function call. That can create a lot of overhead on divergence, but it runs!
If we don't perform this transformation at code generation time but directly in FLIR form (which is possible, since we have switch cases), we end up being able to elide function pointers entirely: a function reference always appears directly at a call site, and any indirect handling of functions is performed through integer label and switches. By collecting sets of labels as we type expressions, we keep track of their possible values (there can only be as many values in a set as there are functions in the program), and so we can transform apply operations into switches with the exact number of cases required.
In other words, where required, we turn closures into sum types and applications into switch trees. How do we figure out when it is required? As to the continuation recursion issue, it helps during specialization to keep track of the current function reference stack. That is, when we apply or continue function C within function B within function A, this creates a stack of A B C. If function A is now referenced again in function C, then we know that this parameter can not be inlined into the function as it is presently being applied. In other words: it is no longer a body constant, and must become a label - this one is then safe to inline.
There's still one point where we need function pointers: at the edges of the program, where we produce commands with attached continuations. Since we return a sum type instead of a closure (i.e. a pair whose first member is a label), but the exterior dispatcher needs an entry point, we can instead generate a wrapper that performs the call. We can generate this wrapper only after the wrapped function has been typed, since we would just enter another recursion otherwise.
The only corner case here is that recursive calls within the same hierarchy can not be resolved, but FLIR does not permit direct recursion anyway - precisely because that allows more code to be GPU compatible.
Now. On the CPU, resorting to a switch case is not the fastest method, since we do have function pointers available, so here we can instead set up a static LUT (lookup table) that is initialized with the function pointers when the executable starts. When a label is referenced, we can just fetch its index from the LUT then and there. Fortunately, the LUT is already generated by the compiler, and so we can solve this by simple forward declaration: first, all functions are forward declared, then they are implemented, and now every function has access to the function pointer of every other function. Since the mapping of label to function is known at compile time, we can now directly select the correct function.
Since we have now slightly different behavior on CPU and GPU, but need to transform the code during specialization, we need to find the method that is a superset of both. It is easy to start with a simple label transform and permit apply to be called on labels. The GPU code generator can then later also use forward declaration and generate a simple switch case in-place.
We do have the added issue on GPU that unions aren't supported either, but that can be overcome with canonical forward and inverse transforms: we can transform a structure to a canonical form the GPU supports, i.e. an array of words, and then perform an inverse transform to a different structure. The center parts can often be elided and so we get the shortest path transformation between two structures. It's not as fast as casting a pointer, but it's safe.
Alright. Now on to the second topic, which is about optimizing buffer usage. On present hardware, any program benefits greatly from being able to avoid allocations at runtime, and so it should create as few buffers as necessary, possibly so small in their entirety that they could be passed around in registers or on the stack before we need to resort to using the heap. But when we need heap buffers, it would be useful if we could reuse them. Therefore, instructions that "generate" new memory should, where possible, instead be rewritten to reuse existing memory, so that we can effectively limit the use of (re)allocations to a single instruction, which helps with analysis. And this can be done in a pure functional way.
Following exemplary instructions and commands, which produce values greater than the data that went in should change to become operations that do not always provoke implicit allocations, and make reuse explicit:
# now derives `count` from the buffer provided, and potentially moves that buffer when possible
vecmap stride:U32 count:u32 closure:fn(u32) -> vecmap stride:U32 count:<0... x stride> closure:fn(u32)

# we extend the definition of `insert` to permit elements that can be a multiple of `stride`, making this a
  `memcpy`-like function. We also rename it.
insert stride:u32 value:<0... x stride> element:<1 x stride> index:u32 -> <delete>
join a:<0... x u8> b:<0... x u8> ~ overlay stride=1 a:<0... x u8> b:<0... x u8> index:u32

# read some memory potentially reusing the provided buffer, calling the continuation with a buffer of same or smaller size.
: flir_cmd_stdin size:u32 cont:closure -> : flir_cmd_stdin dest:<0... x u8> cont:closure


In addition, we need a way to construct, extend and discard buffers:
# declares a blob with undefined contents of requested count; undefined contents of a buffer will be
  in practice initialized with zeroes, but accessing them will produce undefined buffers as well.
undef stride:U32 count:u32

# an annotation operation that forwards the buffer and declares its contents undefined, allowing for additional
  optimizations, see below.
delete value:<0... x ?>

The join operation can then be implemented as:
join = fn (a b)
    result = overlay 1 (undef 1 (+ (len 1 a) (len 1 b))) a 0
    overlay 1 result b (len 1 a)

We can then later gather all uses of an undef instruction and find, through lifetime analysis, places where buffers are dropped, and chain up these buffers with discard placed inbetween. This chaining can overcome function boundaries, by routing undefined buffers into and out of functions if it benefits reuse.
29.7.2022
I now have support for typing expressions and functions - just the typing, the C types are still generic.
But while typing continuation closures of command buffers I am running into infinite recursions, and that's not fun at all. i'm still chewing on that one.
The problem is that the program data is both immutable and directed acyclic, which means we can't directly patch in a recursive reference, but have to instead pass the self-reference through the function arguments so the function can (indirectly) call itself. Fortunately, this is already done by the user, so we can always be sure that when we see a connected function refer one further up in the continuation order, there has been an explicit body constant passed to its environment.
So, theoretically, all we need to do is recognize this indirect argument and make sure that it is properly typed where it is first passed, but then it is also possible that this by-argument passed function is specialized in multiple ways, which means now we need to patch the environment and call order to bring in more instances. However, this can be simplified a little by just replacing the single body reference with a cell of bodies that resides in the same spot, but performs one more indirection to retrieve the correct instance.
How to implement this mess with an elegant rewriting technique though, I'm not sure yet. Say we have a graph like:
%10 = body
    cell "STDO" "10 GOTO 10\n"
        cell
            cellextract (fenv) 0 0
            cellslice (fenv) 0 1
cell %10 (cell (cell %10)) # the closure

We can see that the closure passes the body to itself so it can construct a closure for itself. Simple deductive typing gives us
%10 = body
    cell "STDO" "10 GOTO 10\n"
        cell
            cellextract (fenv) 0 0 == %10
            cellslice (fenv) 0 1 == (cell (cell %10)) 
cell %10 (cell (cell %10))

The problem is that even though we are able to extract a constant,  we can't put %10 into itself there. To make matters worse, typing %10 itself deductively specializes the function but not the captured template, and so we lose the relationship:
%11 = body<(cell (cell %10) (cell))>
    cell "STDO" "10 GOTO 10\n"
        cell %10 (cell (cell %10)) 
cell %11 (cell (cell %10))

Continuing the typing, we just create an infinite recursion:
%11 = body<(cell (cell %10) (cell))>
    cell "STDO" "10 GOTO 10\n"
        %12 = body<(cell (cell %10) (cell))>
            cell "STDO" "10 GOTO 10\n"
                cell %10 (cell (cell %10)) 
        cell %12 (cell (cell %10)) 
cell %11 (cell (cell %10))

22.7.2022
It occurred to me earlier that FLIR isn't really "necessary" as a binary format. It's a specialized intermediate representation that when loaded into a FLIR program itself consists of nothing blobs and cells, so I wonder whether it was really worth it to require a SPIR-V like format. For the compiler, barely anything changes. The interpreter becomes a little more complex.
In a generic format of blobs and cells only, we are able to replace blob ... instructions with just blobs, but each instruction must still be a cell that begins with a blob, and each and every cell must be an instruction. Here, it is technically possible to produce a malformed instruction by beginning with a cell rather than a blob, but a validation layer can catch these.
How the binary format would work here in the simplest case is that for either blob or cell we specify a size in bytes, followed by the data, padded with zeroes at the end for DWORD alignment, without signifying whether the content is of blob or cell type. The header of the file, the first four bytes, is an id, the index of the root node of the graph, with its 4 most significant bits reserved for the node kind, indicating whether the reference is a blob, a cell, or inlined directly into the id. From here, we can type the entire tree. The contents of a blob can be arbitrary, but each cell is an array of id values. As with FLIR, we'd enforce that references can only point back to earlier nodes, so that the root node is at the end of the file, and all nodes in the file are in topological order.
The benefit of this format would be that we can use it for serializing both programs and data.
21.7.2022
On eliding conditional function pointers in pure functional code.
There are effectively two methods: one by inlining, and the other by switch tables. Inlining works as follows:
f = fn (...)
g = fn (...)
(if x f g) args... -> if x (f args...) (g args...)

But this has limited application, since a function that conditionally generates functions would have to be completely inlined as well so we can resolve the condition.
Switch tables operate on a conditional type. if x f g produces a union f|g. If each function in our code has a unique index, then we can assign index F to function f, and index G to function g. At the point of function application, we can insert a switch case to dispatch to the correct function. This method also works in shaders, provided we do not end up with a recursion:
h = fn (x)
    if x F G
switch (h x)
case F: f args...
case G: g args...

20.7.2022 B
Thought I'd quickly describe how closures are implemented in frameloop.
First of all, they aren't. FLIR, the intermediate representation has no concept of closures. It has body (or loopbody for tail-recursive functions), which is an instruction that binds the root of an expression graph that depends on the fenv instruction, i.e. can only be evaluated when fenv is bound to an environment using the apply instruction (apply <body> <env>), which by convention is a cell of arbitrary content - i.e. in the simplest case, just a list of arguments.
But FLIR has also two map-like constructs, one for blobs (vecmap) and one for cells (cellmap), and these do indeed require a "closure", which really is just a cell containing template arguments to use when cellmap is invoking apply internally. We also use this definition of a closure when the system calls back into the program after a I/O operation was requested.
These constraints prescribe the basic format for a closure: it is a cell with two elements, of which the first element is a body value, and the second element is a parent environment of cell type. On application, the argument(s) passed to apply are appended to the parent environment to form the final function environment that fenv will be bound to when the body is evaluated. Or, as a formula:
closure := [body,[x0,...,xN,[body,y0,...,yN]]]
environment := [x0,...,xN,[body,y0,...,yN],args]
closure(args) -> apply(body, environment)

This format for environments provides several affordances, provided we track the lexical depth of a function as level L while generating the IR from text:


Every environment env represents a stack of applied closures, in which a function of level L (which for a top level function is L=0) finds its local environment at index env[L*2] and its local arguments at index env[L*2+1], and can in addition also access all values bound to its parent closures this way.


A function can provide a closure to itself from its environment, simply by truncating the environment to level L*2+1, or as formula:


environment := [x0,...,xN,[body,y0,...,yN],args]
closure := [environment[2*L][0],environment[:2*L+1]]

FLIR does not permit reentrant functions, but it permits self-captures that can be passed as continuation to a function-loop like construction (see 8.7.2022).

Since a closure permits passing extra arguments, these can be used to capture/memoize expressions in the parent function and pass them on into the new environment. This too can be done during parsing, by recording names to be captured, including a cell of captured values in the closure, and then binding those names to cell extractions from a constant location in the next level.

As a result of these affordances, we can conceptually work with runtime closures even though all function bodies defined in FLIR (including loop bodies) are top level functions and are easily translatable to C.
In addition, since we built all these abstractions with cells and pure instructions, an optimization pass can identify unused values and strip/flatten arguments and environments, or even bake additional expressions that only depend on the parent environment right into the closure.
If a function does not require its own environment, then the optimizer can reduce its closure to [body,[]], and the environment of the function will therefore simply be [args], and can be further simplified during code generation.
20.7.2022
A function-loop like continuation interface (as discussed on 8.7.2022) seems to be the best way for Frameloop programs to hand off control to I/O: the program's return value is a command identifier for the I/O operation to perform, followed by arguments, typically including a closure; which is called with the result of the completed operation, which becomes the next entry point into the program.
Presently I solve this through a list of commands, which has the problem that several of them accept callback closures, which can not cooperate on the same state. A continuation engine serializes all operations, and so processing is sequential.
But this means that we need the interface to model parallel I/O operations explicitly. The most generic example would be to permit threading for a program, which is akin to branching: the program supplies two or more closures to be run as parallel threads, as well as a merge closure (a fence), to invoke when all threads have completed, with the results of those threads.
Each thread has its own system I/O (since I/O would be the only reason to perform explicit async operations anyway) and can exit with a result other than just a return code.
The only issue I can see here is hanging - if one of the threads is accidentally caught in an endless I/O loop due to user error, the merge closure will never be reached. This is detectable by the system, as it is managing all threads, and it can present a debuggable error.
It is then possible to write small collector-like components that generate I/O statements and link each other up as continuations, and a bit of macro sugar around this, to get syntax like this:
include "std.flasm"

IO.main
    IO.stdout "program start\n"
    IO.loop () # a macro that binds IO.break
        IO.read 1
            IO.then (s)
                if (is s "\n")
                    IO.block # open a new sequence
                        IO.stdout s
                        IO.break # break from IO.loop
                    #else
                    IO.stdout s
    IO.stdout "done"
    IO.stdout "\n"

How this works is that e.g. IO.stdout does not perform the operation immediately, but instead returns another closure that takes a continuation argument; when that is passed, it produces the final function. this allows us to link up several continuations from back to front in order to build the final sequence.
Consequently applied to the entire program, a pure functional program can build imperative control flow this way, without needing monads - all we require is an impure driver loop that wraps the program and isolates it from side effects. This driver could then interface with the file system, maintain threads, sockets and even perform indirect FFI calls into C.
There are not only benefits, though. This model lacks system-side resource management. The lifetime of monads can be tracked to figure out when they're no longer needed. Here, for every allocated resource, we would require opaque handles returned by the system with a special type whose lifetime can be tracked. It does not matter that these handles belong to a system with side effects, since we only ever return them from the program in order to interact with them, and what happens outside the program doesn't effect its purity.
New tracked handle types still pose a complication to the system. Sometimes though, as in the case of the threading system outlined above, it is possible to avoid adding new types, by aligning the lifetime of the resource with the lifetime of a continuation.
18.7.2022
Now we come to specializing the types of a FLIR program. We start out with polymorphic expressions composed of either of our three high level types: body, blob and cell, and try to lower those to C, depending on the invariants we discover in the types.
To begin with, each of the three types already has a C implementation, but by discovering invariants, we can simplify the type.
Any value is, at first a fl_value, which is a pointer-sized tagged union, consisting of a 16-byte aligned pointer, and the tag encoded in the lower 4 bits.
body is modeled as fl_value (*fl_body_func)(fl_value), but if we know more about the environment or the return type, we can specialize the signature and improve by-value passing.
blob and cell are generally fairly similar: they're both vectors, implemented as piece tables of a single element type, i.e. they join at most log2(n) slices from different chunks for n elements. Where they differ is in their element type: blobs are always byte-sized, whereas cells hold elements of type fl_value.
So here are the invariants we can discover for fl_value:

It is either a body (callable) or a piece of data (blob or cell). This can be discovered either by seeing it appear as first argument to apply (when typing upstream), or by the appearance of a body instruction (when typing downstream).
If it is a body, we can tell from apply and by downstream-typing its graph what its argument and return type is.
We can discover it is constant, i.e., it consists of nothing but cell, body and blob instructions.
If it is a piece of data, it is a blob or a cell. We can discover this upstream by it being passed to a single-typed instruction, or downstream, by being returned from a single-typed instruction, starting with the cell or blob constructor.
If it is a piece of data, we can discover its minimum size 0 <= Smin and its maximum size Smax <= infinity. If Smin == Smax, then the size is constant. We can also discover that its size always shrinks or grows in a fixed increment (i.e. its stride).
If it is a cell, we can discover that all its elements are of the same type (it is an array) or the type of its elements is exactly known (it is a tuple).
cells can have a recursive definition, which is always created by loops. This recursive type is polymorphic and always has a terminal/leaf type that indicates the end of the recursion.
Some invariants are conditional i.e. they only apply within branches.
if and switch produce tagged unions of types, where the tag is equivalent to the condition/switch value. But we can also leave out the tag for less precision and greater generality.

So we land at these annotated types:

"..." for a perfectly constant blob.
cell C0 ... CN for a perfectly constant cell.
sum (tag0 T0) ... (tagN TN) for an explicitly tagged polymorphic value.
union T0 ... TN for an implicitly tagged polymorphic value.
blob S for a blob of fixed size S.
vblob stride Smin Smax for a variadic blob of minimum size Smin, maximum size Smax ("" for infinity) and stride stride.
array elementT S for a cell array of fixed element count S.
varray elementT Smin Smax for a variadic cell array of minimum element count Smin and maximum count Smax ("" for infinity).
tuple T0 ... TN for a cell tuple of fixed element count.
vtuple T0 ... TN for a cell tuple of fixed element count, however the last element is used as an extension, so that indices [i] >= N are transformed to [N][i-N]. The last element must be of type varray, vtuple or vblob.
let ((name0 T0) ... (nameN TN)) T for a recursive cell type, e.g. let ((CONS (tuple (union (blob 1 0 "") CONS) CONS))) CONS to describe a cons cell, or let ((R (union (blob 4 4 4) (tuple B B))) (B (union (blob 4 4 4) (tuple R R)))) (tuple (blob 1 0 "") B) for a red-black tree with a top level string label.
body returnT envT for a body with provided return type and environment type. If the environment type is a tuple, and it is never processed as a whole, then we can elide the tuple and pass values by arguments. I doubt that this brings great performance wins though.

These types can then be lowered to C accordingly, which we'll call the Frameloop ABI:

"..." becomes a global string constant.
cell becomes a global struct initializer.
When a types final size exceeds 16 bytes, its data is moved to the heap and refcounted.
sum becomes a struct type with a tag field followed by a union type field.
union becomes a struct type with an implicit tag field followed by a union type field.
blob becomes a uint8_t[] array. Other element types may be chosen based on the largest possible alignment.
vblob becomes a struct type with a size field, followed by a uint8_t[] field. Other element types may be chosen based on the largest possible alignment.
array follows the similar rules as blob, but uses elementT for its array type.
varray follows the similar rules as vblob, but uses elementT for its array type.
tuple becomes a struct type.
vtuple becomes a struct type whose last element contains an unsized array.
let enforces every named value to become a refcounted heap pointer of opaque struct type.
body turns into its equivalent C function pointer type.

17.7.2022 B
Finally! flc (The FLIR compiler) can compile itself!

compiling flc.flir to flc.c by running flc.flir in the FLIR interpreter takes 17 seconds
compiling flc.flir to flc.c by running flc.elf (which is the previously compiled flc.c) takes 8 seconds

We only get about a x2 speed-up because we merely optimize out the interpreter and its conditional branching; the FLIR code itself is not optimized (lots of redundancy at function boundaries), and datatypes are not specialized either (lots of malloc/free and copying).
17.7.2022
Hooray! After some heavy bug fixing, testing.flasm compiles to C and executes for the first time. flc.flasm mostly compiles, but I found a problem in my assumptions: the topological order is in itself conditional; failing to account for that, we get cyclical dependencies on conditions we can not resolve.
Perhaps it's best to start out from the simplest solution instead: do a depth-first walk that forks on each conditional, so we get a DAG of different topological orderings, that is a working (yet 2^n redundant for n conditions) imperative sequence in itself, which readily translates to C.
Many branches in this DAG will have redundant parts, i.e. instructions that appear in both branches, and we will have to figure out how to merge them safely where possible. This needs to be done ASAP, because multiple conditions in series produce an exponential number of branches, which may all copy a common preamble.
15.7.2022
I have now all parts together to get to compact and efficient C function body generation, save for one: merging common conditional blocks. It appears the best way to go about this is to first build a fragmented conditional instruction block tree, and then merge adjacent blocks without changing the overall order.
The entire block tree is built out of these primitives:
BLOCK := 
    let %ID ...
    (%ID = INSTRUCTION) | IF
    ...
    %PARENT-ID... = %ID...
IF :=
    if CONDITIONS: BLOCK
    else: BLOCK

Technically we could declare all variables at the beginning of the function, but scoping the declarations will help with debugging compiler bugs later.
Now let's apply this in practice. Take this fragmented example of 20 instructions:
let %1
%1 = fenv
let %2
%2 = tupleextract %1 1 1
let %3
%3 = typeof %2
let %4
%4 = blob "cell"
let %5
%5 = is %3 %4
if (%5 0 0 1): let %6
if (%5 0 0 1): %6 = countof %2
if (%5 0 0 1): let %7
if (%5 0 0 1): %7 = blob 2
if (%5 0 0 1): let %8
if (%5 0 0 1): %8 = ieq 4 %6 %7
if (%5 0 0 (%8 0 0 1)): let %9
if (%5 0 0 (%8 0 0 1)): %9 = blob 0
if (%5 0 0 (%8 0 0 1)): let %10
if (%5 0 0 (%8 0 0 1)): %10 = cellextract %2 %9
if (%5 0 0 (%8 0 0 1)): let %11
if (%5 0 0 (%8 0 0 1)): %11 = typeof %10
if (%5 0 0 (%8 0 0 1)): let %12
if (%5 0 0 (%8 0 0 1)): %12 = blob "body"
if (%5 0 0 (%8 0 0 1)): let %13
if (%5 0 0 (%8 0 0 1)): %13 = is %11 %12
if (%5 0 0 (%8 0 0 (%13 0 0 1))): let %14
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %14 = blob 1
if (%5 0 0 (%8 0 0 (%13 0 0 1))): let %15
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %15 = cellextract %2 %14
if (%5 0 0 (%8 0 0 (%13 0 0 1))): let %16
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %16 = typeof %15
if (%5 0 0 (%8 0 0 1)): let %17
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %17 = is %16 %4
if (%5 0 0 1): let %18
if (%5 0 0 (%8 0 0 1)): %18 = if %13 %17 %13
let %19
if (%5 0 0 1): %19 = if %8 %18 %8
let %20
%20 = if %5 %19 %5

The conditions used here are described in multivariate BDD notation (%x C t f) ~ (%x == C)?t:f. I left out the (mandatory) else: blocks here since they're empty yet.
The first transformation we can do is to rewrite if expressions as separate conditions:
if (%5 0 0 1): let %18
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %18 = %17
if (%5 0 0 (%8 0 0 (%13 0 1 0))): %18 = %13
let %19
if (%5 0 0 (%8 0 0 1)): %19 = %18
if (%5 0 0 (%8 0 1 0)): %19 = %8
let %20
if (%5 0 0 1): %20 = %19
if (%5 0 1 0): %20 = %5

Also, we always reorder the let declarations within the same block, which are leaves and thus not constrained upward. The rule is that in each block all let declarations come first, and can in fact become two ordered sets per block:
let %1 %2 %3 %4 %5 %19 %20
%1 = fenv
%2 = tupleextract %1 1 1
%3 = typeof %2
%4 = blob "cell"
%5 = is %3 %4
if (%5 0 0 1): let %6
if (%5 0 0 1): %6 = countof %2
if (%5 0 0 1): let %7
if (%5 0 0 1): %7 = blob 2
if (%5 0 0 1): let %8
if (%5 0 0 1): %8 = ieq 4 %6 %7
if (%5 0 0 (%8 0 0 1)): let %9
if (%5 0 0 (%8 0 0 1)): %9 = blob 0
if (%5 0 0 (%8 0 0 1)): let %10
if (%5 0 0 (%8 0 0 1)): %10 = cellextract %2 %9
if (%5 0 0 (%8 0 0 1)): let %11
if (%5 0 0 (%8 0 0 1)): %11 = typeof %10
if (%5 0 0 (%8 0 0 1)): let %12
if (%5 0 0 (%8 0 0 1)): %12 = blob "body"
if (%5 0 0 (%8 0 0 1)): let %13
if (%5 0 0 (%8 0 0 1)): %13 = is %11 %12
if (%5 0 0 (%8 0 0 (%13 0 0 1))): let %14
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %14 = blob 1
if (%5 0 0 (%8 0 0 (%13 0 0 1))): let %15
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %15 = cellextract %2 %14
if (%5 0 0 (%8 0 0 (%13 0 0 1))): let %16
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %16 = typeof %15
if (%5 0 0 (%8 0 0 1)): let %17
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %17 = is %16 %4
if (%5 0 0 1): let %18
if (%5 0 0 (%8 0 0 (%13 0 0 1))): %18 = %17
if (%5 0 0 (%8 0 0 (%13 0 1 0))): %18 = %13
if (%5 0 0 (%8 0 0 1)): %19 = %18
if (%5 0 0 (%8 0 1 0)): %19 = %8
if (%5 0 0 1): %20 = %19
if (%5 0 1 0): %20 = %5

In the next round, we merge N adjacent blocks by the rule:
if (%x C %t0 %f0): let X0; X0
if (%x C %t1 %f1): let X1; X1
...
if (%x C %tN %fN): let XN; XN
<=>
if (%x C 1 0):
    let X0... X1... XN...
    if %t0: X0
    if %t1: X1
    ...
    if %tN: XN
else:
    let X0... X1... XN...
    if %f0: X0
    if %f1: X1
    ...
    if %fN: XN

Which gives us, if we only do the top level:
let %1 %2 %3 %4 %5 %19 %20
%1 = fenv
%2 = tupleextract %1 1 1
%3 = typeof %2
%4 = blob "cell"
%5 = is %3 %4
if (%5 0 1 0):
    %20 = %5
else:
    let %6 %7 %8 %18
    %6 = countof %2
    %7 = blob 2
    %8 = ieq 4 %6 %7
    if (%8 0 0 1): let %9
    if (%8 0 0 1): %9 = blob 0
    if (%8 0 0 1): let %10
    if (%8 0 0 1): %10 = cellextract %2 %9
    if (%8 0 0 1): let %11
    if (%8 0 0 1): %11 = typeof %10
    if (%8 0 0 1): let %12
    if (%8 0 0 1): %12 = blob "body"
    if (%8 0 0 1): let %13
    if (%8 0 0 1): %13 = is %11 %12
    if (%8 0 0 (%13 0 0 1)): let %14
    if (%8 0 0 (%13 0 0 1)): %14 = blob 1
    if (%8 0 0 (%13 0 0 1)): let %15
    if (%8 0 0 (%13 0 0 1)): %15 = cellextract %2 %14
    if (%8 0 0 (%13 0 0 1)): let %16
    if (%8 0 0 (%13 0 0 1)): %16 = typeof %15
    if (%8 0 0 1): let %17
    if (%8 0 0 (%13 0 0 1)): %17 = is %16 %4
    if (%8 0 0 (%13 0 0 1)): %18 = %17
    if (%8 0 0 (%13 0 1 0)): %18 = %13
    if (%8 0 0 1)): %19 = %18
    if (%8 0 1 0)): %19 = %8
    %20 = %19

Performing this breadth-first, the next layer will coagulate like this:
let %1 %2 %3 %4 %5 %19 %20
%1 = fenv
%2 = tupleextract %1 1 1
%3 = typeof %2
%4 = blob "cell"
%5 = is %3 %4
if (%5 0 1 0):
    %20 = %5
else:
    let %6 %7 %8 %18
    %6 = countof %2
    %7 = blob 2
    %8 = ieq 4 %6 %7
    if (%8 0 1 0):
        %19 = %8
    else:
        let %9 %10 %11 %12 %13 %17
        %9 = blob 0
        %10 = cellextract %2 %9
        %11 = typeof %10
        %12 = blob "body"
        %13 = is %11 %12
        if (%13 0 0 1): let %14
        if (%13 0 0 1): %14 = blob 1
        if (%13 0 0 1): let %15
        if (%13 0 0 1): %15 = cellextract %2 %14
        if (%13 0 0 1): let %16
        if (%13 0 0 1): %16 = typeof %15
        if (%13 0 0 1): %17 = is %16 %4
        if (%13 0 0 1): %18 = %17
        if (%13 0 1 0): %18 = %13
        %19 = %18
    %20 = %19

And after the last layer is done, we have our block tree:
let %1 %2 %3 %4 %5 %19 %20
%1 = fenv
%2 = tupleextract %1 1 1
%3 = typeof %2
%4 = blob "cell"
%5 = is %3 %4
if (%5 0 1 0):
    %20 = %5
else:
    let %6 %7 %8 %18
    %6 = countof %2
    %7 = blob 2
    %8 = ieq 4 %6 %7
    if (%8 0 1 0):
        %19 = %8
    else:
        let %9 %10 %11 %12 %13 %17
        %9 = blob 0
        %10 = cellextract %2 %9
        %11 = typeof %10
        %12 = blob "body"
        %13 = is %11 %12        
        if (%13 0 1 0): 
            %18 = %13
        else:
            let %14 %15 %16
            %14 = blob 1
            %15 = cellextract %2 %14
            %16 = typeof %15
            %17 = is %16 %4
            %18 = %17
        %19 = %18
    %20 = %19

As it appears, we can build the entire canonical tree just by a single block concatenation function which performs concatenations recursively.
9.7.2022 B
For FLIR, It appears that it is a good structural improvement to reorder instructions in topological order (starting from the leaves) so that every instruction is as close to its last dependency preceding it as possible, of which the instructions' conditional nodes are additional dependencies.
For example,
%0 = null 0 0 [0]
%1 = fenv [1]
%2 = tupleextract %1 2 1 0 [1]
%3 = tupleextract %1 1 2 0 [1]
%4 = ult 4 %2 %3 [1]
%5 = blob 1 [%4]
%6 = iadd 4 %2 %5 [%4]
%7 = tupleextract %1 2 1 1 [1]
%8 = tupleextract %1 1 1 [%4]
%9 = extract 1 %8 %2 [%4]
%10 = uconvert 1 %9 4 [%4]
%11 = blob 32 [%4]
%12 = ult 4 %10 %11 [%4]
%13 = blob 126 [(%4 & ~%12)]
%14 = ugt 4 %10 %13 [(%4 & ~%12)]
%15 = if %12 %12 %14 [%4]
%16 = blob "\x" [(%4 & %15)]
%17 = cell [(%4 & %15)]
%18 = cell %17 [(%4 & %15)]
%19 = cell %<body 1> %18 [(%4 & %15)]
%20 = tupleextract %19 0 [(%4 & %15)]
%21 = tupleextract %19 1 [(%4 & %15)]
%22 = cell %20 %9 [(%4 & %15)]
%23 = cell %22 [(%4 & %15)]
%24 = celljoin %21 %23 [(%4 & %15)]
%25 = apply %20 %24 [(%4 & %15)]
%26 = join %16 %25 [(%4 & %15)]
%27 = blob "\"" [(%4 & ~%15)]
%28 = ieq 1 %9 %27 [(%4 & ~%15)]
%29 = blob "\\"" [(%4 & (~%15 & %28))]
%30 = if %28 %29 %9 [(%4 & ~%15)]
%31 = if %15 %26 %30 [%4]
%32 = join %7 %31 [%4]
%33 = cell %6 %32 [%4]
%34 = countof %1 [%4]
%35 = isub 4 %34 %5 [%4]
%36 = cellinsert %1 %33 %35 %5 [%4]
%37 = repeat %36 [%4]
%38 = break %7 [~%4]
%39 = if %4 %37 %38 [1]

can be reordered to (I did this one manually)
%0 = null 0 0 [0]
%1 = fenv [1]
%7 = tupleextract %1 2 1 1 [1]
%3 = tupleextract %1 1 2 0 [1]
%2 = tupleextract %1 2 1 0 [1]
%4 = ult 4 %2 %3 [1]
%38 = break %7 [~%4]
%34 = countof %1 [%4]
%8 = tupleextract %1 1 1 [%4]
%11 = blob 32 [%4]
%9 = extract 1 %8 %2 [%4]
%10 = uconvert 1 %9 4 [%4]
%12 = ult 4 %10 %11 [%4]
%13 = blob 126 [(%4 & ~%12)]
%14 = ugt 4 %10 %13 [(%4 & ~%12)]
%15 = if %12 %12 %14 [%4]
%27 = blob "\"" [(%4 & ~%15)]
%28 = ieq 1 %9 %27 [(%4 & ~%15)]
%29 = blob "\\"" [(%4 & (~%15 & %28))]
%30 = if %28 %29 %9 [(%4 & ~%15)]
%17 = cell [(%4 & %15)]
%18 = cell %17 [(%4 & %15)]
%19 = cell %<body 1> %18 [(%4 & %15)]
%21 = tupleextract %19 1 [(%4 & %15)]
%20 = tupleextract %19 0 [(%4 & %15)]
%22 = cell %20 %9 [(%4 & %15)]
%23 = cell %22 [(%4 & %15)]
%24 = celljoin %21 %23 [(%4 & %15)]
%25 = apply %20 %24 [(%4 & %15)]
%16 = blob "\x" [(%4 & %15)]
%26 = join %16 %25 [(%4 & %15)]
%31 = if %15 %26 %30 [%4]
%32 = join %7 %31 [%4]
%5 = blob 1 [%4]
%35 = isub 4 %34 %5 [%4]
%6 = iadd 4 %2 %5 [%4]
%33 = cell %6 %32 [%4]
%36 = cellinsert %1 %33 %35 %5 [%4]
%37 = repeat %36 [%4]
%39 = if %4 %37 %38 [1]

which when linearly conditionally scoped, produces a good contiguous structure:
%0 = null 0 0 [0]
%1 = fenv [1]
%7 = tupleextract %1 2 1 1 [1]
%3 = tupleextract %1 1 2 0 [1]
%2 = tupleextract %1 2 1 0 [1]
%4 = ult 4 %2 %3 [1]
if %4:
    %34 = countof %1 [%4]
    %8 = tupleextract %1 1 1 [%4]
    %11 = blob 32 [%4]
    %9 = extract 1 %8 %2 [%4]
    %10 = uconvert 1 %9 4 [%4]
    %12 = ult 4 %10 %11 [%4]
    if %12:
    else:
        %13 = blob 126 [(%4 & ~%12)]
        %14 = ugt 4 %10 %13 [(%4 & ~%12)]
    %15 = if %12 %12 %14 [%4]
    if %15:
        %17 = cell [(%4 & %15)]
        %18 = cell %17 [(%4 & %15)]
        %19 = cell %<body 1> %18 [(%4 & %15)]
        %21 = tupleextract %19 1 [(%4 & %15)]
        %20 = tupleextract %19 0 [(%4 & %15)]
        %22 = cell %20 %9 [(%4 & %15)]
        %23 = cell %22 [(%4 & %15)]
        %24 = celljoin %21 %23 [(%4 & %15)]
        %25 = apply %20 %24 [(%4 & %15)]
        %16 = blob "\x" [(%4 & %15)]
        %26 = join %16 %25 [(%4 & %15)]
    else:
        %27 = blob "\"" [(%4 & ~%15)]
        %28 = ieq 1 %9 %27 [(%4 & ~%15)]
        if %28:
            %29 = blob "\"" [(%4 & (~%15 & %28))]
        %30 = if %28 %29 %9 [(%4 & ~%15)]
    %31 = if %15 %26 %30 [%4]
    %32 = join %7 %31 [%4]
    %5 = blob 1 [%4]
    %35 = isub 4 %34 %5 [%4]
    %6 = iadd 4 %2 %5 [%4]
    %33 = cell %6 %32 [%4]
    %36 = cellinsert %1 %33 %35 %5 [%4]
    %37 = repeat %36 [%4]
else:
    %38 = break %7 [~%4]
%39 = if %4 %37 %38 [1]

9.7.2022
Four more BDD insights:

The complex transform (a b c) d e -> a (b d e) (c d e) can be avoided by requiring that terminal nodes must always be 0 or 1, and implementing operators to use substitution instead of combination, ((a b c) d e) -> (a b c).subst(0=d, 1=e).

This changes how boolean operations are implemented:
# old form to new form
a & b -> (a 0 b) -> a.subst(0=0, 1=b)
a | b -> (a b 1) -> a.subst(0=b, 1=1)
~x -> (x 1 0) -> x.subst(0=1, 1=0) (leaf swap)
(~a b c) -> (a c b) (root swap)


Because we can avoid complex conditional intermediates, we can now specialize branches to type Term := {id, 0|1|Term, 0|1|Term}, disallowing complex terms in the conditional.

This also changes variable bubbling (see previous post):
a (b c d) e -> a (b c d) (b e e) -> b (a c e) (a d e)
a b (c d e) -> a (c b b) (c d e) -> c (a b d) (a b e)

In fact, any variable a in the BDD can be pulled to the top by the operation
u = Q.subst(a = 0)
v = Q.subst(a = 1)
Q' = (a u v)


The BDD can be extended to support and optimize switch cases by allowing integer leafs and extending the term to Term := {id, number, number|Term, number|Term}; (a b c d) then translates to C as (a == b)?c:d or switch(a){case b: c; break; default: d}; We can then port BDD nodes as (a b c) -> (a 0 b c). The arithmetic is generally fairly similar, only that specialization changes a little.

The operator a == C can then be described via (a C 1 0), or, for a complex left hand, by comparing any leaf number L of a with C.
With these extensions, we can now perform constant set testing. x in (A B C) can be described as (x A 1 (x B 1 (x C 1 0))), whereas x not in (A B C) becomes (x A 0 (x B 0 (x C 0 1))).

If we consistently apply the swap rule (see previous post) to order our variables, and also order for constants that way when the variables are the same, then redundant selectors on the same variable will always crop up next to each other, and we do not need a contextual mapping to drop the variable. Instead, the following replace rule will suffice:

# regular BDD
(a (a b c) (d e f)) -> (a b (d e f))
(a (b c d) (a e f)) -> (a (b c d) f)
# BCDD (binary comparison decision diagram)
(a B (a B c d) (e F g h)) -> (a B c (e F g h))
(a B (a Q c d) (e F g h)) -> (a B d (e F g h)) # a==B so can't be Q
(a B (c D e f) (a B g h)) -> (a B (c D e f) h)
(a B (c D e f) (a Q g h)) -> (a B (c D e f) (a Q g h)) # isn't B, but could be Q, so no change

update: with the new model and the new replace rules, my test case reduced from 110ms to 19ms, as the memoizer can now work correctly. Without result caching, I get 300ms.
8.7.2022 B
For Binary Decision Diagram (BDD) normalization, it becomes important to bubble-up/push-down switch variables in order to guarantee that the nesting always uses the same ordering. This is possible whenever we have a BDD term of this shape, where both branches use the same conditional:
a (b c d) (b e f)

The truth table for this term is
a b out
0 0 c
0 1 d
1 0 e
1 1 f

If we swap out the middle two lines, we effectively swap the conditional order:
a b out
0 0 c
1 0 e
0 1 d
1 1 f

And from this, we can project back to the term, where a, b, d and e are swapped:
a (b c d) (b e f) -> b (a c e) (a d f)

It is possible to perform this operation on any term, by first transforming it to a redundant version:
a b c -> a (b 0 1) (b c c) -> b (a 0 c) (a 1 c)
a b c -> a (c b b) (c 0 1) -> c (a b 0) (a b 1)

Is this correct? Let's apply it twice!
b (a 0 c) (a 1 c) -> b ((a 0 c) 0 1) ((a 0 c) (a 1 c) (a 1 c)) -> (a 0 c) (b 0 (a 1 c)) (b 1 (a 1 c))

and reduce:
-> a (0 (b 0 (a 1 c)) (b 1 (a 1 c))) (c (b 0 (a 1 c)) (b 1 (a 1 c))) # reduce
-> a (b 0 (a 1 c)) (c (b 0 (a 1 c)) (b 1 (a 1 c))) # fold constant
-> a (b 0 1) (c (b 0 c) (b 1 c)) # apply a
-> a (b 0 1) (c (b 0 0) (b 1 1)) # apply c
-> a (b 0 1) (c 0 1) # fold constant
-> a b c # fold constant

8.7.2022
Here's a pattern for performing recursive evaluation in programming models that do not permit recursion, but allow closures as first class values. The technique works by inverting the control flow, by having a reactor loop apply continuations that produce continuations. The closures then maintain their own stack/coroutine memory.
# described in unrestricted frameloop assembly
function-loop = fn (entryf args)
    loop (f args = entryf args)
        if (empty? f) # raise/return
            break args
            #else
            repeat (f args)

As a (somewhat useless) example, a recursive fibonacci function can then be written as follows:
fib = fn (n)
    it = fn (cont n)
        self = this-function
        if (> n 1)
            do self
                cell
                    fn (x)
                        do self
                            cell
                                fn (y) cont (+ x y)
                                - n 2
                    - n 1
            #else
            cont n
    function-loop it (cell (fn (x) empty x) n)

function-loop may then choose to prescribe a stricter argument format and memoize the arguments, or mandate a user-defined state argument through which functions can implement their own shared memoization, stack or state machine.
7.7.2022
After pondering for weeks and coming to no conclusion on how to properly handle lazily evaluating conditional expressions when translating pure functional code to imperative C, it occurred to me that, I might finally have a clean and elegant way to deal with its corner cases.
The FLIR compiler already solves closures (by not supporting them, as they can be easily generated explicitly during parsing) and loops (by treating every loop as a function with special boundary rules), which means our functions are top level and loopless, but still not branchless.
Take this simple FLIR snippet as an exemplary corner case:
# assuming G, F, t, h, k, a, b are defined
g = apply G t
u = if a g h
v = if b k g
uv = cell u v
result = apply F uv

A straightforward translation (ignoring memory management for now) would be:
// ...
fl_value g = fl_apply(G, t);
fl_value u = fl_tobool(a)?g:h;
fl_value v = fl_tobool(b)?k:g;
fl_value uv = fl_cell(u,v);
fl_value result = fl_apply(F, uv);

Only that this code violates the requirement that g must be evaluated lazily, i.e. only when it is selected by a conditional branch. So we translate g as scoped:
fl_value u; if (fl_tobool(a)) { fl_value g = fl_apply(G, t); u = g; } else { u = h; }
fl_value v; if (fl_tobool(b)) { v = k; } else { fl_value g = fl_apply(G, t); v = g; }
fl_value uv = fl_cell(u,v);
fl_value result = fl_apply(F, uv);

This is better, yet now we violate the requirement that each expression must be evaluated at most once, whereas here, for the case that a && !b, g will be evaluated twice.
The question is now: how to create a proper imperative form in which we guarantee that

each expression of a function is evaluated at most once
conditional expressions are evaluated lazily

The solution I have arrived at is to generate a flat, imperative block without lexical subscopes, in which every expression x = op ... is converted to a statement of the form
fl_value x; if (<conditions>) x = op(...);

where <conditions> is computed by propagating branch conditions up the function graph, starting from the sink node, so that we know exactly for every expression the totality of conditions that lead to its evaluation. The statement allows for x to remain undefined, but since all dependent expressions use compatible conditions, no statement will ever depend on an undefined variable.
Our example program then is first annotated by conditions ? <term>: expr as follows:
# assuming G, F, t, h, k, a, b are defined
? a || !b: g = apply G t
? true: u = if a g h
? true: v = if b k g
? true: uv = cell u v
? true: result = apply F uv

It then straightforwardly translates to imperative form (where we can shorthand ? true to a direct assignment), by maintaining topological order:
// ...
fl_value g; if (fl_tobool(a) || !fl_tobool(b)) g = fl_apply(G, t);
fl_value u = fl_tobool(a)?g:h;
fl_value v = fl_tobool(b)?k:g;
fl_value uv = fl_cell(u,v);
fl_value result = fl_apply(F, uv);

In practice, we would not directly translate to C, but rather to a single static assignment tree that is initially flat, and in a subsequent optimization step merges multiple concurrent statements depending on the same conditions into nested lexically scoped branches.
To represent conditions, we may either use Binary Decision Diagrams (I explored them on 19.3.2021, extending the idea to switch cases), which are DAGs of ternary expressions (where the leaves are either variable, true or false), or the simpler, non-recursive Sum Of Product (SOP) form, though BDDs describe our desired final ternary form well, and make it easy to construct nested branches.
It won't be easy to always optimize branches so they assume their most efficient form, but it can generally be said that this structure is already quite SIMD friendly, since our form somewhat mimicks execution masking, and we don't duplicate any statements, keeping the code compact.
Lastly, what to do with switches? We can trivially understand switches as control flow following a test whether an integer is or is not contained within a set of constant literals. A switch instruction is useful because C compilers can often translate it into efficient jump tables that make it fast to perform branching set and equality tests on single variables simultaneously.
Hence we should abstract this feature specifically, as an instruction in x c0 ... cN that evaluates to true when x is contained in the set, and false otherwise. We can then rewrite terms such as (x == 1)||(x == 2) as x in set(1,2), which translates to C as:
bool q; switch (x) { case 1: case 2: q = true; break; default: q = false; }

and this expands our ternary from c?a:b to (c in i1[,...])?a:b, which we can reflect as a new branch type in the BDD.
As an example, a switch case like
y = switch(x) {
case 1:
case 2:
case 3: y = a; break;
case 5: y = b; break;
default: y = c;
}

would be specified in conditional FLIR as
? x in set(1,2,3): a = ...
? x in set(5): b = ...
? !(x in set(1,2,3,5)): c = ...
y = switch x 1 a 2 a 3 a 5 b null c

which naively translates to the less optimal but correct
bool c_a; switch(x) { case 1: case 2: case 3: c_a = true; break; default: c_a = false; }
fl_value a; if (c_a) a = ...;
bool c_b = (x == 5);
fl_value b; if (c_b) b = ...;
bool c_c; switch(x) { default: c_c = true; break; case 1: case 2: case 3: case 5: c_c = false; }
fl_value c; if (c_c) c = ...;
fl_value y; switch(x) { case 1: case 2: case 3: y = a; break; case 5: y = b; break; default: y = c; }

Clang and GCC differ wildly in how well they can optimize such constructs. Here's what GCC makes of it, showing how it has been used as an intermediate target platform for a long time:
testf:
        sub     rsp, 8
        xor     eax, eax
        call    op0
        lea     edx, [rax-1]
        cmp     edx, 2
        jbe     .L6
        cmp     eax, 5
        mov     eax, 0
        je      .L7
        add     rsp, 8
        jmp     op3
.L7:
        add     rsp, 8
        jmp     op2
.L6:
        xor     eax, eax
        add     rsp, 8
        jmp     op1

And here is clang's unholy mess:
testf:                                  # @testf
        push    rbp
        push    r14
        push    rbx
        xor     eax, eax
        call    op0@PLT
        mov     ebx, eax
        lea     ebp, [rbx - 1]
        cmp     ebp, 2
        ja      .LBB0_2
        xor     eax, eax
        call    op1@PLT
        mov     r14, rax
        jmp     .LBB0_3
.LBB0_2:
        cmp     ebx, 5
        jne     .LBB0_3
        xor     eax, eax
        pop     rbx
        pop     r14
        pop     rbp
        jmp     op2@PLT                         # TAILCALL
.LBB0_3:
        cmp     ebp, 5
        jae     .LBB0_4
        mov     eax, 23
        bt      eax, ebp
        jb      .LBB0_6
.LBB0_4:
        xor     eax, eax
        call    op3@PLT
.LBB0_6:
        cmp     ebx, 5
        ja      .LBB0_8
        mov     ecx, 46
        bt      ecx, ebx
        jae     .LBB0_8
.LBB0_9:
        mov     rax, r14
        pop     rbx
        pop     r14
        pop     rbp
        ret
.LBB0_8:
        mov     r14, rax
        jmp     .LBB0_9

5.7.2022
Learned today that the correct term for the logic of borrow checking is based on is linear logic, also known as "logic of resources".
For such a fundamental model, this proposal is rather new, from 1987, and thus directly connected to the computer revolution. One prerequisite to understanding the notation is to be familiar with sequent calculus, which was invented in the 1930's. It also helps to search latex symbols.
Unfortunately the author lifts yet another set of symbols from computer science, and then finally a third set from lambda calculus, always expecting me to already be on the same page, and that's too intractable. I don't understand why there are no "explain like i'm five" versions of this that allow me to translate this to the field I am familiar with. It takes way too long to figure out whether the ideas would help or not. So I understand, such ideas only travel by accident; If you have someone who was forced as a student to understand all this, and then later, by pure happenstance, be in the fortunate position to translate it to a real world application after graduation. If you're already in the muck, that shit just reads like the gods laughing at you.
At least I'm somewhat in the same headspace and so I can deduce some things intuitively. The revelation that linear logic only allows each expression to be used once, that we are dealing with sequential operations, and that there are two versions of AND and OR, a linear and a bilinear one, has helped to make some sense of it.
There are ways to read AND/OR as timeline operations on events: x is an event that happens. NOT x means event x doesn't happen. x AND y means that events x,y occur in the same timeline (parallel or sequential - AND is unclear here); whereas x OR y means x,y potentially occur on separate timelines; x XOR y guarantees that either x happens or y happens, but not both - so describes a real split timeline.
AND could also be called "all" (all these events occur), whereas OR is called "any" (any of those events occur), and XOR is "either".
IMPLY (the material conditional) describes a dependency of events in this context: x IMPLY y means that when x occurs, y should also occur, though it does not prescribe non-occurence of y when x does not occur, hence x must be a cause of y, but not the cause. This means that IMPLY can not fully separate causation from coincidence. y could occur along with x by coincidence. Only when x occurs and y does not, do we know that they are causally unrelated.
I also learned about notation in sequential calculus, where (x AND y AND z AND ...) implies (u OR v OR w OR ...) can be shorthanded as x,y,z,...⊢u,v,w,... where ⊢ is called the turnstile operator. In terms of events it means both "if all x,y,z,... have occurred, then any of u,v,w,... occur" and "if any of u,v,w,... occur, then all of x,y,z,... have occurred". So we get a time-reversible expression that we can follow in either direction, as opposed to the → (material conditional; "implies") arrow, which is not readily reversible.
You may also imagine a,b,c⊢d,e,f as sort of a TODO-list snapshot or dependency notation; the left-hand side of ⊢ is what has been done (all these things have happened); the right-hand side of ⊢ is what has to be done next (any of these have to happen). at the end of recorded time, the right-hand side is empty; at the beginning of time, the left-hand side is empty.
So a,b⊢c is a dependency: in order for c to happen, a and b must happen. There are though statements such as x,...⊢x,..., and these are called axioms. Here, x depends on itself.
How do we describe the ternary operator c?t:f in sequent calculus? c stands for the condition, t means the true branch is evaluated, f means the false branch is evaluated. I tried to derive it from classical formulation, but ended up formulating the expression from scratch, using two branching terms, which I think is complete in this way. For this, I use the absurdity constant ⊥ to represent false:
⊢t,f    c,f⊢⊥ 

The first term means "t or f is implied", suggesting either will be evaluated (or both), whereas the second term means "c and f is false/impossible", thus turning c into the condition that decides whether t is selected. We can use c to perform a cut of ⊢t,f to get
⊢c,f    c⊢t    c,f⊢⊥ 

Which reads as "c is true and/or f is evaluated", "if c is true, t is evaluated" and "c is true and f being evaluated is impossible".
21.6.2022 (II)
Having some optimization success by replacing cells with an immutable data structure similar to piece tables, requiring essentially only two operations: slice and join. slice truncates a piece table, culling pieces, while join merges two tables. After merging tables, if the piece count exceeds log2(element_count), the smallest two adjacent pieces are merged until the piece count satisfies the constraint. The effect is that indexing into the table, which requires linear search, has O(log n) complexity at most, even though we're not walking or sorting a tree.
This brought down the run time of the transpiler test from 1670ms to 672ms. It worked so well that I will apply the same idea to blobs, so that we also significantly reduce the amount of work involved in string concatenations.
Insertions are presently naively modeled as follows:
insert(cell,element,i) -> cell[0:i] + [element] + cell[i+1:countof(cell)]
Update: A better heuristic for merging pieces appears to be to pick the two pieces with the lowest absolute (square root) difference, or the lowest absolute log ratio, i.e. neighboring pieces of symmetrical size.
Here are some examples for 32 insertions for back, center and front.
back:
(1,)
(2,)
(2, 1)
(2, 2)
(2, 2, 1)
(4, 1, 1)
(4, 2, 1)
(4, 2, 2)
(4, 2, 2, 1)
(4, 4, 1, 1)
(8, 1, 1, 1)
(8, 2, 1, 1)
(8, 2, 2, 1)
(8, 4, 1, 1)
(8, 4, 2, 1)
(8, 4, 2, 2)
(8, 4, 2, 2, 1)
(8, 4, 4, 1, 1)
(8, 8, 1, 1, 1)
(16, 1, 1, 1, 1)
(16, 2, 1, 1, 1)
(16, 2, 2, 1, 1)
(16, 4, 1, 1, 1)
(16, 4, 2, 1, 1)
(16, 4, 2, 2, 1)
(16, 4, 4, 1, 1)
(16, 8, 1, 1, 1)
(16, 8, 2, 1, 1)
(16, 8, 2, 2, 1)
(16, 8, 4, 1, 1)
(16, 8, 4, 2, 1)
(16, 8, 4, 2, 2)
front:
(1,)
(2,)
(1, 2)
(2, 2)
(1, 2, 2)
(2, 2, 2)
(1, 4, 2)
(2, 4, 2)
(1, 2, 4, 2)
(2, 2, 4, 2)
(1, 4, 4, 2)
(2, 4, 4, 2)
(1, 2, 8, 2)
(2, 2, 8, 2)
(1, 4, 8, 2)
(2, 4, 8, 2)
(1, 2, 4, 8, 2)
(2, 2, 4, 8, 2)
(1, 4, 4, 8, 2)
(2, 4, 4, 8, 2)
(1, 2, 8, 8, 2)
(2, 2, 8, 8, 2)
(1, 4, 8, 8, 2)
(2, 4, 8, 8, 2)
(1, 2, 4, 16, 2)
(2, 2, 4, 16, 2)
(1, 4, 4, 16, 2)
(2, 4, 4, 16, 2)
(1, 2, 8, 16, 2)
(2, 2, 8, 16, 2)
(1, 4, 8, 16, 2)
(2, 4, 8, 16, 2)
center:
(1,)
(2,)
(1, 2)
(2, 2)
(2, 1, 2)
(2, 2, 2)
(2, 1, 4)
(2, 2, 4)
(2, 1, 2, 4)
(2, 2, 2, 4)
(4, 1, 2, 4)
(4, 2, 2, 4)
(4, 3, 2, 4)
(7, 1, 2, 4)
(7, 2, 2, 4)
(7, 3, 2, 4)
(7, 3, 1, 2, 4)
(7, 3, 2, 2, 4)
(7, 3, 1, 4, 4)
(7, 3, 2, 4, 4)
(7, 3, 1, 2, 8)
(7, 3, 2, 2, 8)
(7, 3, 1, 4, 8)
(7, 3, 2, 4, 8)
(7, 3, 3, 4, 8)
(7, 3, 1, 7, 8)
(7, 3, 2, 7, 8)
(7, 3, 1, 2, 15)
(7, 3, 2, 2, 15)
(7, 3, 1, 4, 15)
(7, 3, 2, 4, 15)
(7, 3, 3, 4, 15)

21.6.2022
Had a short vacation last week and used the opportunity to mull over whether to use move semantics in the Frameloop IR. What the proposed change means in particular:

Several instructions "move" one or more of their arguments, meaning that the values providing the argument must only be moved once, and can not be accessed after the move. In practice, this means that every value can be a viewed (referenced) argument multiple times, but a moved argument only once.
A move instruction forces a move of a value and provides the reference implementation.
A copy instruction conceptually duplicates a value so that the duplicate may be moved separately.
Instructions that extract or slice values like extract, cellextract, slice, cellslice, produce slice views that allow us to defer a copy. These views are weak, i.e. they become inaccessible when the value they reference is moved.
Values can not be moved out of composites.
Views can be inserted into composites. To prevent this, move or copy must be used on the element to be inserted. A composite with view elements can become partially inaccessible.
When an element of a composite view element is extracted, the view can be truncated. e.g. viewof value 1 0 1 can be reduced to just viewof (localof (cellextract value 1)) 0 1, if element 1 of value is a view.
We move only where an imagined mutation would be of benefit, otherwise we view.
We change native argument evaluation order from right to left, as the first argument is often by convention the one to be moved, and auxiliary arguments are often just views of the first.
All arithmetic unary operations, repeat, break and merge move their argument.
All arithmetic binary operations, select, join and celljoin move the left-hand argument and view the right-hand one.
insert and cellinsert move their composite argument.
All values produced within a function (or loop) are dropped (moved to nowhere) when the function returns, except for the return value.

Here's the pros and cons.
Pros:

Instructions can directly mutate their arguments, meaning no speculative duplication is required, which saves us memcpy and malloc calls. Append operations that would before require n*(n-1)/2 memory units for n operations could now use dynamic array allocation patterns instead. Repeated insertions into arrays of fixed length could all be transformed into in-place mutations, reducing a n² memory operation to n. All this makes it possible to implement indexof in the language instead.
We get fine-grained control over which data to duplicate, and which data to transform in-place, allowing us to reason about performance and memory usage, just based on the amount of copy instructions we see used.

Cons:

The type system becomes more complex. We need to annotate for arguments whether they are moved or viewed. Return values can be local or remote (a slice of values contained within a composite). Static analysis requires computing lifetimes of views.
For debugging, the interpreter must keep track of which instruction moved a value, and even undefined values must be annotated with the instruction that caused the move.
The user must maintain internal knowledge of which instructions move and/or produce views. It is less mentally taxing to have the compiler figure it out. But this could also be provided by a meta-language.
Requires manual application of copy instructions wherever we transform a value more than once. move or copy must be used after extraction and before insertion, if we want to prevent a weak insertion.
All views on a unique value are invalidated even when they reference slices that have not been mutated by an insertion, because the value as a whole has been moved.

13.6.2022
I have serious trouble getting the bootstrap interpreter for Frameloop to perform well enough that it can execute the compiler bytecode without taking 15 minutes to do so. So far I employed the strategy of memoizing instructions by function environment. But that is not enough.
Since all generated values are immutable, each mutation has to be modeled as a copy + modification of the copy. So in simple appending to a dynamic array, we need to duplicate the array every time, plus the new element we're appending, which goes through n*(n-1)/2 elements for n elements appended. If the item going in had a reference count of 1, we could perform the mutation in-place, but then we have to be careful not to mess up memoization - memoization and in-place mutation interfere with each other.
So I'm considering improvements. One fundamental improvement I was thinking about was using move semantics. Every transformative instruction we use requires its arguments to have a reference count of 1, and decreases the reference of its arguments. If the same argument is used a second time, the reference count is zero (we keep dead objects around just so we can safely check for this), and an error is thrown. So these functions treat values as monadic.
Any transformation expects its arguments to be unique, that is, they have a refcount of 1 (owned by the calling environment, and ownership is transferred to the transform). If the refcount is > 1, then the transform must first make a copy of the argument before using it (if it is composite, the refcount of each of its elements is increased by 1), otherwise it can mutate the memory directly, and produce a new value with refcount 1 (ownership is transferred back to the calling environment). After transformation, the refcount of arguments is reduced by 1. If they were unique, their refcount is now 0 and their memory can be freed.
In addition to these unique values, we also need non-destructive slices, which are weak references to ordered subsets of unique composite values that do not have ownership. Weak references allow us to keep the reference count low, and save memory copy operations. Instructions that just retrieve values from a composite do not change its reference count. If the value that a slice is indexing into has a refcount of 0, the slice becomes inaccessible.
When a transformation receives a slice instead of a unique value, it will not decrease the reference of the composite, but take ownership of the referenced elements, even though they are contained within the composite. This means that we can end up with composites of which some elements are inaccessible. This is by design, so we can treat the environment of a function as no different than any other composite.
We need two more auxiliary instructions: move and copy. move is an indentity transform that transfers ownership without changing the provided argument, becoming the reference operator used by all other transforms. copy increases the reference count of a unique value or a slice, allowing us to move that value one more time before its refcount reaches zero.
I've mostly talked about composites in the context of cells. For blobs, operation is slightly different. Since elements are octets and therefore have no refcount on their own, copy only increases the refcount on uniques, and makes actual copies of slices. move can not decrease the refcount of individual octets of a slice, and so will also make copies, except for when the entire blob is referenced. We could in theory simulate refcounting per octet, which requires maintaining a list of region refcounts, which are increased by copy, and decreased by move.
What I am a bit worried about is how well this arrangement works with pure functions and our automatic topological arrangement of instructions. A function is supposed to produce the same value every time it is called. But our calls fold, and so some behavior could be surprising. e.g. two separate instances of copy x in a function fold to the same expression %1 = copy x, and will both produce the same value, and so cell (copy x) (copy x) is really cell %1 %1, which will fail because the reference count of copy x is zero after the first argument has been moved, same with cell %1 (copy %1), which fails because we evaluate and move arguments in the order they are passed. Only cell (copy %1) %1 will work correctly, i.e. cell (copy (copy x)) (copy x). Effectively, we can only use a copy of a value once. For more copies, we must make copies of copies. This means that we should always copy at the call site, and not expect a pre-bound copy instruction to produce an unlimited amount of copies.
Can we break this behavior with memoization? %copy = fn (x y) (copy x) will produce two different instructions for apply %copy x 1 and apply %copy x 2. If apply will however always evaluate and move arguments, then the behavior is the same: the second attempt to pass x will fail. In any case however, this function will fail to operate as intended, since x has already been moved into the function, and will be dropped on exit, and so we haven't really made a copy. Only a macro function can perform here. But macros are always liable to accidentally fold copies. The copy instruction should always be used at the earliest point.
We could of course also consider the alternative, where copy is never memoized, and so every copy instruction will always increase the reference. If we now declare x* = copy x, then passing x* as an argument always produces a new copy, and copy x* will produce two copies, which probably makes no sense. Hence copy is now more of an annotation rather than a instruction.
In closing, let's walk through a very common operation that would benefit from move semantics, incrementing a value in a container:
%1 = cell %A %B %C
%2 = cellextract %1 2   # %1[2] = %C
%3 = iadd %2 1          # %1[2] + 1
%4 = cellinsert %1 %3 2 # %1[2] = %1[2] + 1

We'll annotate the refcount of each element after every instruction:
# starting out: %A:1, %B:2, %C:1 (for example)
%1 = cell %A %B %C         # %A:0, %B:1, %C:0, %1:1
%2 = cellextract %1 2      # %A:0, %B:1, %C:0, %1:1, %2:view(%1[2])
%3 = iadd %2 1             # %A:0, %B:1, %C:0, %1:1, %2:view(%1[2]), %3:1
%4 = cellinsert %1 %3 2    # %A:0, %B:1, %C:0, %1:0, %2:view(%1[2],dead), %3:0, %4:1

7.6.2022
A hybrid of the generator and the collector, the replacer:
R = {
   result,
   if empty/full: {}
   else:          {value, (value) -> R}
}

The replacer has an evolving state, a next item and an update function to replace that item. The user is expected to receive the item, process it and update the replacer until the replacer produces no new items, upon which the result is complete.
Because of this structure, a replacer can for example operate on a conditional expression: first, it returns its condition for evaluation; when it is fed back in, and is undecidable, the replacer continues to return and update both branches in sequence. When the condition is constant, the replacer can instead only return the selected branch, and replace the original conditional expression with the branch result that is fed back in.
It is impossible to skip elements in the replacer - one must feed as many items back into it as have been received. A replacer can accept a special value to indicate skipping the insertion, but it must still be able to update its state.
6.6.2022
Based on my yield excursion yesterday, I developed a generator and collector protocol that has feature parity with generators and collectors in Scopes, but is simpler to set up, as we can implicitly capture state with closures at runtime.
A generator is defined as follows:
G =
   if empty: {}
   else:     {value, () -> G}

The first value at generates the value. The second function next produces the next generator for a sequence. The generator indicates through its empty-state that it is spent.
A collector works like this:
C = {
   result,
   if full:  {}
   else:     (value) -> C
}

The first element holds the present state, or the result; The function push receives a value, possibly evaluates it, commits it to the result and produces a new collector with a new result. The collector indicates through setting push as empty that it is full and accepts no further input.
5.6.2022
I implemented the map primitives in the interpreter today, which are serially processed but semantically parallelizable. The C implementation will later use threads to massively parallelize processing.
An open problem I keep thinking about is supporting yield control flow. Do we really need coroutines to implement it?
The problem is essentially, how do we return and continue from inside a loop?
yieldrange = fn (count)
   loop (i = 0)
      if (< i count)
         yield i
            repeat (+ i 1)
         break;

We expect this function to produce a value and a continuation 0 (fn (x) (repeat (+ i 1))), which is an illegal form, as repeat can not be moved out of the loop. Were this a recursive function, it would be more straightforward:
yieldrange = fn (count)
    l = fn (i)
        self = this-function
        if (< i count)
            cell i
                fn () (self (+ i 1))
            cell;
    l 0

returns 0 (fn () (l (+ i 1))); loop is a closure, and so we can perform a recursive tail call here. It is not really recursive, because we have already exited loop and only carry its captured environment, and the application has been wrapped so it is evaluated after we have exited the function. This will port straightforwardly to C.
So yield is really rewrite sugar that transforms loop, repeat and break. We do not really need to support this in our assembly.
3.6.2022
That was an extraordinarily productive day, without any major obstaces. I generated and routed debug information straight into the interpreter, being able to print a stack trace now, and I wrote a simple stackless mark & sweep garbage collector that prevents memory from filling up.
A minor kink in the debug info protocol is that body and foldbody (that's the body of a loop) instructions need to reference a di node. While the interpreter can easily track the di stack parallel to regular processing, we don't associate di nodes with results, which means we don't have that information for closures, which are preassembled. When a previously returned body/foldbody references a di node though, we can continue debug info tracking from there.
We still lose information about the function itself though (as the di for the body is below it), and the presence of a di node in the dataflow interferes with deduplication and debug info stripping (before, we could just take the value of a di and voila, it's stripped). So what I would prefer here is that the interpreter entirely figures it out on its own, by rewriting bodies to their di nodes on return, so we can unwrap those later on application. We could also keep a separate map, with the drawback that a body can have multiple identities, so we don't know exactly what function we have originally been calling.
2.6.2022
The jerry-rigged FLIR interpreter works well now. It is reasonably performant, but wastes a lot of memory, which doesn't matter in the grand scheme of things.
I am now working on the actual FLIR compiler, which is written in frameloop assembly (flasm), a text based language I invented so I could write some FLIR. The compiler can load and debug print FLIR modules, as well as compute predecessors, postdominators and scope trees. I also wrote a C code generator that turns a template tree into C code and takes care of proper indenting, quoting and delimiting.
The first version of the compiler will generate C modules that work with basic refcounted sum types exclusively, i.e. no local specialization of datatypes or elision of copy-on-write operations yet. All functions use the same signature result:value <- (closure:value, args:value), and by convention always move their arguments, i.e. decrease the references. So to keep an argument alive, a caller must incref it before passing.
An open issue is tracking debug information for expressions. I want to make use of the approach I developed last year (26.7.2021), where a debug information graph is kept as a metastructure for our program, in a way that allows us to strip the metadata at any point.
All we need is one new node di (debug information):
"di" : defop('DI  ', d(value = id, md = id, subdi = array(id)))

The idea here is that we tag an instruction value with metadata md, and then provide a new di node for each of its id typed fields. For metadata, we start out with a simple text location node:
"md_at" : defop('MDAT', d(file = id, line = u32, column = u32))

Later, we can add more nodes to provide more detailed data.
25.5.2022
Finally figured out how to properly generate and evaluate pure closures in FLIR and the FLIR interpreter in a way that deduplicates well, without requiring the C stack machine. I'm only going to talk about functions, but loops work similarly to this.
A function definition fn <level> <body> requires only a scope level and the body of the template expression. The parameters of that function can then be addressed via argsof <level>. We require that topmost functions start with level 0. Nested functions will later require to access both their own parameters as well as the one of the function they're contained in, and so they need to be defined at a progressively higher level. We can later apply a function by apply <function> <arguments>.
With these mechanisms, we can then describe nested functions that curry arguments:
%1 = fn 0
   fn 1
      fn 2
         cell (argsof 0) (argsof 1) (argsof 2)
# produce a (cell 1 2 3)
%2 = apply (apply (apply %1 1) 2) 3

For execution, the interpreter requires an environment instruction env <level> <args> <prev-env> that can be chained, where the null instruction indicates an empty environment, as well as a closure instruction closure <env> <level> <body>.
The interpreter carries the currently active env as global state. At runtime, a fn instruction is evaluated to a closure instruction, which captures the currently active env (truncated to the parent of the env instruction matching the fn level). When an apply is evaluated, the first argument must be a closure. The currently active env is saved to a stack, the closures env is selected, and then a new env is chained to the closure, associating the closure's level with the arguments passed to apply. The state of apply is marked as "in evaluation", then the body of the closure is queued for evaluation. When evaluation has concluded and apply is visited for a second time, the original env is retrieved from the stack and activated, and the original apply instruction id is aliased to the id of the evaluation result, as we always do.
One more change to aliasing: under environment influence, an instruction does not always evaluate to the same result. For instance, argsof 0 will produce different results depending on which environment chain is presently active. Therefore we alias instructions together with their environment id, and also later resolve the aliasing with the presently active environment id, searching up the environment chain to find a match.
One can optimize for better reuse of instructions, by exploiting the fact that not all expressions evaluated within a nested function require all bound arguments of the environment. By tracking for each instruction which environment level it actually requires, we can pick a lower level for aliasing, so then later more branching environments can reuse the result, and by that also determine in what scope an instruction actually is. For example, an instruction with the null environment is always a global instruction.
13.5.2022
Alright. So these are the building blocks of Frameloop that presently exist:

A compact FLIR instruction spec in Python that can autogenerate a C header file
A FLIR macro assembler written in Python that uses (1) to parse a text file in SLN format and outputs a FLIR binary module.
A FLIR library written in C that can prettyprint FLIR modules, but most importantly specialize a FLIR module to a new FLIR module.

(3) is essentially an interpreter whose datatypes are instruction types. Instructions that are undecidable at evaluation time are passed through. This folds all constant expressions, and for a module that is entirely decidable, the resulting module will only consist of constant expressions.
10.5.2022
What is purity? A pure function is a function which, given the same input, will always return the same output. A function that reads a piece of data can only be pure if no other function alters that data. A pure function that causes other functions to become impure, by e.g. altering state that other functions read, is said to have side effects.
Can we write a completely pure graphical application? Yes, when the program only describes one frame of the app, that is, one iteration of its mainloop.
This means that any side effects are allowed to happen between two frames - after the evaluation of the last, and before the evaluation of the next frame. Every repeated evaluation of the same expression during that frame however must yield the same result.
For functions that usually have side effects, e.g. such as a function that reads from a file, we can trivially introduce purity by caching the result until the next frame. This allows us to evaluate lazily: even though reading from a file has side effects, we can choose to only perform the read when it is actually requested. It does not matter how late in the frame this read comes, and whether the result would have been different had we read earlier in the frame, the only thing that matters (for purity) is that any subsequent read yields the same results.
This is particularly interesting when streaming. Say we read an 8-character string from a generating function f(size), as f(8). We want to read more data, but by the definition of purity, f(8) will always resolve to the same string, not the next eight characters.
An abstract, solve-all way is to have monadic state that breaks the purity of the function, so we get function arguments like f(offset,size)->newoffset,value, specifying our offset in the stream. With the same offset, we get the same result. We can also rewind to an earlier offset if we like. To implement this function, we need to cache the entire stream read to the largest requested offset+size to an append buffer, and then return slices from that buffer on request. But when the frame ends, we can clear this cache, and allow our offset to reset to zero.
7.5.2022
The Frameloop VM really only needs two kinds of immutable values, blobs and cells:

blobs (aka chunks, strings, bytes, etc) can represent any kind of contiguous serialized binary data (integers,floats,booleans,utf-8 text,symbols,vectors,bitmaps,matrices,file contents) but are flat and schemaless.
cells represent any ordered composition of values with overloadable operators (arrays,structs,tuples,maps,sets,string wrappers) but can not be serialized as-is.

A cell can be composed of cells and blobs and also specify a type, which is another cell, and in that way allow to implement typesafety and type based dispatching. Cell keys used for forms are also cells, as are functions and built-in forms, S-expression lists, and so on.
We can convert a cell to a contiguous blob using a generic serialization function, and a blob back to a cell provided the format is correct, and by that way we have a bridge between the two kinds.
We encode arithmetic types like integers, floats, booleans as single blobs in cells with a matching type that has the relevant operations overloaded.
Lowering code that uses blobs is simple, because they're just memory buffers; short buffers can be directly handled in registers based on their size and usage patterns, e.g. a 4-byte blob that is repeatedly added should best be stored in a 32-bit integer register.
Lowering cells to structs and arrays requires analysis of how these cells are used and transformed. Since all our cells are immutable, we do not need to account for mutation, but can rather optimize for elision of temporary values in favor of in-place transformations.
6.5.2022
After writing the terrain engine and reexperiencing, despite all the friction that Scopes has already removed in comparison to a C/C++ solution, how annoying and convoluted it still is to iterate on a GPU pipeline, when it could be as much fun as writing a Shadertoy, I am back at giving the idea of a GPU/CPU integrated VM a second try, because I really enjoyed the prototype I had built last year, and I think this backend simply needs to exist. I then want to reimplement Nowhere's terrain engine in it.
I singled out the runtime from Tukan and named the project "Frameloop", after its runtime model, which defines a program as a directed acyclical graph of operations that construct a single frame of a graphical app, which is then fed back to itself to maintain state.

Write a throwaway interpreter in python that can execute Frameloop sourcecode (looks a bit like scopes, has *.flsc file extension), so the implementation of Frameloop can be written in itself. The interpreter exists but isn't entirely complete. This is where I am at at the moment.
Write a transpiler in flsc that can convert Frameloop sourcecode to FLIR (Frameloop Intermediate Representation), which is based on the FIR model I developed for the Tukan VM. FLIR is the native, universal form of Frameloop programs, and there is no one source code representation for it.
Write another transpiler in flsc that converts FLIR to standalone C source code.
Write a compiler/optimizer in flsc that reads FLIR, lowers it to more efficient forms, and writes FLIR again.

This gives us a complete self-hosting compiler. With TCC added in, live coding is possible and we can provide JIT like services.

Write a FLIR generating backend for Scopes, and build Frameloop integration for Scopes, so Scopes users can use Frameloop to provide sandboxed modding for their games.
Write the rest of Nowhere's tooling in Frameloop and further develop its procedural engine.

In the long run, I also want to add web0 capabilities to Frameloop, so users can share and execute programs without the help of a browser.
28.3.2022
The project manager I theorized in my last post now exists.
18.3.2022
I keep returning to my idea for an agnostic, project oriented package manager that allows solo devs and small teams to manage dependencies for a project without having to subscribe to a global shared repository or a fixed way of doing things that doesn't match their development culture, without it becoming a dedicated job that requires sinking substantial time into it.
I have a personal need for such a tool, and even thought I presently don't have time to work on it, I thought I'd write down my feature list at least once, for future reference:

ships as single binary and is scriptable, very similar to the ideals of GENie or premake.
project oriented; that is, it uses its configuration script to find the root directory of the project, and then operates only within its subdirectories. the package cache also lives in this folder.
the project's recipe repository also lives in the project folder; we can check it in together with the source code and share it through source control.
we typically use the package manager to bootstrap and install project dependencies once when setting up the project on a new machine, and then subsequently only update partially, which requires removing an existing installation; this is done by first installing into a staging folder, archiving this folder for caching, unpacking it to the actual installation image (to verify our archive works), then later using the filename list from the archive to know which files to remove.
simple conditional dependency directed acyclic graph.
has a default, POSIX guided understanding of where installed packages go ($/bin, $/lib, $/include, and so on), but other installation directories can be configured.
recipes can be imported from other projects and from online template repositories.
Cross-platform; can operate on Linux, Windows and Mac OS X.
Can flexibly use existing system infrastructure to install additional tooling when required, namely svn, hg and git.
Project recipes can implicitly download files, check out source repositories (see previous point), unpack tar.gz, bz2, zip and 7z archives, copy files and directories, run shell commands and conditionally depend on other recipes to be executed first.
Platform recipes work the same way, but can install prerequisite command line tools system wide if need be, so that we can start off a machine project file to prepare the overall dev environment, e.g. install msys or build-essentials.
The recipe format should be compatible to scopes list notation. The format is super easy to parse, at lot easier to retain in human memory than YAML, and we can read recipes in Scopes too if need be.

About the implementation:

Could be prototyped in Scopes, and built as single binary from there.
Codebase should settle quickly, so that we can use a built binary over years.
Bootstrapping the PM requires the ability to download archives and files from HTTPS protocol, as well as the ability to handle a native introspectable archive format that we use for archive caching and bookkeeping.
A library like LMDB can be used to persist state where needed, although if state is already inferable from presence or absence of other files (e.g. archives), then that is to be prefered.
All other capabilities of the PM can be augmented with command line tools.
Most work will go into recipe parsing, topological execution of DAG, filesystem I/O, path and string formatting/replacement as well as piping data to and from command line tools.
Platform recipes should capture most of the "administrative wisdom".

9.3.2022
Nowherian space exists on a flat 3-torus, which means that if we travel long enough along one direction, orthogonally, we end up back at the beginning.
How do we introduce a daylight cycle into such a space? We can not put a sun at a distant point, because light cycles within our space and is gradually absorbed by the medium. There is no "outside" to our space, at least not in this dimension.
We can place the sun within our plane, as a point light, but it would be very hot, and nothing could be built near it. For a cycle, the sun would have to move, but not every place would experience a night. An alternative is for the sun to pulse, so that it is off for half of its cycle, which would cause a night everywhere at the same time. Places further apart would never fully shadow other places, because the light loops around and scatters through the medium, so any point has the ability to be lit by 27 immediately "adjacent" suns.
A different way to introduce daylight in a way that creates timezones would be to originate the light from a parallel dimension. Let's drop to the flatland case to make better sense of the situation. We live within a repeating plane. The light source is outside this plane, perhaps as a traveling emitter that irradiates the surface like a scanner light. "Photons" hitting the plane would be tied to its surface, effectively losing one dimension, and be randomly deflected along the plane to hit any solid object. Perhaps the photons would also only be created by the medium as a reaction of the exocosmic irradiation, as a luminescent effect.
Without further constraints, everything would be lit up everywhere, even the insides of caves. So concentrations of matter would have to deflect radiation before it hits the surface, effectively causing light to always build up further away from matter, preserving the darkness within cavities. This would look like in practice as if the medium glowed like a light in the fog, or a bright sky. The light would have a smooth ambient lighting character, seemingly coming from everywhere and producing no hard shadows.
If the irradiator is focused, the resulting irradiation pattern could be spherical and moving only along one orthogonal axis, so that depending on world position, we get more or less light in a day; there would be a concept of a spatial "equator", which receives the most light in a day, which would influence temperature and climate in that region, and the light could shift laterally to the orthogonal axis over the course of a year, so that we end up with changing daytime and "seasons". The intensity of light could also influence the color of the luminescence, so that at the fringes of the spherical pattern we get rainbow hued evening colors as are familiar on earth.
There is also the matter of recycling. The bubble is an apparent perpetuum mobile, in which waste heat is converted back to matter. How it could work is that an exocosmic coolant transfers heat away from the world (what the universal expansion does for us), concentratic it someplace else, and eventually injecting it back into a particularly empty region of space. One might imagine this cooling effect as the result of heat "falling" into a narrowing exocosmic funnel that loops back to a point in space, guided by the same field that repels exocosmic radiation, injecting primordial matter. To an observer it might seem like a volcano erupting from a point rather than from the ground, its lava eventually cooling down and forming a new planetoid.
The light and energy produced by the radiation emitter is also sourced from collected heat.
The funnel has a bottleneck and a threshold, and so there is a bandwidth limit as to how much heat can fall into the funnel per second (with the greatest part absorbed by the radiation emitter), and a threshold that requires a substantial build-up of temperature within the funnel that triggers a cascading effect causing an eruption. So the world may experience a greenhouse effect if more waste heat is produced than can be absorbed, but this would not cause a runaway cascade effect, as the amount of energy that can be burned within the closed system is finite.
There is an alternative outcome, in which the system does not burn extra energy because all metabolic activity has ceased. This will only cycle light energy, of which the smaller part will be lost in the funnel and occasionally ejected as matter, which does not degrade fast enough under radiation. Eventually there will be an abundance of matter as the rest energy left for light depletes, and at some point the ambient sun will glow so faintly that it can barely be registered, and the lights go out.
8.3.2022
Been crunching Nowherian time and space scales today to figure out how big to make the map and how to make the short time of the game feel like a long time. I do have some good metrics now, and the terrain can be 64km³ big (with lots of empty space inbetween), at a resolution of 1 meter per sample.
In general it's interesting to observe how spacetime changes in proportion. Let's say we map the approximate circumference of earth, 40075km, to a game map of diameter ~1336km, with the approximate conversion factor k = 1 / 30.
To orbit earth at 960km/h, the best common speed of a commercial aircraft, would require a little under 42 hours. We can scale this duration directly by k and estimate that our map would be orbited in a little over 83 minutes instead, at the same speed.
To arrive at the same time spent, we must either slow down the aircraft by factor k to 32km/h, or redefine the concept of an hour within the game, so that 1 game hour = k real hours (or k * game_duration = real_duration), exactly two minutes. We'll use gh for game hours, h for real hours in this text.
So now we can translate 960km/h directly as 960km/gh. Orbiting the map still takes 83 minutes, but since the clocks in the game run much faster, they will show 42 hours have passed. We also run the day cycle at this speed, so a day takes 48 minutes now. The impression of seeing the sun go down and then up again during flight will have the same proportions.
Because our days are compressed, our lifetime would be too. Assuming we skip 1/3 of each day in the game because of sleep, a game year takes a little over 8 days in the real world. 80 game years pass in a little over 1 real year and 9 months. A game completion time of 44 real hours would cover a little over 82 days in the game.
We do have a considerable scale issue here, because hardly anyone would see the end of the game if they had to spend 23360 hours to complete it; assuming you play 2 hours a day and 8 hours a day on the weekend, that's 26 hours a week, 1352 hours a year, so you'd effectively need over 17 real years to get to 80 game years.
But if you compress time so much that 80 game years fit into 44 real hours, then a day in the game would last about 8 seconds. The sun and the moon would whoosh by with confounding speed, and the days would be over faster than you could make a decision.
So we say OK, we compromise, and either days are, in relation to time, much longer, and this includes all cycles they entail, or the lifespan in the game is much shorter, so you age 1 year per day, and thus turn 80 game years around 42 real hours. In the end, it means the same thing.
If a game day is representative for a game year of one's life, then we could apply storytelling techniques and treat it as pars pro toto, the part for the whole, as if every day had the weight of 365 days. What you eat, you ate 365 times. What you earned or lost, is incurred 365 times. How you slept, you slept 365 times. Inflation rises in a game day by how much it rises in a real year. That seems a bit strong for some applications. Will you have to cut your hair back every day too, which grows with the pace of a magical shrub?
This arrangement also implies that you complete major stages of your life within 20 game days (or under 11 real hours), from childhood, to young adulthood, to middle-aged, to old age.
What happens to other annual cyclical change, like seasons? Let's say we want the player to go through the seasons five times, so that in each stage of their life they go through the seasons once, but do not end up in the same phase, then a season transitions within 4 game days, or 2 real hours. That's too fast.
If we, instead, map five seasons to a lifetime, then a season lasts 16 game days, or 8.5 real hours. As four seasons define the length of a year, a game life then lasts a little over a game year, i.e. our life is annual. So concerning some annual effects, we can ease up a little, and say that 16 game days have the impact of 91.25 real days, so 1 game day can instead have the impact of 5.7 real days, a bit over 1 day shorter than a week.
What happens to our concept of weeks and months? We only have 64 game days a game year to distribute. Say we only assign a single month to each season, then a game month has 16 game days. If we assign 5 game days to a game week (2:40 hours to play), then a game month has a little over 3 game weeks, which gives us a first week, a peak week, and a last week for each season.
7.3.2022
Trying to figure out what chunk size of terrain data I can practically support. Every time I update a chunk, I need to update all parent LOD chunks, so log2(W) steps for a W³ world. Our morton codes should fit into 64-bits. 32 bits would be even better, but then max(log2(W)) == 10, and at one meter per sample, that's merely a cube of a kilometer sidelength. So we can do 21 bits per sidelength, or 2097 km cubic sidelength.
Let's say I want to spend as much bandwidth as I would on one full HD screen: B = 1920*1080 = 2073600. So I divide by the number of LOD levels to be updated, and I get C = 2073600 / 21 = 98742.6 cells per chunk to be updated. Since C = W³, W = C^(1/3) = 46.2, and so 46³ would be the chunk size to use to fit a HD screen. The next lower power of 2 is 32, so 32³ would be a good choice here.
8.2.2022
Notes on terrain sector meshing.
We recursively generate the u32 sector keys to process and write them into a mapped GPU buffer, which is used in the next step, one key per workgroup.
The next step can be done in the same shader with shared memory and barriers, as long as we keep temporaries under 32KB.
If we can stick to 8 bytes per cell, then we can accomodate 4096 = 16³ cells.

preprocess 4³ sector; per cell,

fetch 8 values of dual lattice (5³ environment)
compute and store 3 edge sign bits
compute and store feature plane

probably fits all into 8 bytes
packed normal + distance: 3 * 16snorm = 6 bytes
sign bits in last 2 bytes


generate triangles from 4³ sector

read sign bits from 2x2x2 cluster
generate up to 3 quads for sign bits:

compute feature vertex from feature plane (normal * distance)
use plane normal for vertex normal


6.2.2022
From experiments, I see that sampled distance fields produce cleaner and sharper MC isosurfaces than density fields, as their wider support region enables C1 continuous gradients (and beyond), but i find them harder to reason with as source data.
Density fields make probabilistic statements about occupancy within a cell: a density of d=0 means there is no mass in that cell, d=1 means the cell is maximally dense, and d=1/2 means the cell is half full. We do not know where the particles are in that cell, but if we took a uniformly random sample, with d=1/2 we would hit a particle half of the time. That's what "probabilistic" means.
A density field has by definition no embedded isosurface, so by convention we set the isosurface at d=1/2, that is, where our field crosses the 50% boundary, we determine inside/outside of the surface. A more correct rendering of the field would instead use its density as inverted transmissiveness for a participating media raymarching, ending the march where transmission is 0%.
A signed distance field only makes statements about the distance the center point in our cell has to the surface, and we extract the isosurface at d=0, as we have positive and negative distances as support. We need the support region to extract a smooth gradient and curvature using finite differences. We can not directly derive density from the field; this would require integrating a step function over the interpolated cell, so that negative values map to zero, and positive values map to a constant.
Where density fields shine is when we integrate them to lower resolutions, because we can conserve the spatial energy with decreasing accuracy. Signed distance fields can not be properly integrated. Their weighted average can only be used at greater distances, and one always has to drop to the highest resolution the closer one comes to the surface. Lowering their resolution requires building a min-function over the cells to be merged.
1.2.2022
Further worked on the terrain engine with LOD support. This is the present approach to updating the terrain.
We use a double buffering scheme that updates over multiple frames.

Compute addresses of all cells required for the player position.
Copy all reused generated cells from the last vertex buffer to the new vertex buffer, and remember for each cell its region in the array, so we can find it in the next update.
Over multiple frames, generate vertices for all new cells and append them to the vertex buffer. Also remember those cells regions in the array. Only process a fixed number of cells per frame.
Swap out the old vertex buffer for the new one and repeat from (1).
The vertex buffer can then be drawn as one.

We should also keep a MRU cache for triangles generated at cells, faces and edges that we can fetch from when a discarded tile reappears.
26.1.2022
Developed a stateless heuristic to maintain the tile list, such as int(min(9.0, -log2(length(p/16.0) + 1.0/1024.0))).
How it will work is that over several frames, we just statelessly generate the list of required tiles based on the latest position, and then request from the tile cache the cells and faces we need, possibly meshing new tiles that we haven't seen before.
The tile cache then has a memory cap, and starts throwing out old tiles (i.e. least recently used) when it runs out of its allocated memory.
With 32-bit addressing, we can cover a tile space of 1024³; each tile then has a resolution of 32³, which gives us an effective resolution of 32768³; that is more than enough.
When a change in the terrain occurs, we regenerate all visible tiles in the cache overlapping with the invalidated area, and discard the invisible ones.
24.1.2022
Ok, let's see if I can formulate an algorithm to manage the terrain.
We need to manage two kinds of partition objects, cubes for cells and squares for faces, namely where cells are connecting. All cells have a dual sample resolution of 9³, so we can run 8³ jobs to build the mesh of a cell and get good GPU occupancy. Similarly, faces have a sample resolution of 9² and 8² jobs for good occupancy.
The scale of a cell or face only doubles or halves, so it's always of size 2^N, where N is an integer LOD level.
Each cell owns a triangle soup that represents the contoured mesh in its range.
We store cells by address in a hashtable, Lewiner style. This way, neighborhood, parent and child queries can be performed implicitly.
We also use three more hashtables for X, Y and Z oriented faces, where each face has the address of the cell whose lower front left corner it touches, and resides on the same level as the lower of the two LOD levels it connects.
LOD levels: 0 is the highest resolution (cell has smallest scale), and N the lowest (cell has largest scale).
We maintain the general rule that the LOD difference between two adjacent tiles can't be greater than 1, so that we can use an invariant 1:2 meshing kernel for boundaries of different scale.
21.1.2022
Sigh. i don't think my cascaded terrain mesh is panning out, or making lots of sense for that matter. It looks like I'll have to work with convex clusters like everyone else.
I'm thinking the simplest way to go about it is to connect dual meshed chunks of size 32³. We go up the LOD level for chunks that are further away, so that the maximum possible level is 5, with one vertex per chunk. With dual meshing, we only need the 6-neighborhood of each chunk to connect. We also further require that adjacent LOD levels can only differ by 1 maximum, so that we only need to deal with 1:2 border cases.
So what we end up with is an L³ lattice of lodable N³ chunks, with LOD levels increasing away from the center of the lattice. Once a cluster has reached its highest LOD level, the lattice becomes a L³ chunk of a higher K³ lattice (if that is even necessary). As we move across the lattice, high LOD level chunks are streamed in at the corners, and chunks are piecewise updated to reflect their correct LOD level for the distance they have to the center.
We do need to adapt the partial updating scheme to this arrangement and properly schedule chunk updates.
7.1.2022
Happy new year!
I finished my first prototype for a camera centered LOD lattice. The demo in the video even sources a precomputed volume for data. But I don't believe it's worth the effort to fix and optimize it further. The next step would have been to switch to dual marching cubes. Still, using volume cascades has multiple benefits, mainly that translation only invalidates a border of the volume.
Scheduling cascade updates is also simpler, as we can alternatively update one to two cascades per frame, depending on how much frame time we have.
    0         1         2         3         4         5         
    012345678901234567890123456789012345678901234567890123456789   60 FPS
    ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
    0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|   33 ms
     1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |   67 ms
       2   |   2   |   2   |   2   |   2   |   2   |   2   |   2   133 ms
           3       |       3       |       3       |       3       267 ms
                   4               |               4               533 ms
                                   5                               1067 ms

    0         1         2         3         4         5         
    012345678901234567890123456789012345678901234567890123456789   60 FPS
    ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
    010101010101010101010101010101010101010101010101010101010101   33 ms, 33 ms
    2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|2|   33 ms
     3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |   67 ms
       4   |   4   |   4   |   4   |   4   |   4   |   4   |   4   133 ms
           5       |       5       |       5       |       5       267 ms
                   6               |               6               533 ms
                                   7                               1067 ms

So the effort to update a cascade depends on whether terrain was invalidated, which fully updates the cascade around the invalidated terrain, and/or whether the camera has translated, which requires a memory move within the cascade, and generating new slabs at the border.
8.12.2021
Developed a few terrain related solutions and published them as Shadertoys.

Features from Curvature, an alternative to QEF based feature finding for dual contouring, where we just do a quasimontecarlo search for high curvature points, by using gradient descent to find the surface, then picking the point with the highest curvature. If none of the gradient descent samples managed to hit a corner exactly, we don't get a perfect point though.
Subjective Filtering, using spherical inversion and hyperbolic transformation to produce a volume in which sampling detail gradually decreases with distance from the center, but never reaches a border, so that the volume conceptually stretches out into infinity, described with a simple average sample. This was meant as a continuous alternative to sampling cascades, so that we can generate information into a single grid rather than into several ones, and don't have to manage continuity at the edges. Bi/triquadratic interpolation is enough to produce a smooth field, but must be performed in the inverted space. As long as ray directions originate from the center though, they remain linear under inversion. Only lateral rays become curved.
Baryquadratic Interpolation (2D), interpolating four samples with C1 continuity and strictly quadratic interpolants using triangle subdivision. I was hoping to extend this to 3D, but the interpolant would have to be subdivided into 40 tetrahedra.
First Positive Root of Hextic, using gradient interval arithmetic to find first roots on higher order polynomials. This was preparation for raytracing a triquadratically interpolated surface.
Triquadratic Raymarching, using gradient interval arithmetic and dimensional reduction of a polynomial to efficiently raymarch a triquadratic surface. Such a surface can be used as an interpolant for terrain data.
Relaxed Distance Field (JFA), inspired by Relaxed Cone Stepping, computes the relaxed distance field for a signed distance field, which is the distance of a point to its nearest backfacing surface. The data can be used to march through a field with significantly fewer iterations, as we can guarantee that we don't miss any roots.
Relaxed Sphere Marching, extending the ideas of the RDF to analytic shapes such as circles and rectangles, and exploring composability. RDFs don't work that well for implicit functions, unfortunately.

I'm now continuing work on the functional reactive semantics I was working on earlier, while writing actual systems for the game in it. Right when it was time to make a visual programming GUI for it, I lost all motivation, and I guess that's because i want to write the GUI in the semantics of the language, but I want to use the GUI to write the GUI! A typical hen and egg problem. So I figured I could probably get a lot of mileage out of working in text alone for the time being, and once that is comfortable enough, I can start working on a GUI for it, and at least do it with comfortable semantics.
I also see so many things that need to be done but I have to somehow ignore all that future work and stop longing for ideal outcomes and instead just plough through, and solve problems when they come up.
It's already got great ideas. We have closures without scope limitations, side-effect free programming, no GPU/CPU barriers, direct access to GPU compute and GPU rasterizer, implicit deduplication of expressions and culling of unused expressions and automatic GPU control instruction ordering deduced from the expression dependency graph.
What we don't have yet is metaprogramming facilities like operators, custom types and operator overloading. We don't have loops, recursive state and state serialization yet, as well as interfacing with other I/O, namely files, audio, game controllers and network sockets.
These problems all have easy solutions. A harder optimization problem that I have no solution to yet is functional compression: since the functional DAG is entirely lambda dropped, there's an opportunity to reduce code size by lambda lifting, which requires finding expression subgraphs that have similiar shape, and replacing them with function calls. But we'll cross that bridge when we get there.
10.11.2021
I'm looking at the sampling mesh that produces truncated octahedra. This one has a bunch of desirable properties that made me look into it:

The sampling mesh is a tetragonal disphenoid honeycomb, in which we build dual vertices within simplices exclusively; here it is impossible that dual vertices produce broken 2-manifolds because every possible configuration only produces one surface.
When extra balancing vertices are added to the center of faces, the truncated octahedron allows plane symmetrical 3-coloring of vertices, greatly simplifying barycentric UV mapping.
Only six hermite points go into the QEF process, increasing precision and saving computations.
A singular interior sign produces a well tesselated spherical surface with 44 to 72 faces (depending on face subdivision) and 24 to 38 vertices, producing well tesselated curvature for small curvaceous details.

The spatial structure of the mesh isn't the most straightforward though. We have these data points:

sign bits are on a BCC lattice with two unique points per eight cells, at the zero corners
body-diagonal hermite points are on a cubic lattice with one point per cell (eight per eight cells), on BCC mirrored diagonals
axis-aligned hermite points are on a triplanar lattice, six unique per eight cells
feature vertices are on a triplanar lattice, four per axis i.e. twelve unique per eight cells
depending on tesselation, up to 6*2+8*6 triangles per eight cells, or 12 triangles per cell (96 per eight cells); but not really, as we're building 2-manifold meshes, so the real triangle demand will probably be around half that.

The hermite to vertex ratio is an optimal 6:1; the hermite to feature ratio is 7:6. the float data to feature ratio is 23:2 or 11.5:1.
Storage: per cell we have 2/8 bits of signs, 7/4 hermite points and 3/2 feature vertices. Uncompressed, this costs us X*Y*Z*(30 + 2/64) bytes (log2 offset 4.908392). A 32³ volume would be 961 KiB in size.
Working with the sheared cubic lattice is vastly easier though.
9.11.2021
More documentation on the FCC lattice with miller index (111).
Samples in the 111 lattice, per cell:

1 distance field sign (1 bool per sign)
6 hermite edges (normal + offset, 4 floats per edge)
3 contoured dual vertices (3 floats per vertex)

All in all 33 floats, but can be losslessly reduced to 27 with octahedral normal packing (I'm not counting the sign bit, which can be infered or stored in the sign of another absolute coordinate, of which there are plenty)
The lattice can be subdivided without taking new samples, which is important for level of detail. New corner sign bits are inferable from hermite planes, new hermite points are extrapolated from the dual surface, new dual vertices are computed from hermite points.
The cubic lattice based equivalent would carry per cell 1 field sign, 3 hermite edges and one dual vertex, which comes down to 15/12 floats, that's a reduction of 1:2.2 to about 44-45% the size. But the cubic lattice carries only one third the amount of dual vertices per cell, so the ratio of data to vertex is 15:1, 12:1 at best, whereas the 111 lattice has a better ratio of 11:1, 9:1 at best.
Building the octahedral dual vertex requires 12 hermite points, similar to building a cubic dual vertex, whereas the tetrahedral points only require half that. So the hermite to vertex ratio is a constant 12:1 for the cubic lattice, the 111 lattice's ratio is 8:1 on average.
With a body diagonal, a tetrahedral partition and a squeeze, the cubic lattice could reduce its data footprint. It would require four more edge samples, seven in total, but could generate six feature vertices, bringing down the data to feature ratio from 12:1 to 39:6, or 6.5:1; and the hermite to vertex ratio to a constant 6:1. We'd generate faces of two types: hexagonal and quadliteral. The resulting, uncontoured surface would have the shape of a bitruncated cubic honeycomb with the more spherical looking cells in BCC close-packing arrangement, as opposed to lattice 111's rhombic dodecahedra in FCC close-packing arrangement.
After pondering packing configurations a little further, I have concluded that I want to maximize the density of feature points. The best dual vertex packing configuration is FCC, whose vertex image is the Tet-Oct Honeycomb, which produces only equilateral triangular faces, the ideal image of a surface. It feels correct. Each vertex is constrained by the same space, a rhombic dodecahedron, which comes down to clamping the vertex against three planes in the octant it is in.
I look at the FCC neighborhood and conclude that we end up using everything! corner signs are defined on a cube lattice, axis signs in a FCC pattern L of half frequency (and so we have irregular samples on the BCC lattice), hermite points on the rhombic dodecahedron's edges connecting the signs, the dual vertices contoured from the hermite points on the opposite half-frequency FCC lattice of L (clamped by the rhombic dode), and, finally, the connected faces of tetra/octahedra.
It's a little more involved; to compute each point, we need to add up twice as much edges as for the cubic lattice solution.
Data storage / addressing is as follows: RD corner bits can be stored in a volume of full density, if need be. RD axis bits are 4 per 8 cells in an even/odd pattern, so we merge cells from odd z layers into the even layers, halving the Z resolution. Generated vertices can be stored the same way, for the alternating pattern. All corner-axis edges of a cubic cell, of which each edge holds one or no hermite point, are connected to the same RD axis bit, so they too can use a volume with half Z resolution, eight edges per cell; we could also instead force spatial equivalency and double the volume's XY resolution and use one corner-axis edge per cell.
To summarize storage; for dual contouring a XYZ cubic space (minus the borders):

X*Y*Z*1 bit for SDF signs at rhombic dodecahedral's cubic corners
X*Y*(Z/2)*1 bit for SDF signs at rhombic dodecahedral's pyramid corners
(X*2)*(Y*2)*Z*vec4 for hermite points; .xyz components define the normal gradient, while .w holds the edge scalar (with a value < 0 encoding no sample). A normal can be losslessly compressed as octahedral (bringing us down to vec3), and as all values are in the normal range, 16bit floats are probably enough. A vec2 might suffice.
X*Y*(Z/2)*vec3 for dual vertices; these vertices have cell-relative coordinates, so 16bit floats packed into a vec2 might be enough.

So, the storage cost for a dense X*Y*Z volume is X*Y*Z*(54 + 3/16) bytes uncompressed (log2 offset 5.76), X*Y*Z*(36 + 3/16) bytes compressed (log2 offset 5.18). A 32³ volume would thus require 1.69 MiB uncompressed or 1.13 MiB compressed.
Let's compare with a pure, cubic lattice solution, which would require X*Y*Z*(48 1/8) bytes uncompressed (log2 offset 5.59), X*Y*Z*(32 1/8) bytes compressed (log2 offset 5.17). Doesn't make that much of a difference, does it?
8.11.2021
It's time to revisit dual contouring and rewrite the landscape engine. Based on incomplete work I did in 2015, I laid out the new sampling & mesh generation system on Twitter, but I'm going to repeat it here for posterity.
We generate 0,3 or 4 out of 6 quads per type A simplex, using the signs at the simplices' vertices to decide. The neighborhood of 11 QEF contoured dual vertices is sourced for the quads: 1 at the center, 4 from octahedral neighbors and 6 from type B simplex neighbors.
To keep working with the coordinate system and underlying datastructures neatly orthogonal (read: to prevent ourselves from going crazy), we project the coordinates from a FCC coordinate system. This allows us to store samples in a cubic lattice; per cell we only need to store 1 corner sample, 6 edge samples and/or 3 body samples, depending on what we're doing (sampling, QEF contouring or meshing). The shadertoy demonstrating the transform is here.
The reasons why I chose this particular projection for dual contouring:

we can use marching tetrahedra, so the mesh is a true manifold, without self-intersections or double-booked vertices
We only generate quads
Each quad triangle is equilateral, the surface produces sphere-packed rhombic dodecahedra, so mesh quality is great

The only caveat here is that contoured vertices are constrained to different polyhedral spaces, but the math is trivial in FCC space:

All samples are constrained to the unit cube
Tet/Oct/Tet vertices are constrained to manhattan distance 0..1 / 1..2 / 2..3 respectively

1.11.2021
Been improving my math knowledge, mostly by working on problems involving piecewise and analytic integration, least squares fitting, preintegrated convolution, fourier decomposition, complex numbers and spherical harmonics.
As a result, I have a much better understanding of things that I have previously made use of hastily, and I am therefore now able to work on problems more accurately, and with the aid of wider inspiration.
So, I might have done LPV occlusions wrong until now. A quirk of the LPV is that we're using the center point of each cell as a kind of environment map; an occlusion should therefore be represented as a masking with the oriented solid angle of the contour of the occluder.
When sampling a distance field, we can only approximate this information. Previously, I've simply scaled a projected gradient normal by 0..1, depending on how deep we were into the surface; so, assumed coverage of the cell translates to an oriented occluder.
An actual integrated sample at that position would never be able to mask a solid angle greater than 180°; for when we're integrating beneath the surface, there's a sharp discontinuity to a solid 4 * pi of coverage.
Because of this discontinuity, when we're sampling a SH field, surfaces should not incorporate SH samples that lie inside surfaces.
I also noticed that simple scalar interpolation in linear light space is wrong. Instead, the light should be interpolated logarithmically, with the exponential taken afterwards, which greatly reduces discontinuities.
I suspect that SH samples also should not be interpolated linearly, as we're simulating the integration of what a surface sees as it moves between two samples. I suspect that, as SH coefficients are supposed to be squared in the dot product, and we're dealing with perspective projected areas, they need to be interpolated in the space of x² or 1/x².
4.9.2021
The smallest detail visible with the naked eye is ~0.05mm in size; that's about the median size of a plant cell. A game map area perceived as "huge" is 32km in diameter. this gives us a "perfect" resolution scale of 1:640,000,000; or about 29 log2 mip levels.
Devising an efficient integration/differentiation rule that can be applied to a (2^29)³ sized spatial reality is an interesting problem. We must not only consider a static world, but a moving one too. It's obvious that we don't have 2^87 bits of storage available, so some kind of compression must be employed. Even in a sparse, surface only representation, we're still left with 2^58 bits of data, a terabyte of information.
My preferred solution is semantic compression. Particularly when we get to microstructures, materials have great crystalline homogeinity. Instead of a bottom-up datastructure that is integrated as we go up, we should instead consider a top-down representation that procedurally generates detail from statistics. The lowest miplevel of this world is a single composite value describing the entire world. If this value were a seed number and a given total mass, the next highest level, in an octree, would be a 2³ structure with that mass distributed among all octants, and new sub-seed numbers for each octant.
What we want here is a universal but configurable fractal production rule that can be recursively applied to all 29 levels, and which we use to generate local detail. The headache begins when player induced state changes are involved. Depending on which levels those changes appear, we can still use the same data format and project it to lower miplevels. But we now need to integrate those changes to realize and update higher miplevels, and that means switching from seed based generation to data storage, unless we can figure out how to find the correct seed for the change we've made.
29.8.2021
continued here