paniq/rubberducking.md Secret

## rubberducking.md

      
    Raw
  

              rubberducking.md
            
          
    (ad-hoc devnotes are continued here)
29.8.2021
Thought about "synthegrals" on a 1D example. Averaging 4 samples (blue) to 2 (orange) after applying a 1:2:1 lowpass kernel (magenta) destroys information; that means that upsampling 2 samples back to 4 can "create" information within the constraints.
Another interesting observation is the presence of a phenomenon that is sort of the "inverse" of the nyquist limit: just as we have a resolution limit, we also always sample a "window" of the world; when the lowest frequency exceeds this window, it becomes a constant.
A real world example of that is that we assume the laws of physics (and by extension, chemistry) to apply to the entire universe, because they appear to be constant within our visible window of it. they could be varying over a larger range, but we can afford not to care.
So there are two natural boundaries to sampling, and by this, to our perception: low amplitude high frequency details above the sample rate, and high amplitude low frequency details above the sampling window.
Because of these limits, we can imaginatively and procedurally synthesize in both directions without running out of memory: zooming in, new high frequencies can be imagined while the lowest frequency can be dropped to a universal constant.
Zooming out, the universal constant can be used as center of a new low frequency structure, while the highest frequencies can be filtered out. This way, through movement, we maintain the consistent illusion that a) there is more detail b) there is a bigger picture
10.8.2021
I'm a little anxious of what comes next, because now we re-enter GUI design territory. Everything is in place, all hard fundamental performance questions explored and answered. I'm back at implementing an efficient visual programming interface, and I don't want to fuck it up.
What bothers me the most currently is keyboard editing (navigation/input) within the DAG, compared to working with a text medium. If we can't make keyboard editing a fluid experience the way text can, in fact, if we can't be even better than text (after accounting for missing muscle memory), then visual programming is doomed. But I'm confident that it's possible.
I'm worried that we might repeat the complex quirks of previous WYSIWYG interfaces. I'm thinking of the examples of LibreOffice Writer, Adobe Dreamweaver and Markdown/Markdeep. Dreamweaver operates on a HTML document and translates tags to widgets in realtime. Similarly, Markdown translates readable text syntax into semantic and graphic elements, but that's offline. Both techniques have in common that there is an underlying ANSI/Unicode formatted document which can be parsed and translated into a presentable, graphical version, where Dreamweaver supports backprojection, i.e. editing the graphical version alters the text document.
LibreOffice Writer is special in that it does not offer a "readable code" form of its documents, or rather, there is human-readable XML underlying the visual editing, it just does not expect you to know about it. Microsoft Word is even more consequential, in that the underlying format is binary and proprietary. Editing a Word document frequently has us encounter "invisible formatting": a space in a paragraph uses a different font setting than all other letters in the paragraph. We can't tell a non-breakable space from a real one. A special mode is offered in most text editors that replaces popular invisible control characters with visible characters: space, tab, carriage return.
We take from these observations that layout requires control elements; that the presence of visible control elements creates confusion and can feel overwhelming; that hiding control elements allows the editor to get a WYSIWYG feeling; but that it can become equally confusing to have to deal with unexpected behavior caused by invisible control elements.
From looking at code editors, we can see what programmers prefer in their work. Programmers are an empowered class of users who have the most agency over their editing environment, who are able to eat their own dog food in ways that no computer using plain writer or designer can, and so we can truly observe from their apps what users want, without falling for marketing gimmicks or stuff that looks useful to a novice, but is generally loathed by experienced users, or design that is actually more playing to the cognitive strengths of the programmer creating the app, rather than the cognitive demands of the user of the app.
Programmers use a both-hands-on-keyboard scheme, ideally being able to utilize all keys of the keyboard to perform every required task without ever needing to use the mouse. This surely has conservative reasons, because programming is much older than the mouse. Designers and artists in contrast use one-hand-on-keyboard-one-hand-on-mouse; Blender is a good example. In the classical right handed setup, the left hand rests on the keyboard, activating functions from letter keys directly, without the need for modifier keys, while the right hand controls the mouse, its scroll wheel and its three or more buttons.
Here, we experience our first split in the design: text input requires the right hand to switch to the keyboard, but interaction with graphical, pseudotactile elements requires the right hand to return to the mouse. On laptops, often both hands drop from the keyboard to the touchpad. We can in general assume that only the left hand receives the luxury of resting in the same area.
What feels best for the user is to be able to stick to one position, and we are making a graphical app for procedural design that greatly benefits from pseudotactile operations like drag & drop, slide, twist, etc. But users also need to frequently input names for identifiers and functions, as well as write documentation. So, somehow, paradoxically, we need to support both modes equally well.
In the GUI paradigm this cognitive split manifests in the concept of "keyboard focus". Are we focusing a graphical window? Then the keyboard/mouse mapping applies. Are we focusing a text input field? Then the keyboard is inputting characters, and modifier keys are used to access special functions. Mismatched focus frequently leads to hilarity. And there seems to be no way out of this complexity.
Code editors rely heavily on the use of keyboard shortcuts, which is non-optimal for a beginner. All users intending to use an app seriously do at some point make use of a "shortcut cheat sheet" to start to memorize keyboard shortcuts, which is seen as vastly superior to fishing a function out of a context menu, but has poor discoverability, of course. Menus allow us to browse and discover the functions that a program has to offer; the shortcut only becomes relevant when we need these functions more than once, and in rapid succession.
Sidenote: What bothers me greatly are tutorial videos where the user is instructed to activate a shortcut without explaining what text searchable concept it is connected to. If a new version of the program has moved the shortcut around, the video becomes worthless. If we made shortcuts 100% user configurable, going so far as not to offer any shortcut presets from the start, then tutorial videos would be forced to reference the name or symbol of the utilized function, rather than the keyboard shortcut, to communicate, as each user has their own shortcut setup.
I do like the approach of games (often enforced by the publishing rules of console vendors) to put shortcut legends on the screen, or at least keep them nearby in an explanatory menu. User configured shortcut or not, this greatly eases the cognitive load on the player and puts the smallest necessary cheatsheet close.
Sidenote: A history of recently used functions, similar to the history of browsed websites, can also help to reconstruct activity for partial automation. Programmers cherish consoles for their natural tracing of past input and output, whereas GUI systems suffer from "out of mind, out of sight" syndrome, where we technically would have to record a video of ourselves on a loop using the app at all times so we can e.g. go back five minutes in time and check what we did before. Often enough, the undo buffer (which consoles do not provide) serves to provide that function in addition to ad-hoc versioning.
Programmers like their text free of formatting. They prefer characters of fixed width, so hand-aligning paragraphs and columns vertically comes easy. (i.e. there's value in a raster that has a much lower resolution than pixels). Any coloring, italization or boldening of text is a lens that is applied on top of the document to enhance readability without interfering with editing at all. Most editors now support automatic insertion of syntactical characters such as parentheses, which some hate and others depend on. It can be difficult to get used to automatic parenthesis insertion when your muscle memory is already trained to add them yourself.
Lately, code editors also support folding of functions, but this feature is also debated, as it introduces content hiding. Typically, there is an ellipsis symbol present to indicate that folding has taken place, and attempting to write near it unfolds the function.
In programming, there are two cursors on screen: one is the mouse cursor, which is modeless and has pixel coordinates, and the other is the keyboard cursor, which highlights the current location for insert/delete operations and addresses the text by line and letter column. The mouse cursor controls the keyboard cursor. The keyboard cursor can be extended to become a selection, and this selection is a linear range from a beginning line:column to an ending line:column. There exists now a (for me rather confusing) way in many editors to expand the keyboard cursor to multiple, vertically arranged cursors supposedly in order to perform certain search/replace tasks with less cognitive overhead, which I have accidentally activated more than once, but for now this complicates an already complex situation I have to think about and I shall therefore for now pretend this mode doesn't exist.
The question is now, when we edit a DAG, how do we locate the keyboard cursor, and how do we describe selection? We do not have lines and text columns in the traditional sense. We also don't want to enforce the use of a monospace font.
We could simulate the presence of a line:column raster (with variable-sized character cells) that the cursor can use. Similar to text, we would then require a bijective mapping between document location and cursor, so that every coordinate points to one unique location in the DAG, and every unique location in the DAG has one coordinate on screen. Lines would have varying column length, and exceeding the line length would wrap into the next line. An increment/decrement of column should map to horizontal movement (except at the borders), and an increment/decrement of line should map to vertical movement.
This mapping would give us the added benefit of maintaining the format of error reporting in programming, which needs to point at text locations. It also allows us to straightforwardly convert from pixel to text location and back.
We would also need the concept of addressable spacing in order to support functions like jumping "from word to word", or often in this case, from widget to widget. That means spacing and margins must be clickable, and owned by the document. How typing into those blank spaces is interpreted depends on context. Typing after a cell could e.g. open a new symbol, but typing after a symbol would extend that symbol.
Cells can also opt to (modally) merge multiple characters into a single cursor location, in order to make intra-text selection impossible unless e.g. editing is activated. This is for example desirable for buttons and menu fields.
Lastly, and this is the most difficult problem, how do we handle selections? Other than in a text document, we have real hierarchical boundaries between cells. Say we have two different and incompatible edit fields next to each other. Can we select the tail of the first field over to the head of the second field? What happens when we hit "delete"? Would that merge the fields? How do we split them then? How do we copy and paste such a selection? And this is the simplest case I can think of.
Different to a source code file, we can not edit selections to produce a semantically illegal document. Interacting with a selection whose beginning and end lie within the same cell can be handled by that cell. A selection that spans multiple cells could be regarded as many separate selections, which each cell must handle individually. So in the two edit field case that I posted as example, we would split up the selection into three parts: the tail of the first field, the spacing inbetween and the head of the second field. A copy operation would create a sequence of three clipboard items. A delete operation would ask the fields and the containing cell to translate that accordingly. That's one way we could do this. We do not have to block the operation if we still can make sense of it, but the resulting behavior could become too surprising to the user. Here, it might help to indicate early where a selection will have no effect; for example, in the spacing, the selection could be striped to indicate read-only space.
I do like the line:column mapping solution because it flattens the problem and has well defined behavior, even in the proportional text editing space. Collaborating users can use this information to communicate document locations.
7.8.2021
It becomes apparent that we often need to compute information for a CADAG that needs to be only temporarily associated with the DAG, and for this I use hashmaps, which increases the complexity of implementations and isn't cache friendly. In addition, when a mapping is dense, we not only needlessly trash the cache, but also keep a useless set of keys around. Arrays would often be enough, if only our DAG IDs weren't their actual buffer offsets, but their entry index. So that's the change I propose:
A DAG uses as IDs the ordered indices of its entries rather than their buffer offsets. The general rule here must be that if offset(a) < offset(b), then id(a) < id(b). We do not explicitly store offsets with the data, to keep a strict topological and linear order, but rather reconstruct indices on loading, by scanning the buffer once (reading size information), and building an array mapping from index to offset from it. When we serialize the DAG, we don't need to serialize the index list, saving a possible source of inconsistencies.
As a result, we can now replace dense hashmaps with simple arrays, and topological processing on a cleaned up DAG will perfectly hit the array front to back in linear order. We also regain the ability to do a lazy depth-first traversion, in which we just walk the index array front to back, and we can easily query the total number of nodes in the DAG.
I was worried that this change would make it harder to parallelize processing later, but it's not too bad: New allocations continue to first atomically allocate from the append buffer. Then a new offset entry needs to be appended to the index array; to do this safely, we check if the topmost index array element is equivalent to the buffer offset prior to increment. If yes, we are the next one to append to the index array, otherwise we must wait and retry for other threads to finish their allocation. In this way, all threads will continue to append to the index array in order of allocation.
6.8.2021
Invented a hash stack, i.e. a hash that is reversible given its last combined argument. It can be used to fingerprint a path during stackless traversion without requiring extra storage, given that we know the deepest path element at every step:
hstack() -> hash("")
hpush(stack, value) -> rol(stack ^ hash(value), 1)
hpop(stack, value) -> ror(stack,1) ^ hash(value)

The challenge was to find a non-commutative reversible operator that maintains hash bit distribution while avoiding collisions. Using ~ or bitreverse in place of rol/ror generates too many early collisions from permutations. I also used +/- instead of ^ before but with use of rol/ror that appears unnecessary now. hpush and hpop can also be swapped out, what's important is that one function is the inverse of the other.
As suggested by Paul Khuong, here's a different version that uses a large 64-bit prime and its modulo inverse to do the same job, and has possibly better collision avoidance guarantees:
hpush(stack, value) -> (0x66d6cf4cc5ddd26du * stack) ^ hash(value)
hpop(stack, value) -> 0x24ffe0c7fcc70765u * (stack ^ hash(value))

29.7.2021
Added a small function constructor macro to make generating functions easier. Functions do not receive a parameter list, but a single params object. Each params object carries the number of parameters it has and a scope level so we can build closures.
To keep code footprint small, we will try to keep as many functions intact and reduce inlining to a minimum. One of the first required transformations therefore is lambda lifting, as code generation requires all functions to be in the global scope. In effect that means we turn functions with scope level >= 1 into level 0 functions, by giving them an additional context argument of tuple type that contains their higher level dependencies. Call sites then need to pass those higher level dependencies in the context argument. We start with the deepest levels first, moving on further up, aggregating context arguments.
One example:
fn (%1=(params 0 1))
    fn (%2=(params 1 1))
        fn (%3=(params 2 1))
            add (add (va 0 %1) (va 0 %2)) (va 0 %3)

after step 1
fn %1=(params 0 1)
    fn %2=(params 1 1)
        closure
            fn %3=(params 0 2)
                %4=(va 0 %3)
                add (add (at 0 %4) (at 1 %4)) (va 1 %3)
            tuple
                va 0 %1
                va 0 %2

and then finally
fn %1=(params 0 1)
    closure
        fn %2=(params 0 2)
            closure
                fn %3=(params 0 2)
                    %4=(va 0 %3)
                    add (add (at 0 %4) (at 1 %4)) (va 1 %3)
                tuple
                    at 0 (va 0 %2)
                    va 1 %2
        tuple
            va 0 %1

Another challenge is to support "function pointers" on the GPU. What we are going to do here is build tagged union types, with unit types replacing functions. Say, for example, an expression conditionally returns functions f, g and h. Our union type for that is f|g|h, and the code generator will likely use an enumerator in the inclusive range 0..2 for it. A call site then is turned into a switch case, that, depending on the enum value, calls one of the three functions. We can perform this lowering in the IR already.
A third concern is control flow. There are three forms we are particularly interested in, break and repeat for loops and return for functions. If we want to keep the functional nature of flow, we always have to evaluate expressions to values (and thus, types), and define implicit treatment of what i dub "control values". They are not without precedent. Our implicit treatment of option types is similar, in that an application of which any optional argument is none also yields an option value set to none, propagating non-evaluation backwards.
The rule for encountering a break or repeat wrapped value is then as follows: only a loop expression can consume these values, and does in fact expect one of the two. Only a fn expression can consume a return. When an expression that is not a control structure encounters one of these values, it evaluates to them, just as with option types. Their value is opaque, so can not be transformed. When a conditional expression returns control values, its type becomes a tagged union, which we can use to merge types and propagate type constraints back down.
27.7.2021
Something else: a thing I'm thinking about for a few weeks already is an audio engine that simulates minimal physical parameters for improved echolocation. The idea here is that we start, as in OpenAL, with a listener and sound sources in 3D space. But we also allow specification of acoustically reflective panels, for which we simulate one acoustic bounce to the listener. For this to work, we need to consider not only the distance between listener and sound source, but also the delay. The speed of sound in dry air at 20°C is approximately 343 m/s. That means a sound source in a distance of 10 meters has a delay of 29ms (at an amplitude of -40 dB, relative to the amplitude at 1m of distance), and even at a distance of 1m, we still have 2.9ms of delay.
Because we introduce this real delay, we can implement echoes. When a reflective panel is located in the space of sound source and listener, the total distance to the listener is the sum of the distance from sound source to panel, and panel to listener. When we filter the signal, we roughly emulate the absorption of the reflecting material. Diffusion can be emulated by a low pass filter. A material can be resonant. I also assume the signal changes in quality depending on the reflection angle, but I don't know yet what that quality is. The area of the reflective panel will influence how much sound is reflected. A small area can't reflect as much as a larger
one.
We can in principle model a bounce as a new sound source. That makes the most sense when we assume that the panels can move. The reflection graph does not change when the number of panels does not change. Once a bounce has happened, the moving panel can not influence the origin of the sound. In principle, we could implement the whole system with a large number of channels and channel stealing, where we reuse the most silent, most filtered channels, and thus trade number of sound effects active for number of bounces.
Let's assume we want to simulate as many bounces as possible.
When we have more than one reflective panel, we also have to consider the path between both panels. At three panels, there are 13 paths to consider: 6 infinite bounce paths between the panels (forward and back), 3 enter paths from source to panels, 3 exit paths from panels to listener, and one direct path from source to listener. For N panels and L sound sources, the total number of paths is then bounce paths B plus enter paths I plus exit paths O plus one (the direct connection) where B = N*(N - 1) and I = O = 2*N*L, so all in all N*(N + 2*L - 1) + 1 paths. So 16 panels and 16 sources would require processing of 753 paths. It could be worse. ;-)
One last important aspect I haven't thought about much yet is occlusion. Any of these paths can in principle be blocked / muffled. The working principle here is different, because occluders are path modifiers. To do occlusion correctly, we might have to introduce the concept of solid angles, and use 2nd order spherical harmonics to compute the effective blocking of sound on a path.
I'm wondering if Light Propagation Volumes could be a useful inspiration here, for a sound propagation volume, a SPV. Signals travel slowly through a SPV though. A N m³ cube of resolution R at samplerate S requires a buffering of s samples per cell, where s = N * S / (R * 343). Assuming we feed one sample per iteration at 44.1khz (the perfect model), a 32³ SPV could cover only 0.73ms of delay, or 25cm of distance along one of its primary directions. A 10m³ cube of same resolution would require a buffering of roughly 40 samples per cell. This gives us an event resolution of I = S / s = R * 343 / N, or 1098 ticks / iterations per second. The memory requirement would be, assuming full floats, M = sizeof(f32) * s * R³, so 5 MB per volume, so 15 MB total for propagation, surface and occlusion volume. That means we move at least about M * I memory per second, so 16 GB/s of bandwidth. It's about the equivalent of processing 34 1080p RGBA8 textures per frame at 60fps. Not great, but also not terrible.
26.7.2021
I chose to avoid explicit types for IR nodes and instead wrote a typer that selectively types nodes on request by deduction. the typer can be used by different passes that need it and memoizes its results. It's currently being used by the code generator and the lowerer. I'm very happy with this choice, and the more I work with my model, the less I can understand why other IR's insisted on making types explicit everywhere.
Something i'm still thinking about is how to add source location information, to allow the compiler to talk to the user and point at problems. I don't want to tag every single node as that interferes with memoization. We do have the problem that we need to fold expressions without folding source locations. Therefore, we should keep multiple sinks, one for each channel we're interested in, so that we can e.g. eliminate the debug info sink (for hashing), and the program still compiles. The idea here is that the debug info sink mimics the structure of main tree nodes so we can visit both in parallel. For e.g. a main node with 6 parameters, we would offer a debug node with 6 parameters and its own source location. Constants are tagged with leaf debug info, while references to other nodes resolve to further branching debug nodes. We could also only provide subnodes for references, and constants get nothing. The question is how to transparently maintain this debug node graph without forcing all rewriters to atomically inform the module of changing meta information.
Perhaps the insight can help that treating the acyclic graph like a tree (duplicating nodes) gives us a unique debug identity for every node, depending on where in the traversion stack we are. This identity is equivalent to the path taken there. A node can not have more possible identities than the number of paths that lead to it. And when a rewriter operates on a node, we always know its current path from the traversion stack. We can further predict that nodes closer to the sink have fewer identities, and sources (particularly constants) have the most identities, because so many paths lead there.
When a rewriter references other nodes, it typically either sources inputs of the node it is operating on, or nodes from a local map. For inputs sourced locally, we can easily infer their path identity by appending them to the path we're presently on. For inputs sourced from further down the graph, we might be missing possible path elements inbetween. For inputs sourced from anywhere, we don't have any path information.
We also need to incorporate parameter positions into the path to be able to reflect two distinct references. Therefore, our path format might look like index0:%id0 index1:%id1 ... indexN:%idN, with the index denoting the nodes index in the parent, and does include the node whose path identity we wish to query. The identity of the sink is 0:%sinkid.
We also know that in the CADAG, the invariant %idN-1 > %idN always holds true for paths. That means we could be able to keep unordered paths (i.e. sets, reversed paths), and order the path only when required. We also know the path doesn't contain nodes multiple times.
When we insert nodes into the graph, for example when first building it from source, we usually do this depth first. But the path identity of the node can not be known if we do not know the ids of the parent nodes yet. So we need to keep a node-local map of ids to source location, whose keys have to be extended by the parent id as soon as it is known, and which we bubble up and merge as the graph is built.
Also, source locations are more complex than just pointing at an originating expression. In the presence of macros and instantiated or inlined functions, we want to not only point at a value, but also at the expression that caused this value to be generated (for an inlined function, it would be the callsite). This means we need a breadcrumb trail of source locations that start with the first instantiator, and end in the generated expression. We also want to distinguish pointing at the argument of an expression from pointing at the expression bound to that argument (if it is elsewhere). This could also be incorporated into the breadcrumb trail: the argument is the instantiator, the expression it points at is the generated expression. This means we need to tag breadcrumb elements with how they came to be, so we can post better debug info, for instance: original, inlined, expanded, instantiated or referenced.
All that considered, I've come up with a possible solution.
Here is the example source code:
do
    print (+ 1 (+ 1 1))    
    print (+ 2 (+ 1 1))

And the IR we build for it:
%1 = loc 2 14
%2 = const 1
%3 = di %2 %1
%4 = loc 2 19
%5 = di %2 %4
%6 = loc 2 21
%7 = di %2 %6
%8 = loc 2 16
%9 = add %2 %2
%10 = di %9 %8 (%5 %7)
%11 = loc 2 11
%12 = add %2 %9
%13 = di %12 %11 (%3 %10)
%14 = loc 2 5
%15 = print %12
%16 = di %15 %14 (%13)
%17 = loc 3 14
%18 = const 2
%19 = di %18 %17
%20 = loc 3 19
%21 = di %2 %20
%22 = loc 3 21
%23 = di %2 %22
%24 = loc 3 16
%25 = di %9 %24 (%21 %23)
%26 = loc 3 11
%27 = add %18 %9
%28 = di %27 %26 (%19 %25)
%29 = loc 3 5
%30 = print %27
%31 = di %30 %29 (%28)
%32 = loc 1 1
%33 = do %15 %30
%34 = di %33 %32 (%16 %31)

We construct two graphs here, one with and one without debug info (di) nodes, where one graph is an overlay copy of the other. In this graph, %34 is the sink, holding a direct reference to the stripped program, %33, and a source location. But it also holds the same structure as %33 itself, this time referencing di nodes wrapping the arguments. Each di node gives us, at any point, the stripped representation, maintaining the ability to do quick hash based content comparisons and map lookups.
Now we transparently maintain di nodes during descends and rewrites. Querying the type of a di node will return the type of its wrapped node. Asking for a handle returns the wrapped copy embedded in the di node. Committing the handle also commits a new stripped node created from the di node's wrapped copy. Rewriters will then consistently read and copy di arguments.
In addition, when a fresh node is built, we want the builder to build a di node instead. The contextual source location for this node is implicitly known by the builder, and used by the di node we are building.
We also use the aforementioned breadcrumb technique to chain emitter info to source locations.
While this representation is somewhat wasteful, requiring twice the memory of a stripped graph (plus location overhead), we can remove most of the cognitive overhead for rewriters and maintain debug info over rewrite steps. I believe that to be a necessary sacrifice.
25.7.2021
From working with resources in the functional tree, it becomes apparent that we need a strategy for optimizing transforms that can safely mutate the resource they work on, because they're the only ones holding a reference on it. We have three options to do this:

In the code generator, always copy-on-write by default, but count valence beforehand and subtract on access to figure out when its safe to elide the copy.
Use move semantics, in which a write on a resource renders the source operator inaccessible for other writes and reads. This is a good mode for analysis, as we can observe how often this condition is triggered in a purely topologically ordered tree, whether it is controllable, and where it needs work. Users can insert manual copy operations to fix errors.
Use above semantics, but auto-insert copy operations for all but the last access, and elide superfluous copy operations.

(1) is relatively easy to program, but difficult to analyze, as we, at that point, already generate the underlying code. (2) is the easiest to realize in the beginning and naturally efficient. I've implemented move semantics in scopes before, so I know what to look out for. (3) Is a natural extension to (2), possibly difficult to get right, but not impossible.
24.7.2021
Thinking about useful and intuitive functional building blocks to the rasterization pipeline that we can easily lower to Graphics API instructions. I had a jab at this earlier, but I think it could be better.
The primitive constructor is
M := line | linestrip | lineloop | triangle | trianglestrip | trianglefan
primitive(M, T) -> Primitive<uvec2,T,M>

Like range, it produces a range, but of vertex and instance ids, that can be transformed arithmetically.
The fragment constructors are
fragcoord(Primitive<fvec4,T,M>) -> Fragment<fvec4,T,M>
I := smooth | flat
fragment(I, Primitive<C,T,M>) -> Fragment<C,T,M>

which takes a range of matching transformed vertex coordinates and attributes and produces a set of fragment attributes, which can be used in arithmetic. smooth produces a barycentrically interpolated fragment attribute. flat produces a non-interpolated fragment attribute.
To perform depth testing, we may perform comparison operators against constants or depth buffers, which will produce a new depth buffer of matching size. There are two test functions, one for early-out depth testing, and one for custom depth fragments.
OP := min | mineq | max | maxeq | always | never
overlay(Primitive<fvec4,T,M>) -> DepthFunc<Z,W,H,M>
depthtest(OP, Primitive<fvec4,T,M>, Range<Depth,W,H>) -> DepthFunc<Z,W,H,M>
fragdepth(OP, Fragment<Z,T,M>, Range<Depth,W,H>) -> DepthFunc<Z,W,H,M>
depthof(DepthFunc<Z,W,H,M>) -> Range<Depth,W,H>

The resulting depth buffer will use a z coordinate from the fragment where the x, y coordinates align, otherwise it uses the other operator argument for that coordinate. fragdepth allows to alter the depth value (read from fragcoord) before it is tested (writing to gl_FragDepth), but this will disable early Z testing. overlay rasterizes without a Z-buffer, with the last written fragment of the vertex buffer winning the pixel.
Finally, we can compare against the depth function to write fragment attributes to a framebuffer:
selectfragment(DepthFunc<Z,W,H,M>, Fragment<C,T,M>, Range<C,W,H>) -> Range<C,W,H>
mixfragment(DepthFunc<Z,W,H,M>, Fragment<C,T,M>, Range<C,W,H>, BlendOp) -> Range<C,W,H>

which takes a depth function, a fragment and a default attribute buffer to produce a new attribute buffer. Where the depth buffer matches the fragment, the fragment is written, otherwise the default attribute at that coordinate is used.
Blending is performed by describing the blending operation with a mock arithmetic tree that describes how the target value is computed. Permitted as sources are fragments and ranges blendsource(), blendsource1() and blendtarget(). A typical alpha blending equation would, for instance, be
blendsource() * blendsource().a + blendtarget() * (1 - blendsource().a)

The compiler will attempt to canonicalize the equation and match it to the respective arguments of, e.g. glBlendFuncSeparate, glBlendEquation and glBlendColor, and only fail when the equation is too complex.
Stencil testing will likely require an extension to the DepthFunc complex.
Indexed rendering can be performed implicitly where all vertex ranges sample the same index buffer. This behavior can be recognized and alters the draw call.
23.7.2021
I invented semantics for global feedback variables in a pure, non-recursive program. I wanted a way for library functions to be able to encapsulate their own global state in a pure functional way. All we need is a loop expression whose value updates only once per program iteration. For this description, I'm choosing a scopes syntax.
static (x0 x1 ... xN = init0 init1 ... initN)
    break result0 ... resultM
    repeat expr0 expr1 ... exprN

Describes global variables x0..N whose initial values are the evaluated expressions init0..N, and are updated by the repeat clause arguments expr0..exprM once per program iteration. The updated state becomes available in the next program iteration as bound to x0..xN.
static evaluates to the break clause arguments result0..resultM. This way, the results of static can be decoupled from its isolated state.
We transform functional code so that static expressions always bubble to the top. We do this so that we can optimize shared state, and know at a glance what state footprint a program has.
Here is an example for static arguments bubbling up:
f
    static (x = init)
        break (g x)
        repeat (g x)

becomes
static (x = init)
    break (f (g x))
    repeat (g x)

Parallel statics are then going to nest:
f
    static (x = init1)
        break (p x)
        repeat (p x)
    static (y = init2)
        break (q y)
        repeat (q y)

becomes
static (x = init1)
    break
        static (y = init2)
            break 
                f (p x) (q y)
            repeat (q y)
    repeat (p x)

Here, one static can not bubble out of the other. For this situation, we use a new rule that allows us to merge statics:
static (x y = init1 init2)
    break (f (p x) (q y))
    repeat (p x) (q y)

And so, parallel statics merge into a single one.
Let's look at two other forms of chaining and how they transform. The first uses a static in the init clause:
init2 :=
    static (x = init1)
        break (x + 1)
        repeat (x + 1)
static (y = init2)
    break (y + 1)
    repeat (y + 1)

Here, static (y) only evaluates init once, in the very first iteration of the program, and so the term can be reduced to
static (y = (init1 + 1))
    break (y + 1)
    repeat (y + 1)

The second chain expression puts a static in the repeat section:
static (x = init)
    break x
    repeat
        static (y = (h x))
            break (g x (f y))
            repeat (g x (f y))

which reduces to
static (x y = init (h init))
    break x
    repeat (g x (f y)) (g x (f y))

We observe that two states are the same after one iteration, so by inlining one iteration, we reduce the dependency of repeat to a single input:
static (x1 = (g init (f (h init))))
    break x1
    repeat (g x1 (f x1))

This optimization can only be performed after all exterior clauses have been moved to the interior, because the entire program now skips one iteration in the beginning.
We also need to examine the behavior of static constructs inside conditional expressions, loops and function declarations. In all three cases, we use the bubble rule to move complex exterior expressions inside.
Let's start with conditional expressions:
if x
    static (y = init)
        break (f x y)
        repeat (f x y)
else w    

would naively transform to
static (y = init)
    break
        if x (f x y)
        else w
    repeat (f x y)      # unguarded dependency

which means a static would always evaluate, even when inside a conditional. We could go with this rule, and thus require users to move conditionals inside static expressions when they wish to model an unchanging value, or attempt to perform a transformation that leaves the conditional exterior intact.
A proper transformation would transform the conditional to the repeat clause as well. To understand what goes on here, let's do the transformation in two steps. In the first step, the break and repeat clauses are duplicated in both branches (which is invalid syntax):
static (y = init)    
    if x
        break (f x y)
        repeat (f x y)
    else
        break w
        repeat y

In the second step, we merge arguments for each clause (which is valid syntax):
static (y = init)
    let br rep =
        if x 
            tmp := (f x y)
            _ tmp tmp
        else (_ w y)
    break br
    repeat rep

A similar situation occurs with a static in each branch:
if k
    static (y = init1)
        break (f u y)
        repeat (f u y)
else
    static (z = init2)
        break (f v z)
        repeat (f v z)

Which naturally becomes
static (y z = init1 init2)
    let br repy repz =
        if k 
            tmp := (f u y)
            _ tmp tmp z
        else
            tmp := (f v z)
            _ tmp y tmp
    break br
    repeat repy repz 

Moving on to static inside for constructs. A static will only change once per program iteration, and this will also be the case for loops. Take this example:
loop (z x = 0 init1)
    if (z < 10)
        repeat (z + 1)
            if (z == 5)
                static (y = x)
                    break (g x (f x y))
                    repeat (f x y)
            else (h x)
    else
        break x

If we assume that static only updates on the very first conditional evaluation, and then subsequently holds the same value, we can move static out of the loop by specializing the loop for the first point at which static will be initialized to get its init value:
inity :=
    loop (z x = 0 init1)
        if (z == 5)
            break x
        else
            repeat (z + 1) (h x)
static (y = inity)
    loop (z x = 0 init1)
        if (z < 10)
            repeat (z + 1)
                if (z == 5) (g x (f x y))
                else (h x)
        else
            break x
    repeat (f inity y)

But initializing from a loop variable turns this into a complex operation. If we restrict static initializers to non-loop variables, we can get around this problem without, in my opinion, sacrificing too much expressivity.
The last behavior we need to look at is static used inside of function templates. Consider this situation:
fn f (x)
    static (y = x)
        break (h x y)
        repeat (k x y)
g
    f a
    f b

As we call f twice, with two different initializers, we would expect two separate instances of static to be materialized, since their results will in fact have different outcomes. This is difficult to model in a code generator, since functions do not have instance memory. By moving static out of template bodies into macros, we can ensure they are always inlined into the callsite:
fn fbr (x y) (h x y)
fn frep (x y) (k x y)
inline f (x)
    static (y = x)
        break (fbr x y)
        repeat (frep x y)
g
    f a
    f b

Which, after inlining, becomes
fn fbr (x y) (h x y)
fn frep (x y) (k x y)
g
    static (y = a)
        break (fbr a y)
        repeat (frep a y)
    static (y = b)
        break (fbr b y)
        repeat (frep b y)

And then ultimately
fn fbr (x y) (h x y)
fn frep (x y) (k x y)
static (y1 y2 = a b)
    break (g (fbr a y1) (fbr b y2))
    repeat (frep a y1) (frep b y2)

And that's it. That should cover the gist of it.
22.7.2021
To know how big a buffer allocation needs to be, I figure out the maximum for a range definition using interval arithmetic. Unfortunately, intervals quickly grow much larger than they need to be. Take this example, where the interval is annotated on the right:
%2 = input ScreenSize <1:u32 4096:u32> <1:u32 4096:u32>  
%21 = comp 0:u32 %2 <1:u32 4096:u32>   
%32 = range %21 <0:u32 4095:u32>
%37 = comp 0:u32 %32 <0:u32 4095:u32>   
%45 = utof %37 <0.0 4095.0>   
%48 = utof %21 <1.0 4096.0>   
%51 = fdiv %45 %48 <0.0 4095.0>   

As you can see, %51 is three magnitudes larger than what it needs to be, because interval arithmetic can't capture that %45 and %48 are connected. When we use inequality terms, we can see the relationship:
1 <= %21 <= 4096
0 <= %37 < %21
0.0 <= (%45 = utof(%37)) < utof(%21)
%48 = utof(%21)
%51 = 0.0 / utof(%21) <= utof(%37) / utof(%21) < utof(%21) / utof(%21) --> 0.0 <= (utof(%37) / utof(%21)) < 1.0
0.0 <= %51 < 1.0

Fortunately, we have access to the entire functional graph, and so can make deeper optimizations to the interval. The question is how to introduce inequality forms into the IL.
21.7.2021
I get monads now, independent of the syntax in which e.g. haskell presents them. The idea is that you make opaque change of global state explicit in your functional model, as a moving value, and enforce a logical read-write order to prevent "impossible" operations.
Take this simple pair of operations:
write(oldstate, pointer, value) -> newstate
read(state, pointer) -> value

We require that a state can only ever have one dependent write(), but as many read() ops as we like.
This maintains a linear, non-parallelizable mutation order.
We also track for all expressions what state they derive from. All arguments passed to write() must be either be global or read from oldstate. All arguments passed to read() must either be global or read from state. Sourcing any other state would source an invalid state.
Finally, we need conditional branching, which creates parallel realities. For this, we need a split operator:
branch(state, condition) -> {true_state, false_state}

As well as a merge operator, which requires two independent states:
merge(true_state, false_state) -> state

All these rules move states whose value we never need to know. we just need to maintain their invalidation rules, and by this, we order operations and prevent impossible dependencies on invalidated states.
15.7.2021
The first implementation for a content agnostic acyclic graph is working. Registered types can be used as nodes, and topological transforms update the tree recursively, without requiring the stack, so graphs can be arbitrarily deep.
As modules are conceptually append-only, it should be straightforward to apply multithreading later, and process the entire DAG in parallel, without needing to change any user code.
Serializing the graph is as easy as writing a u32 array to a file, and deserialization only requires reading the array back. Memory mapping can also be used. The root node is always at the tail end. every node in the array is only depended on by nodes that appear after it.
I decided against built-in deduplication because multiple instances of the same node can be contextually significant. Deduplication can be implemented by users as a type guided transform. But we could also give types a nodedup flag and automate deduplication.
Alignment constraints on the graph ensure that valid nodes could even be read and written in a GPU shader. A node's ID is equivalent to its offset in the module buffer.
One hard constraint is that nodes can only depend on nodes in the same module. I might support weak referencing nodes in the future though.
11.7.2021
Partially inspired by this paper, I started doing a version of the DAG module where each element is a cons cell. Developing a hylomorphic fold/unfold routine was straightforward, and it does indeed seem like a good paradigm for transforms.
While cons cells compress suffixes and can map lists of arbitrary length, storing a string requires twice the memory, and there is no way to randomly seek within a list. Packing arrays of variable length, we can't start reading the last array because we don't know where it starts. Taking a lesson from cons cells, we could write the length of the array at the tail, and use indices to tails as ids. But with this technique, we can only read backwards, and a linear forward scan becomes impossible. We also need to know which words are ids so we can unfold correctly. With cons cells, that information is already aggregated in the top 3 bits of references.
9.7.2021
The smaller, range based semantic model for the FIR is working now, but it made the code quite messy.
It feels like there should be some sort of grammar for DAG processing, a way to write transforms that can cover every use case. Here are some things I need to do with DAGs:

Cull unreachable nodes
Conditionally attach named attributes to nodes. Some attributes are polymorphic (such as operator arguments), some are not. Attributes can be complex containers, such as maps and sets. We can agree that, if vertices are rows, attributes should be columns.
Depth-first transforms
Conditionally collapse identical nodes
Multiple passes
Stackless iteration
Find predecessors (scoped)
Build postdominator tree (scoped)
Flex Layouting
Scoped context transforms, such as resolving symbolic bindings or inlining functions; This one is instancing vertices, giving them scoped attributes.
Cyclic typing of loops
Lambda lifting

5.7.2021 (2)
Been thinking more lately about proving iterative loops, and it becomes clear that the reason why predicting break clauses is hard is that iteration is integration, and trying to prove that an iteration breaks is to solve roots of an indefinite integral.
Take for example for (int x = 2; x < 50; x += 3); does this loop break, and if yes, when does it break? We can rearrange this expression as 2 + ∫ 3 dx = 50, which gives us 2 + 3*x = 50, and so ⌊x⌋ = 16, and that is the number of iterations on the initial value.
Now you'd probably think "oh that's neat, so i can solve a wide range of loops" but this range is smaller than you think. integration is an art, and integrating integer arithmetic doubly so.
To make matters worse, root finding is only trivial on trivial functions & once you deal with unbounded peripheral values, it gets really icky. The entire problem sector is mostly unfit for compile time automation unless you bruteforce it and/or use monte carlo/fuzzing techniques.
All that said, what we can trivially do is apply metrics and compute the minimum and maximum cost of an iteration, and from that estimate e.g. how many iterations fit into the allocated timeframe. It is then possible to provide a loop construct that has, in addition to its initializers and loop expression, a timeout catch expression, which is evaluated when the iteration count is exceeded. We can then always guarantee that the loop will finish "in time", one way or another, and give the programmer a simple way to decide whether to end the entire program right there, or recover from the timeout.
What this method doesn't do is to actually count time, as that's expensive. So this is mostly useful in a pure functional context where we know that the loop doesn't perform any I/O with surprising stalls.
We're also not accounting for multiple and nested loops. In principle, the entire program has to be graded. The total cost C of the program p is C(p) = K + i1*C(l1) + ... + iN*C(lN) where K is the accumulated cost of non-iterative parts and loops whose upper iteration bound is known, l1..lN are the loops of the program, and i1..iN is an unknown upper bound on the iterations those loops will take. Say we wish to solve for a budget B = C(p). After we subtract the non-iterative cost from the budget, we have the remaining cost to be spent on all loops. We would naively try to find a common maximum iteration count, so i = (B - K) / (C(l1) + ... + C(lN)). This however will lead to great power disparities if the loop costs have wildly different magnitudes. Consider a program with three loops C(l1) = 10, C(l2) = 500, C(l3) = 1 and a budget of 10000 cycles. All three loops will only be permitted to perform at most 19 iterations each, even though l1 and l3 are rather light. If we instead distribute the budget so that each loop gets an equal slice of the budget, i.e. (B - K) / countof(l1..lN), then each loop in our example gets about 3333 cycles, and so i1 = 333, i2 = 6 and i3 = 3333, in other words iX = (B - K) / (countof(l1..lN) * C(lX)).
Of course this rule is also too rigid, so we'll have to support weights; they allow the user to specify how much of the budget is to be distributed to this loop. The constraint is then, for a set of loop weights w1..wN, iX = (B - K) * wX / (sum(w1..wN) * C(lX)). For our example, let's assume our center loop get three quarter of the program, and the other loops shouldn't need this much time, the third one even less so, so we author w1 = 20, w2 = 75, w3 = 5. The distributed iteration limits are then i1 = 200, i2 = 15, i3 = 500.
By now you might have noticed that all of this sounds a lot like flex layouting. Our estate in this example is cycles to spend, and we distribute space across loops, starting with loops whose size is fixed, and distributing the rest by weights. The algorithm should indeed always function like a layout constraint algorithm, and more complex arrangements are thinkable here too, for example, the iteration count of two or more loops is supposed to be the same. The entire process is performed at compile time though, so it's worth thinking about.
The question remains if this is user friendly. After all, just because we can build it doesn't imply it's fun to use. The system ensures that our program remains responsive, the behavior is fixed at run time and we can reason about it. It does not interfere with loops where the iteration bound is fixed. But it requires that we properly handle iteration overflow, and how we handle it is an important question. For data processing that is connected to presentation, we will almost always want to defer additional work to the next frame, and save / restore iteration state. But giving up and presenting an error indicator is just as good a choice.
5.7.2021
The new FIL is beginning to be operational. I started with low level nodes that map to API primitives, and managed to build a working test case that dynamically renders per pixel to a window on screen.
DISPATCH, which executes a compute shader, receives two sets of BIND operators (one for input, one for output) that bind shader attributes to resources. To conserve the pure functional nature of the tree and avoid mutations, Outputs aren't bound to existing resources, but to resource templates. Images, for example, are produced by a TEXSTORAGE descriptor that defines the size and format of the image the shader will be writing to. the DISPATCH expression itself then returns the actual instances of the generated objects.
Which brings us to a problem. Images can have multiple LOD levels, and sometimes layers. A compute shader can only address one LOD level at a time, and often producing the next highest LOD level requires reading the lower one. How do we do this in a system without mutation?
This is what I came up with: we allow the system to produce partially undefined/uninitialized values, for example an image with multiple lod levels, of which some levels have not been initialized yet. These values may exist, but they may not be read, and if they're read, the behavior is undefined. To make the system safe, the typechecker would catch these undefined uses.
Now, when we first generate an image with a TEXSTORAGE template, we receive the final instance of it through the DISPATCH node. Any reader of its output would only have access to lod level 0 that has been initialized. To initialize e.g. lod level 1, we bind the output of this node in a new DISPATCH operation, and specify level 1 as bind target. Since binding for writes is the only way to alter a value, we enforce the constraint that there can only ever be one write bind per lod level in one execution path (which, to allow for branching, requires scoped reference bookkeeping, not great not terrible), and the output of this operation is produced as a new DISPATCH node.
With these elements in place, the subsequent mutation of an images lod level is not going to produce errors, as we cannot have readers that depend on this data until after it has been written, and then new readers depend on the new instance of the existing image with added lod level, coexisting with possible readers attached to the older values who read fewer lod levels.
The principle demonstrated here can be applied to any partially undefined value whose defined values can be known at compile time. Even when we don't know which values are truly defined at compile time, we can always provide an unsafe operation in which the user must be careful not to intersect write operations or risk corrupting readers who don't expect it. Even when e.g. write indices are not known, we can make use of interval arithmetic to prove at compile time that operations do not overlap. Writing code to satisfy such a proof should be partially possible, e.g. by bounding indices with MIN/MAX. Needs more study.
2.7.2021
In terms of topological ordering (aka time ordering) of a directed acyclic graph, a functional operator f(x,y,...) has invariant properties in terms of sequence processing. There are four classes of events an operator can trigger:

When the operator is first visited, this creates and enters the operators context. This triggers a time sequenced event that precedes all nodes the operator depends on, but may not depend on these nodes. This is not a topologically ordered event. After this event, the arguments of the operators are visited, in an order the operator sees fit.
When an argument has been visited, the operator context can be reexamined. This is a time sequenced, non-topological event within the operators context that may change the contexts meaning.
When all arguments of an operator have been visited, this closes and finalizes the operators context. This triggers a time sequenced, topologically ordered event that depends on the operators arguments.
When an operator is revisited, the operator is not reentered and nothing happens. Instead, the events earlier output is typically reused.

What we can't gleam this way is how many times we're going to revisit a node; particularly the last visit is often relevant in terms of determining lifetime. This requires a topological prepass that visits nodes & counts visits per node. When we later walk the tree for real, we decrement the visit counter on each visit, and when it reaches zero, we are the last visitor to that node, and its lifetime has ended.
1.7.2021
Designing intermediate representations for compilers, it becomes clear to me that an IR should have 2 instruction sets: a hardware set that the code generator can use to effortlessly generate code (which is the norm), and an abstract set that the optimizing stage translates.
I see typechecking as a part of optimization. An abstract set has support for type inferred instructions and versions of hardware instructions that accept constant expressions. The job of the optimizer is to elide all abstract instructions & produce a hardware-friendly program.
There's really no benefit from completely separating the two stages and have separate syntax trees or document formats. The abstract frontend needs to duplicate all hardware instructions anyway, and once the data is in hardware format, we can't make new abstract alterations.
From the example of LLVM IR, we can see that specifications become increasingly strict with time, as code generators don't want to figure out anything, and expect their IR to be as accurate and precooked as possible. That also makes the IR less and less valuable. In other words, the idea that right below language semantics there are hardware semantics is a fiction. Inbetween there's another set of abstract features shared between many languages: system ABI compliance, macros, closures, type attributes, checking, precomputation etc
A hardware-facing IR also has the problem that without an abstract layer, it is forced to break its API as it becomes more specific. An abstract layer on the other hand can absorb changes as software features, making transitions easier. In other words, good middleware must be two-faced: a direction that faces the hardware, and a direction that faces challenges in software. When you design under the illusion that the user facing side should be SEP, you get libs like Vulkan or LLVM.
The trick with providing abstract features is to refrain from trying to isolate the backend and to instead allow users to "reach through" and use low level interfaces where needed. When there are two designers at work rather than one, the frontend and the backend typically look very different from one another. And when the frontend is not an extension of the backends datastructures, working with the infrastructure will become tedious.
30.6.2021 (3)
The question is now how to best translate the new functional IL. We need to generate

GPU: Various compute shader programs, one for every root that's sourced by a GATHER operation / system output nodes.
CPU: A struct that holds all the GPU resource handles we'll need, as well as setup and shutdown functions that acquire and release resources. The function that acquires also needs to compile shader programs and upload constant buffers.
CPU: The driver function that changes states and executes programs in order.

We also need to track for each node

Its type: either a single scalar, a scalar within a buffer, or, for rasterization, a vertex buffer object, a rasterizer or a fragment.
For buffer scalars: Its width, if constant, or otherwise, the range it is sourcing.
For buffer scalars: Its value interval (min / max) among the entire buffer. When a range operator is created, this way we know the memory bounds of the required buffer.
For fragments: The set of flat/bary attrib keys this fragment depends on. Fragments can not directly depend on buffer scalars, but must go through a GATHER node.

30.6.2021 (2)
The pure functional version of a parallelized scatter operation is in its naive implementation O(n*k), but when you have a O(n) hardware operation that does the same thing, that means you can translate problems that look like the one below to a scatter op.
fn scatter(indices : int[S], values : T[S], defaults : T[D]) {
    map(range(D),
        fn (d) {
            c := filter(shuffle(range(S)), fn (s) { (indices[s] == d) })
            values[c[0]] or defaults[d]
        })
}

The additive version looks like this
fn scatter_add(indices : int[S], values : T[S], defaults : T[D]) {
    map(range(D),
        fn (d) {
            c := filter(shuffle(range(S)), fn (s) { (indices[s] == d) })
            defaults[d] + sum(map(c, fn(s) { values[c] }))
        })
}

30.6.2021
In my current design iteration of the functional IL, all values are now scalar, and RANGE (what i previously called DIM) operators are used to create workgroups / allocate buffers.
when a GET operator sources a buffer via gather, it creates buffer boundaries and requires sync events. we will use compile time interval arithmetic to determine upper bounds on buffer sizes. Interpolating and other special fetches from values will cause the underlying buffers to be hosted as textures. A combination of scattering PUT operations define new buffers, where the upper bound of the index value defines how large the buffer will be.
I also want to make triangle filling / shading accessible with a special scatter operator whose design i don't know yet.
What goes into triangle drawing is:

A 1D vertex buffer of 4 scalars whose size is typically a multiple of 3
A list of vertex attributes, interpolated and non-interpolated
In which buffer to write vertices
Whether depth testing is to be performed, and which kind (requires a depth buffer)

What usually goes into a fragment input is:

vec4 fragment coordinate, computed from auto-transformation of triangle coordinate
barycentrically interpolated per-vertex user attributes, such as uv-coordinate or color
non-interpolated per-vertex user attributes

The fragment shader then computes:

Depth of the fragment
various scalars to be generated at the fragment coordinate

So we have a two stage situation (vertex -> fragment, fragment -> framebuffer) that will require new types and several new operators.
In the abstract, these are the objects we will construct:
setup :=
    stencil on | off
    & depth test on | off
    & depth comparison
    & blendfunc
    & etc.

vertices := {x y z w}
vertex-attribs := {a b c ...}
triangles := {vertices flat-vertex-attribs bary-vertex-attribs}

fragmentf :=
        FRAGCOORD
    |   flat-attrib <attrib-index>
    |   bary-attrib <attrib-index>
    |   f(fragmentf)

depthf := fragmentf | none
attrbuffer := {setup depthf triangles fragmentf defaultbuffer}

In this system, all shader applications that use the same triangles may share the same vertex shader; all shader applications that use the same setup, the same triangles, the same depth function and the same defaultbuffer range may share the same drawcall; all shader applications that use the same depth function, and share all fragment functions among multiple drawcalls may share the same fragment shader. Defaultbuffers that are only used by one site may be moved and mutated in-place, otherwise they need to be duplicated.
We end up with these new instructions:

VERTICES vertex-list flat-attrib-list bary-attrib-list produces an abstract grouping of triangles. vertex-list is a LIST of equal-sized vertex buffers with keys 0..3 to map from x to w respectively. flat-attrib-list and bary-attrib-list are LIST nodes of attribute buffers of the same size as the vertex buffers, with distinct, user defined keys. Either or both attrib lists may be UNDEFINED.
RASTER setup-list vertices depthf produces a triangle rasterization pipeline, without any framebuffer settings. setup-list is a LIST of keys. There will be keys to setup depth testing, stencil testing and blending. vertices is a VERTICES operation. depthf is an optional expression that determines the depth value to be used for testing, and which will also be written to framebuffers, it may be UNDEFINED.
SHADE raster fragmentf defaultbuffer produces a framebuffer in which triangles have been rasterized i.e. it performs the actual shading. raster is a RASTER operation. fragmentf is an expression that determines the attribute to be written, and defaultbuffer is a 2d buffer that contains the values to be used for every pixel that is not shaded. All SHADE operators that source the same RASTER operation and the same defaultbuffer range will be grouped into the same draw call, with all fragment expressions merged into the same shader program. All draw calls that indirectly use the same depthf, and share all fragment functions, may share the same fragment shader.

These operators may be sourced by fragment and depth expressions:

FRAGCOORD index produces a fragment coordinate. index is an integer in the range 0..3 to map from x to w.
FLATATTR key produces a non-interpolated attribute of a fragment. key is one of the keys passed to the flat-attrib-list arguments of a VERTICES operation.
BARYATTR key produces a barycentric interpolated attribute of a fragment. key is one of the keys passed to the bary-attrib-list argument of a VERTICES operation.
DISCARD produces a discarded fragment. When the expression evaluates to this value, no fragment will be generated.

This should be enough for most rasterization tasks. We also have additional optimization opportunities by collecting the attribute keys actually sourced by a draw call, and intersecting them with the attribute lists to truncate unused keys (and also find keys that aren't defined). Ordering of list keys will also be neccessary to group vertex setups.
28.6.2021
After musing about differences between CPU and GPU SIMD semantics, I think I have a better understanding now of how to decide what kind of work goes to CPU/GPU and how to design the FIL (functional IL) accordingly.
We need a semiscalar architecture. That is, we type operators to determine vector width and whether it is known at compile time or runtime, but instructions can for the most part be generated as if we're describing a single kernel.
Architecture    | CPU (Ryzen 5) | GPU (RX 480) | Raspi 4  |
----------------+---------------+--------------+----------+
Wavefront       | 8¹ * 2        | 16 * 4       | 4²       |
CUs             | 6             | 36           | 4        |
Effective lanes | 96            | 2304         | 16       |
Clock           | 3200 MHz      | 1120 MHz     | 1500 MHz |
GFLOPS          | ~224          | 5161         | 13.5     |
Memsize         | 64 GB         | 8 GB         | 2-8 GB   |
Latency         | 0.5ns         | 10ns         | 0.5ns    |
Bandwidth       | 20 GB/s       | 256 GB/s     | 4 GB/s   |
PCIe Transfer³  | 15 GB/s       | 15 GB/s      | N/A      |
Peripheral I/O  | Yes           | No           | Yes      |
Stack access    | Yes           | No           | Yes      |
IR              | LLVM          | SPIR-V       | LLVM     |
Scatter/Gather  | No / Some     | Yes          | Yes²     |
Exec Masking    | No / Some     | Yes          | Yes²     |
Shuffle Lanes   | Yes           | Group        | Yes²     |

¹ AVX/AVX2 instruction set for x86_64
² SVE instruction set for aarch64 / ARMv8
³ PCIe 3.0 x 16

Within the FIL, all values are vectors of 32 bit words (though we could provide subword strides). Vector size can either be known or unknown, in which case we need to track the source variable for the vector size.
To discern which operations to run on the GPU we need to group all operators that share the same vector size. Constant vector sizes can be dispatched from the CPU, variable sizes need to use the indirect dispatch feature, where we need a small compute shader to prepare the buffer, and then fire the dispatch command from the CPU.
There's a transfer cost for CPU -> GPU in terms of high latency and low bandwidth which can be amortized by async transfer, and not needing to wait for the GPU to send anything back before we send more. In the case that we need a talkback channel, we need a separate memory site that is allowed to run with an arbitrary frames of latency, and which becomes visible as an event.
Almost all outputs are on the CPU except three, Framebuffer, GPUOutput and CPUEvent. The same goes for inputs (of which GPUEvent receives values sent to CPUEvent), except for GPUInput (which is fed by GPUOutput).
To sort which instructions go where, we need to tag operator scope visiting the DAG topologically from the root. Sourcing a GPU input from the CPU produces a compiler error. When a CPU input is sourced from a GPU output, we need to figure out the boundary operator. This could be done with explicit hinting, a GPU x operator that transfers a vector to the GPU so that depending operators can continue there.
Instead of a MAP operator, we create parallelizable workloads from a DIM1 x, DIM2 x y or DIM3 x y z range operator, which conceptually represents three u32 vectors of size x, x * y or x * y * z, where each vector must be extracted with GETX, GETY and GETZ, respectively. The vectors identify their coordinate within the range, and can be subsequently transformed.
27.6.2021
It is easiest to imagine a program cycle as a process that reads memory, transforms it, and produces a new memory state. Since we have two system architectures, CPU and GPU, with separate processing and memory, VMEM (virtual main memory) and GPURAM (video memory), a program needs two separate loop kernel functions, each with two inputs and two outputs:
IN local: read memory locally written in the last cycle
OUT local: write memory to be locally read in the next cycle
IN remote: read memory remotely written in a previous¹ cycle
OUT remote: write memory to be remotely read in a future¹ cycle

¹ latency may be greater than one frame

In addition, both loop kernels use an append buffer with cycle lifetime for scratchpad memory, and the CPU loop kernel maintains a hidden connection to peripherals, receiving and sending system events.
GPURAM access is usually compartmentalized for GPU shaders, but compute shaders have the ability to write memory. We must nevertheless find a way to distribute GPU work efficiently.
26.6.2021
Worked out loopless & branchless conversion of integers to strings. The general formula for the number of digits required to represent an unsigned N-bit number in a U-ary numeral system is ceil(N / log2(U)).
In other news, I'm figuring out how to make piece tables malloc free, pool the memory and reduce overhead. The idea is that we maintain each global as a flat main buffer for the program, representing its latest revision. In addition, to build the next revision, we keep an append buffer as scratchpad memory that is cleared prior to each frame and allows multithreaded allocations via atomic add. This append buffer stores computed intermediate values as well as new piece tables themselves, as one cycle of the program is running. A piece table slice then references either memory in the main buffer (from the latest revision), or memory in the append buffer, i.e. for the next revision.
When a frame is completed i.e. the current revision committed, the final piece table is used to copy memory back to the flat memory buffer, where identity slices can be skipped i.e. slices whose offset in the piece table is the same as their offset in the main buffer. Because editing may swap out memory in the main buffer, slices can have a complex cyclical dependency (slice A needs memory that slice B overwrites, slice B needs memory overwritten by slice C, slice C needs memory overwritten by slice A), and a simple memcpy doesn't cut it. The stupidest solution here is to use double buffering, but this doubles the required main memory, and it also requires applying changes from the last frame as well, since each double buffer is two frames old.
The next best solution I can think of is to qualify slices as pointing back or equal/forward. equal/forward pointing slices are caused by unchanged memory and memory compaction respectively, and this we want to encourage and make fast. backward pointing slices are making copies. We could simply memcpy forward pointing slices from front to back of the buffer, as the memory we copy from has not been invalidated yet, but we are going to invalidate the memory of backward pointing slices. So the first step is, we collect all backward pointing regions, sort them by offset and merge connected regions. In the next step we make copies of those connected regions into the append buffer and create a map. Now we build the new buffer, front to back, copying the forward pointing slices directly, whereas backward pointing slices source the copies.
Alternatively, we could weigh volume of forward vs backward pointing slices and use temporary values for the smaller volume, not favoring either technique.
25.6.2021
Trying to figure out how to create transparent interop across CPU and GPU, I'm considering a semantic metamodel for Scopes, in which we provide functional building blocks that generate code at the latest moment, when a root node needs to manifest an actual value. Scopes' typechecker would be used to build the functional graph (as Scopes types), without generating instructions, and then custom graph handlers would generate the equivalent code in question.
The benefit of this is the ability to do whole function analysis before we begin, and to generate cross-architecture, boilerplate & resource init/teardown code from the same representation. Similar to register allocation, we may track operation lifetimes and recycle existing resources to reduce waste.
24.6.2021 (3)
I wonder if it's useful for analysis and optimization to entirely remove all interior loops of a program description, in favor of one exterior loop. I came up with a variant of a flow diagram that I call Downward-Directed Monocyclic Graph, which is similar to a braid word, but permits merging and splitting of edges. In the DDMG, all edges must strictly point downward, and where an upward edge must be enforced due to a cyclical relationship, the vertex requiring this edge is instead split into two halves arranged at top and bottom of the graph, with all out-edges emerging at the top, and all in-edges converging at the bottom. So conceptually, the graph is drawn on a cylindrical surface. Due to this arrangement, each edge is only traversed once per cycle.
The DDMG representation helps to see connected components, chronological data dependence, synchronization points, lifetimes, as well as opportunities for asynchroneous parallelization and loop decoupling. All horizontally alignable nodes in the DDMG can be evaluated simultaneously.
Translating the lessons learned from the DDMG to SSA representation, we arrange all reentrant basic blocks horizontally at the top, and all basic blocks breaking to those reentrant blocks horizontally at the bottom. This arrangement ist enforced by the rule that, with exception of the topmost basic blocks (which may source any value), the phi nodes of a block must always be pointing up, and, with exception of the bottommost basic blocks (which may only break to topmost basic blocks), the break instructions must be pointing down. Logically, all basic blocks inbetween will be conditional blocks.
Further restrictions to our DDMG compatible SSA are that

The SSA doesn't have recursive functions or functions with reentrant basic blocks i.e. functions can not have any loops.
With exception of the topmost basic block, arguments in basic blocks may only depend on instructions assigned further up. That means that even if a block is cyclically dominated by a block further down, it may not use any of its variables. The correct fix for this situation is to rotate the order of strongly-connected blocks within the loop in order to fix the edge direction.

To further improve reasoning over cycles in the SSA, we may further require that

Instead of phi nodes, basic blocks are equipped with arguments, which are passed by break statements. This removes confusing dependencies of topmost phi nodes to values defined further down.
Values must never be sourced from a preceding dominating basic block, but explicitly passed as block arguments through all basic blocks in which they will be required. This simplifies lambda lifting (turning blocks into functions), rearranging and folding of common blocks, lifetime analysis (as values not passed on become inaccessible), and most importantly, removes all dependencies of topmost phi nodes to any values defined further down.

24.6.2021 (2)
It's amazing how a pure functional description of a program is complete, and yet implementing it efficiently requires asynchroneous shoveling of memory between five different locations in hardware (not counting peripherals), with each pathway having different interface semantics:
Pathway                 | I/O Semantics                               | Interface       | R/W Speed*   | Latency*
------------------------+---------------------------------------------+-----------------+--------------+-----------
   network¹² <->   VMEM | recv()              <-> send()              | socket.h        | 12/1 MB/s    | 10ms-800ms
      disk¹² <->   VMEM | mmap / fread()      <-> fwrite()            | stdio.h, mmap.h | 500/200 MB/s | 1ms-4ms
peripheral¹² <->   VMEM | various                                     | various         | 1MB-1GB/s    | 0.125ms
    GPURAM¹² <->   VMEM | DMA download        <-> DMA upload          | GPU API         | 20 GB/s³     | 1500ns
    GPURAM¹² <-> GPURAM | various                                     | GPU API         | 320 GB/s     | 10ns
       stack <->   VMEM | malloc() & memcpy() <-> alloca() & memcpy() | ASM & memory.h  | 20 GB/s      | 0.5ns
      VMEM¹² <->   VMEM | memcpy()                                    | memory.h        | 20 GB/s      | 0.5ns
    register <->   VMEM | load                <-> store               | ASM             | 20 GB/s      | 0.5ns

¹ requires resource management
² requires or supports asynchroneous I/O
³ bottlenecked by slower participant
* values are anecdotal/ballpark

There are only three reasons to shovel data around: storage, processing or interfacing with the user. Each of the locations gives us access to different features:
Location   | Lifetime*    | Persistence | Processor  | Op/s* 
-----------+--------------+-------------+------------+------------
network    | Any          | Any         | Any Remote | Any
disk       | Years        | Yes         | Any Local  | Any
peripheral | Milliseconds | No          | User       | < 1
GPURAM     | Hours        | No          | GPU        | 1-10 TFLOPS
RAM        | Hours        | No          | CPU        | 1-3 GHz
register   | Nanoseconds  | No          | CPU        | 1-10 GFLOPS

* values are anecdotal/ballpark

Also made some notes on braiding programs.
24.6.2021
Idea for exponential interval arithmetic: store intervals not as [min..max] value, but as [floor(log2(min)) .. ceil(log2(max))], i.e. as exponent bits. 6:6 bits are enough to describe min/max for fixed point 64 bit value, 8:8 bits can describe intervals between 2^-128..2^127.
It's also possible to e.g. use a full 32 bit integer to store 2 bits, minbit and maxbit overlapping. v & -v extracts the minbit, v - minbit gets the maxbit.
But is it practical? probably not.
19.6.2021
Got a replacement MSI x470 mobo and a Ryzen 5 1600. It's about the same setup as the last one, but with two additional cores, so that's nice for compiling C++.
Used the time on the Raspi 4 to fix Scopes issues with the aarch64 ABI, and now all test cases run. I also installed Windows 10 on a VM and went to work on the latest issues with Scopes, and now all problems are fixed there too, and we support LLVM 12. Ain't that nice?
13.6.2021
My motherboard died overnight, and I'm using my Raspberry Pi 4 as desktop for the time being.
It occurred to me that, with extensions, a piece table can pretty much describe the entire program. All it requires is to allow procedural slices, which are lazily executed when the byte value for a particular table cell is required. But this also means in part that the size of a slice (and thus the offset of following slices) is dependent on input values. And even the offset into the source can be a variable.
What I like about this representation though is that slice processing is canonically parallelized (compared to sequential immutable operations), that data size can be computed independent from data content, and that non-procedural indirections can be folded. We have data and code in one single representation.
12.6.2021
It appears I have pretty good chances getting my "typeless" LLVM/SPIR-V-friendly functional DAG IL to work. Every value is just a flat array of bits (a "bitring"). vectorization works via stride(value,bitwidth); i.e. neg(stride(257,8)) is stride(65535,8). tuple-ish access is get(stride(x, wordsize),wordoffset,wordcount).
Every instruction is functional and four 32-bit words wide. The IL is homoiconic, in that the IL itself can be represented as a flat array of instructions.
There are no system level pointers or references in this model, apart from the ones implemented by indirect use of get. Composition works by joining bitrings into greater bitrings. How temporaries are stored is an implementation/optimization detail. The code generator may opt to use register values, vectors, stack arrays, and, for larger structures, piece tables as a more versatile version of slices.
Here's the instruction list so far, by category, without any further documentation.
CONSTANTS
=========

UNDEFINED
CONST bitwidth lo32 hi32

VECTORIZATION
=============

STRIDE x bitwidth

COMPOSITION
===========

GET v wordoffset wordcount
PUT v wordoffset element
SHUFFLE v1 v2 mask
LEN v

FLOW
====

SELECT cond true-expr false-expr
IF cond true-expr

INTEGER MATH
============

ADD v1 v2
SUB v1 v2
MUL v1 v2
UDIV v1 v2
SDIV v1 v2
UREM v1 v2
SREM v1 v2

SHL v1 v2
USHR v1 v2
SSHR v1 v2
AND v1 v2
OR v1 v2
XOR v1 v2

EQ v1 v2
NE v1 v2
UGT v1 v2
UGE v1 v2
ULT v1 v2
ULE v1 v2
SGT v1 v2
SGE v1 v2
SLT v1 v2
SLE v1 v2

FLOAT MATH
==========

FADD v1 v2
FSUB v1 v2
FMUL v1 v2
FDIV v1 v2
FREM v1 v2
FNEG v

FOEQ v1 v2
FONE v1 v2
FOGT v1 v2
FOGE v1 v2
FOLT v1 v2
FOLE v1 v2

FUQE v1 v2
FUNE v1 v2
FUGT v1 v2
FUGE v1 v2
FULT v1 v2
FULE v1 v2

FO v1 v2
FU v1 v2

CONVERSION
==========

TRUNC v bitwidth
UEXT v bitwidth
SEXT v bitwidth
FTOF v bitwidth
FTOU v bitwidth
FTOS v bitwidth
UTOF v bitwidth
STOF v bitwidth

FUNCTIONS
=========

PARAMS closure-level
FN closure-level expr
CALL fn args

REPETITION
==========

FOR init-args repeat-fn next-fn
MAP count-vector map-fn

PROGRAM
=======

MAIN fn

A program must only have one main function. This function is evaluated once per iteration, with an array of input values provided, and is expected to return an array of output values in return.
Input values and output values have a fixed binary format, starting with a header of flags indicating which fields are active, followed by a table of contents for the enabled flags, and then the actual content.
9.6.2021
It is possible to simplify conditionals that depend on the evaluation of other conditional values, so that each branch depends on a root condition only. This requires specialization and expansion of conditional terms, which also gives us opportunities to fold constant expressions.
Take this example, in which both branches of d as well as the conditional c depend on conditional value b which depends on unknown conditional x, preventing the folding of conditional constants in a:
# pure functional
a = x?-1:1
b = (a + 1)
c = (b == 0)
d = c?f(b):g(b)

# scoped imperative form
a =
    if x
        -1
    else
        1
b = (a + 1)
c = (b == 0)
if c
    f(b)
else
    g(b)

After applying these transformation rules
[1] f(u?a:b) -> u?f(a):f(b)
[2] u?f(u):g(u) -> u?f(true):g(false)
[3] (u?a:b)?c:d -> u?(a?c:d):(b?c:d)

we arrive at this program, in which d only depends on the simple condition x, and its previously exterior dependencies now diverge within the branch:
# after, pure functional
a1 = -1 # under x = true
a0 = 1 # under x = false
b1 = a1 + 1 = 0 # under x = true
b0 = a0 + 1 = 2 # under x = false
c1 = b1 == 0 = true # under x = true
c0 = b0 == 2 = false # under x = false
d1 = c1?f(b1):g(b1) = f(0) # under x = true
d0 = c0?f(b0):g(b0) = g(2) # under x = false
d = x?d1:d0

# final reduced program
d = x?f(0):g(2)

I'm also trying to figure out an elegant way to elevate subexpressions shared by multiple branches into a shared scope (i.e. a dominating block) without having to change the pure functional structure of the AST (also required by loop-invariant code motion). It's something that only comes into play when the SSA code is generated, but at that time, we need to have the right order for what goes into which branch. Take this simple example, which is presented here in SSA form but is in fact part of a DAG whose topological order is implicit:
a = f()
b = x?u(a):v(a)

Without elevating the subexpression a, we'd generate this code:
b =
    if x
        a = f()
        u(a)
    else
        a = f()
        v(a)

But the code we want to generate is:
a = f()
b =
    if x
        u(a)
    else
        v(a)

How do we discover & reflect this property in our DAG? The first thing we may think of is to propagate and merge the conditions in which an expression appears, top down:
a = f()         # x|!x = true
u' = u(a)       # x
v' = v(a)       # !x
b = x?u':v'

If we decomplicate conditions as earlier described, our branches would all be tree shaped, and each condition would dominate all its subexpressions, leading to scope description in the form of simple ordered boolean products such as a &!b & !c & d & ..., which could also be read as a unique path descriptor true_branch[a][true][b][false][c][false][d][true] ... leading us to the block in which the expression belongs. In this setup, b from the example above would most certainly be the root node, because through our transformation all conditional expressions bubbled upwards. Following the topology of the graph then, we would be able to generate this:
a = f() # belongs in path true_branch
if x
    u' = u(a) # `if x` is implicitly generated by u's dependency on it
else
    v' = v(a) # we insert v' into the preexisting else-block of x

But we still want to be able to perform lambda lifting, and that in turn would again make conditions more complex. Besides, some common subexpressions are shared by loops, and for those, conditional paths wouldn't really work (without an extension). So what is the most elegant, useful way here?
The only thing I can think of right now, is to introduce explicit dependency nodes whose job is to do nothing but to alter the topological processing order. A dep(source, other...) node must be used in place of the expression it depends on, and can chain one (or more?) other nodes to the processing, like this:
a = f()         # x|!x = true
u' = u(a)       # x
v' = v(a)       # !x
b = x?u':v'
b' = dep(b, a) # forces a to be evaluated early

A dep node holds a weak reference on the second argument and should be elided when it holds the only (other) edge to the resource, as there is only one consumer to it. It must also be the first user of that node - if it depends on a node that has already been visited earlier in the topological order, it is useless and can also be elided. The question remains how we best discover such dependencies.
For loop-invariant expressions, we tag loop arguments as loop-variant, and then propagate this status depth-first along their dependencies, updating a set of the earliest untagged node(s), in which untagged nodes outrank the untagged nodes they depend on as we move up the DAG.
For shared conditional expressions cond?a:b, we tag branch a breadth-first as of branch cond, and then the second one as of branch !cond. Where we discover new nodes already tagged for a branch we are not currently tagging for, we record this node as a dep candidate. Once we're done with the condition, we rewrite the node to dep(cond?a:b, candidates...). It is possible that a third branch later discovers a node we have already added a dep for, but that will simply create an earlier dep node further up in the DAG. For our branch tags, we also maintain a hierarchy, so that we can find the parent branch tag of a branch.
We do not necessarily have to store these dep nodes in the tree, but can keep them as a reverse node -> {node,branch_index} map that we're able to refresh at any time; here, the common shared node is the key, and the conditional node & branch that depends on it is the value. When a conditional node has already been recorded, we find a shared parent condition of our branches (the lowest common ancestor of a binary tree). With a parent list, this is simple: build a set from all parents of branch A, then follow the parents of branch B until we find the first parent in the set.
8.6.2021
Here's a generalized solution for a ray-polyhedron intersection test for regular convex polyhedra with opposing parallel planes (cube, octahedron, etc. but not tetrahedron):
    float ro, rd; // provided ray origin ro and ray direction rd
    float r; // provided plane radius of polyhedron; typically 1.0
    
    float near = -inf;
    float far = inf;
    for (int i = 0; i < N; ++i) { // consider unrolling and vectorizing this loop for few planes
        float p = -dot(ro, n[i]); // n is one of the polyhedron's plane normals
        float d = dot(rd, n[i]);
        float sd = sign(d) * r;
        float tmin = (p - sd) / d;
        float tmax = (p + sd) / d;
        near = max(near, tmin); // standard intersection procedure for convex bodies
        far = min(far, tmax);
    }
    // true when an intersection exists
    bool hit = near <= far && far >= 0.0;

5.6.2021
There appears to be no closed form solution for a*x + b = cos(x). That's really sad, although not surprising, considering that equations that have multiple roots are in general difficult to solve. If I understood complex numbers better, I could perhaps do more here.
The greater problem that I wanted to solve is this: consider a tangent line a*x + b on cos(t) over the interval t = [1.5 .. 2]*pi. The gradient a of the line is the derivative -sin(t). In the interval [0 .. 2]*pi, every tangent line -sin(t)*x + cos(t) + sin(t)*t intersects cos(x) in two points: at our known tangent position t, and at an earlier point in the interval s = [0 .. 1.5]*pi, s < t. If we had a solution for a*s + b = cos(s) for this interval, we could solve the problem. For a normalized u = [0..1] so that t = PI*(3 + 1*u)/2, the solution appears to be close to s ~ acos(u)*3, but lags behind until u ~ 0.96870276, and then overtakes the result. acos(pow(u,0.96))*3 produces a result of greater precision, but still is not exact.
1.6.2021
The quadratic function
f(x) = A*x² + B*x + C

can be bounded with a wedge interval
g(x) = x0 + x1*x ± e*x

so that
x0 = C
x1 = B + A*0.5
e = abs(A*0.5)

that is, the interval of g'(x) is B + [min(A,0),max(A,0)]
30.5.2021
Some notes on variability within topological ordering of operations:
Consider the setup A -> B -> C -> D, A -> E -> D. There are three possible topological orders here, A B C E D, A E B C D and A B E C D. In other words, E has freedom to move closer to A or D, or choose equal distance to either. This describes an optimization problem: which order is best?
It appears that the answer depends on the temporary memory required to keep A and E alive. If A has more "mass" than E, then E should process early to shorten the lifetime of A. If A is lighter than E, then E should process late. However, in a parallel setup, we might prefer to always execute E early: if A is heavier than E, it's the right choice anyway, and if A is lighter than E, it stands to reason that E is not only heavy but also costlier to compute, and so should start its work early.
I drew a diagram to describe this variability.
The second problem is topologically ordering nodes that depend on a composite resource so that simple read operations on the composite are all performed and executed before the composite is transformed, because the last transform of a composite can be performed as a move, and therefore be converted to a mutation. A cheap way to solve this, without having to explicitly track out-edges, could be to first compute a propagating cost for every node, where move-friendly transformations cost more, and then ordering parallel edge traversal by cost.
I'm also thinking about how to accelerate bidirectional adjacency queries without having to do a lot of patching when changing things. Some kind of sparse adjacency structure would be useful. If we restrict operators so they have at most one output, and at most four inputs (of which the widest are cond c t f and set c k v), then these are the queries we are interested in:

How many sources does a node have (0..4)
What are the sources of a node (zero to four sources)
How many sinks does a node have (unlimited)
What are the sinks of that node, and the indices within those sinks (set of (vertex,0..4) pairs)

We answer these questions with the following structures:

We maintain a definition map vertex-id -> const|(vector vertex-id (0..4)), where the size of the vector is the number of sources.
We use the definition map to iterate the sources.
We maintain a usage map source-id -> (sink-id -> indexmask). The size of the retrieved map is the number of unique sinks. indexmask tags in bits the arguments that source a vertex.
We use the usage map to iterate the sinks.

Changes in the DAG then entail this:

Add a new vertex by allocating a new or recycled free id, update the definition map, then update the usage map for its sources.
Remove a vertex by updating the usage maps for its sources, as well as updating the definitions from its own usage entry, removing its definition, then freeing its id.
Swapping out the definition of a vertex then requires reading out the previous definition to update the usage maps of sources, followed by the definition swap. The users of this definition do not need to be updated.

29.5.2021
Some writings about bit rings
27.5.2021
To canonicalize a function, we must define the function header so that two functions of equal definition produce the same expression, regardless of argument names or order. We must reparametrize parameters to appear in topological order, and remove unused parameters, which requires modifications at the call site, see also 13.4.2021 B.
I propose this S-expression syntax:
parameter def: (<num_params> . <parent parameter def>)
parameter: (builtin.param <index> [<level>=0])

with types, the definition
fn (a : i32, b : i32, c : bool)
    fn (x : f32, y : f32)
        \ a b c x y

becomes
fn (i32 i32 bool)
    fn (f32 f32 (i32 i32 bool))
        builtin.param 0
        builtin.param 1
        builtin.param 2
        builtin.param 0 1
        builtin.param 1 1

When a closure doesn't capture any parent variables, it becomes a root function.
24.5.2021
Model for a Call-Continuation VM:
In addition to our atomic types none, symbol, string, number, bool and map, we only need two structures:
struct Closure
    function : symbol # abstract "address" of the function
    context : map     # closure context

struct Instruction
    cont : closure    # function to continue with
    return : closure  # return function
    monad : map       # global "mutable" environment
    args : map        # arguments to function

Conceptually, the program consists then of a single switch block inside a loop that "executes" the right function depending on the currently selected symbol.
loop (instruction = init-instruction)
    switch instruction.cont.function
    case SYMBOL-CONSTANT
        Instruction <cont> <return> <monad> <args>
    ...
    default
        error "illegal address"

This simple single-threaded setup is capable of simulating a stack, call-cc, function local and global mutation, can revert to any previous state and fully contains the complete state of the system at every iteration. All intrinsic instructions and forms are also cases of the switch block. All values are immutable. Reference counting suffices to perform memory management outside the monad. For monad-held memory, an analog to free can be used, as well as borrow checking, reverting to an earlier monad, or indiscriminately removing all keys but a select few.
A notation for an instruction in the CCVM could be:
name: target-name|target-expr arg-expr ... [-> return-name ctx-expr ...]
                              |            |               |
                              |            |               +-- mapped to c0..cN
                              |            +-- mapped to ret
                              +-- mapped to x0..xN            
c0..cN: context values
x0..xN: argument values
ret: continuation
labels specified as call arguments bind to the active context
label.<value> can be used to access values from previously visited labels

This fibonacci function
fn fib (n)
    cond (< n 2) n
        +
            fib (- n 1)
            fib (- n 2)

would then be translated as follows
fib: cmplt x0 2 -> _1           # (< n 2)
_1: condbr x0 _3 _4             # cond (< n 2) ...
_3: fib.ret fib.x0              # return n
_4: sub fib.x0 1 -> _5          # (- n 1)
_5: sub fib.x0 2 -> _6          # (- n 2)
_6: fib _5.x0 -> _7             # fib (- n 1)
_7: fib _6.x0 -> _8             # fib (- n 2)
_8: add _7.x0 x0 -> fib.ret     # ret (+ (fib (- n 1)) (fib (- n 2)))

which after lifetime optimization (walking instructions in reverse topological order, moving non-local references to callers and replacing them with context arguments) would look like this
fib: cmplt x0 2 -> _1 x0 ret    # (< n 2) | n ret
_1: condbr x0 _3 _4 -> () c0 c1 # cond (< n 2) ... | n ret
_3: c1 c0                       # return n
_4: sub c0 1 -> _5 c0 c1        # (- n 1) | n ret
_5: sub c0 2 -> _6 x0 c1        # (- n 2) | (- n 1) ret
_6: fib c0 -> _7 x0 c1          # fib (- n 1) | (- n 2) ret
_7: fib c0 -> _8 x0 c1          # fib (- n 2) | (fib (- n 1)) ret
_8: add c0 x0 -> c1             # ret (+ (fib (- n 1)) (fib (- n 2)))

and after typing would appear like this
fib(number): cmplt x0 2 -> _1 x0 ret
_1(bool){number closure}: condbr x0 _3 _4 -> () c0 c1
_3(){number closure}: c1 c0
_4(){number closure}: sub c0 1 -> _5 c0 c1
_5(number){number closure}: sub c0 2 -> _6 x0 c1
_6(number){number closure}: fib c0 -> _7 x0 c1
_7(number){number closure}: fib c0 -> _8 x0 c1
_8(number){number closure}: add c0 x0 -> c1

22.5.2021
A Twitter conversation this morning inspired me once more to think about introducing a database-like model for operating on composites to Tukan. Tukan presently uses stem "cells" to perform composite operations; a cell is an associative map, which can act as an array, a tuple or a dictionary. But specialization of cells is not so straightforward, and bringing nested structures of cells into the same format (e.g. ensuring that a cell represents an array of 4-vectors of floats) is also convoluted.
What could help then is an abstract specialization that forces the user to think a little more about structure, but not too much. Databases are complete structures with a fixed hierarchy of elements at every level. At the leaf level, we have the elemental types: booleans, numbers, text and references (which i'll get to). We can join elements of different types into rows, which are equivalent to tuples. Each element must be associated with a descriptive key, which is a pair of a field symbol and the element type. Rows then can be joined into a table, as array of rows, where they can be addressed by index. The keys of those rows are joined to columns at the orthogonal axis. Elements can only share a common column when their keys are all the same. The rows in a table must all conceptually share the same columns, so for polymorphic rows, some of these columns will have holes where a row doesn't provide an element that fits. Finally, we can join tables together into a database, in which each table is associated with a symbol.
References in a database refer to rows within the same table or a different table in the same database; they can not refer to rows within different databases. References within a single database can be managed together, and fixed accordingly, when a referenced row is removed or its order changes. Different databases can only loosely reference each other, and these references may break.
What makes the concept of databases useful is that we can establish and interrogate a schema across nested datastructures as they are constructed, as well as maintain a flat perspective on all data present in the program. We can statically analyze queries performed on it and maintain auxiliary datastructures that make these queries fast. We can represent changes between these databases as a list of operations that can be used as a changeset.
The question is: can we edit databases in a pure functional style? Conceptually, we only need these operations:

new(db, table-symbol) -> (db, id) which creates the table if it doesn't yet exist, and allocates an empty row within it. The returned value associates the changed db, the table, and the row within it.
set(db, id, key, value) -> (db, id) which updates an entry in a existing table. The column is added when the key is new, then the value is associated with the respective row and column. If the value is none, it is cleared. If all values for a column are cleared, the column is removed.
del(db, id) -> db which removes a row from a database and returns the new db. If the row was the last entry of a table, the table is removed as well.
get(db, id, key) -> value returns the value associated with the row and column or none if it doesn't exist.
find(db, table-symbol | cursor, key, value) -> cursor returns a set of ids whose key matches value.
first(db, table-symbol | cursor) -> (id, cursor) returns the first row of the set or none and a new cursor without the first row.

The types encountered are:

Our elemental types
db, which is an opaque transaction on our db (see below).
id, which is an opaque reference to table symbol and row reference. This is the only type that can be passed as value to set.
cursor, which is an opaque reference to a table symbol and an ordered set of rows.

A chain of transformations via new/set/del becomes a prospective transaction. When the program iteration ends, so that all transaction lifetimes collapse but one or none. This is the definite transaction that is applied to the db before the next iteration begins.
The db transaction handle could become a transparent monad, so that operations in the system appear like mutations.
What troubles me is that rows are not being kept alive by other values holding a reference on them, and so we will need to collect garbage by analyzing the tables and columns that the program is accessing. When a column of a table is never queried, we can drop the column and elide all insertions for it. When a table is never accessed, we can drop the table and elide all insertions for it.
Another way to go about this is to make a few smaller modifications to cells as they presently are:

A row cells keys are interpreted as columns; values that are cells must be references to row cells too. A row cell is associated with a single table, and can not belong to multiple tables simultaneously (or can it?)
A table cell must only hold row cells, and have only indexed keys. It has a descriptor of all columns that exist within it.

We can through static analysis, categorize cells of being of either type.
19.5.2021
An interesting commutative property of symmetric 2-dimensional maps:
map[a][b] = map[b][a]

Any map lookup could therefore be rewritten to use a canonical ordering, or to optimize access. this is particularly interesting for namespace[object][key] vs namespace[key][object].
Something else I've been pondering for a while: function inversion using monte carlo methods as procedural generation. Take, for example z = x + y. If we invert this function, starting from a desired result, e.g. z = 1, there are many possible solutions for x and y, but constrained to x + y = 1; some random valid solutions are e.g. (5,-4), (0.25,0.75), (-0.5,1.5). z = abs(x) becomes a generator choosing from the set {-x,x}. z = sqrt(x²+y²) produces random points on the edge of a circle of radius z. z = x < 5 produces numbers smaller or larger-equal than five. And so on. It's actually pretty much how particular distributions are achieved, which is often by taking the inverse of a desired integral, but users rarely have the option of generating samples from a forward specification, i.e. random solutions that satisfy the provided constraints. I wonder what we could do with such a parametrization.
Here's a naive way we could e.g. process a version of z = sqrt(x²+y²):
# forward graph
%0 = x² [%0 >= 0]
%1 = y² [%1 >= 0]
%2 = %0 + %1 [%2 >= 0]
z = sqrt(%2) [z >= 0]

# backwards graph
%2 = z² [%2 >= 0]
%1 = random(0, %2) [%0 >= 0 -> (%2 - %1) >= 0 -> %2 >= %1; %1 >= 0]
%0 = %2 - %1
y = sqrt(%1)*randsign()
x = sqrt(%0)*randsign()

Unfortunately, there is no simple arithmetic to this, as we effectively undertake root finding, and many equations will require a non-trivial solver.
12.5.2021
Bootstrapping the Tukan compiler is particularly difficult because of the high level nature of the language, and this time I'd like to get to a self-hosting version (as far as that is possible) as soon as possible.
For this, I would also like to use this opportunity to complete the futamura projections.
Here's my plan to get to a complete setup:

Design a Sandboxed, Safe and Scopes-compatible IL that we can easily translate to Scopes IL, henceforth called the Tukan IL or TIL.
Write a TIL -> Scopes IL translator.
Design the Tukan language (TukanLang or TL) and write a bootstrap interpreter for it in Scopes (TL on Scopes). It only has to be complete enough to run the specializer.
Write a specializer in TL that takes a TL program and a set of constants for it and outputs TIL (TL on TL).
Run the specializer through the bootstrap interpreter, specializing the specializer for the specializer (TL on TL on TL -> TIL), yielding a native compiler-generator.
Write the full TL interpreter in TL (TL on TL).
Run the compiler-generator on the TL interpreter, yielding a TL -> TIL compiler (TIL).

The Tukan IL is specified with s-lists and omits these elements from the Scopes language:

Pointer types and reference qualifier, reftoptr, ptrotref, inttoptr, ptrtoint, getelementref, getelementptr
Any numerical type that's not u32, i32, f32
Typenames and unique types, view, lose, dupe
Packed tuples
Vector sizes greater than 4
All memory intrinsics and mutation: alloca, malloc, free, load, store
inline, label, merge, loop, if, switch
do, processing order is not prescriptive
let, the s-list is a DAG
extern
Recursive calls
All reflective instructions, including typeof
Variadic arguments, quoting

These elements will be new:

table as replacement for switch, in the form of switch <value> <const1> <then1> ... <default>
cond as replacement for if, accepting exactly three arguments
fold as a replacement for loop, which works the same, including break and repeat, but requires a constant iteration limit
map which launches and joins threads, and requires a constant thread limit
Loopback slots and bind to store values
input and output system I/O interface

2.5.2021
Rule 110 can be partially specialized; when only two bits are known, five patterns can still be solved:
01?->1
0?1->1
?00->0
?01->1
?10->1

13.4.2021 B
For function optimization it makes much more sense to first treat all user functions as macros and expand them, then build actual functions from the resulting graph for compression. but that can become quite memory intensive.
The horrors and blessings of deduplication. these functions are all the same function and only differ in parameter order. the canonical version can be obtained by reparametrizing the function so parameters are always declared in the order they are processed:
# before canonicalization
fn f1 (0 1 2) (f (f 0 1) 2)
fn f2 (0 1 2) (f (f 0 2) 1)
fn f3 (0 1 2) (f (f 1 2) 0)
fn f4 (0 1 2) (f (f 1 0) 2)
fn f5 (0 1 2) (f (f 2 0) 1)
fn f6 (0 1 2) (f (f 2 1) 0)

# after canonicalization
fn f1 (0 1 2) (f (f 0 1) 2)
inline f2 (0 1 2) (f1 0 2 1)
inline f3 (0 1 2) (f1 1 2 0)
inline f4 (0 1 2) (f1 1 0 2)
inline f5 (0 1 2) (f1 2 0 1)
inline f6 (0 1 2) (f1 2 1 0)

Idea for stack mini-GC
13.4.2021
Exploring the notion of what a type actually is, I've been thinking about the fundamentals of partial evaluation, namely data encoding. The most basic data that we can feed a program is a sequence of bits, a constant such as 111010110111100110100010101. We assume that the sequence has a known number of bits, and the program can process all of them to produce a new constant. Now, partial evaluation in this context can only mean that we want to keep some of the bits unknown, e.g. using a variable such as ??1010110????00110??00??101. The program will likely not be be able to produce a constant, but will instead also produce a similar variable stream of known and unknown bits; but we can partially evaluate the parts of the program that only depend on the constant bits in the variable. This leaves us with a so-called residual program, specialized for the provided variable.
We may know even less about the variable, or a little more. In addition to some of the bits being unknown, we might not know how many bits exactly are unknown. We might have a lower bound, an upper bound on each fragment, or both. For some bits, we might know a set of possible values, or a set of impossible values. Some bits may depend on the state of other bits. The more we know about our unknowns, the more assumptions we can make, and the better we can partially evaluate the function.
But let us put those complexities aside for a moment and think about a simple encoding for our variable stream. We can encode our variable only with bits. A bit can't represent a symbol such as ?, only 0 or 1. Therefore we need to invent a special bit sequence to mean ?, and distinguish it from bit sequences that look the same way. The most compact way to do this is to make 1 ambiguous, to mean either ? or 1, depending on which bit follows. So the sequence 10 means ?, and 11 means 1. ??1010110????00110??00??101 then becomes 101011011011110101010100011110101000101011011. A problem here is that the length of the new stream depends on how many variables and ones appear in the original stream. It can have the same length as the original stream (when it contains only zeroes), or be up to twice as long (when it contains only ones and variables). Let us try a different encoding that always doubles the stream length, so that we can deduce the length of the original stream by halving it. Now 0 becomes 00, 1 becomes 01 and ? becomes 10:
??1010110????00110??00??101                                 # original variable
 ? ? 1 0 1 0 1 1 0 ? ? ? ? 0 0 1 1 0 ? ? 0 0 ? ? 1 0 1      # spaced by 1
101001000100010100101010100000010100101000001010010001      # insert control bits

This gives us an interleaved form, in which each even bit signifies whether the following bit is constant or variable. We can also separate our control stream and our constant stream, which allows for better compression, as the control stream can likely be reused for multiple inputs:
??1010110????00110??00??101     # original variable
110000000111100000110011000     # control bits
001010110000000110000000101     # constant bits

Note that the variable bits in the variable are now unused. As the length of our control bits preserve the original length of the constant and the position of its constant and variable bits, we could compact our constant bits down to the residual information:
110000000111100000110011000     # control bits
  1010110    00110  00  101     # constant bits
10101100011000101               # compacted constant bits

For the control bits, we notice that we're likely going to have long redundant sequences of zeroes and ones, and so it appears useful here to use a run-length encoding. We'll use bytes for this. Bits 1-7 encode the count of bits, from 1 to 128, and bit 0 encodes whether the value is constant or variable. In other words, variable counts encode as 2 * x - 1 (odd numbers), and constant counts as 2 * x - 2 (even numbers).
10101100011000101                                                           # compact constant bits
110000000111100000110011000                                                 # control bits
10000010 00000111 10000100 00000101 10000010 00000010 10000010 00000011     # RLE control bytes

Note that our control stream got a lot longer as a result, which means it's not packing as well as it used to. To ensure that our encoding always stays at most twice as long as the original variable, we enforce that the variable's unknowns must have 8-bit alignment, and we count bytes, not bits.
??1010110????00110??00??101                                                 # original variable
???????? 01010110 ???????? 00000110 ???????? 00000000 ???????? 00000101     # variable, padded to bytes
?? 56 ?? 06 ?? 00 ?? 05                                                     # same byte sequence, in hexadecimal numbers
01 00 01 00 01 00 01 00                                                     # control bytes

Our control sequence is then also shorter in comparison, even though we don't get any benefits from the RLE compression:
56 06 00 05                 # compact constant bytes
01 00 01 00 01 00 01 00     # control bytes
01 00 01 00 01 00 01 00     # RLE encoding (happens to be the same in this case)

We can improve the encoding further, by making use of the fact that we're often going to alternate between constant and variable bytes. Encoding this quality implicitly, we start off by assuming that the first bytes in the sequence are constant, and we encode in each control byte the number of sequential constant / variable bytes, using all bits. For our example we then need to start our control byte sequence with a 0, to indicate zero leading constants. A constant or variable sequence greater than 255 would have to be encoded as 255 0 (x - 255), meaning that we need one extra byte to signify an empty alternating sequence, but by then we have skipped at least 255 bytes. And with wider control words, this will occur rarely bis never.
56 06 00 05                     # compact constant bytes
01 00 01 00 01 00 01 00         # RLE encoded control bytes
00 01 01 01 01 01 01 01 01      # RLE with implicit alternation

We now have a 13 byte stream describing an 8 byte variable, and the length of the variable can be computed from summing all numbers of the control stream. We can furthermore use prefix sums to find for any index its location in the control stream.
00 01 01 01 01 01 01 01 01      # control bytes
00 00 01 02 03 04 05 06 07      # sum table sum(x)=control(x-1)+sum(x-1)

Using the example above, we can for example find index 4 using binary search, landing at index 5 of the control byte sequence, which is an odd number, designating a variable, indicating a sequence count of 01 bytes.
Remember when we said we would like to be able to specify an unknown number of variable bits? This is possible by turning the control bytes themselves into a variable. For example 04 ?? 04 describes a stream of at least 8 constant bytes in length that has an unknown number of variable bytes inbetween. It is also possible to specify an unknown number of constant bytes, but that means that the residual constant buffer now also becomes a variable, which causes a cascade of encodings. Take this example:
?? 04 04           # V: unknown number of constant bytes, 4 variable bytes, 4 constant bytes
??... aa bb cc dd  # C: conceptual residual constant buffer

00 01 02           # VV: control bytes of V: zero constant, one variable, two constant
04 04              # VC: residual constant buffer VC of V

?? 00 04           # CV: control bytes of C: unknown number of constants, zero variable, four constant
aa bb cc dd        # CC: residual constants of C

00 01 02           # CVV: control bytes of CV: zero constant, one variable, two constant
00 04              # CVC: residual constants of CV

Leaving us with five constant streams required to describe V and C:
# V:
  00 01 02           # VV: control bytes of V: zero constant, one variable, two constant
  04 04              # VC: residual constant buffer VC of V
# C:
  00 01 02           # CVV: control bytes of CV: zero constant, one variable, two constant
  00 04              # CVC: residual constants of CV
  aa bb cc dd        # CC: residual constants of C

Using a recursive notation, in which xx yy ... designates a constant byte stream and (x : y) a variable consisting of a control byte stream and a residual constant byte stream, the encoding for (?? 04 04 : ??... aa bb cc dd) is then ((00 01 02 : 04 04) : ((00 01 02 : 00 04) : aa bb cc dd)).
8.4.2021
Thinking about what data structure to use for qualifier maps (which are immutable), I've been considering sorted arrays, flat hash maps and hash tries so far. 8 to 16 slot hash tries are a good compromise between storage cost and lookup speed, but the cells waste a lot of memory on low loads, and I wonder if they deduplicate that well.
I'm having successes experimenting with a dense hashtable, where slot selection happens by most significant bit, thus bucketing entries, and storing colliding entries as a linked list in each slot (i.e. cons cells). When the lists are kept ordered, and the hashtable size equals the number of entries rounded to the next highest power of 2, then the hashtable is canonical i.e. no matter how a map is constructed, if it has the same content, it has the same layout. Furthermore, the ordering allows us to reuse 50% of each list on average. In fact, rehashing becomes quite simple, which I'll get to.
I recorded some statistics for inserting the same 1158 keys into hashtables of increasing width, and it appears that far below a width of log2(count), with loads far above 1, we get to average insertion and retrieval speeds that outmatch a binary tree, while keeping the map dense.
Rehashing is as simple as doubling the table width, which increases bit significance by 1 and splits each slot t(x) in two slots t'(2*x) and t'(2*x+1). The list stored at the slot must be split up into two lists at the boundary of the new significant bit, which are then assigned to the two new slots. And that's it. Compaction is the same process in reverse, requiring to join two subsequent lists.
Merging tables is similarly straightforward and can be performed in linear order, as hashes are ordered.
slots 1 keys 1158 load 1158.0 log2 10.17742
insert avg 287.392914
best 1158:u32 worst 1158:u32 median 580:u32 average 579.5

slots 4 keys 1158 load 289.5 log2 10.17742
insert avg 73.604492
best 281:u32 worst 299:u32 median 145:u32 average 145.323837

slots 16 keys 1158 load 72.375 log2 10.17742
insert avg 19.810017
best 57:u32 worst 96:u32 median 37:u32 average 37.215027

slots 64 keys 1158 load 18.09375 log2 10.17742
insert avg 5.729706
best 5:u32 worst 35:u32 median 10:u32 average 10.249568

slots 256 keys 1158 load 4.523438 log2 10.17742
insert avg 2.204663
best 1:u32 worst 13:u32 median 3:u32 average 3.33506

slots 1024 keys 1158 load 1.130859 log2 10.17742
insert avg 1.312608
best 1:u32 worst 6:u32 median 1:u32 average 1.634715

slots 2048 keys 1158 load 0.56543 log2 10.17742
insert avg 1.170121
best 1:u32 worst 4:u32 median 1:u32 average 1.338515

With small modifications, the representation is also suited for emulating sparse arrays. When using a hash function that only truncates redundant trailing zeroes, the lookups will be sequential. Descending ordering of colliding values will ensure that appending is O(1). Rehashing will have to be performed when the maximum integer key requires an additional bit, which also changes the hash function. When the maximum key is removed, it is also possible a bit needs to be removed from the hash function; To find out how many, the next highest key has to be found, which is as easy as picking the next element from the same slot, or the first element of the next lower slot.
The sparse array can also be optimized for linear traversion, by using ascending ordering; otherwise traversion requires unzipping elements of each slot.
2.4.2021
Protocol time. We've got a content-agnostic DAG system, now it's time to build the new IL on top of it. Let's see what data structure features we want:

Containers: maps, tuples. There are two versions here: a.) write-efficient, where modifications avoid duplicating the entire structure, but access is O(log n), and b.) read-efficient, where access is O(1) but memory usage is wasteful. Both have their applications, and can be mixed.
Strongly typed constants as well as polymorphic ones for the specialization stage.
Type introspection. Types can be kept in a channel separate from the data to keep it polymorphic (and profit from deduplication), so that functions not specialized for a particular type can use the information to read and generate new data (as well as update the type annotation). All interpreter constants could be represented like this, and we can specialize data merely by removing the type channel.

That leaves us with types as the "untyped" information. Not only need types to be of implicit type, it has to be possible to describe types with types. These are the hardware types that we need to be able to describe, limited by the types supported by GPU's:

16-bit, 32-bit and 64-bit signed and unsigned integers.
16-bit, 32-bit and 64-bit IEEE754 reals.
Logsized vectors of integers or real up to length 4. GPU's mostly do not really care about vectors, but they become relevant when accessing images.
Sized and unsized arrays of arbitrary element type. Unsized arrays can only be used as last elements of tuples, on the root level and as pointer element types.
Tuples of arbitrary element type.
Images of dimension 2, 3 or 4, and numerical or vector type.
Pointers of arbitrary element type. Not really supported on GPU, but we can translate every pointer to either a logtile index or a GPU resource.
Function signatures. Functions take multiple arguments and return single arguments.

For more abstract, generalizing types we also need:

(Tagged) unions, which allow us to describe the situation of multiple possible return types.
Some way to describe recursive data structures. A DAG doesn't make this immediately obvious. I am considering an analog to the loop statement, in which the initializer binds the terminator to a recursive variable, and the body then uses it somewhere. A mutually recursive declaration, for example, would use a definition like recursive (a b) (tuple (b | ...) b) (tuple (a | ...) a).

For the specializer / interpreter stage, we need, at least, the typical data structures required to describe our AST.

Strings. Similar to SPIR-V, these are encoded as arrays of zero-terminated 8-bit characters, four per string. An (array uint32) hardware type would obviously be insufficient to allow presentation and debugging of strings on screen.
Symbols. Use the same hardware format as strings, but to be resolved to a bound value.
Builtins. These are symbols with a different type signature, to prevent them from being resolved again. They represent intrinsic instructions / non-reducible forms.
S-Lists. Tuples do this job just as well. By classic Scheme rules, we'd interpret the first element of a tuple to be a function, and the remaining elements as arguments the operator is applied to. The lists do not need to be polymorphic, the type signature does that job well enough.

If we are to use a structural type system, which I would very much like because it deduplicates better, then the question is how we distinguish strings from symbols and builtins, for example. What typically sets types of same storage type apart are methods and properties that modify their behavior. For this, we would need a decorator that equips hardware types with maps, similar to how typenames work in Scopes, but without any guarantee of uniqueness, which isn't necessary here anyways. But that means we need attribute maps, ideally order independent, so insertion order doesn't matter, at the type level. But since those attribute maps are required to boot the type system, they themselves can't use keys or values whose types require attribute maps. i.e. we can't describe symbols as qualified uint32 arrays, because we need symbols in the map; and even if the map specializes the key type as symbolic always, values need types too - for this, we need a type signature for the map itself. When we qualify a value type, we can't use this map as a value for the qualifier. So the string type, for example, can not use a string value to give itself a readable name - it will have to use a regular uint32 array. With these restrictions in place though, we can still carefully hoist up our type system.
1.4.2021
Now that almost all basic DAG services are implemented, save for marshaling, I'm still a little bothered by the GC potentially having to move a lot of memory, because it's stored in contiguous memory. And as we have refcounting, a GC isn't even strictly necessary. Logtiles could be individually allocated and hold each other as shared pointers. Each logtile would contain its actual memory, its ref bitmap or a list of its reference indices, a hash, and a set of shared pointers to logtiles. We still use the same addressing scheme. To resolve an address, we would have to resolve a map of addresses of tiles. If the address is a subregion of a tile, we will need to resolve the map multiple times, and save an alias for future lookups. There are drawbacks and benefits to using individual tiles.
The benefits:

Tiles are immutable, so the information we store in a tile does not have to be changed ever again.
We are able to deduplicate not only tile addresses, but also tile storage.
Refcounting removes tiles from memory the moment they are no longer required. The GC is reduced to tracking weak references and compacting the address space, which involves rebuilding the tile map. That means garbage collection is much faster.
Refcounts allow us to query the incidence of a tile.
Storing a set of referenced tiles per tile allows us to scan the DAG's topology much quicker, but it can produce overhead where a tile consists of nothing but references.
There is no buffer overflow, and therefore there's no array that requires a transparent reallocation / copy, which is something that can happen at random when allocating new tiles (although it only happens when the memory is exceeded for the first time, and it happens exponentially rare).
The system is able to entertain the concept of a "work in progress" tile whose memory isn't immutable yet. The tile has to be explicitly committed in order to be able to reference it. But the tile could also be discarded and replaced for an existing reference. That makes tile commits transactional. But that is mostly API design, and could also be applied to the contiguous version.

The drawbacks:

Simple serialization doesn't exist anymore. We can't simply dump an array to disk or the clipboard, nor can we store it in a mmap.
Lookups are more involved. With contiguous arrays, address resolution consists of unmasking and bitshifting the offset, then adding it to the array pointer to get the actual address. Individual tiles requires hashing the tile and looking it up in a hashmap which requires adding the hash to the hashmap pointer and performing a comparison inside a loop; possibly repeatedly if there are collisions (though we could keep the map extra sparse). For subreferences inside a larger tile, we need to do multiple lookups ascending the hierarchy and offset on success.
We don't benefit from CPU caching through linear access of the array.
The representation isn't GPU friendly at all. The lookups are prohibitively expensive alone. Tiles would have to be at least aggregated in buckets to become manageable, and that takes a lot of benefit out of the way tiles are managed.

In my opinion, the drawbacks kill the solution.
A remaining point is marshaling. The DAG buffer can be sent over the network or stored in a file as-is, but file synchronization over multiple processes should use LMDB as an intermediary, and that would be a good opportunity to deduplicate multiple DAGs and reuse as much data as we can. A way to do that could be to save tiles individually, as a hash to tile mapping, and then storing the actual DAG as an array of hashes as the root structure, i.e. it becomes its own tile; or - simply allow the root tile to hold the entire DAG, which of course complicates loading. To deduplicate as much as possible, we could write tiles so that references are resolved to the hashes they point to. As a hash requires 32 bytes, we should use an auxilliary table that maps a running 32-bit index to a hash, and then translate references accordingly.
31.3.2021
The garbage collector is now implemented in the way I described in the last post and this time, it works. I am very happy about that.
I switched the deduplicator out for the incremental method. The deduplicator is pretty cool, it runs up to log2(n) hashing jobs for n words simultaneously, commits new tiles as they're finished, and rewrites references as it passes them. With this technique, we build a growing dictionary of hashed tiles and are able to deduplicate only the latest committed appended memory. The deduplication method for the last IL was just to hash a node and find an existing node with the same hash, but this one deduplicates all memory references, so one could implement a tile engine on top of it, dimension doesn't matter.
I also had a free bit left in references, which I used to implement weak references. The GC skips marking weak referenced regions, and they are rewritten to zero on compaction. Deduplication, which is memoization, produces surprising behavior here, because as long as there's any user for a region containing the weak reference target as subregion, the weak reference will not clear. It is however possible to break and control the pattern in which this happens, for example by building reference chains whose terminal item is a typestring.
The next step is to implement the actual IL on top of the agnostic DAG.
29.3.2021
Well. The deduplication mechanism is implemented. I ended up using an even simpler method.


Iterate and hashmap the largest unique logregions (using SHA256 hashes), while also aliasing addresses (unique addresses are aliased to themselves). Then repeatedly subdivide unique regions and repeat the same steps with those, until we arrive at unit size. That's sub O(n log n).


Scan content for references (using the ref bitmap) and rewrite each reference according to the alias map. The hashmap is no longer necessary for this step. The alias map will not contain addresses of subregions of redundant blocks, but we can search upwards in the alias map until we find an alias for the parent block, and recalculate its new offset and try again. Also sub O(n log n).


The range/refcount-based garbage collector I mentioned in an earlier post however doesn't work at all; I forgot the case where two ranges share the same start offset but not the same end, so one start bit is set, but two end bits are set, and that messes up the refcounting irrecoverably. So we need a different approach.
Instead of an indices bitmap, we use a mapping of region to boolean, where the boolean is true if the region is a leaf. When a region is marked, all its parent regions are mapped to false if not already mapped, and if none of its parent regions are mapped as true, the region is mapped to true. We begin by marking the root node and the zero node (0:1). We define a sweep line starting at the end of the buffer. We use the map to find the deepest rightmost marked logregion that is intersecting or touching the sweep line and mark all references that it contains. Then we move the sweep line to the front of the region and repeat. This ensures that we process all regions in topological order, exactly once. We can therefore also record all logregions that we have visited.
The number of iterations required to descend the tree can be reduced by continuing with the left neighbor of the last processed block. If the block isn't tagged, we ascend the hierarchy until we find one that is, then descend to find the rightmost subregion mapped true that's not past the sweep line. The walk is over when no more marked blocks are found.
For compaction, we walk our recorded logregions in reverse (so they're in the correct order), and move the allocations back - which can be done in-place. We store an alias mapping from old logregion to new logregion. For every reference we encounter, we search the aliases for a mapping; If there is no direct mapping, we try to find the first mapped parent region and recompute the local offset of the reference. We also move the reference bits.
28.3.2021
I wrote a deduplication algorithm for logregions that is a little wasteful on space, but at least it works, and complexity wise it's not too bad.
Iterating the underlying buffer, we append offsets to the array of actively tracked offsets.
After each append, we check our actively tracked offsets for sizes that have reached a log2 size in this iteration (for every offset, that happens at least once immediately after appending, because 1 is a log2 size.) The size can be computed by subtracting the offset from the global offset. If the offsets logregion is complete i.e. it covers its maximum possible size (restricted by the alignment of the offset), we remove it from the array of actively tracked offsets. We also compute the hash for the logregion that the offset has reached (which can be done incrementally), and insert it into a map of hashes to offsets if the hash hasn't been mapped to a previous offset yet. Otherwise we only replace the existing mapping with the new one if the existing offset is complete and its size is smaller than our current offsets reached logregion size, or the new offset is incomplete and the sizes match.
The map of hashes to offsets can then be queried by references. The referenced logregion can be hashed and the hash searched in the map to find an existing copy.
At any point, we're only dealing with log2(max_buffer_size) active tiles.
We could save a lot of memory by less speculative building of the tree; We record all incremental hashes of references as we encounter them going back to front; Then only track offsets that fit those incremental hashes.
26.3.2021
Notes on storing binary trees in arrays, with implicit, pointerless addressing:
-0--------------
--11------------
----2222--------
--------33333333
-011222233333333 type A: best for frequent sibling traversal

----------------
--------0-------
----1-------1---
--2---2---2---2-
-3-3-3-3-3-3-3-3
-323132303231323 type B: tree order equivalent to sorted array

increase depth:
a.) at bottom

    type A: append new level
        -011222233333333 -> -0112222333333334444444444444444
                            ----------------
    type B: shift right & space apart by 1 element
        -323132303231323 -> -4342434143424340434243414342434
                              - - - - - - - - - - - - - - -        
b.) at top (e.g. as in a merkle tree expansion)

    type A: pad out each level individually
        -122333344444444 -> -0112222333333334444444444444444
                              - --  ----    --------
    type B: append new right hand side
        -434243414342434 -> -4342434143424340434243414342434
                             ---------------
                             
Functions:

type A:
root_index() = 1
left(idx) = idx << 1
sibling(idx) = idx ^ 1
parent(idx) = idx >> 1

type B:
root_index() = array_count >> 1
radius(idx) = idx & -idx 
left(idx) = idx - (radius(idx) >> 1)
sibling(idx) = idx ^ (radius(idx) << 1)
parent(idx) = (idx | (radius(idx) << 1)) ^ radius(idx)

While the traversion arithmetic of Type A is slightly simpler and the tree structure has good CPU caching behavior, Type B is easier to extend and rearrange, as all subnodes of a branch are stored in contiguous memory.
Another fun idea that I have no use for right now is a encoding of 2**n sized ranges that allows addressing 2GB of memory with a single 32-bit value, provided the address is size-aligned. In this format, we encode size and offset (which must be a multiple of size) as ((offset << 1) | size). Then size(addr) = addr & -addr and offset(addr) = (addr ^ size(addr)) >> 1. 0 can still be used as a null value. If ranges must be at least 16-byte aligned, then 32GB can be addressed this way.
Another version of this packs a range as 27:5 bits, where 27 bits store the offset, 5 bits store the exponent of the size. This allows us byte-aligned access to 128 MB of memory; 16-byte alignment increases the address range back up to 2GB. The 64-bit version uses 58:6, and addresses 2 ** 18 TB with byte-precision. Overall, 64-bit is more fun. We can allow arbitrary sizes, and choose a trade-off between size and offset. 32:32, the carpenter's choice, addresses 4GB and allows up to 4GB sizes. 40:24 addresses 1TB and caps size at 16MB. And then there's also the option to make the trade-off variable. In o:s:6, 6 bits designate the width of o (1 .. 57), whereas the width of s is defined as 58 - width(o). The offset range stays the same, but the resolution changes in form of greater alignment. Up to 256 bytes could be placed with byte precision across 1024TB, whereas 4096 bytes would require 16 byte alignment. Alternatively, we could also allow byte-aligned offsets, but require sizes to be aligned.
25.3.2021
Having finished the tagging part of the GC for our bit-tagged flat cons array, I'm still going through use cases in my head, and it doesn't sit right with me that references can't be ranged, and are implicitly terminated by the data they are referencing. I would like to use a representation that makes it easier to construct (nested) Piece tables, and for that I'm reconsidering the idea I had yesterday, about dropping the separator bit.
That would leave us with the bitarray tagging references, which is still required. The four big questions underlying the design then are: How do references work then? How do we perform GC marking? How do we perform compaction? How do we perform deduplication?

References become slices. Not wanting to extend word size to 64-bit, as that creates problems on the GPU, and wanting to give slices the ability to be of arbitrary size, and to address a range beyond 2 ** 16 words, we would require that a reference consists of two consecutive unsigned words, either the beginning of a range, and the exclusive end of it; or the beginning and its size.
Marking: we need two bitarrays with each bit mapping to a word. The first bitarray marks slice starts, the second marks slice ends. Sweeping front to back, starting at the root, we tag for each reference encountered its start and end position in both bitarrays respectively. Iterating tagged bits backwards, we find regions that are in use by incrementing a counter when the end of a region is encountered, and decrementing it on range starts. This also handles partially overlapping regions correctly. As long as the counter is greater than zero, the region is in use, and the tagged reference bits inside are able to mark additional regions further down.
Compaction: this time we iterate range events, recorded during marking, from front to back, once again using the counter to find marked regions. Regions where the counter is zero can be discarded; Active regions are copied to a fresh buffer, to create a contiguous buffer of used regions. Each region event also marks a reference target; we rewrite that location to point to the new position. Each encountered reference must be updated during copy, by looking up the rewritten reference targets in the old buffer at both beginning and end of range. We could also perform compaction in-place, by writing new addresses to a temporary hashmap instead, which maps old slice boundaries to new ones.
Deduplication: this is a tricky one, because if done non-naively, it's basically good data compression: for any slice whose contents already exist as a sequence somewhere earlier in the buffer (and must also match the reference bit pattern), we can use that substring instead. The simple way is to assign a hash to each referenced range, and reuse references sharing the same hash. It misses out on many opportunities, but does the main job of deduplication: reducing the cost of region comparisons by abbreviating them to a reference comparison. For a better job, a dictionary similar to a suffix tree could possibly help here. We build the tree as we walk back to front; encountering a slice reference, we already know the maximum search depth. We try to find the substring in the tree, and will either find the substring itself, or an earlier location.

Apart from references doubling in size, which is probably unavoidable, I like the design and the more linear elegance of marking & compaction in comparison to the little dance marking does with the current implementation, where it walks from right to left, but on each step, a little to the right again.
24.3.2021
Invariants are nice, but variable arguments are not an invariant. This way they add an additional level of complexity to storing nodes in a linear array. What if we stored cons cells instead? Each array element is a tuple of two fields at and next (veterans may know them as car and cdr). With fully featured cons cells, both are of variant type, but for our purposes, only the first one needs to be variant, and the second one is a reference to another cons cell. We use unsigned ids, so index 0 will designate NIL, or an empty cell.
How do we encode the variant type? If we encode it with references, then we can reuse a cell multiple times, but front to back sweeps won't be able to read cell contents and perform dependency tagging correctly. So we need to put it in the cell. In order to simplify pointer arithmetic, the cell size should remain a power of two. Say we encode the type of at in the upper bits of next. In practice, sweeps only need to consider whether at is a reference or not. So we need only 1 bit of next to store this information, still leaving us with 1 << ((bitcountof word_size) - 1) units of addressing space.
As the array only knows cons cells and we want to keep this the only type of memory, strings will also have to be stored this way. The overhead in storing a string as a sequence of cons cells would then be 1:2, i.e. twice as many bytes needed to store a string. On the other hand, we might gain memory through tail deduplication. Multiple strings ending in the same characters on aligned offsets will reuse the same chain of cells. But how many times is that the case? The only way to decrease the overhead is to change the ratio of at size to next size; say each at can store three words, then the overhead ratio is 1:3; but then we lose the property that (countof cells) == (countof elements_in_list), and we need to store that information some other way.
Let's consider another encoding, in which we write an unbroken sequence of cons cells as a sequence of at fields, where the next field is implicitly pointing to the field that follows. This requires us to be able to address the sequence at any starting point, needing the ability to recognize when the sequence is ending. As all bits of each at field are required, that's impossible. We would need a second, auxiliary array, which holds two tag bits per entry: bit 0 acts as separator, indicating whether the field is the last field in the sequence, bit 1 indicates whether the field is a reference (so we can perform our content-agnostic sweeps). For 32-bit words, that gives us a fixed overhead of 1:16, and for 64-bit words, 1:32; good CPU cache behavior for both. One 64 byte cache line of flags allows to find separators and references in 256 words, without even looking at the content. If we separate those two bits into individual arrays, then a cache line can enumerate tags for 512 words. Reading bits in 64-bit chunks, counting elements is as easy as using findlsb and some bit twiddling.
One important thing I left out in this bit-tagged tuple representation is the presence of a next field. I would leave this to the implementation of types and protocols on top, but by convention we could require that every terminator bit indicates a next field. We are in any case unable to encode a zero element tuple. We only need one anyway, and it would be at index 0, with its field pointing to itself. The presence of an implicit next field allows the system to compress tuples that end in the same tail, but which complicates iteration and counting. Therefore I would leave this to implementations. Any protocols that benefit from packing need to implement it themselves.
Could we do without the separator bit? Yes, if each reference was bracketed i.e. a range. This is beneficial also for the reason that we don't always have to count. Then garbage collection could tag used words by following ranges. It would cause a reduction of the addressing space though, and some overhead from unused space, up to 1:2. The naive way to deal with this is to use 64-bit words and split ranges into 32:32 fields. But if we required sizes to be a power of 2, we would only need 5 bits to encode the size of the bracket, and had 1 << 27 addressable words left (536 MB). If the exact size of the content mattered, it could be encoded separately, or extracted a different way: for example, the exact length of a zero-terminated string could be determined by searching from the tail end in reverse, or from the middle to the tail, which will require f/2 iterations on average for a f-word partition. If the string were guaranteed to contain no zeroes, we could even employ binary search and find the length in log2(n-1) iterations. But such are protocol considerations.
Content-agnostic deduplication changes a little: with separator bits, sweeping from back to front, we can find duplicates by incrementally hashing tuples from their tail ends and recording the hashes per word in an auxilliary array; then deduplicating references to the same hashes, preferring the index into the longest continuous sequence. The hashes could be imperfect, requiring us to perform a comparison to be sure. Without separator bits, we will see tail ends only from references. It makes more sense here to hash ranges as they are being addressed, and deduplicate multiple references to the same content. If we build a map of ranges being subsets of other ranges, we could once again prefer ranges that are subsets of the largest continuous range. Perhaps content-agnostic deduplication isn't worth it and we should leave it up to protocol. We'll see.
Content-agnostic garbage collection also undergoes a change: with separator bits, we can merely tag the first word of a reference to indicate everything up to the next separator as used. Without them, we need to tag every word in a range.
Could we do without the reference bit? This will require garbage collection to work heuristically, similar to the Boehm GC, wherein we assume any word to be a potential reference (or ranged reference), and tag the respective target as used. This causes a lot of false tags, but might still work sufficiently well. Deduplication, on the other hand, will no longer work, because we can't rewrite references that we're not sure are references.
23.3.2021
I'm in the process of rewriting the optimization process for the IL, and I want to make it stackless, more memory efficient, but also extensible. Some wild ideas:

Keeping all nodes in a single u32 array, including their arguments, is doable. Ids are then synonymous with the node offset within the structure.
Each nodes header must then include the number of arguments that it allocates. A reference to the root node must be kept.
Reverse iteration of the array requires a node footer that holds the offset to the previous header, or we merge the footer with the header of the next item, and create a sort of "doubly linked list with deltas". Limiting the maximum number of parameters to 256 (which also limits strings to 1024 bytes), we can then keep 8 bits for the number of preceding arguments, 8 bits for the number of following arguments, and 16 bits for the opcode. Or 8 bits for an opcode, 12 bits for 2x arguments. 64-bit words would leave us with more room ofc.
Any node can be truncated if need be without having to move other nodes around.
The tail node currently written to the array can be incrementally extended with parameters.
Making temporary sidechannel arrays for nodes gets harder. That is a big argument against keeping it all in one array.
256 bits (32 bytes) can tag 1KB of memory.

The following ideas are also suited for (topologically ordered) arrays of nodes that keep their own operand memory:

256 bits (32 bytes) can tag 256 nodes.
Garbage collection can be done by following root node references back to front and creating a tag map, then copying tagged nodes to a new buffer, front to back, while using the old buffer to write the new address.
A sidechannel that stores for each node the smallest index of all its dependencies provides a heuristic limit to iterating the dependencies of a particularly node back to front, using the tag method to find actually connected nodes, and also provides a heuristic to estimate if a particular node could be a dependency (with false positives, but no false negatives).
A similar sidechannel that stores for each node the largest index of all its dependants provides this heuristic in the other direction, front to back. Once again using the tag method, we can find connected nodes in that range, and estimate if a particular node could be a dependant.
Drop/Apply operations that require descending (back to front), then ascending (front to back) connected nodes could also be replaced by a repeated front to back sweep that e.g. changes (replace (x y) (f a ...)) to (f (replace (x y) a) ...), moving replace nodes further down the tree on each iteration. This is mostly useful when there are many other rewrite operations being done in the sweep, including garbage collection.

22.3.2021
Writing down the rewriting patterns to optimize functional trees for efficient execution.
simplification:
if(true,x,y) -> x, if(false,x,y) -> y
if(x,c,c) -> c
if(x,true,false) -> x

atomic conditions:
if(if(a,b,c),d,e) -> if(a,if(b,d,e),if(c,d,e))

constant folding:
if(x,f(x),g(x)) -> if(x,f(true),g(false))

boolean logic to branches:
and(a,b) -> if(bool(a),b,a)
or(a,b) -> if(bool(a),a,b)
not(x) -> if(bool(x),false,true)

bubble up shared conditions (some examples):
all(if(a,x,z),if(b,y,z)) -> if(and(a,b), all(x,y), z)

push down paired instructions:
any(if(a,x,y),if(b,z,w)) -> if(a,any(x,if(b,z,w)),any(z,if(b,z,w)))
any(x,if(b,z,w)) -> if(b,any(x,z),any(x,w))

push down calls of unknown functions:
if(x,f,g)(a,b,...) -> if(x,f(a,b,...),g(a,b,...))

19.3.2021
The BDD selector pattern (c f t) which is analog to the C function c?t:f can be extended to support tags and by extension, switch cases:
(c (i1 = x1) ... (iN = xN))

Where c is a conditional tag, i1 to iN are tag constants describing cases and x1 to xN are expressions to be selected based on which constant c matches. The tag type of c can be deduced from case constants.
The selector pattern can then be transformed into the switch pattern through defining booleans as tags with two values false and true, and rewriting the pattern so that
(c f t) -> (c (false = f) (true = t))

The application of conditionals to cases then is also performable as substitution
(c (i1 = f1(c)) ... (iN = fN(c))) -> (c (i1 = f1(i1)) ... (iN = fN(iN)))

Switches with complex conditions can be similarly rewritten as selectors
((c (i1 = x1) ... (iN = xN)) (j1 = y1) ... (jN = yN)) ->
    (c (i1 = (x1 (j1 = y1) ... (jN = yN))) ... (iN = (xN (j1 = y1) ... (jN = yN))))

And we can also reduce identity switches as well as switches that all lead to the same result
(c (i1 = i1) ... (iN = iN)) -> c
(c (i1 = x) ... (iN = x)) -> x

Now, extending switches to support any integer, it would be useful to first define integers as ordered sets of constants and ranges.
[i1 .. iN]

defines an inclusive interval from constant i1 to constant iN, where i1 <= iN. An 8-bit unsigned integer would then be defined as [-256 .. 255], a boolean as [0 .. 1], and [x .. x] is equivalent to an integer constant x. Tags and integers can then be composed to canonical types k1 | ... | kN where kX in ([i1 .. iL], i) with ordered non-overlapping cases.
The switch pattern then requires that case constants are specified in the same non-overlapping order as the type - in fact, the conditional type can, again, be deduced from the specified cases. We also allow compositions.
(c (i1|... = x1) ... (...|iN = xN))

All operations but the substitution operation remain as they are. The substition operation requires that we are able to replace variables with cases, which are subtypes i.e. we must permit types as conditions, and apply them just as we apply constants.
This requires rewriting switches to reduce the number of possible cases according to the input condition, by intersecting the input with cases and removing the cases with empty sets. Some examples:
# definite selection
c=1 (c (0 = a) (1 = b) (2 = c)) -> b
# can only reduce expression and alter type, but we can still solve, see further down
c=[0 .. 10] (c ([-256 .. -1] = a) (0 = 0) ([1 .. 126]|[128 .. 255] = 1)) -> (c (0 = 0) ([1 .. 10] = 1))
# illegal value in interval
c=[0 .. 10] (c ([0 .. 8] = a) ([10 .. 20] = b)) -> error: unhandled case 9

A new discovery is that the presence of types causes constraints to bubble up
(c ([-256 .. 255] = (c ([0 .. 8] = a) ([16 .. 32] = b)))) -> (c ([0 .. 8]|[16 .. 32] = (c ([0 .. 8] = a) ([16 .. 32] = b))))

Direct links like the one above can trivially be further simplified
(c (-1 = q) ([0 .. 8]|[16 .. 32] = (c ([0 .. 8] = a) ([16 .. 32] = b)))) -> (c (-1 = q) ([0 .. 8] = a) ([16 .. 32] = b))

But bubbling up cases separated by other conditions requires a new rule. Take this example:
(c 
    (0 = a) 
    ([1 .. 2] = 
        (f
            (0 = d) 
            (1 =
                (c
                    (2 = j)))
            (2 = 
                (c 
                    (1 = h)
                    (2 = i))))))
->
(c 
    (0 = a) 
    ([1 .. 2] = 
        (c
            (2 =
                (f
                    (0 = d) 
                    (1 = j)                    
                    (2 = i)))))) 
->
(c 
    (0 = a) 
    (2 = 
        (f
            (0 = d) 
            (2 = i))))

Effectively, any appearance of a switch depending on the substitution variable c in a subexpression must invert the relationship and bubble c upwards, intersecting all possible states, which will allow us to specialize all subexpressions, removing c from subconditions. It may still appear as a terminal result, but will be inert.
A switch should also bubble multiple subswitches with the same condition if they have conflicting cases. This will reduce and also fix the expression.
Lastly, in this system, (c ([1 .. 5] = x)) and (c (1 = x)) both are meaningful expressions because they serve to constrain the type of c.
13.3.2021
Converting option types to branches. In this system, every value is an option type, and thus is conceptually conditional. Expressions take options and return options, thus propagating the condition. At certain points, conditions are merged:

Arguments passed to an expression. In this case, the expression only evaluates to a result when all options are valid, and none otherwise (AND relation between all conditions).
Arguments passed to a merge operation. In this case, the first argument that is not none becomes the result, otherwise the result is none (OR relation between all conditions).

This requires a boolean compile time arithmetic for conditions whose final result tells us in which branch(es) the result will end up, if new branches are to be created, and which branches they depend on.
We'll need these abstract compile time operations to describe branches:

?expr ("the branch of expr") describes the branch that depends on the value of expr, and thus defines a new branch. expr must be of a type that can be evaluated to true or false in the context of branching.
?true describes the branch that is always active i.e. the root scope.
?false describes a branch that is always inactive and can only be the result of a compile time evaluation.
?a & ?b: a branch that is only active when the branches of a and b are active.
?a | ?b: a branch that is active when either branch of a or b is active.
!?a: a branch that is active when the branch of a is inactive.
branchof(x) -> y is a macro that retrieves the branch expression y on which the evaluation of x depends.

Every expression thus performs branch arithmetic (I simplified the case blocks from the last post so I can reason better about this):

input(InputEnum) selects branch ?InputEnum. InputEnum is a system I/O channel whose option is set at runtime by the host system. branchof(InputEnum) will always evaluate to ?true.
if(condition) selects branch branchof(condition) & ?condition as condition is a boolean known only at runtime that may already depend on a previous branch.
else(expr) selects branch !branchof(expr). Depending on expressions (the earliest one being else(if(condition))) ensures that topologically, negated branches depend on plain branches.
merge(a,b,...) selects branch branchof(a) | branchof(b) | ...
generic expression op(a,b,...) selects branch branchof(a) & branchof(b) & ...

Our problem of assigning branches to expressions has then become a problem of bringing boolean expressions into canonical form. For the canonical form, we choose sums of products of optionally inverted branches:
branchof(expr) = ((!)?x11 & ... & (!)?x1n) | ... | ((!)?xm1 & ... & (!)?xmn)

For the code generation, we require that products are ordered by lowest id first, and that sums are ordered by lowest starting id first. Then a product term is equivalent to a conditional stack of branches, and a sum term is equivalent to instancing the expression in multiple unrelated branches.
We also simplify and canonicalize expressions:

!?true is ?false, !?false is ?true
?x & ?true is ?x, ?x & ?false is ?false
?x & ?x is ?x, ?x & !?x is ?false, !?x & !?x is !?x
?x | ?true is ?true, ?x | ?false is ?x
?x | ?x is ?x, ?x | !?x is ?true, !?x | !?x is !?x
(?x & ...) | ?x is ?x
(?x & ?y) | (?x & !?y) is ?x
(?x | ?y) & ?z is (?x & ?z) | (?y & ?z)
!(?x & ?y) is !?x | !?y
!(?x | ?y) is !?x & !?y
(?x | ?y) & (?z | ?w) is (?x & ?z) | (?x & ?w) | (?y & ?z) | (?y & ?w)
!((?x & ?y) | (?z & ?w)) is (!?x & !?z) | (!?x & !?w) | (!?y & !?z) | (!?y & !?w)
And so on. I'm surely going to find a few more looking at results.

As sums require instancing expressions to multiple branches, we might want to reduce duplication. This can be achieved by lifting sums into a new branch built from boolean or expressions executed at runtime, so that merge(if(a),if(b)) becomes if(or(a, b)).
Likewise, it is possible to drop boolean not expressions to rewrite if(not(x)) as else(if(x)), as well as boolean products, so that if(and(a, b)) becomes merge(if(a),if(b)), useful for when branches ?a and ?b exist already.
In this system, switch cases are implicitly specified as if(x == case1), ..., if (x == caseN) whereas default cases are equivalent to else(merge(if(x == case1), ..., if (x == caseN))). It will require pattern detection to reconstruct switch cases for the code generator, but this saves the programmer from having to think about it.
In a secondary pass, branches may propagate backwards, from sink to source, as expressions may depend on conditional subexpressions that only appear in certain branches, but have been defined unconditionally. This optimization is particularly useful when all dependent branch expressions are single products sharing one or multiple factors, and allows us to provide the classical post-select pattern select(cond, a, b), implemented as merge(then(if(cond),a),then(else(if(cond)),b)), that move all terms of a and b into if(cond) dependent branches.
6.3.2021
To translate our functional dataflow DAG to nested SSA, we need to figure out how to create control flow from flow control, i.e. how to sort conditional flow into nested conditional blocks. We provide four forms that we can map to control flow structures in SSA:

case(lhs, rhs) evaluates to none when the comparison of integer or boolean values lhs == rhs is false (terminating all dependent processing) and otherwise returns a bang. rhs must be a constant. Multiple forms using the same dynamic condition will control a shared switch instruction. Dependent expressions will become part of the case basic block for rhs.
default(lhs) is active when no case() instruction for lhs has succeeded. It is used to generate default cases for switch instructions that share the same lhs argument.
then(lhs, rhs) ignores the content of lhs, but evaluates to rhs when lhs is active, otherwise none. It can be used to propagate the conditional status of a value (such as a bang generated by case or default) to another (typically a constant). It generates no instruction.
merge(value, ...) merges the control flow of multiple conditional values of same type from different branches and begins a new basic block that is dominated by their shared scope. It likely generates a phi instruction.

Through case and default, simple conditional branching can also be implemented, where if(condition[, value]) is equivalent to case(condition, true[, value]), and else maps to the default operator.
Every node can in principle depend on only one most recent multiple flow control operation, and in the first step, we must annotate for each node the conditional node it belongs to, propagating from nodes to their dependent nodes. A conditional node dependency can only be a case node, a default node, a bang (indicating always-on) or a none (indicating always-off). case and default both return bangs that depend on them, while the conditional nodes themselves depend on the parent conditions. then takes the conditional node of lhs and applies it to rhs, where the conditional node of rhs must be a conditional node of lhs. merge requires that all its inputs conditional nodes share the same parent, and propagates this parent as conditional node, effectively going up the hierarchy by one or more levels.
When a node sources separate input values with different conditions of which at least one condition is not a parent of or equal to the others, it will never be activated and thus its condition becomes none.
When nodes are translated to instructions, a scope is then mapped to each unique case and default node (as well as a top level scope that maps to bang). Instructions are then generated in topological order to the respective scopes based on the conditional nodes they depend on, where the topology must ensure that unconditional dependencies of conditional expressions are evaluated before their conditional branches are built, i.e. switch blocks can only be appended to a scope when all shared expressions their subscopes depend on have been generated, and when every branches return value is known, which is when all merge nodes for a conditional set have been encountered.
In addition to tracking each nodes conditional dependency, it also needs a order number to indicate which basic block belonging to the conditional scope we are generating into (as e.g. the main scope is left to evaluate conditionally, then returned to, then left again)
Take this example
# an m that is only evaluated when k == 2
let m = (then (case k 2) m)
merge
    then (case k 0) "zero"
    then (case k 1) "one"
    then (case m true) "two A"
    then (default m) "two B"       
    then (default k) "?"

which produces this simple DAG SSA with conditional annotation and reordering
%k [0:]
%2 [0:] = case %k 0
%7 [0:%2] = then %2 "zero"
%3 [0:] = case %k 1
%8 [0:%3] = then %3 "one"
%0 [0:] = case %k 2
%1 [0:%0] = then %0 %m
%4 [0:%0] = case %1 true
%9 [0:%0, %4] = then %4 "two A"
%5 [0:%0] = default %1
%10 [0:%0, %5] = then %5 "two B"       
%12 [1:%0] = merge %9 %10
%6 [0:] = default %k
%11 [0:%6] = then %6 "?"
%13 [2:] = merge %7 %8 %12 %11

sorting nodes into blocks:
%k [0:]
%2 [0:] = case %k 0
    %7 [0:%2] = then %2 "zero"
%3 [0:] = case %k 1
    %8 [0:%3] = then %3 "one"
%0 [0:] = case %k 2
    %1 [0:%0] = then %0 %m
    %4 [0:%0] = case %1 true
        %9 [0:%0, %4] = then %4 "two A"
    %5 [0:%0] = default %1
        %10 [0:%0, %5] = then %5 "two B"       
    %12 [1:%0] = merge %9 %10
%6 [0:] = default %k
    %11 [0:%6] = then %6 "?"
%13 [2:] = merge %7 %8 %12 %11

3.3.2021
I'm growing increasingly fond of in-memory formats that can be marshaled without stream processing, but it so rarely happens that the fates align so what's good for storage is also good for processing. Here's one, which is inspired by SPIR-V, can be used to describe any DAG and permits sequential processing directly from the stream.
We describe our graph as vertices (nodes) stored in an array of words, where each word is of type uint32. Each vertex entry begins, minimal RIFF style, with two packed uint16 integers, the first describing the number of input edges that follow (allowing us to deal with a variable number of edges (arguments) and skip unknown vertex formats), the second specifying an implementation dependent vertex type that tells us the format of those edges and the function of the vertex node.
Note: How many bits are allocated to size and type is variable and subject to size of possible parameter space. The argument count can be left out if all vertices have the same number of edges or the number of edges solely depends on the vertex type and there are no unknown vertex types. The vertex type can be left out when all vertices have the same type.
Edges / arguments may be constants or references, depending on the vertex / node type schema, which we must have knowledge of. For constants, values larger than a uint32 word are spread over multiple words, while values smaller than a uint32 are padded, with the exception of strings. For strings, four characters are fit per word, the string is zero terminated, and unused characters in the last word are padded out with zeroes.
For references in a DAG, an edge sources a vertex by its index (counting from 1), where 0 is reserved to specify the null or undefined case, where needed. Furthermore, and this is where we diverge from SPIR-V, an argument may only source nodes earlier encountered, i.e. whose index is smaller than the index of the node we are currently describing. This produces a topological ordering in which we can guarantee that all dependencies of a node have been processed before the node is encountered.
Chronological processing of nodes is then as simple as visiting the array from front to back. Tasks like lifetime analysis or search will often require reverse iteration, which needs a mapping from index to word offset within the array.
The most interesting bit about this format is that temporary metadata for nodes required during processing (e.g. reference count) can be made available on sidechannel arrays which are addressable with the same indices.
2.3.2021
These are the specs of our IL so far:

Pure functional dataflow
I/O at endpoints
Strong types
Statically compiled
All forms eagerly evaluated
Control flow becomes flow control, supported through propagating option types, needs only four tokens: none, on, on? and any

These are the open points I still need to document:

Structural typing: types are compared by their definition. A unique type is made by creating a unique definition.
Abstract types: the typechecker starts by assuming a superset of all supported types, and decides which datastructures to use based on how data is altered throughout the functional graph. This requires topological iteration of the graph from both sides.
Closures: we can trivially support compile time closures, which must always be inlined, but I haven't thought about how to support runtime closures yet.
Loop forms: we support parallel and sequential processing, but the semantics aren't entirely clear yet.
Packing: to reduce generated code size, repeating patterns should be packed back into assembly functions.
Lexical scope reconstruction: code generation requires sorting operations into nested blocks. Loops will be easy but I haven't thought too much about nesting conditional flow correctly.

1.3.2021
Push logic can be converted to pure functional logic simply by specifying that every expression result is of an Option type T?, and that a simple function invocation is implicitly gated by a constraint that defines the result to be active if all arguments are active.
Constant expression folding then allows us to build conditional dependencies, so that for every graph node, we can specify that node N is active if its parent node P is in-/active (the relationship can be inverted). If P in turn depends on parent node Q, then N depends on Q and so on, and this way we can propagate states forward and sort nodes into sets where all nodes of the set share the same activity conditions, and therefore belong into the same scope, and this allows the compiler (which needs to generate a SSA form) to sort nodes into basic blocks.
The question is if true and false still need to be distinct values separate from the option type, or if they can be perfectly absorbed by specifying that any non-empty value equals true, and any empty value equals false. What is the result of none != bang then? The problem here is that because the left hand side is empty, by standard rules, nothing can be computed, and the result is also none. Likewise, none == none as well as none != none will also result in none, which makes empty values behave more like NaN floats (minus the actual NaN check) than false values.
But then we get to the magical not x operator, which, by standard rules, is unusual: not bang is none, while not none should be also none, but that renders the operator useless. or is the only operator which we have specified to select the first active input, so we could try building not out of x and none or bang, but that also won't work. let true false = (branch x); false would be the correct way to build a not operator, in which not none equals to bang.
But not (a == b) is supposed to equal (a != b), which works when both a and b are active, but when any of the two values is empty, then not (a == b) will evaluate to bang whereas (a != b) will evaluate to none. So all comparison operators with equality checks need to treat none als false; so that (none == none) and (none != bang) will both evaluate to bang. It causes comparison instructions to turn dataflow back on, even though inputs are off, and that's confusing.
So we'll need separate boolean values after all, and so we can have four somewhat related values: true, false, bang (which are all active options) and none (which is the empty value of the option type).
This means that the signature of branch also changes a little, from (bang | none) (bang | none) <- (bang | none) to (bang | none) (bang | none) <- (bool | none). branch is now used to transmit active flow to one of two outputs, whereas the boolean value controls which of the two will be active, and an empty input causes both to be inactive. It is the fabled extra state of the handheld game in the Tao of Programming:
A Master Programmer passed a novice programmer one day.

The Master noted the novice's preoccupation with a hand-held computer game.

"Excuse me," he said, "may I examine it?"

The novice bolted to attention and handed the device to the Master. "I see that the device claims to have 
three levels of play: Easy, Medium, and Hard," said the Master. "Yet every such device has another level of 
play, where the device seeks not to conquer the human, nor to be conquered by the human."

"Pray, Great Master," implored the novice, "how does one find this mysterious setting?"

The Master dropped the device to the ground and crushed it with his heel. Suddenly the novice was enlightened.

This change to branch then ensures that branch works by the same rules as any ordinary function, and can not magically turn signal flow back on.
But do we need to change or? We already defined the binary operators | and & in addition to and and or, which operate on booleans. and already works by standard rules. But or is special: by standard rules, true or none should result in none (because an argument is empty) but it results in true. Likewise, false or none should result in none (because the first parameter is false), but it results in false, whereas false or true will result in true. It is also all too easy to accidentally turn flow back on if the conditional bang isn't sourced in all subexpressions: let doit = (branch x); ((doit and f)) or (g) would always invoke g if doit were empty; whereas if or worked by standard rules, an inactive doit would cancel the entire operation.
By push logic, a lazily evaluated and and or is impossible anyway. So these operators simply don't exist in this form.
Instead, I'd propose an any operator which merges the flow of multiple branches, and always forwards the first active argument.
With the existence of real booleans, branch can be further reduced to an operator that converts bool to bang: on x would bang if x is true, and otherwise do nothing. Then let yes no = (branch x) can be translated to let yes no = (on x) (on (not x)). While the branch form is much closer to condbr, this allows us to separate conditional flow at first and fuse conditionals during optimization. The typical use of on would be to initiate conditional flow and forward a value: (on x) and y, so at least syntactically it makes sense to fuse the extra parameter and turn the form into on x y, which reads as: if x is true, push y.
With these alterations, it would be possible to build traditional functional flow completely without on and any, using only eager forms of select, and and or. But it would require the use of flow control to avoid sending I/O commands, where the null state for a value isn't sufficient. The compiler will also use flow control to generate code for I/O input events, so that the user can always build flow for the "there is data" case, which is then optimized out.
One more requirement remains, which is the ability to turn on flow when an input is empty. The old form of branch could do this, but the new one can't. on ((any optional-input always-input) == always-input) would do the trick, provided optional-input never carries the same value. A query version of on is thinkable, which converts flow back into a boolean: on? x will always be active, and return true if x is active, and false if x is empty, effectively recovering the argument passed to on. Then on (not (on? x)) can be used to perform operations that depend on inactive input arguments.
So these are all building blocks of flow control and explicit lazy evaluation: the none constant (representing an inactive input), and the special forms on, on? and any. It is up to the compiler to isolate basic blocks and to generate condbr, switch and phi instructions from this information. For SPIR-V, we also need to construct valid lexical scope, and for this reason Scopes generates nested SSA as a preprocessing step.
I haven't thought about loops much yet, and there are surely a few corner cases to consider here.
28.2.2021
Continuing to think about push logic, which can be modeled with elements from the functional world, but has some idiosyncracies that need to be addressed separately.
First, let's reiterate the basic boolean conversion rule from the last post, which is that we do not send boolean values through the system. Instead we treat any data that has been pushed as true (allowing us to also push void data, which Pure Data calls a bang), and expressions that never evaluate to anything as not pushing, so dependent expressions aren't triggered. A false is therefore never sent, and all boolean operators therefore perform operations on active and inactive flow of arbitrary data, rather than boolean values.
In functional programming, conditionals are implicitly lazily evaluated: if a then b else c only executes the expressions b and c depending on whether a evaluates to true or false. But with push logic, if a then b else c becomes a dataflow gate that exchanges the flow of a for b, otherwise for c, and in this case, all the stations that b and c have passed before have already executed their transformations.
(If we imagined dataflow as the physical processing of beads, then this gate would indeed stop the processing of b or c, but only after all previous stations have been passed once and the bead lanes are full. In our evaluation model however, we only ever feed constants and system parameters as single beads into this highly structured pachinko machine, and at the end of each loop, we are back to an empty setup.)
To build a push logic gate that performs lazy evaluation, we need an operator o1 o0 <- i that takes only one input i, and outputs flow to two independent streams o1 and o0, where o1 flows if i flows, and otherwise o0 flows. Whichever computations then need to be executed conditionally source either o1 or o0.
For compatibility with pull logic, we also use the general rule for classical functions that they only push a result when all inputs are satisfied, and all their outputs are available simultaneously i.e. they can't control flow at all. An expression that would then, for example, compute f(o0, o1) would never push a result, because either of the sources is always off. A programmer used to SSA form would throw their hands up in horror at the sight of expressions dominated by neither basic block, yet the expression is well defined from the perspective of the push system, it just doesn't do anything. (But the UI would do well to warn of such expressions and display them as inert)
For this reason, we also need a way to join flow, when the data returned by either branch is to be merged into a single flow. It turns out that the logical or operator is a perfect candidate, because it selects whichever input is pushing, and ignores the other. Therefore o0 or o1 will join two branches and, in SSA terms, act as a phi node.
Logically, any expression that depends on a conditional argument becomes part of its conditional set. An or operator has the ability to merge multiple conditional sets into a single one, but sets can contain subsets, and therefore it would take as many or operations as if operations to return back to a flow without any conditions.
It turns out that in push logic, and and or are able to replace (eager evaluated) if entirely: if a then b else c then becomes a and b or c.
Expressions can be sourced multiple times. When a result is pushed, it reaches each dependent expression simultaneously. But expressions can only source one result per argument, which is already defined by the rule that arguments are assigned by labeled edges. Even when arguments are modeled as vertices, as in Pure Data, we don't know how to resolve the conflict of two values arriving. Countless different operations are thinkable: prefer the greater one, add both together, concatenate them, etc. Pure Data effectively leaves this to the implementation of the individual operator, which creates dozens of special cases and can be quite confusing. If Tukan supported connecting multiple outputs to a single input, we should have the datatypes define how to merge a set. Though at this point I feel inclined to just say one input per output, and make the merge explicit. We'll see.
Moving on to polymorphism. In principle, we can use the idea of tagged unions to permit all expressions to be polymorphic. This particularly concerns or, as if only outputs bangs. The problem with polymorphic expressions is that they force code generation of branched behavior for dependent operations, and it would be much more practical to continue with separate branches up until the latest possible moment. Yet tagged unions give us the conceptual ability to apply one generic expression to separate branches, and it is still possible for the compiler to instantiate long, separate paths for different tags, i.e. merge as many instructions as it can into a single switch block, and fold tag checks where they appear.
Another problem with polymorphism is that SPIR-V compute shaders don't support unions. We may require that at least the storage has the same layout, which would translate, or we construct structs that reserve memory for all options, which is bulky but works, or we use a hybrid version where same layouts fold into the same struct members, and otherwise they don't.
In closing, let's look at how the observations of today change the echo program example from the last post:
# example for an echo program

main
    # main declares its outputs. names starting in "io." are special system outputs,
      and only permitted in the main loop.
      
    # alias a node
      a comparison operator pushes an empty symbol when it is true (a "bang").
      io.readline only pushes when a line was received.
    let exit? = (io.readline == "\n")
    
    # signal io.exit when banged
    io.exit = exit?
    
    # constants are autopushed per iteration, but we only need to push this one once,
        for which we use the `io.setup` input, which bangs at startup, and then never again.
    io.prompt =
        # if io.setup bangs, send "> ", otherwise nothing happens
        io.setup and "> "
    # send characters to output for this iteration
    let yes no = (branch exit?)
    io.stdout =
        or
            yes and "exiting...\n"
            no and io.readline

Note that in the above program, it is possible that io.readline is not pushing; which means exit? is not pushing; which means branch exit? pushes no; no and io.readline will not be pushing because io.readline is not pushing; the result is that no data is pushed to io.stdout, which means no stream writing is triggered on the system side. This way it is possible for the compiler to generate different code paths for different input events, separate or combined, and to deterministically optimize out instructions that do not matter in context.
27.2.2021
The Tukan execution model. The basic idea here was to invent a program model where the semantics didn't distinguish between CPU and GPU code. The program would be described as a dependency graph, and hence its compiler would be a dataflow solver. The solver decides which operations are suitable to run on the GPU and generates all intermediate buffers and graphics API instructions. I should mention that seeing the work done on Futhark has been greatly encouraging in trying to design this system, and Pure Data is also highly relevant to the design.
Being a dependency graph, the program can't contain any cycles, so recursions in both functions and datastructures are off the table. Compute shaders aren't running on stack machines, so absence of functional recursion also happens to be GPU friendly. State machines with cyclic flow paths are impossible too, but highly desirable. The solution is to allow specifying loop with isolated interior dependency graphs, in which we permit to feed output data back as input data using loop arguments (very similar to the loop form of Scopes). A node that is the output dependency of a loop may only depend on a loop argument from the same or a parent loop. But maintaining state on the top level is still an open problem, which the next paragraph will also address.
We also want to ensure that a program can be interrupted on a per-frame basis so that state can be captured and processes paused/resumed and even reverted to an earlier state or a state imported from a different machine. In other words, all program state is always marshalable, virtualizable and sandboxable. The benefits of this constraint are undeniable. The solution for this was to model the entire program as the body of a loop, which fits the message loop concept of interactive applications pretty well already. I/O interaction with the system happens outside this loop, with requests being output and results being input. This includes interfacing with user-controlled mutable memory, which is written to with outputs and read from with inputs as well. This allows the entire program definition to be pure and stateless.
Multithreading and data synchronization is a transparent feature that should not need to be controlled explicitly. Flow analysis and  performance estimation heuristics allows the compiler to discern which loops may run in parallel, and may benefit from massive
parallelization.
I feel that I should start with an interpreter for research, and through that understand better which parts can be specialized. The interpreter isn't suitable for production and can't service the GPU either, but to write a compiler for a yet unexplored and only half-understood semantic model is too wasteful, as the model surely is going to be improved with usage.
The IL is also not supposed to have a parseable text-based language, as editing will be visual, but it is easier to start off with a DSL in Scopes with which I can explore its programming before the translation to GUI.
Here is a proposal.
# example for an echo program

main
    # note that we generally use push logic, which means that operators
      have the ability to not 'return' a result at all, ending the entire evaluation chain.
      
      main declares its outputs. names starting in "io." are special system outputs,
      and only permitted in the main loop.
      
    # alias a node
      a comparison operator pushes an empty symbol when it is true (a "bang").
      io.readline only pushes when a line was received.
    let exit? = (io.readline == "\n")
    
    # signal io.exit when banged
    io.exit = exit?
    
    # constants are autopushed per iteration, but we only need to push this one once,
        for which we use the `io.setup` input, which bangs at startup, and then never again.
    io.prompt =
        # `?` pushes its second argument when the first argument is received, otherwise the third, if specified
            the third argument can only be sent when it is proven that the first argument will never be received.
        ? io.setup "> "
    io.stdout =
        # the `not` operator is equivalent to (? x none bang)
        ? (not exit?)        
            io.readline
            # none is an empty symbol that can not be sent
            none

26.2.2021
A long time ago, I had a fun idea for cache-style memory management: record newly allocated and touched memory at front of a queue. When the memory budget is spent, blocks from the tail are evicted to make space (equivalent to LRU policy). It was a joke at the time, but it turns out that there's some sense to it. At first it sounds absurd, right? The app doesn't access a buffer for a while, and it's gone.
Turns out, lost buffers are no longer an issue when their contents can easily be reconstructed. There are two ways to do this: reconstruct it procedurally, or backup the buffer to disk on eviction, and load it back when requested. With well defined rules to reconstruct lost buffers, application dependent cache can then also be refreshed simply by evicting buffers early. Buffers could also specify pointers that are loaded lazily i.e. when the dependent content is actually needed. Serializing the whole app is as easy as backing up the remaining entries in the queue to disk.
An offline garbage collector would still be required to track and clear unused memory on disk, but it can run entirely parallel to the application.
On concurrency: We need to make sure that a buffer doesn't get evicted while we're still operating on it. The simplest idea I can think of is to equip it with an atomic lock for the time that it's in use, effectively pinning it, which is something we might also want to do explicitly. It wouldn't be very comfortable to use in C, but in Scopes we could make it nice.
There will be an added cost on pointer dereferencing which suggests we should make use of indices as much as we can and use pointers only to retrieve base addresses.
The suggestion came up to use plain memory maps for serialization, but if i'm not mistaken this way we lose the ability to decouple memory from disk or network representation, to put codecs in the cache pipeline, to provide asynchroneous loading and to disable writeback from buffers selectively. For the beginning, LMDB is a useful layer that provides content mapping, allocation management, deduplication, storage expansion, MRSW synchronization, transactions and crash recovery, all features that mmaps don't have out of the box. I can't imagine myself writing a custom solution at this point, but perhaps it turns out that mmaps are worth it after all.
I do like about mmaps that caching becomes transparent, and streaming from disk is an implicit feature. But we're still left dealing with temporary memory that we explicitly don't want serialized; the network layer and any form of lazy computation would be a separate system too. It sounds like a system that would start out simple and then end up requiring exactly the same mechanisms as the original manual alternative.
Anyway, first idea for an implementation. We have a fixed array of buffer descriptors whose address never changes and a matching queue. Buffer descriptors hold an optional constructor function pointer (which performs the loading I/O), a destructor function pointer (which performs the eviction I/O), the buffer size (so we can calculate budget costs), its atomic mapping refcount (so we can pin the buffer), auxiliary data for the constructor function (where to source the content; typically a hash or uid in DB, socket index, filepath etc.), its index in the queue and the actual content pointer (which is null if not yet loaded). Cache buffers never point inside other cache buffers; instead, they pass and resolve indices to buffer descriptors, which we use as handles. The queue is implemented as a ringbuffer with atomic write > read indices. Each queue element holds a handle whose content has been loaded.
When we define a new cache buffer, we either reuse a free slot for the descriptor or append. When a handle is mapped and the content pointer is zero, the constructor function is executed (which is able to create additional handles) and returns a new address, which is assigned to the content pointer, the mapping refcount is increased; when the handle isn't in the queue yet, the total cache size is incremented by the buffer size. When the handle is then being unmapped, the mapping refcount is decreased and the handle is moved to the front of the queue. When there is no more space in front of the queue or the total cache size exceeds the budget, handles whose mapping refcount is zero are evicted from the back of the queue until either constraint has been satisfied. When a handle is evicted, its destructor function is executed and content pointer nulled. Read and write index of the queue's ringbuffer both skip over mapped handles. The read index also skips null indices, which can be caused by moved or explicitly evicted queue elements.
A command queue could be used to perform asynchroneous mapping of handles.
25.2.2021
Continuing from the last post: thoughts about common structure types in the memory DB, as I'm calling it now.
To start off, we need an entry point into the DB. The simplest way is to link a bootstrap key to the uid of a root structure from which all data is accessed, comparable to the root directory of a file system. (In some ways, our database is comparable to a file system, but files and folders are fused into a single structure, there is no mandatory metadata, and changes do not overwrite data.)
A single root key has the added advantage that the entire state of the DB is attached to a single value, and keeping historical records is as easy as having the new root record link to the previous one.
I've isolated three interesting generic datastructures that are suitable for the DB: for maps and vectors, two separate tree structures of complexity O(log n) and which always produce a canonical image regardless in which order edit ops are executed; and for string and big array editing, piece tables where the complexity is sublinear, although piece tables have no canonical image.
It's also important to keep in mind that LMDB itself already uses a O(log n) structure, and caching is the only way to optimize access times. With immutable records, caching is particularly useful, and we would benefit from standardizing the way caching works early on, so there is a minimal guiding principle that covers all use cases.
I'm increasingly fascinated with piece tables, for the following reasons. They could make a good basis for an extension of array datastructures, including strings.

Array structures are well understood and CPU as well as GPU cache friendly.
Random access is O(k) for k pieces (O(log k) with precomputed auxilliary information and binary search), while bidirectional iteration from cursor is O(1).
Permits decoupling of index-shifting element insertion and removal from updating the array, at the cost of O(k) for the piece table when the edit is not at an existing boundary.
Overhead grows with number of edits, not with size of array.
Lookups require only one indirection.
Appending is the cheapest operation.
Piece table can also be edited as a mutable linked list.
Operates on immutable data: a piece table describes an alteration of an original array without changing it.
Is invertible i.e. from a previous buffer and a piece table describing the next buffer, we can make a next buffer and a piece table describing the previous buffer. This allows us to make access into the current version of a buffer fast, while compacting previous versions. We can build implicit undo/redo features as well as versioning.
Canonical hashes are possible by hashing through the piece table. Two piece tables with different layout can be compared to produce the same string, and a cheaper piece table can be swapped for a more expensive one.
Is nestable; a piece table can describe alterations of another piece table, and so on. This allows us to repeatedly compact changes.
Piece tables can be partially or fully "baked" into flat arrays, and then operated on by a new piece table. Therefore, table size can be traded for content size, and access can be updated depending on load; when random access becomes too costly, pieces can be merged and create a new buffer. In this way, piece tables contain their own cache.

Some thoughts on hashing in general: Hashing big arrays is expensive, as the hash requires a walk through the entire array. For a partially changing array, it is better to organize a loglevel hashing structure, in which we hash in e.g. 512 byte chunks; an 8 MB buffer then requires a buffer of 16384 hashes (512K of SHA256), which in turn requires a 32K buffer, which requires a 2K buffer, which requires a 128 byte buffer, that can then be hashed to a single hash. Every in-chunk edit then requires four hierarchical hash updates, where at least 1664 bytes are hashed and altered altogether, rather than the original full 8 MB buffer.
It appears strongly necessary to perform such updates asynchroneously. A piece table makes this easier, as it performs alterations in a staging buffer; a worker thread can use the table to serialize the new content to disk, or apply it to mutate a memory structure, and identify the areas that need rehashing, then swap out the piece table upon completion, or restart the work if until then, more changes have been applied.
Also, I see several key features of CPU caching repeated in the RAM <-> disk (<-> network) system we need to build here:

References initially remain thunks until a load is performed, which unmarshals additional data from the DB and unthunks in-memory references, which creates stalling.
Support for read-ahead can help with stalling, in which we asynchroneously dereference in advance, and guarantee to unthunk at least n MB of memory, similiar to the guarantee of CPU cache line size.
Because pool size is limited (for example, our budget is 2GB of main/GPU memory), and data can also be altered from a separate process, we need to be able to drop least accessed and invalidated data and "re-thunk" references to it, which means dropping information in a threadsafe way.
Pool data can be mutated, but mutations must be written back to the DB asynchroneously, as versioned updates to persistent data structures. Piece tables (or a comparable datastructure) make it possible to cheaply resize buffers and shift buffer contents, as well as record which alterations took place. change-bits for paged memory are a useful alternative as well.
A lot of this appears to be already supported by the OS through memory maps, but we need the LMDB layer to provide interprocess safety, protect against dataloss and support compression as well as compaction. There's also no network layer on the OS level.

Another way to organize immutable edits (that do not move content around) is to use selective double (or n-tuple) buffering of pages with changebits: say we have a 8 MB source buffer of 16384 pages, each 512 bytes. We want to mutate elements in this buffer, but without interfering with existing threads still reading from it. So we create a new buffer of the same size (8 MB), copy over pages to the new buffer on write, and mark in an additional change flag buffer (2048 bytes to flag 16384 pages) which pages have been updated. The new buffer can even be larger, as long as the change flag buffer matches the new size; in this case, all new pages are flagged. Lookups then access the new buffer where flagged, and otherwise use the old buffer. A worker thread can then eventually fill all holes in the new buffer and swap the two buffers. It's reasonably efficient. Writes need to copy pages on first change, reads are only becoming difficult when they cross page boundaries.
You see that I'm thinking a lot about how to provide a general persistence layer that allows the user (the user being Tukan or Nowhere, which also needs to version data) to "just forget" that we're synchronizing data in the background and try to keep all operations non-destructible, while not incurring penalties that severely impair realtime performance. The hard part is figuring out which features go into the system level, and which features can be implemented in userland. My intuition tells me that we might support piece tables as a userland abstraction. But performant persistent data structures and memory management are something we need as a basic service. Treating RAM like another cache level, and evicting cache following a RU pattern, we can treat garbage collection as an offline solution meant to save disk space. I described pointer descriptors in the previous entry as an implementation detail for that.
When working with data in hot paths, semi-persistent data structures is what we need: lots of unversioned mutations of hot data, but a point at which we compactly snapshot the program state to save it from further mutations.
Another idea for a datastructure, a growing array without copy-on-resize: array is partitioned into optional pow2 sized buckets of which each bucket is twice as big than the one before. Then index I into array translates to bucket_no = findmsb(I+1), bucket_offset = I+1-(1<<bucket_no), bucket_size = 1<<bucket_no. Iterating the array cursor is as easy as if (++bucket_offset == bucket_size) { bucket_offset -= bucket_size; bucket_size *= 2; bucket_ptr = buckets[++bucket_no]; }. Buckets are allocated when needed; the only thing that would need to be resized here would be the bucket pointer array (which is a O(log n) step); i would start with the first bucket holding at least 64 bytes. It's a good append structure; what's less simple are memcpy ops into and out of it. With a fixed size bucket array, also a good structure for single-writer multiple-reader multithreading.
I'm also beginning to see the general value of having worker threads that "bake" write-friendly but complex data structures into fast readable arrays. Readers are hybrid, in that they can read the complex, but also the fast version when available. it's like JIT, but for data structures.
22.2.2021
Continuing about the on-disk format, specifically pointer descriptors.
I use a key-value store like LMDB. The store is meant to house programs but also userdata, in other words, we don't know entirely what users are going to store in it, but we want to make it as compact as possible, and provide an introspection guarantee, which makes it possible to automate and validate data conversion and schema updates.
This guarantee is realized with structural type descriptors.
On the first level, we just hash buffers to keys for mapping. Buffers are immutable, i.e. once created, they are not altered. Every piece of data inserted into the store uses its hash as key.
We support pointers, but pointers don't use memory addresses, but secure, collision-free hash addresses (such as SHA256). We can also compact pointer sizes by mapping hashes to incrementing unique ids of smaller size (32 or 64bit if need be), using uids as keys and addresses, and maintaining an auxilliary table of hash keys to uids for cold lookup. This reduces pointer size to 12.5%-25%.
As far as marshaling is concerned, the first level only requires a description of pointers in structures so it can aggregate and enumerate all dependencies of a block. A descriptor, stored as a hierarchy of separate nested buffers, would define a size or stride (so number of elements in an arrayed buffer is its size divided by stride) and then list, in order, relative pointer offsets and the uids of their descriptors. Computing a buffers universal hash would require hashing all data besides pointers verbatim, and resolving pointer uids to hash addresses before hashing. With this information only, data compaction and packaging as well as pointer schema changes (i.e. going from 32 to 64 bit) are supported. In addition, any nested structures appear flat in the descriptor, as it only cares about pointers, so processing would be fast.
The pointer descriptor is also sufficient to support content agnostic breadcrumb / zipper logic for immutable editing: as we descend into nested structures and make a change, the new hierarchy can be assembled merely from reading descriptors, making buffer copies and fixing addresses, without knowing anything else about the buffers in question.
The principle is portable beyond LMDB: a stream format might replace uids with stream offsets. In memory, uids could instead be pointers.
To make the system even more compact, we could require all pointers to always appear first in a buffer, which saves specifying relative offsets and allows payloads to remain consistently spaced: all it requires for a description is a stride and a list of descriptor sub-uids, one per pointer field appearing in the buffer. Apart from read-ahead caching concerns (note we still respect interleaving pointers with array stride), there is no reason why pointers must be scattered among fields within a structure. Note: There is one, which is composition of nested structs - reordering the fields means instructions like getelementptr are no longer straightforward.
On a separate channel, this reduces the descriptor format to a stride and a sequence of subsequent descriptor uids; But when the description becomes part of the buffers themselves, we would only require two fields: the stride size and the number of pointers. Setting a max stride of 64kb and a max number of 256 pointer fields per element, our descriptor fits snugly within a single DWORD, with 1 auxiliary byte to spare. When the descriptor is zero, the buffer does not contain any pointers, and stride does also no longer apply.
This concludes the pointer descriptor format.
What do we need on the next levels?

A way to describe programs that operate on these buffers
A way to recognize and update buffer schema (i.e. adding a new field to structs), as well as searching for buffers by type description
A description allowing to visualize contents in an element inspector GUI

Residual idea for the type descriptor: Pointers aren't specified explicitly; instead we implicitly enforce every substructure whose size is greater than a hash key (32 bytes) to be referenced as a separate piece of data. This guarantees that no struct or array field is greater than 32 bytes, still allows packing vectors, and encourages data deduplication.
16.2.2021
The format of the AST is still not entirely nailed down. I prototyped a few solutions but I'm never happy. I want to start out with a generic interchange data format that has following in-memory properties:

Homoiconic (can be manipulated by the programs that it describes)
Introspective (self-documenting schema, diffable/patchable, marshalable)
Versioned (supports undo/redo and rollback history through persistent data structure)
GPU Compatible (compute shaders must be able to read it; ability to edit would be perfect)
Compact (binary, good alignment, no redundancies, little overhead, packs well, fast to read, fast to edit)
Multithreaded & Massively Parallelizable
Acyclic (represents a directed acyclic graph in the most abstract, so there's always a linear topological order, which simplifies many algorithms)

Here are the puzzle pieces I've covered so far:

Homoiconicity is in principle easy to realize, but in this case we need GPU homoiconicity, which means compute shaders should be able to process and generate code, perhaps even massively parallel. That introduces some limitations; Processing must be stackless. References must be indices. A columnar memory format would fit the array structures that GPU APIs favor.
Introspection is implicit in JSON / XML / S-Expressions, but they carry significant overhead in memory, and still need an additional schema to enforce a particular structure. Much better are packed arrays and a descriptor that uses a structural type system to tell generic data readers about relations, alignments, sizes and content format, and can be hash-compared by specialized readers with fixed layout requirements. The greater difficulty here is that the descriptor requires a bootstrap format that should ideally also be describable with said descriptor, and also remain extensible and backwards compatible (so existing specialized readers can deal with schema changes). Columnar memory can be extended with new columns without disturbing old readers.
Versioning through immutable tree editing. I previously used a prefix tree representation that requires path copying, and thus read and write access is O(log n). It is also strongly depending on a stack, which isn't nice for the GPU, as aforementioned, it carries a lot of overhead in terms of unused slots, and it isn't compact enough for efficient reading. So this needs to be completely redone.

15.2.2021
On modeling the self-hosting editor: what if I merged the flex layout visitor with typechecking, or at least transferred its lessons? Single-pass typechecking performs a depth-first tree walk of expressions. Flex layouting first performs a depth-first tree walk of widgets to collect required sizes, then a breadth-first tree walk to assign final rectangles. Considering that layouting must be performed in two passes, I could leverage this two-pass structure for a more complex two-pass typechecker as well.
What can we do with a two-pass typechecker? We can do data structure layouting for a layout agnostic language. What does "layout agnostic" mean? It could mean that e.g. all composite structures, like tuples or arrays, are initially partial sets, whose field order isn't determined yet (unless explicitly set). In the first depth-first pass, we validate argument types and compute expression types as usual, but the types we use are abstract. In the second breadth-first pass, we decide on the actual layout of data structures and their specific hardware types, based on the usage we have seen.
Following the idea of merging typechecker and flex layouter, it would then only be consequential to treat the AST as both code and self-modifying markup tree. We know the hamfisted inverse approach, where e.g. XML is expanded with programming features, but to my knowledge nobody has started from an AST and added markup features. In visualization, we would naturally present the code structure, but let markup attributes change how it is presented, and, in classical Scheme style, allow code to be evaluated (analog to macro expansion) in order to generate new markup on the fly. With syntax functions, we also enable the structure to read and rewrite itself. This allows the AST to build UI to modify parts of itself.
11.2.2021
A plan ripens to create a semantic model in which vectors and arrays are the only composite data structures, and pointers are replaced with indices. Reading this, you will notice that it resembles databases, but as a systems programming approach.
A struct can be emulated by a set of arrays, of which each array represents a field. In other words, we store structs in rows, and fields in columns, in a column-first memory layout. The "struct" then becomes a struct-of-arrays, acting as a descriptor of its field columns. We call this a table.
For the system to be complete, column elements only need to be of the following types: integer, real, index (extensible with dynamic type, range and sparse index cache), and vector of integer, real or index.
This allows us to keep multiple struct definitions that share some, but not all of their field columns, and thus build venn diagrams of table definitions, a feature that traditional structs do not afford. For this it is necessary to maintain same row order in all overlapping columns, and so all columns connected through definitions must contain the same number of rows; we call this superset of connected columns a bank. A bank contains, in addition to all columns, a definition of all individual tables i.e. subsets of columns.
As SoA users already know, not only are columns able to use optimal alignment for each field, but flag columns can also be bitpacked. A single-flag column would pack at ratio 1:8.
A pointer field can be converted to an index column, which stores, as metadata, a single pointer to the table that each index applies to. An index column can depend on itself, which allows the implementation of linked lists.
A union can be emulated through an index column depending on a bank, of which a submask in the index defines which columns the index applies to, and thus defines the dynamic table type of the index. To keep the submask compact, column sets are enumerated, so that we can maintain a large number of columns (over 64), and yet are able to e.g. use 8 bits to enumerate 256 different tables.
Arrays of variable length (for example strings) must be stored element-wise into a single column, optionally using a null element as sequence terminator, and indices are used to address them. In this case, an index must be expanded to become a ranged index, so that offset and size can be known, and data compaction becomes possible later. This setup also makes it possible to handle strings on the GPU.
Adding a row to a bank is as simple as appending it. As indices are decoupled from memory locations, banks can reallocate their sizes to grow without any pointer updates being neccessary (apart from the single column-to-bank/table indirections, which are pointers to pointers).
Removing rows out-of-order is trickier, as removal produces holes, which makes processing more difficult. There are multiple ways to deal with this:

Formalize a way to repeatedly swap the tail row with the row to be removed, then decrement the row count; this also requires updating all dependent index columns, which can become O(n²) costly if one does not want to maintain a linked list of dependencies.
Flag removed rows as invalid, then skip invalid rows in processing; regularly use compaction to free memory, and update dependent indices in the compaction process.

Memory could also be saved by treating a column, table or bank as sparse. The easiest way is to maintain an index map. Locally, we keep rows compact, and sparse indices are mapped to compact indices i.e. via hashmap or bisection. A sparse index caches its last known compact index and use it for subsequent resolution, and the cached compact index is verified prior to each lookup (or as a separate fixup step), by checking that the row at the cached compact index is within bounds and has the right sparse index key. If there is a mismatch, then the new position of the key can only be lower than the old one, and we can bisect within that region or perform a linear search according to a discrepancy heuristic.
A reverse lookup, i.e. querying which indices point to a particular row, is trivial when there's only one incoming edge. In any case, a column C keeps a list of columns that depend on it (which may include itself), as well as for each depending column an auxilliary column first(C) of the first index within the depending column D that points to this row. For multiple incoming edges, each depending column D must then also maintain an auxilliary column next(D) that builds a linked list of local indices all depending on the same target. Then a reverse lookup can be realized with two nested loops: the outer loop iterates all depending columns, the inner loop iterates all depending rows in these columns. This is a complex but doable setup.
28.1.2021
For my data structures, I'm increasingly beginning to think about GPU compatibility. For this reason, I don't want to use pointers in data structures anymore. Other advantages:

we get to choose the bitwidth of our indices, and an index can serve to access multiple memory locations in a SoAoS configuration.
data structures can be serialized / marshalled without a translation table.

The condition is that we now need to keep all objects in pools. Pools can be

self-referential, in which case it would be useful to maintain a left-indexing rule, in which every element index is greater than the indices of the elements it is pointing to, so that we at least keep an in-array DAG guarantee that allows us to split and compact pools with ease as well as play nice with (massive) multithreading. Example: Per pixel linked list.
referencing indices in one or more pools. Indices can hold category or hierarchy bits that allow them to point to more than one other pool. This opens a possible avenue for auto-building types for indices, according to analysis of pool dependencies.
Both.

Special situations occur when we

Need to delete pool elements: Using elements from the tail breaks existing references and the left-indexing rule. Reallocating gaps easily breaks the left-indexing rule. Compaction by GC seems the most useful method, as it also covers the next bullet point.
Run out of pool space: We can asynchroneously check that the pool space always holds enough space to accomodate the max number of additional items we are able to add in the time it takes to set up and switch to a bigger pool. As soon as the boundary is broken, we can insert a processing step that transfers pool memory to a bigger pool.

7.1.2021
I'm working out how to formalize insertions/deletions in immutable composite structures. A common element is, as one descends, to record traversed nodes and their position as a path of breadcrumbs that can be re-composed when the new structure is built.
This is usually done with recursion, so that the information is saved implicitly in the stack, but it can also be done stackless, with a local dynamic array.
In order to avoid repeated copies of the same structure, it is better to perform multiple edits in one step, which requires retaining a parent element until all subelements have been covered. Exploring individual subelements can be safely threaded.
I'm thinking that the only composite datastructure should be a n-dimensional array (where n is in the range 1..3).