I was hoping the work on Bangra wouldn't take much longer, but I was wrong. Bootstrapping a static compiler is really really hard. Absolutely none of my original ideas for it have panned out. All ideas that came instead where actually better. In the process of building Bangra, I gained even more useful insights:
Your hoist interpreter doesn't have to be written in a performant language, but should instead be written in a language that makes writing this interpreter particularly comfortable and reduces the work you have to bring into it, while still offering a foreign function interface. Compactness and maintainability matter over performance.
I was struggling to support continuation passing style with garbage collection in C++ in a compact, comfortable and debuggable way. The line count had swollen to about 8000 lines with no end in sight.
So, after doing some tests, I rewrote the interpreter in Lua within a week, to make use of LuaJIT's tail call optimization (permitting endless forward calling) and GC. The result has roughly 5000 lines. The language I worked on before, "None", also runs on LuaJIT, but the difference is that None compiled to LuaJIT bytecode. As a result, many Lua-specific idiosyncracies bled into None, such as the type system, Lua tables, API functions, the 60 upvalue/200 locals limit, or indexing starting at 1 instead of 0 by convention. Bangra runs in a bootstrap interpreter written in Lua to isolate it from Lua internals and keep control over the design.
What I have now is an interpreter running in an interpreter. That means Bangra runs about
4xtimes as slow where
xis an overhead constant incurred every time a process is virtualized. But it's fine, because we'll Futamura the shit out of it.
Writing a Cheney on the MTA interpreter by hand is laborious and renders the resulting code unmaintainable.
I don't want to get too much into this right now, but there are a bunch of devils in the details. Suffice to say, once you have allocated some memory, you don't ever want to move it again, especially when other memory depends on it. Also, CMTA has no way of tracking dead memory, so a special solution is required here too if you need to invoke custom destructors. I put a week of work aside to explore this avenue and it was a waste of time.
Church encoding helps reducing your type system to the absolute minimum required to permit the language to bootstrap itself.
The C++ implementation made extensive use of tuples (unnamed structs) to pass values between functions. I removed tuple types from the Lua port as soon as I realized that when you have closures, you already have tuples. In continuation-passing style, you can encode any tuple
(x0, x1, ..., xn)as a closure
fn (f) (f x0 x1 ... xn)where
fis a function that receives all tuple values. If
fdoesn't know or care how many elements the tuple stores, it can take a variable number of arguments. Furthermore, if the language supports multiple return values (and Bangra does), the closure can be simplified to
fn () (return x0 x1 ... xn), where
returnis an implicit return function provided by the caller.
The same goes for enums and the inevitable switch cases that use them, which leads us to the next point.
Intrinsic compare functions in conjunction with continuations provide a great way to decimate the number of cases to consider when doing comparisons between two values.
This is an original solution I haven't seen elsewhere yet, and which I'm particularly proud of. The challenge is to allow types to overload relational operators, preferably through a single function, while also keeping the dispatcher to these operators simple.
C++, Python and Lua for example require a minimum number of overloaded operators to consider the interface complete. If you specify operators for
>(greater), you can use negation to deduce
<=(less-equal), and through deduction, all other operators from the set
== != > >= < <=. Because some types offer fast operations for different cases, or have special values that don't compare (such as
NaNfor real types), the dispatcher that translates what the interface provides to what the user expects has to consider every permutation.
Composed from these operators, C (like many other languages) offers another approach to comparisons, mostly used in conjunction with sorting algorithms, a function whose C signature resembles
Ordering compare(T a, T b), where
Orderingis an enumerator (told you this would come up) with one of the outcomes
Greater. Here we have a single function dispatching each relational operator to an equivalent enumerator, and on the user side a switch case that is able to turn each enumerator value back into code flow.
compareis composable; we can comfortably build a
comparefor tuples from
comparefunctions for its elements.
So why not attempt to invert this relationship and, rather than composing
comparefunctions from relational operators, compose our relational operators from intrinsic type-dependent
comparefunctions instead? There are however two obstacles to pass before this solution works.
The first obstacle is that
compareonly covers ordered values, but we also need to compare unordered values such as structs (and values of unrelated type), or partially ordered values such as types (which can be unrelated or supertypes of each other) and reals (where a comparison with the
NaNtype is always unordered). To that end we add a fourth result to
Ordering, which is
Unordered, and covers all these cases where the result is neither
Greater. Special thanks here to @korenchkin, who provided me with the mathematical idea of ordering and its accompanying technical terms.
The second obstacle is the paradox of comparing the result of a comparison. In order to compute
(compare a b) == Equal, we need to expand it to
(compare (compare a b) Equal) == Equal, which expands to
(compare (compare (compare a b) Equal) Equal) == Equaland I think you get the picture.
We solve Zeno's Paradox of Comparisons by reaching once more for church encoding and turning
compareinto a four-way branch function of the format
fn branch-compare (a b f-equal f-unordered f-less f-greater)that compares
band, depending on the outcome, invokes one of the four continuations. Then the relational operators can be implemented with one-liners.
a >= bfor example expands to
branch-compare a b return-true raise-error return-false return-true.
There are additional upsides to this method. Relational operators often appear in conjunction with branches. Since
>=already created four branches, any branch that depends on the returned booleans can be folded into a direct invocation of
branch-comparehas no direct equivalent on hardware (as far as I know), but can be lowered without much difficulty depending on atomic type and the number of unique outgoing paths.
I wanted to be done with the compiler by February so we could prepare the March release of NOWHERE, but was struggling with winter depression and the fact that things took way too long to move forward in C++. Really, every time I touch that damn language, pain and misery become my permanent companions. Now that the interpreter runs in Lua, I progress much faster with my agenda, but there are still a few things to do before I can assemble the next alpha. Here's a laundry list:
The few Bangra tests that I have need to be fixed to match the modified parsing rules and the reduced type system. The FFI needs to be ported before step 4, along with the clang parser (Both still exists only as C++ implementation).
A partial evaluator / specializer needs to be written (in Bangra) before Bangra functions can be trivially compiled. It takes a closure and optional parameters to pass to that closure's entry point (of which each parameter can either be constant, typed-value or any-value) and returns either a constant (in case all parameters are constant and the function has no side effects) or a new function with all constants and closure values inlined and evaluated, all captured closures eliminated, and all parameters typed (where possible). The result has to be clean enough to use it for the next steps.
All Bangra functions are already stored in IL, so I don't foresee any logistic problems here. I have never written a partial evaluator before but conceptually, I understand the problem well. Nevertheless I expect surprises.
A lifetime tracker needs to be written (in Bangra). It annotates the IL with information on what values are required in each continuation, how they should be allocated and when they are dead. I know next to nothing about lifetime tracking and don't even know my requirements here. I just know the generated machine code won't depend on a GC, so we'll probably need metainformation that tells us where and when to use
Up until a few weeks ago I had my doubts that steps 2. and 3. were even possible, considering that I hadn't yet seen a project based on aggressive deep type inference which did this, but then I discovered Carp and that lifted my spirits. I had productive talks with Erik Svedäng, head dev of Carp and also game developer (in fact, like Bangra, Carp is written with game development in mind), and I hope I can learn more from him on these subjects.
A LLVM IR generator needs to be written (as described in this paper) that takes specialized IL and turns it into LLVM IR (using the LLVM C library). LLVM can then compile the IR and we have successfully translated Bangra code to an executable format. This is where execution of Bangra code speeds up to
1xfor the first time!
Analog to 4., a SPIR-V generator needs to be written that translates specialized IL into SPIR-V. The format is very similar to LLVM IR, and I know at least one helper library designed to assist with generating SPIR-V. Now, with the power of Vulkan, we can magically run Bangra functions on the GPU as well as the CPU, so shader functions and gameplay functions can source the same math library.
The Liminal game engine needs to be ported from None to Bangra. The languages are largely similar so this is mostly typing work.
Now the codebase is big enough to port NOWHERE's implicit function renderer and add the new CSG features. Then the world renderer works again.
From here, gamedev can commence and I can work on procedurally generating game assets and wiring them up with logic. Textures, models and scenery, sound, audio sequencing.
Inbetween 6. to 8., there's going to be a problem of a rapidly growing codebase that takes longer and longer to load. The interpreter already takes 1.5 seconds to bootstrap Bangra. Macro expansion in particular is very costly. A quick fix is to cache the generated IL on disk (which is what None did), and that takes the writing of some serialization routines. IL caching can't however cover for the LLVM compiler which I'm told is not the fastest one on the block. I can also try to devise some caching here as well, but these are not pretty solutions.
What follows is the pretty solution as outlined by the Futamura projections, and I'll try to restrain myself to implementing it after the alpha has been released:
The interpreter needs to be ported from Lua to Bangra so Bangra can run and most importantly specialize itself. Since both languages are very similar, this will be relatively straightforward, but some types internal to Lua will need to be reimplemented in Bangra at this point up until the required functionality, prominently strings and hashtables (Liminal has a great hashtable implementation).
Now we have an interpreter (Bangra in Bangra) running in an interpreter (Bangra in Lua) running in an interpreter (LuaJIT). Would we actually run anything of value here, execution would slow down to
8xtimes as slow as a native execution. (We could also run the interpreter repeatedly recursively and bring everything to a crawl, which is more limited by available memory than computing power.)
Using the functionality developed in 2-4, we can now specialize the Bangra interpreter for a given bit of Bangra source code, and yield a binary (which embodies the interpreter running the code processing arbitrary user input into arbitrary output). e.g. specializing the interpreter for NOWHERE source code would yield a NOWHERE executable. The translation process is rather slow as it still runs through the interpreter. But we can do even one better.
We proceed to specialize the Bangra interpreter for the specializer and yield an executable (embodying the interpreter running the specializer processing arbitrary user code into binaries that work as described in 11), yielding the specialization process as an executable that can take NOWHERE source code and yield a NOWHERE executable. This is also known as a compiler. Now compilation is also fast.
Not necessarily required, but a nice bonus: we can specialize the specializer specializing the specializer. You read that correctly. You get an executable embodying the specializer specializing arbitrary interpreters into binaries that work as described in 12. In simple words: you put in an interpreter for a new language (where the interpreter itself is written in Bangra), and a compiler for that language falls out. There is a great step by step explanation of how that works on this blog.
And that's it for now.
Hooboy, this Bangra business is taking way longer than anticipated. Bootstrapping a static compiler is _hard_. Nearly none of my original ideas for it have panned out. On the upside, the ideas that came instead were actually better. In the process of building Bangra, I gained several useful insights:
If you want your hoist language part (in this instance, C++) as small as possible, with all important bits written directly in your new language (in this instance, Bangra), the smallest thing you can write to get there is an interpreter. Going straight to any form of static compilation without a dynamic language stage is too hard.
Don't bother with memory management when you are bootstrapping. Just allocate and never free anything. It will be fine. You can always write a GC later.
Your bootstrap language has to be comfortable before you write anything of value in it; Your actual compiler can only be written once the language is convenient. If you write it in a minimal IR, like I originally planned, you'll end up with unserviceable bloat. This is where the interpreter stage helps.
The compiler will need to interface with libc and LLVM, so a C FFI is most opportune if you don't want to blow up the hoist part further; Doing FFI through LLVM itself is too bulky. libffi is a better choice: it's tiny in both interface and implementation and comes with alignment info for all primitive C types.
LLVM IR is not a good intermediate representation for your own language. You need a mid-level IR that fits your model better. The Swift developers published an interesting talk last year to that end, and Rust announced their mid-level IR only this April. For the interpreter, I settled for something far more minimalistic. See next point.
It is possible to compile closures as zero-cost abstractions. I was made aware of a fantastic paper from last year, which introduces a new intermediate form that unites concepts from both CPS and SSA forms, called CFF (control flow form). CFF formulates each statement as a continuation (a lambda function that never returns) with a single function application as its body. Continuations passed as arguments turn into closures. The paper then introduces lambda mangling, particularly lambda lifting and lambda dropping, to eleminate closures. A final transform can then translate CFF to LLVM IR for machine code generation.
To this end, I picked CFF as intermediate form. It turned out CFF is also ridiculously easy to interpret: the eval-apply loop for CFF requires less than 100 lines of C++.
If Bangra is supposed to interface with C without friction, and the compiler written in Bangra is supposed to compile itself, the data model of the language should accomodate those goals. That means the interpreter should not deal in dynamic object types (CPython uses the
PyObjecttype for this), that require conversion to/from ffi compatible forms when interfacing with C functions. Instead, the interpreter should deal with C values, structs and arrays directly, so that no conversion is necessary.
But where do we store the types that an interpreter requires in order to make binary blobs inspectable? The answer is: fat pointers. Bangra's fat pointer is a dynamic pointer that stores a type reference along with its target in order to support interpreter features that depend on inspection: type checking, overloaded operators, bounds checking, element access and garbage collection.
The more of your hoist language data structures are written in C, the less work you'll have to put into bindings. Using libffi and fat pointers, the interpreter can access all memory directly.
Fat pointers turned out to be a fantastic solution for several of my problems, and my only regret is that I had not learned about them sooner. I am still in the process of transitioning the interpreter to this new format, along with the introduction of libffi for FFI calls instead of LLVM.
This is by the way the densest problem I've worked on so far. The interpreter hovers around only 6k lines of C++, but I've agonized over these lines in a way I never had to before.
I hope that the interpreter work is completed somewhere in the next week. Then I'll start work on the compiler, which will be needed for what's next: the new version of the engine uses Vulkan instead of GL by default, and the Bangra compiler will be able to target SPIR-V directly; it's just a matter of translating CFF to SPIR instead of LLVM IR. Once that's done, I hope I'll never have to deal with generating GLSL code again (except for the mac port), and that will significantly speed up the next development phase.
The realtime CSG renderer is practically done and waiting in a drawer to be taken out and put into use once the Vulkan backend stands. I can't wait for the moment when the plan finally comes together.
Looks like I've been slacking on my habit of regularly updating this devlog, probably mostly because I was ashamed that in the past three months I've been procrastinating on game engine work, and instead worked on perfecting our programming environment after I had a hard time writing a new visual editor for live programming, which was supposed to help with procedural audio design for NOWHERE, for which the ability to tweak parameters in real time is a must in order to be able to, ironically, progress swiftly. And there I was, shaving the yak again. And what a beautifully naked yak it has become.
My trouble started when I began to realize that keeping source in Lua tables was at odds with needing to render the UI as fast as possible, using dear imgui (Oh - I also wrote comfy bindings for imgui. So there's that.), as Lua doesn't really give you a high performance interface to its data structures. So it would have been either calling dynamically scripted functions from statically compiled code, or using the Lua C API to iterate and edit tables, which turned out to be a pretty awful experience. Once again, abstraction becomes the enemy. There's no direct access to the data structures, instead you're supposed to communicate with the VM using an awkward stack machine interface that renders all code unreadable, and doesn't lend itself well to abstraction due to constantly moving stack indices and a significant overhead associated with parameter checking.
I can't remember how I got to the conclusion so I must have intuitively felt that I had to fork Terra (as LLLua) in order to be able to fix a few long standing issues, primarily the way too difficult process of building win32 versions (didn't manage to build a single working version) but also the inability to annotate struct alignment for better auto-vectorized code generation, hooking initialization mechanisms in the AST, support for latest LLVM, throwing out code I didn't need (like the Lua language extensions).
In the process of maintaining LLLua, I had to concede that I don't understand LLVM and Clang very well; both C++ API's are incredibly complex, and one had to spend some time with the architecture to get a better understanding of how it worked.
So I put a week aside for an experiment, a small toy project which I felt would be an ideal way to get to know the LLVM infrastructure. The goal was to write a cousin of None, a minimalistic statically typed Scheme-like, implemented in a single C++ source file, that would leverage the low-level MCJIT capabilities of LLVM in order to create a self-expanding language. I nicknamed it "Bang", after the big bang which created a universe from the smallest possible point.
Based on my experiences I expected the experiment to fail; There was a reason Scheme was a dynamic language, so I argued, and good metaprogramming would therefore need some kind of Terra-like two tier system: a dynamic VM like Lua would bootstrap a compiler using a language toolkit like LLVM.
But the cul-de-sac moment never came. Instead, the undertaking proved to be frustratingly fruitful, and it became impossible to quit working on it because I kept succeeding, and the system exposed attributes early on that None lacked.
With the help of MSYS2, Bang proved to be incredibly easy to build on Windows. So easy, actually, that I almost look forward to booting Windows to make a new binary. I don't even use a build system; Just a Bash script for Linux and a batch file for Windows. Bang uses the LLVM/clang distribution on Linux both as its sole dependency, and as its compiler. On Windows, compiling and linking had to be done with gcc.
Bang has recently been renamed to Bangra. Its release number is now at 0.2, there are binaries and documentation online and I wrote an introduction that gives a rationale for its existence. The only language Bangra supports at the moment is the more low level IR language, which has a restrictive type system, but due to its hygienic macro support, even coding at this level is beginning to become convenient.
I'm currently in the process of implementing a new version of Nonelang on top of it, as Bangra's default high level language. When I'm done, the Liminal game engine will be ported and the Nonelang runtime retired. Then it will be possible again to do Windows builds on a casual level, and a big hurdle to releasing the next NOWHERE alpha is overcome.
It's time to document what I've learned so far, before I dive in and tackle the next topic (which is diffuse indirect lighting). Let's start with a video of what the renderer looks like right now. And here's a video with slightly more geometry.
It turned out that the point renderer as I envisioned it was simply not feasible for dynamic scenes. Trying to solve geometry over multiple frames produced all sorts of reprojection woes that I don't want to get into right now.
The current renderer, which I'll stick to, is able to dynamically render up to 64K solid convex superprimitives per frame per viewport, raytracing only, no triangles. Everything is movable and morphable, there are no static elements, which is perfect for outer space and highly dynamic scenes with rearranging geometry.
The primitives (which in the engine are called brushes) are of a single type that I engineered for this application, sdUberprim. Uberprims can represent boxes, spheres, cylinders, capped cones, capsules, pellets, and smoothly morph between those shapes. (My ideal superprimitive would be a tetrahedron or hexahedron with individually weightable edge smoothness, but I couldn't figure out a distance function for those. Any takers?)
On my GTX 460, the rendering time ranges from 4-10ms for a single 1280x720 viewport. Depth complexity is handled reasonably well. These two worst case test scenes give a good sense of where the visual fidelity limit is. There's no material system yet, and the timing is for a scene without shadow mapping, so 30 FPS seems a reasonable target for this GPU. Current gen GPU's with three times the compute capability of the GTX 460 should have considerably less trouble.
I had to let go of smooth and subtractive CSG to get to this performance level, and the primitives must be convex, because the abysmally slow sphere tracing has been swapped for Newton-Raphson iteration, which can iterate towards convex surfaces in less than 8 steps, and even more importantly, is able to provide correct conservative bounds (the inner bound dissolves for features thinner in depth than the cone width, which is not so desirable. Any takers?).
(smooth) CSG using intersections is still possible with Newton-Raphson, as the intersections of two or more convex primitives are also convex, but these operations can not be performed directly as viewport objects are accumulated, but have to be performed in separate registers first, and I couldn't figure out a good data model for that yet.
If you would like to implement a similar system, here's roughly how the renderer works:
- Conservatively rasterize all brushes as oriented bounding boxes into a Z-buffer of 1/16 viewport resolution, and raymarch each fragment to cull the surface further and provide a relatively accurate max-estimate on Z distance.
- Rasterize the same brushes again, with identical parameters, but this time the brush index of each raymarched fragment is written into a per-pixel linked list (I call it an OBB texture), again at 1/16 viewport resolution. To avoid writing too many fragments per tile, the generated Z-Buffer texture from step 1 is queried. If the raymarched distance lies behind the Z-Buffer distance, the fragment is occluded and will not be written.
- Initialize a brush index texture from all tiles in the OBB texture that only have one fragment. Those tiles are ultimately resolved and will not change again. The work remains to refine the remaining tiles, as those contain overlapping contours of two or more brushes.
- Initialize another sparse linked list (I call it the contour buffer) from all tiles in the OBB texture that store more than one fragment. The contour buffer is an array of linked lists, one array entry per multi-fragment tile, pointing to a list of brush indices and min-ray distances (see implementation notes) for this tile.
- Increase the brush index texture resolution to 1/4 size of the viewport. We're now going to fill in some of the remaining empty tiles at the contours.
- Using a compute shader, subdivide every multi-fragment tile in the 1/16 contour buffer into 4x4 smaller tiles of which each tile is now 4x the viewport pixel size, and store these tiles in a new 1/4 contour buffer. For each of those tiles two raymarching passes are needed: first, find the nearest interior bound of all overlapping brushes, then discard all brushes behind that bound. If only one brush remains, write the brush index directly to the brush index texture. Otherwise, add the tile to the 1/4 contour buffer for further refinement.
- Increase the brush index texture resolution to full viewport size. We're now going to fill in the remaining empty contour pixels.
- Refine the contour buffer once more (as in step 6), from 1/4 to full resolution, with some modifications. Only a single pass is required, finding the brush with the nearest exterior. The result is written to the brush index texture. We now have a completed brush index texture.
- For each fragment of the brush index texture, raymarch the brush at that pixel coordinate and use central differences or automatic differentiation (which is only slightly more efficient) to compute a high quality normal. The normal is written to a normal buffer, and the depth written to a depth texture.
- Apply deferred lighting to your hearts content. Season to taste with salt and pepper. Serve hot.
The efficiency of this method stems from the implied early occlusion culling and the refinement of contours only, which approximate curves with infinitely small thickness. N contour tiles are refined to 4N tiles on average, while the number of overall pixels increases by factor 16.
Some implementation notes:
- When rasterizing the bounding boxes, all vertices are shared. Each vertex of a box is connected to three edges. If a vertex has two contour edges (a contour edge joins a forward- and a backfacing triangle, thanks Marco Salvi for pointing that out), it will need to be adjusted. The edges are shifted outside and the vertex is mapped to the intersection of those edges. See Conservative Rasterization for more details. The second method is relevant here.
- The distance function is solved towards
f(t) - bias*t / (min(screenwidth,screenheight) * sqrt(2)) = 0where f(t) is a distance function, t is the ray scalar, and bias is either 1 or -1 depending on whether we want the outer or inner shell of the primitive. Subtracing the linear term effectively narrows the distance to the ray as it progresses, turning the ray into a widening cone with a spherical cap. See this shadertoy for a reference implementation.
- The per-pixel linked list is very well described in the presentation OIT and Indirect Illumination using DX11 Linked Lists - AMD. Our fragments are not sorted, though the implementation would probably benefit from it. Cyril Crassin wrote about a version that reduces bandwidth pressure in his blog, but for our use case that made no difference, as we were primarily compute bound.
- When refining the contour buffer, fragments have to be added and linked up before knowing how many brushes would actually pass the depth test, and another contour tile would have to be added. If it turns out that the contour tile is not needed (as only one fragment has been added), then the fragment is orphaned. This is in practice no problem and carries very little overhead.
- Convergence can be sped up by storing the computed ray distance with each fragment in the OBB texture and contour buffer, and passing that ray distance down in the pipeline to the final depth resolve to reduce the number of raymarching iterations.
As mentioned I tried raytracing again, RevAA, then back to SDF and then developed a piecewise polynomial evaluator for rays to speed up convergence, but it's simply too slow. I wrote a bit about this here. To sum it up, when evaluating complex distance fields, every iteration hurts. Just one to two iterations per frame are okay.
So I tried the next thing, based on an idea that Timothy Lottes had, about using particles to represent the scene, and am experimenting with points converging on a distance field scene along rays. Here is a video of the technique in action. Performance on my GTX 460 is fairly good (a steady 1-2ms, because the work is completely uniform, regardless of scene complexity), while image quality is terrible, but now I'm optimizing the picture rather than performance (could be worse: I used to work on both at the same time), which feels like the right domain to think about.
Several avenues of improvement remain: a) Attempt to no longer limit lifetime of points and instead do a perfect detection of when points are truly dead. b) Seed points not at random locations but only to fill cracks. c) Seed points not at near plane, but closer to depth at surface.
Some more details to these directions:
- These points are recycled so far: backfacing points, points that leave screenspace. These points need to be recycled as well: points covered by the Z-Buffer, points in regions with too high point density. It all hints towards maintaining points not as a list of coordinates, but directly on the framebuffer (on a grid), because this way the grid automatically culls backfacing points, points out of screenspace, points covered by zbuffer and surfaces with high point density. Storing a point per pixel implies all those things. (However it is a big problem that I can only write 32-bit atomic-min values in GLSL, which limits the amount of information I can store with a single pixel. I will need two passes: one where points scatter their new depth values, and one where the same points evaluate if they've won and if so, write additional information for that pixel. Ho hum.)
- Seeding new points to fill cracks requires either a histopyramid on the Z-Buffer to enumerate empty pixels, or, when points are stored in the framebuffer, this can be part of the iteration pass.
- Seeding new points closer to depth at surface to reduce convergence time: not sure what to do here. Using the (unprojected) previous Z-Buffer seems okay most of the time unless stuff fades in from the sides, or new objects appear. The danger lies in picking a starting depth that is behind a distance field volume, effectively clipping that volume from view.
Happy new year everyone! We just whooshed past our original deadline, which was the end of 2015, aaand I'm still doing R&D.
What transpired? Reducing my tetrahedral mesh to a triangle mesh in 2D still left me with an approach that felt icky to work with. I still didn't know how to do painlessly procedurally generate structures with triangles. What I could work with, however, were lumps of classical game objects as I used in the last 2D prototype (a small demo movie here), but I thought that would bring back the old object vs. terrain dichtonomy, and I always wanted a single system to do everything, as the world is also not split into living and dead things, but everything can fluidly transition from resting to animate and back. Splitting a game into world and objects was a performance hack that ended up with the creation of two systems that are incompatible to each other, and make transitions hard.
Then Media Molecule started demoing their Dreams project (just look at this awesome demo and how easy they can create complex topology with it). I had occasional talks with Alex Evans about the technology they were using, and we shared our R&D woes in the past years. A few years back I had dabbled with similar systems for NOWHERE. I had thought that their approach would not work for the huge continuous planetary landscapes that I was envisioning. But as my dream of tetmesh backed giant realtime transforming asteroids was collapsing under the weight of what computer consumer hardware can realistically do in 2015 (it's still a great idea - for the computers of 2030!), and admitting that I was unable to explain to myself how I would actually teach a program to generate and alter shapes this way, I took another look at Alex' SIGGRAPH slides and repondered on the intersections of our projects.
Then it dawned on me.
The first thought was that since composition of operations was trivial with Media Molecule's smooth CSG approach, it would also be trivial to write procedural generators for it: smooth-add and -subtract a bunch of boxes and ellipsoids, and evaluate the resulting shape as a point cloud. Then compose the whole world out of these objects. An asteroid for example would then be simply a huge blob made from many medium-sized rocks clumped together. Digging a tunnel would be removing rocks from the structure. It would have the added benefit of keeping surface irregularity; any dug tunnel would still look rocky. And because we have smooth CSG, we can blend structures and make them seem more continuous instead of mashing up discrete elements.
That brought me to the second, and even more important realization: I was approaching the abandonment of the terrain/object dichtonomy from the wrong angle. All this time I thought that the way forward would be to understand objects as a special case of terrain, that any game model is a sort of mini-terrain - one volumetric body geometrically built out of a connected graph of tetrahedral cells with different attributes. As this was how the physical world worked: as a fluid laplacian field where everything was connected. The wave-like way of seeing the world.
Sidenote: It turns out that while graphs are the lowest common denominator of all data structures (you can describe any data structure as a graph), computers are exceptionally awful at parallelizing any form of graph work. It takes a lot of research to make these things fast. It's much better to work with trees, as trees only branch - they don't merge again. It's even better to work with linked lists, as lists don't branch - they just skip. It's absolutely excellent to work with arrays, as arrays neither branch or skip, and can just be traversed linearly, a job that parallelizes easily. GPUs want arrays. It is possible to describe a tree in form of a list, and a list in form of an array, and most certainly can a graph be described as a tree, but these reduced data sets are lossy and therefore immutable. Triangle meshes make editing bearable by keeping a few connectivities invariant (e.g. every triangle edge is adjacent to only two triangles), but graph editing incurs a constant chasing of pointers (implying random access all over memory) that is detrimental to performance of contemporary computer architectures.
Anyway. It did not occur to me that the terrain / object dichtonomy could also be approached from the other side: that terrains can be described as superset of objects. Our smallest atom is no longer a vertex, a singularity, but a geometric primitive that has surface and volume - the particle-like way of seeing the world. Any terrain can be composed from primitives, any game model is a composite of primitives anyway. We've been doing this for years already, in our rigid body physics systems. Assuming that we can find a simple way to smooth the surface gradient to get rid of hard corners that invariably form where primitives overlap, the result can look just as convincing - and Smooth-CSG offers just this approach.
Philosophically, it makes no sense for a computer game to attempt to do a full physical simulation. This way, we end up with an abstraction that is so occupied with every atom of a tree, we lose all semantic understanding of the tree. Traditionally, games prefer to organize their world semantically - the world not as we experience it, but as we make sense of it, and how we communicate with it. Every game object's appearance exists merely to convey its semantical points: the color of leaves of a tree may communicate the tree's genetic heritage, the season we are in, the quality of the soil it is embedded in, the cleanliness of the air that it breathes, the presence of parasites and how much sunlight it is getting. A physical simulation attempts to recreate a system where the semantic understanding is an emergent effect. A game takes shortcuts to explicitly recreate emergent effects without the systems that produce them, and so discards a wide range of possible emergent side effects for the efficient mapping of the few relationships it cares about. It is an edited, condensed version of reality forced by the evil ever-present constraints of computer performance :-)
"That is all nice and well", you may interject impatiently, "but what have you actually done in the past weeks?". How rude of you - but also how right you are!
The first thing I did was attempt to recreate the distance field evaluator that Alex described in his slides, first on the CPU. I started by assembling some of the algorithms I would need, such as a 3D hilbert curve encoder / decoder (here's a video of the 2D version). Then with the help of @rianflo (who runs the truenext forum on Slack, a place where developers interested in writing the next generation of dynamic world games meet to talk business). I wrote a few max-norm distance functions for popular primitives such as the box and the ellipsoid, and did some initial spatial bisection tests in 2D (picture). Then I wrote and experimented with a GPU backed 3D point renderer (a beautiful picture here) and added hilbert curve ordering (picture).
The last CPU-based version of the evaluator randomly spammed space with objects (picture), and then I.. I concluded that the whole thing was already getting too complex for dynamic scenes and threw it away.
That's not entirely how it happened. In the middle of it it occurred to me that my old friend Interval Arithmetic (IA) could help describing more primitives than max-norm, which has no composable arithmetic rules, would allow (in fact they often lead to identical results). Then it transpired that, since IA operates well in distorted domains, I could perhaps do the evaluation directly in window space and use a Z-buffer to cull hidden and occluded surfaces early (2D prototype picture). I prototyped a few more IA operators (picture) and rendered a beautiful testing scene here, which convinced me of the practicality of working with special arithmetic rather than distance norms.
It turned out that Alex had also written about a frustum renderer in his slides, but this one operated on trees of volume textures, while I went straight to dynamic surface evaluation. Alex helped me a lot by explaining the minutiae of his approach, which allowed me to make more informed decisions. I had to learn how to write compute shaders at the same time, and I'm still not entirely fluid in it, and am forced to figure a lot of things for the first time.
I wrote a first prototype of a GPU version (without CSG evaluator, just a simple implicit surface for testing). Here's a video of how the implicit surface frustum rasterizer (ISFR) converges towards a solution. A problem that I wrestled with was the behavior of evaluation close to grazing surfaces. Many surface samples can become relevant for the same pixel, and the evaluator tends to do 3-10 times more work than necessary on the finest resolution. With optimal scenes, the frame time can go as low as 1.5ms, while it climbs to 20-30ms at grazing angles. Anyway, here's a picture of some beautiful smooth-min CSG rendered with the ISFR.
Another issue with the ISFR is the required log2(N) stream expansion steps, as the frustum space has to be recursively bisected, and, to be honest, Interval Arithmetic is not very good at dealing with rotations and affine (shearing) transformations in general. Raytracing with IA has been done before but seemed particularly wasteful, as this effectively constructs a box around the ray, and reports positive for any surface inside that box, often neccessitating a bisection even when the ray is not even close to the surface.
So I bit the bullet and looked into the more complex and confounding successor to Interval Arithmetic, Affine Arithmetic (AA). That is where I am today. There is a beautiful paper on raytracing and voxelizing implicit surfaces with revised affine arithmetic (RevAA) that I am following currently. I have written a prototype on shadertoy that demonstrates how effective RevAA can be in finding zero-plane intersections between a ray (which remains error-free under rotation) and an implicit function.
Classic intervals describe a box around the range (effectively offering us a binary choice between "interval contains surface" and "interval does not contain surface"), and there is no other way to refine the result further but to bisect it into two parts and continue bisecting until the interval reaches our desired resolution. Not so the revised affine interval. Because each affine interval is a parallelogram (see shadertoy prototype), not only can it tell us whether it potentially intersects the surface, it can also prune the range further in the same step, giving us a far smaller search range to bisect. In the ideal case where the function is linear (as when evaluating the faces of a box), the resulting intersection width can even be zero, terminating our search after a single step!
This means that when evaluating surfaces with RevAA, we only pay for surface complexity (high curvature) and depth complexity (many layers of surface behind each other). Unlike with IA, grazing angles of near-flat surfaces are "free" because of the affine component and the resulting pruning efficiency. Each affine computation is costlier than classic IA, but we need less of them, netting a win (see the benchmarks in the paper).
I am now inclined to reconsider raytracing, as using RevAA to do spatial refinement doesn't offer the same ease of pruning, and on top requires two more affine components. In 1D ray evaluation (where the ray travels on f(x)=0), the range of intersection describes a simple interval (a line). In 2D evaluation we measure the intersection of a zonohedron with the f(x,y) = Z = 0 plane (which is our ray lifted by one dimension), so the range of intersection describes anything from a line to a parallelogram to a hexagon. Limited to searching a rectangular domain, we can only refine a bounding rectangle, greatly diminishing the benefit of pruning. You can imagine how bad the 3D case looks, where we do the intersection in 4D.
These are the problems I'm currently facing, and which I attempt to tackle in the next prototype:
- Raytracing has notoriously bad performance even on GPU, because of data and execution divergence. On the upside, we're evaluating primitives instead of triangles (good ol' POVray style), so we get to cover much more surface with much less data. There is a paper on raytracing with affine arithmetic on the GPU which I intend to study more closely.
- Without a refinement hierarchy for rays, we can only refine depth information at the target resolution, with a linked list of ranges and culled CSG operators per pixel. Starting with a lower resolution as in this paper and refining when close enough to surface should help, but needs testing.
- I'm still lacking a bunch of affine approximation operators necessary for CSG, particularly min / max / abs. min/max are easy to do once an approximation for abs() exists, as
min(a,b) = (a + b - abs(a - b))/2. Fryazinov's paper gives a step by step explanation on how to construct new operators, and that's what I'm studying right now.
Many additions to the 2D prototype, primarily figuring out the basic mode of movement, which is now a sort of flat version of first person controls; WSAD control lateral movement, while horizontal mouse movement rotates. I'm planning to test a depth-adjustable crosshair for vertical mouse movement. I started out with fish-like swimming controls, and while that was interesting, I think the basic mode of transport should offer high precision, as most of your movement in early life has high locality. There will be many other combinable forms of movement suitable for different activities later on, including the tether-based transportation we had in the last alpha.
The next bit to do is one of the most important mechanics in the game; In NOWHERE, your avatar is identical with your inventory. What you have is what you are. Therefore a way to manage your appearance is needed. I call it "self-assembly mode". In this mode, bits and pieces floating by are held by your avatars electromagnetic telekinesis abilities, can be rotated and attached to your body, and assigned to keyboard keys and controller buttons. It's a sort of diegetic version of the Besiege interface. This way, editing appearance, special abilities, loadout, role, character stats, inventory is all folded into a single, infinitely expandable system. There will be no additional skill tree. One look at a Nowherian and you can tell what their primary profession is, based on the bits and pieces they are made of.
I also need to reintroduce the GL rendering pipeline so the screenshots become more presentable again (and the next alpha doesn't look too terrible). Right now I mostly use NanoVG to render debug graphics.
Lots of improvements to the engine codebase as of late, primarily obvious modernizations and ports from the old Python-based codebase:
- Found two bugs in Terra, #134 and #132, and received a fix within a few days. @zdevito is awesome.
- A new entity component system (in Liminal, it's called an Actor Role system) that facilitates adding new properties and behaviors to actors in the game using object composition.
- A massive parallelization infrastructure based on the concept of Fibers, inspired by this GDC talk by Naughty Dog developer Christian Gyrling. Kinks are not entirely ironed out, but what I see so far is very promising.
- A full implementation of the PCG family of random number generators.
- A new PCG-based Random class inspired by Python's Random that produces random values with different distribution levels in 1D. 2D and 3D.
- Hash maps now attempt to maintain a 25%-75% load for improved cache coherence and faster iteration (great esp if you only have a few entries), and do in-place rehashing, avoiding memory reallocations.
- Extensions and performance improvements to the vector math library.
- Beginnings of a rotation math library with support for versors, anglevectors and axis angle computations.
- A port of the color math shader library to static, with various color space conversion routines.
- New implementations for Queue and Stack (Queue also works as a ringbuffer).
- Alt+Enter toggles fullscreen in glmain-based apps.
Business wise, in order to avoid confusion among our founders, we talked about splitting off the 2D part into a separate prequel, working title "NOWHERE 2D" or "NOWHERE Flatland" ;-) The 3D version will still be completed as promised, but will have to be delayed because, as it turns out, directly developing in 3D without solving most of your gameplay in 2D first is so hard, I would wage that with what we know now, a 3D version takes the time of doing a 2D version, squared, if the 2D version isn't worked out first. I was more than frustrated with that realization, but then remembered that even GTA started out as a top-down 2D clone of the APB arcade game. Knowing how to do small iterations on your work is key, and sometimes you have to go three steps back in order to go six steps forward. Or something like that.
Anyway, the idea is that each founder gets a NOWHERE 2D download channel in his humble bundle account, with a separate alpha, and eventually both versions will be developed alongside, with the same engine. I hope that this way we can make up to our founders for the extended duration of development, and have a more productive time doing it. At this point I'd like to say that we have amazing founders who have so far been more than patient and understanding with how things have been going, and I'd love nothing more than make all of them happy. One thing we'll never do is abandon the project. To Sylvia and me, it's still the most interesting thing to do in the world. As the saying goes, "It does not matter how slowly you go as long as you do not stop."
I finished the new input mapping system today, which allows to bind arbitrary input events to logical events. A declaration of a mapping is as easy as
input-mapping <name> <option> ...
An example declaration is:
input-mapping left ; if input is an axis, negate the value flip ; set an event handler for button presses on-press on-activate ; specify a list of default bindings defaults $ Input.axis-leftx Input.key-left Input.key-a Input.button-dpad-left
The system allows to register handlers for released buttons and value changes, and to set press/release threshold limits and controller id filters. The bindings can at any point be cleared and re-assigned. The generated event handling code is completely inlined.
I'm happy with how it works now, as the basic setup is open to future extensions that don't force big changes in the interface.
After struggling to implement the position based dynamics system I bit the bullet and postponed most world physics features in favor of working on gameplay, controls and AI. I went back to a simpler 2D physics system, Chipmunk 2D, and am now reimplementing player movement controls and the body editor. At this point, considering the pressure we're under to deliver, I lost all interest in doing more world engine research and just want to get it done. 3D and malleable worlds will have to be done at a later point when our funding level is better and we have regained the trust of our supporters.
The None programming language is now reasonably complete and I work with it every day.
Mesh refinement is working nicely now, and I wrote an article on how it works exactly for anyone who wants to implement it as well.
We were away four days to help my father-in-law move. I thought a lot about how to do overloaded names in None in a generic way and arrived at this solution (this is the test case for this feature), but could still not figure out how to deal with the corner cases of declaring method names this way.
In the end, I unified the way methods are called in None (both dynamic and static part). Before, there were three ways to do it:
(--> object (method-name ...) ...)which allows to chain method calls, but it turned out I hardly ever used the chaining feature.
(object --> (method-name ...)), the infix version, which was also cumbersome to use.
(object.method-name ...)which only works with specially rigged classes and descriptors in the dynamic language, and specially declared struct functions in the static language. That means all imported Lua objects, including strings, can't be method-called with this notation.
There was a bit of Twitter conversation about this topic with Michael Zinn, and this was why I tried the overloaded
name solution (comparable to how Haskell and similar functional languages do it), which
effectively would allow something like
(method-name object ...) and make
methods indistinguishable from regular functions. Alas, this
approach doesn't work well with dynamic languages (or rather creates more
confusion than it's worth), and so I ultimately settled for this notation:
(.method-name object ...)
The notation is recognized by both the static and dynamic list hook and translated into method calls. I removed the old notations and fixed all tests and dependent code, and now it's all neatly uniform. Should have done it like this from the start, good that I caught it in time.
Everything is now fun again. I also experimented with building a Voronoi diagram from the self-optimizing delaunay mesh. It seems that six neighbors (three inner, three outer vertices) allow to build a reasonable map, but there are often situations where this is not the case, because the mesh is not fully conforming. Considering the set of the triangle vertices full 1-ring neighborhood would give a complete map, but be more computationally expensive.
I also began work on the Position Based Dynamics physics system for the self-optimizing mesh, but in order to be able to play around with it more and devise interesting test cases, I need to cut down turn-around times and work on an editor for it, which brings me to a painful old problem, designing a good UI system.
I already wrote about this earlier in the devlog (7.9.2015), but I haven't progressed very well on this item so far. I guess it's time now to get it done, so I can move on and have more fun playing with physics.
Earliest footage of the position based dynamics physics system animating a bunch of points. Mesh refinement struggles keeping up when there are more than four points on screen, eventually a flip fails and the system collapses, don't know why yet.
Short update: I went back to the trimesh prototype and fixed all the mesh
operations that broke by porting to newer syntax. Next up is finally
implementing that purely gather-based approach I talked about
last month. It was
useful to devise a template for handle types (
detect parameter mistakes early. I also wrote a general
Mesh.validate function that checks if all links within the mesh are coherent and no triangle is degenerate, that allowed to fix functions very quickly.
Some things were added as part of fixing the prototype:
- None has a
float==comparison function now for doubles and floats that checks equality within half epsilon (the precision limit of each data type). Linear Algebra programmers know how important that one is.
crossfunctions for Liminal's GLSL-style vector types (it's growing more GLM-like by the day)
liminal.hashmodule that provides a generics-based wrapper around CityHash64 for both static and lua (hashes nearly anything you throw at it).
And two small things about None:
The globals fix that I proposed has been implemented (with a few changes in the concept) and works beautifully.
__globalcan be used to register new global names in the translated module scope-wise, and also redirect global names to a name that holds a table within the file.
usinghas been changed to directly require a module name. The macro loads the module during expansion and pastes
__globalstatements for the exported symbols into the source code. That means that
usinghas no further runtime cost than retrieving a name from the dictionary now, and therefore performs much better.
It could still be possible to devise a
usingthat resolves table attributes in local scopes, either by passing the attributes you want to access explicitly, or the name of an attribute list (effectively a sort of type) that is available at compile time.
I didn't change
?yet, I might never do it or do it on a lazy afternoon. I can imagine that the Lisp-style
ifmight trip up a lot of people (it still trips me up), so I still feel that would be the right move. What do you think?
What happened! Another 10 days gone!
Some work on Liminal (the game engine):
- A new finite state machine syntax that makes it easier to tackle tasks like AI logic and game flow.
- A function list construct to allow modules to hook events using the observer pattern, facilitating separation of interest. Function lists are inlined statically and so need to be complete before the first call. That makes them fast, but also disallows runtime addition/removal of hooks.
Lots of features and fixes for None (the programming language):
- lots of language bugfixes
- rudimentary online documentation!
- a comfortable naked notation that allows to write S-expressions entirely without nested parentheses.
- speeded up loading time from ~2 seconds to ~200ms (on my machine at least) by caching generated bytecode. None needs the
lfsmodule (LuaFileSystem) in its path to be able to generate the necessary timestamps, otherwise cached bytecode will never be loaded.
- To allow for bytecode caching, all occurences of
meta-dothat have side effects other than altering the active translation context had to be removed because they can't be saved with the bytecode; That means all
syntax-*-globalfunctions operate in the script context, and newly registered syntax only becomes available after execution of the script. Also,
syntax-exportno longer allows to export the active syntax table, but has to be called in the script context with a table created by that script. That fix is usually quickly applied, but syntax tests typically need to be moved out of the module that defines them.
nilis now named
null, because None lends other keywords from C (
var), and therefore
nulljust fits better.
- further improvements to tagging expressions with origin, and traceback information
- the parser stores columns for each expression now.
ast-list-unpackfor unpacking expressions with patterns
syntax-rulesno longer takes extra arguments for literals and hygienic variables. Instead, a literal can be defined inline using the
_:literalprefix, and hygienic names can be prefixed as
$:tempvarthe first time they appear in the template.
syntax-rulesalso uses pseudoquotes now, allowing to
unquote-splicedynamic expressions into patterns and templates.
switchmacro as an abbreviation to
enummacro working similar to C's
enumthat also stores the reverse lookup in the generated table and the enumerator count (last enumerator + 1)
tryhas an alternate notation friendlier for naked mode
Two more changes to None that I want to see through next:
ifworks like its Scheme and Lisp counterparts, but is in a modern sense much closer to C's ternary conditional operator cond?a:b.
condhas an unsual syntax for people used to classic
if. Therefore I'll experiment with renaming
?and adding an
ifsyntax that is convenient to use in naked mode and follows the structure
(if <condition> (then ...) (elseif <condition>) (else ...)to avoid surprising novices who come from other languages (it also trips me up when I switch between None and other languages).
As part of getting bytecode caching to work, I separated
core.ninto stages and found that it was difficult to figure out what macros weren't active yet because None interprets non-macro expressions as function calls, and symbols that it doesn't see in the scope as lookups into the global namespace -- but failing lookups lie dormant in functions and only become visible when they are executed. What to do? It occurred to me that global symbols typically don't change in scripts and are all expected to be present before the script is loaded, so for the debugging session I taught the translator to check unknown symbols against the active global table and report unknowns. It was only a problem where the module itself introduced new global symbols and used them immediately, so I hardcoded a whitelist for that purpose.
This worked incredibly well, so here's my idea for a new rule: the translator registers all global symbols in the tree its working on and prepends a presence check to the generated bytecode. This way, even cached bytecode can detect changes in the global table early (minus changes in macros, because these are already expanded). A script can use the new
__ignorespecial form at any point in the source file to add symbols to the whitelist of globals that should not be checked.
var-globaland similar macros are upgraded to implicitly use
__ignore, because that is always implied.
This should fix the typo errors at runtime that Python, Lua and many other script languages are notorious for (an example here).
Of course this means that
__set-scopemust no longer be possible (as it locally hooks the global table) and the application of
usingmust be changed to a static approach.
usingmust directly reference the module to be imported, the translator then checks this module for additional global symbols and implicitly resolves them.
Went on another yak shaving binge for the past ten days and woke up in a puddle of my own commits. Lots of exciting things happened for None's infrastructure, where by exciting I mean lots of bugfixing and red tape removal, which heightens my enthusiasm for using None to write more game related code.
At the end of this post I also write something about our current financial situation.
Let's see if I can retrace my steps, in chronological order:
- Found that most of the static DSL expressions skipped nil arguments thanks to Lua not being able to store nil values in tables. Fixed.
- Made it so that GDB actually displays the proper function filename and line number for anonymous terra-generated functions in stack traces, but then found out that this crashes LLVM because it relies on the anonymous name for ... something? Argh.
Shviller from the #nonelang IRC channel (we are on freenode, feel invited
if you'd like to program in None) suggested that None could use
https://github.com/franko/luajit-lang-toolkit to directly generate bytecode,
allowing to skip one extra parsing step and to directly annotate bytecode
elements with correct line numbers. Until then, None scanned backwards through
the generated code to find the associated line number from the original source,
and that was not only slow and memory hungry (as each translated source needed
to be kept in memory), it was also faulty and imprecise. I wanted the static
DSL and the GLSL DSL AST nodes to be annotated from the lua stack using
debug.getinfo(), but with the current method, the result was slow and unreliable.
I went through a list of what needed to be done first before the bytecode generator could work. I describe the steps in chronological order, in present tense:
The None translator
eval_none.nis written in S-Lua, which is my version of Lua in symbolic expressions. S-Lua is the intermediate form the None translator uses for code generation, which is then translated to a Lua AST tree (by another generator,
eval_lua.lua), which is then translated to actual Lua by yet another generator,
format_lua_ast.lua. I thought back then that it would be smart to write the None translator in S-Lua, to have a test case that proved S-Lua works, but S-Lua ended up really not fun to write, and is yet another language someone has to remember and think about.
Therefore I decide that before I do any more work on the translator to do the bytecode translation, S-Lua will have to go. I do one last translation of S-Lua back to regular Lua, and end up with a nicely formatted
In the next step, I make all fundamental macro expanders in the translator part of the-language-bootstrapping-itself and translate them to
core.n. A bit tricky because they need to be written before the actual macro system is in place -- in a way, they are the macro system. That part comes now first in
core.n, and it uses the macro-less verbose special forms that the translator understands.
The outcome is that
eval_none.luais now a lot smaller than before, and therefore there's less to refactor for the eventual bytecode transition, which I expect to be annoying and cumbersome, because I know nothing about the library I'm about to use.
For some reason I get irked by the fact that method accessors
(object --> (method ...))are annoying to use in None. I "fixed" this issue for dynamic classes before, by introducing Python's descriptor classes to customize member access. Members can now bind self to a function on access, and that allows to call methods as
(object.method ...)as one would expect.
Can I do the same thing for static structs? Thanks to terra's metamethods, I can indeed, and now structs can also define methods that use the simple accessor; LLVM's code generator is very efficient and completely optimizes out the intermediate construction of a struct that holds the instance reference.
I work on better error reporting while I translate the TriMesh prototype to use the new method syntax. At that point I realize that static AST nodes will never get annotated properly with the present system, and I return back to the task of realizing the bytecode translator.
It turns out that luajit-lang-toolkit is relatively simple, almost fun to use and I refactor
eval_none.luato output luajit AST nodes instead of S-Lua in a day. It's amazing how quickly things are working again.
I also find out that the LuaJIT bytecode has no problem with symbolic names that are considered illegal by Lua's parser.
my-f*ing-amazing-variableis a legal name to use there, and will even be honored by LuaJIT's debug output. Lovely, but I'm not going to make use of that fact yet.
The S-Lua -> Lua AST generator is no longer required, neither is the AST -> Lua code generator, so I remove them both from the project. Good riddance.
Lots of fixes to get coherent line number annotation all the way from the parsed symbolic expressions to generated bytecode and machine code.
__pragmaspecial form that allows to turn compiler features on and off for debugging. Very helpful for figuring out why new features (in this case it's inlining pure Lua expressions to avoid too many temporary variables) break production code.
__pragmaalso allows to dump generated Lua AST nodes, equivalent lua code, expanded symbolic expressions, and the linenumber and filename that AST nodes will be tagged with.
The static DSL tags each and every expression with a filename and linenumber, which I don't have to do anymore now. I add a rather elaborate visitor pattern that patches each and every terra AST node with the correct anchor, and that turns out to be pretty elaborate work. Terra has too many AST nodes.
One odd bug is that line numbers passed to the bytecode seem to be modulated by 256. I guess that this is because LuaJIT uses the
lastlinebytecode tags to pack linenumbers into 8-, 16- and 32-bit numbers, and indeed, once I generally define
lastline = firstline + 65536, the problem is gone. It's a hack but I'm not going to expand the parser now so that it always supplies a valid lastline. I hate the parser.
10 new tests have now been added as part of debugging new issues in the prototype's codebase.
So there we go, bytecode translator all done. While I'm at it, I do some more cleanup:
When I wrote the GLSL translator, I refactored None's macro expander to serve any domain specific compiler (swap out the syntax table completely, use
__escapeto stop the macro expander from expanding any further, etc.). The static translator however still uses its own macro expander, and that means all linenumber propagation work that I did, I would have to do here again.
So I refactor
static.nto use the internal macro expander, and now it's all splendid. It still has its own "compiler" stage that translates the expanded forms to terra AST node constructors, but given that the static "compiler" does next to nothing, I wonder if I couldn't completely merge that stage with the macro expander.
Which brings us to the present day. The TriMesh prototype currently renders no content (but at least the hit testing works), so I'll have to fix that next up, and then I hope I can finally work on the game editor.
Now on to financial matters.
It appears we are running out of money again. The long waiting time for new content has eroded the number of new founders, and one of our long time patreons had to reduce his monthly support because he's also in a financial pickle right now. We also used to borrow a small complementary amount of money from my mum each month and are about to reach the cap next month, so we will also run out of that funding pot by end of October. Sylvia is still waiting on payment from one of her commissions (this was before Sylvia began to require advances), but it is very likely they screwed us and we're not going to see a cent. There are currently no new commisions in sight for her, but of course that can change any day.
That means we have to think about how to get fresh funds before September runs out. I've considered starting a Patreon for None/game engine development exclusively, so we are at least getting support for that part of the work (which is not what Nowhere players are necessarily interested in, although they damn well should ;-)), but Patreon only allows one page per account, and managing that one account is already hard work for Sylvia, so we'll have to merge the tool development part with the page and make it about more things than one. We'll have to think about how to best present this.
We also need something new to show so we can justify doing a new video and invite everyone to check out the new alpha. Sylvia has done lots of new concept art that could go into a new video, we can finally exchange that old dreadful logo, and replace "Psychedelic RPG" with a much more fitting "Alien Life Simulator". The fundraising video is nearly 2 years old. Back then we gave away the end game without mentioning many of the intermediate steps, and I believe people would be more inclined to trust us if there were a more detailed plan of action, and we would be more clearly on our game being developed as a service, not as a one trick pony.
It also becomes clear that we will not be able to deliver on our full promise by the end of 2015, but I want to do a release anyway with what we get to work by that date. We don't want anyone to feel disappointed or exploited. The plan is to ship what's there, then continue on a "sequel" (which is really just a new major version) and keep our founders on board until they get what they paid for.
I can not believe how fresh our concept for Nowhere still is. The pitch has been out for several years and no one has stolen it from us, and I guess that's because it really is that ambitious and forward thinking. We have ideas for art, sound and gameplay design as well as programming, modding and player service which are quite different from how most game development is done today, and I would love to see all our goals realized in exactly the way we envision them.
I don't mind if it takes ten years to get there, but we need to compromise on the progress on the way and reintroduce regular releases of alphas, betas and "masters" to keep the company operational without incurring additional debt.
I wanted to hold off new alphas until we had a game model we could definitely stick to, but I see now that I have to get over my perfectionism and, as the mantra goes, release early, release often.
Lots of optimizations on the trimesh prototype. Streamlined the structs and interfaces, translated the remaining shaders to the new GLSL compiler (that went remarkably well, just had to add a missing
discard node). Now I'm back to thinking how I can add more map editing features to the prototype, and the challenge of writing UI annoys the hell out of me. I currently have no GUI framework I'm happy with. I considered embedding ocornuts C-wrapped IMGUI, but I don't like how it's set up. IMGUI attempts to provide a ready-to-use library of widgets, very much the classical approach of UI frameworks. The problem with that system is, that while it makes it easy for users to slap together a bunch of preset widgets, they're typically completely on their own when they want custom widgets - and they always want custom widgets. My OUI library has a minimalistic abstraction, which I like a lot, because it is agnostic to theming and allows to build any sort of widget from a handful of configurable rectangles, a very Scheme-y design.
One could say: if IMGUI is the GL 2.0 (with lots of extensions) of immediate UI, then OUI is Vulkan (less presets, more combinatory opportunities) ;-) but OUI has a design flaw, which is that the way items are cached between frames (to maintain coherence during stateful operations) leads to the UI losing track if the order in the UI layout changes, and all in all, I had heavy optimization issues with this touch-everything-in-one-frame approach. It's awfully difficult to both track and cache operations when your UI objects are all supposed to be ephemeral.
Therefore I have begun work on a OUI2 (implemented in None), which is going to use a retained mode again. The vision I have for it is to become a minimalistic system to facilitate a browser-like DOM (something that I tried to write with OUI before). I enjoyed designing webpages in the browser more than working with any UI library, but I despised the shit stacking and the legacy baggage. Considering that symbolic expressions drive everything in my engine, I believe this is the perfect abstraction for GUI / data visualization / websites etc as well. Where Chrome has HTML/CSS/JS syntax, I'm eventually going to have S-expressions with just three different domain specific semantics (of which the JS part is already done).
The principle of items (the lowest abstraction of a widget) and the layout algorithms (box model, word wrapping) are kept, but the items will remain persistent in a pool. I'm also going to make changes to event handling. OUI will have an outgoing event queue (& associated callbacks) for abstract item events (hot/cold, clicked, focus/blur; drag, drop; etc.), and there will be no more polling of item state, as that technique is frame-dependent and typically ends up being lossy. The modifications open the door for implementing future optimizations like partial layout refreshes and draw surface caching.
The trimesh map editor is OUI2's first test case. Considering how many GUI systems I've used and written before, I hope that I'll end up with a lean minimal UI system that remains expressive and resourceful for years to come.
I managed to get the GLSL compiler to a working state, and the trimesh subdivision shader (which is fairly complex and uses three stages) is now working perfectly; https://gist.github.com/paniq/aff6395684c22aa978d1 is translated to https://gist.github.com/paniq/ad0783d4c61a5ba3f72a. It's now going to be comfortable to write and interface with new shaders. I still need to add a few small tweaks and improvements, plus a new testing mode that allows to directly test shader functions on the GPU and read out their return values. After that, I can start working on a faster Delaunay transform shader.
Not so much happened the past 10 days; We visited Sylvia's dad and then fought against the looming heatwave. Our studio has a mobile climate unit that eats a lot of power and makes the room halfway bearable. The heat is distracting and it's difficult for me to think about the current problem that I have, which is transforming a triangle mesh into a constrained Delaunay triangulation on the GPU, but without compute or atomics. I rubberducked the problem within this article, but the write-up got so long that I saved it here instead: https://gist.github.com/paniq/e09a9f6c5b8b9aa1b835
Lots of things happened the past two weeks, along with my birthday. I took three days off to catch up on recent games.
Conspire had matured enough that it was time to single it out into its own repository. Oh, and I also renamed it. It's now called "None", which is a backronym. The website (which currently just redirects to Bitbucket) is at http://nonelang.org. The language should run on Linux, OSX and Windows. There are a dozen test cases now that demo aspects of None.
There are too many small changes to None to mention here, best check the commit log: https://bitbucket.org/duangle/none/commits/all - The most striking improvement is that the largest parts of macro expansion are now accessible as hooks from within the language, scoped, so that the DSL can be completely swapped, and global macros overridden. The static language still uses its own expression parser, but that is no longer necessary. Using the built-in macro expension has the benefit that the error reporting facilities can be reused. Expressions that cause trouble are included in the traceback.
After all these upgrades, I thought that it might finally be doable to try a GLSL backend, and that's what I'm sitting on now, the shader DSL. The general idea here is that you can compose your shader functions along with your dynamic and static functions (semantically, syntactically it's almost identical).
The GLSL code generator implicitly figures out the necessary dependencies to generate vertex, fragment and geometry shaders (compute to be added later). This becomes most convenient when dealing with shader
out variables that connect the stages with each other, vertex streams and framebuffers; In the shader DSL, they're called
shared variables. The code generator tags these using a simple heuristic: when a function sets a shared variable, it's an
out for this shader stage, otherwise it's an
in. If the shared variable is neither read nor written to, it is not declared. This way, a shared variable only needs to be declared once, and is then included in generated code where needed.
There's also support for declaring overloaded external functions that are either built-in or residing in native GLSL source code. The code generator does a type inference pass, matches applicable overloads, enables the extensions and includes the source code required to make it work. To avoid notation fatigue (a condition that I just made up), type inference is also used to figure out when a numerical constant is a float, and when it's an int, with strict implicit conversion rules: an int can be upgraded to float, but not the other way around. Overloads first try a perfect match, then an implicitly casted one.
Then there's the implicit goodness of None's powerful macro system: dynamic functions can build permutated shader functions from templates and quasiquotation. Shader functions can directly inline constants declared in dynamic code. The system is no longer hermetically sealed off from the rest of the codebase.
I hope these extensions will make writing shaders frictionless enough to make me want to write more of them. I have the largest part of the code generator done, and am finishing up on the top level declarations, namely uniforms and uniform blocks. My first test case is the shader program I use in my 2D remeshing prototype, which is a fairly complex shader with three stages (the code is here https://gist.github.com/paniq/e5b5ebd4797065f33a57)
Once that works (should take this week at most), all hard language work is done and I can return to continue work on the prototype.
More features for Conspire!
- Removed the Conspire 1.0 language from the loader; It's no longer supported.
rmlc-prefix has been changed to
static-. The RMLC DSL is now simply called the static DSL.
assertis now a macro; The second argument is only evaluated when the assertion fails. There's also an
assert-exceptfor unit testing against code that is expected to raise an error.
usingnow also works in the static DSL.
- There is now initial support for transparent struct-of-array types, inspired
by Jonathan Blow's SOA feature in Jai, via the
column-arrayarray type constructor.
- arithmetic with GLSL-style vector types only worked in the static DSL, now also works in dynamic code.
- vector types now support GLSL-style compound constructors in the static DSL, and can be initialized from arrays and SSE vector types.
- The symmetrical GLSL-style matrix types are now implemented.
- added Python-style
dprintis like print, but outputting the module name in which it was called.
- Because the dot notation makes member access convenient, built-in Lua modules are now in their own namespaces.
=#operator comparable to
reprallow introspection into tables and types in form of formatted symbolic expressions.
- Better support for handling Lua VM varargs: The
...argument is now passed unfiltered and argument unpacking is now a special form (
__unpack). Lua's select() function is available as
__select, allowing to enumerate varargs.
$operator now serves both as table constructor and namespace for table operations. It's a little jQuery-like.
- Keeping naming Schemes more familiar for programmers of other languages:
length-ofgot renamed to a Pythonesque
- To make pointer arithmetic more readable to people coming from C/C++,
*operators will expand to
^*if used as unary operators.
What... what happened? I spent all week writing all sorts of improvements and conveniences for the language and forgot the time because it's too much fun. Let's see.
- Updated Terra to latest build and LLVM 3.5. A few fixes were needed to deal with changes in the interface.
- Ported the gl, vecmath and pool modules to the new conspire.
exceptsyntax sugar for xpcall.
- Added GLSL-style vector types with support for permutated accessors, arithmetic operators and SIMD code generation. It's really quite convenient to work with, but the matrix classes aren't completely done yet.
*can be used as symbol prefixes, as shortcuts for
(^*). Makes de-/referencing more C-like.
alias-castsyntax sugar, similar to C++'s
- a very generic
usingmechanism that allows to hook scoped lookups for variables, inspired by Jonathan Blow's extension to C++'s
usingin his Jai language. I believe my implementation is generic enough that it allows extending the language with new lookup strategies from within.
The dynamic GLSL front-end was fun until I ran into a tough problem: how to import the function declarations contained in existing GLSL shaders (and I have a lot of those). So I'll stick with the previous GLSL shader generator for now.
Therefore: the yak is shaved! I ported the trimesh prototype I talked about on 10.6.2015 to conspire2, everything is ready for a thorough refactoring now, and work on the main game can continue.
What are the features of conspire2 so far?
- Scriptable compiler, can optionally be deployed to users.
- Runs on top of Terra, so provides all the same runtime services.
- Dynamic code runs on LuaJIT runtime.
- Statically typed code is compiled and optimized by LLVM.
- Native support for importing C libraries and Lua modules.
- Can export standalone executables.
- Memory management is explicit for static code, garbage collected for dynamic code.
- Syntax inspired by Scheme.
- AST uses S-expressions exclusively.
- AST is designed to be symmetric: code can be read, automatically altered and saved back to disk.
- Interpreter can be extended with new keywords and language constructs from within the language, up to completely new domain specific languages.
- Macros can be hygienic or non-hygienic.
- Dynamic and static functions use the same syntactic building blocks.
- Support for infix operators in S-expressions.
- Support for symbolic prefixes (0x, 0b, etc.)
- All sorts of nice syntax forms to make life easier.
So, my branching problem got fixed with an alternative solution suggested on the mailing list (https://mailman.stanford.edu/pipermail/terralang/2015-June/000441.html ). I'm happy to report that the RMLC (runtime machine language compiler) works without much further changes, and is completely expression based. Writing some fast assembly now got super convenient.
I also added a bunch of macros for struct declarations and dynamic class definitions, based on the 30log library.
I'm currently sitting on a GLSL compiler with a syntax analog to the RMLC compiler, and have just completed a markup tree -> C-like code generator implementation to that end. The markup tree will be generated from a GLSL AST which in turn is constructed from a semantic parser.
This way it will be possible to mix dynamic functions, static functions and shader functions in the same context, which all use more or less the same syntax. Different shader main functions can use the same dynamic declarations yet generate completely different material shaders from them. The GLSL system will then also be able to use Conspire's module import system, convert struct declarations into uniform structs and other niceties. I hope writing shaders will then no longer be a total PITA.
Turns out the bug report was necessary, I submitted it here: https://github.com/zdevito/terra/issues/99
As long as this isn't fixed, I'm not quite sure how to proceed with
branching support. I can't code the ternary conditional as a regular
if-block because those are statements; I would have to first
declare a variable, then set this variable in the
then or the
else block, but the type of the variable has to be negotiated
between what the returning expression expects and what the two
branches contain. Also, what to do with situations where the
final result is thrown away?
void can be passed as a value
for these situations, but it would help to know beforehand that
these expressions only matter as statements.
It looks like I'll have to revert the RMLC code generator to the
putdest structure I use for the Lua translator. Sad, because
I thought terra's expression quoting system was flexible enough. Great,
because this gives me more control. If generators know what an
expressions destination will be (return value, variable or discard),
and more importantly, the destinations type, they can annotate
temporary variables correctly and generate different code.
(if cond-expr then-expr else-expr) can produce a
ternary conditional expression when the target is a variable or
return value, enforcing both branches to return types compatible
to the target, or a regular
if statement block when the return
value is discarded.
Terra and the LLVM compiler both optimize out excessive use of temporary variables, so generating code in this fashion should be fine, except for immutable struct types declared on the stack, where temporary values might evoke unnecessary copy constructors. I'll have to do some testing first to know what happens here.
I added the new RMLC DSL (RMLC = runtime machine language compiler),
which is designed to be a full analog of the dynamic language DSL,
only with static typing added. This way, it is easy to switch a
static for a dynamic function or vice versa. Terra makes it easy to
write code that is completely expression-based, thanks to the
quote a in b end form, but I've been running into an evaluation
terralib.select, the ternary conditional operator, where
quoted statements get executed regardless. I'm currently building
a minimum reproduce to test with the latest version and submit to
the Terra project if necessary.
Before I could start re-adding the LLVM DSL I had to re-implement a
bunch of syntax forms to make the subsequent task a bit more
comfortable. In the process a few more special forms also got added.
which makes writing syntax rules easier),
variant of if-elseif-else blocks),
(import-from) (which automates
the task of declaring local variables and initializing them with
values read from a table, sort of the counterpiece of
(var-unpack) which imports variables from arrays, and
(foreach), with support for auto-unpacking iterator results.
The first thing I did today was write an infix parser that rewrites
(func 1 + 2 * 3 arg2 4 + 5 arg3) as
(func (+ 1 (* 2 3)) arg2 (+ 4 5) arg3). While the most popular
operators are already bootstrapped, code can register new infix
operators that will be translated to regular s-expressions. Infix
operators are always treated as binary. Unary prefix and postfix
operators are not supported.
Unlike in Lua or C, operators and arguments need to be properly
spaced apart because the s-expression parser doesn't know anything about
operators, so e.g.
a+4 is read as a single token,
a + 4
is read as three tokens.
After running into some semantic difficulties (it is simply easier
for syntax rewriters to assume that an expression list guarantees
one expression per token), I decided that infix
operators always need to be declared in dedicated parentheses; So
(func 1 + 2 3 + 4 + 5) must be written as
(func (1 + 2) (3 + 4 + 5)).
Variables are now running on the stack again, and
an ad-hoc table.
backquote have been ported.
The next step is to re-add the LLVM / terra DSL. I'm thinking of a
style that is transparent to the main language, with types added e.g.
(var name 0) becomes
(var name : int 0),
(function name (arg1 arg2) expr) becomes
(function name (arg1 : int arg2 : int) : int expr)
(function name (arg1 arg2) : (int <- int int) expr).
The past few days I continued work on the new Conspire interpreter, which is coming along nicely. The special forms expanded to 12, a number that is continuously in flux. I'm still in the process of bootstrapping all language features in the core header, and today I managed to move syntax expansion to its own pre-pass before compiling.
There's one central language feature I'm having trouble implementing, which is hygienic macros for syntax-rules. I tried parsing and rewriting the templates, but it's difficult doing this without replicating the syntax expander.
The other feature I'll need to change is the way scope variables are captured; I'll move this back to the Lua stack, mapping generated symbols to the actual names to support free naming. Currently, the values are directly written and read from lua tables, which complicates the code and may cause problems with garbage collection; I'll implement the feature I did this for (locals) differently; (locals) will simply expand to a table mapping all variables captured in the current scope.
I took a week off from Twitter and that seems to have helped with the productivity. I forced myself to look at the code I have to refactor and realized that the Scheme-inspired language that I developed for game programming needed a few fixes in its semantics so it's fun again.
The first version was very closely built along the principles of Lisp and Scheme, and I made some trade-offs to be able to translate code to Lua, and it turns out a) I don't like (let) and (letrec) much, because they make variable declaration in functions semantically different from variable declaration in modules b) piggybacking variables on Lua's stack (with the drawback of not being able to use the full Lisp character set for variables) for speed wasn't really needed, because I write nearly all game code in the terra DSL, and so I can make scopes a little more Pythonic instead (supporting easier injection and extraction of variables) and c) The terra DSL part has become the heart of the implementation but runs on C-like semantics that are sufficiently different from the Scheme one that switching between both languages becomes jarring, so they need to meet in the middle.
I wrote a list of changes I want to do in the implementation (it's at https://gist.github.com/paniq/34525b55cd4ae8744961), then made a new translation unit to fit the specs. I'm very happy with how it turned out, it only needs 8 special forms, 6 constants and 5 builtin functions to bootstrap the rest of the language.
Conspire supports multiple translation units. The previous revision still works without any changes, so I don't have to port anything right away, and because transforming the AST is simple, it's going to be easy to translate some of the code I already have.
That should make it a lot less annoying to refactor the triangle mesh implementation, and I can think about more new language feature like support for SOA (struct of array).
It appears that many of my problems stem from the inability to rubberduck them with a colleague; When I was still working at a game studio, I could meet my colleagues in the hallway and bounce my ideas off them - someone I'd deem competent who would listen to my stream of thoughts and give it a seal of approval, giving me enough confidence to go down that road -- or enlighten me with an alternative approach that leads to the same end.
Nowadays, I mostly work alone, and while Twitter has been helpful many many times in providing a rubberduck surrogate, it appears that often my problems are too complex to describe in few tweets in such a way that they invite others for collaboration or judgement. When everyone who is competent believes that I already know what I'm doing but does not tell me so, I feel like I'm on the wrong path. I get the sense that my friends have their doubts but keep them to themselves, and I procrastinate, sitting in a mental waiting room, waiting for my next shot of enthusiasm.
Therefore, in the hopes of catalyzing thought and confidence, I decided to start a devlog within a gist, which I will attempt to update each day. The devlog is to be seen as a precursor to a possible blog article; thoughts are less ordered and focused, documented approaches can be completely pointless in the end.
It is possible that some of the things written here might ultimately turn into a blog post; I usually begin planning one when enough people express their interest for a topic.
So, with that out of the way, what's going on? I'm currently in the process of writing a 2D prototype of the NOWHERE terrain and physics system as I imagine it to work; following the ideas of this paper, http://matthias-mueller-fischer.ch/publications/airMeshesPreprint.pdf the world will be completely embedded within a huge square 2D triangle mesh, tesselating both air and solid materials. The mesh will host collision detection, rigid and softbody physics, most likely using new position based dynamics techniques as documented in http://www.researchgate.net/profile/Jan_Bender/publication/274940214_Position-Based_Simulation_Methods_in_Computer_Graphics/links/552cc4a40cf29b22c9c466df.pdf and I also want to experiment with 2D light propagation and wind current simulation later on.
What I have so far is a 2D mesh class with support for generating the beginning quad that bounds the simulation space, splitting faces, splitting edges, flipping edges and an optimization pass that flips all edges where it improves the quality of the mesh.
There's also a testing and editing front-end that renders the mesh straight from a buffer texture using a GL geometry shader, allowing me to dynamically split faces and edges, flip edges and initiate an optimization pass -- all this happens on the CPU right now and will continue to do so until a move to GPU is really necessary.
What is missing right now is the physics and collision system and the associated auto-mesh-optimization step. For this to happen, I believe I need a double-buffering scheme for the mesh. Currently, vertices and faces are stored in two separate pools, and I suppose those would then become four pools. Each pool would store the individual attributes in separate arrays (struct-of-array), so buffers could be flipped array-wise, not pool-wise; this also allows me to upload only the absolutely relevant data to the GPU.
The methods for the mesh would all have to support reading from a front- and writing to a backbuffer target which can be freely set.
For the opimization via edge flips, it appears I'll have to operate on the backbuffer completely, as operations update opposite half-edges (I call them "twins"), and the operation can't be parallelized. It probably can, but I can't think of or seem to find an efficient way to do this. So it will have to do for now.
When the double-buffering is implemented, I can probably start writing a primitive physics loop with some colliding bodies.