TerryE/NodeMCU-LGC-Notes.md

## NodeMCU-LGC-Notes.md

      
    Raw
  

              NodeMCU-LGC-Notes.md
            
          
    More notes on the Lua Garbage Collector in the NodeMCU implementation

Background

Lua employs automatic memory management for its objects.  The Lua garbage collector (LGC) uses an incremental mark and sweep (M&S) strategy, in contrast to the reference counting strategy used by Python, etc..  The LGC is quite complex, and its detailed strategy and algorithms have been changed by the Lua development team in each of the Lua 5.x versions.
As is common practice for many other FLOSS source repositories, Lua's primary program documentation is the Lua source itself.  The Lua code is largelyrr clear, readable, and broadly achieves this self-documenting goal.  However in my view, the LGC is an exception here and thelgc.c could do with code cleanup to make it more transparently understandable, as many aspects of its implementation are clear as mud.  There is a sparsity of other technical notes, specifications, etc. relating to the LGC and much of this information relates to old Lua versions, so other than the Lua source itself, the relevant background public LGC documentation is really limited to the Lua 5.3 Reference Manual (LRM), specifically LRM §2.5 and referenced sections within the LRM. A presentation Garbage Collection in Lua by Roberto Ierusalimschy's in of the Lua workshops is also useful, though this predates Lua 5.3.  In this note I will assume that the reader has read and is familiar with the LRM. Here I want to focus on the Lua 5.3 LGC implementation, and specifically how the 5.3 LGC works within the NodeMCU fork.
The NodeMCU environment is a Lua 5.3 fork intended for embedded IoT applications using the ESP8266 and ESP32 series processors. These processors employ a modified Harvard architecture that can execute code out of programmable flash memory (effectively acting as ROM for normal operation), but with limited RAM available to the application (typically ~40Kb and ~150Kb free RAM heap for the respective series).  IoT Lua applications also usually require real-time or near real-time reponse characteristics.
NodeMCU implements a flash memory LFS storage class where compiled modules and their associated Lua objects are treated as read-only from the perspective of the Lua runtime.   Whilst the Lua Virtual Machine (LVM) uses these RO objects within its addess space (in the same way as all other objects in the Lua collectable object classes), these are permanent and immutable.  Normal RAM collectable objects can refer to RO ones, but RO objects can only refer to other RO ones. In practice all ROM objects are entirely removed from the LGC process, except that any RO object encountered in a LGC traverse is treated as a dead end and the current node on the traverse is terminated. Otherwise the LGC roperates as standard for the collection of RW resources.
As part of LFS, NodeMCU also adds a second ROstrt string table to the runtime environment for short RO strings.  The LVM maintains integrity across this ROstrt and the standard strt for RAM based short strings created during execution; this is simple and low-cost, with the LVM string resolution process extending across both the RAM and ROM string tables. Hence any strings already in the ROM string table can be reused, thus avoiding the need to add these additional entries in the RAM table. The LGC itself only operates on the RAM based strt.  This approach both significantly reduces the size of this table and removes a lot of strings from LCG scan operations.
In earlier Lua versions (prior to 5.1), the LGC was a single atomic operation that was executed as needed.  LVM execution was halted for the duration of the M&S process, and this could cause significant unplanned latencies in the running application.  The 5.3 LGC now runs as a coroutine interleaving its execution with that of the LVM executing the application; each yields in turn to the other at defined points during execution (though the Lua library still provides a collectgarbage ('collect') function to allow the application do a full synchronous and atomic garbage collection).  The overall LGC robustly maintains data integrity across a range of usecases and implementations, but the fine details lie outside this note.  What I do want to explore here are some characteristics of this coroutine interaction, and how this impacts the overall performance. Before doing this I think it useful just to review what are (and what are not) Lua collectable objects.
Lua Objects

The basic atomic unit used to store a value in Lua is a TValue structure. The ESP processors use a 32bit address and data bus, so by default NodeMCU uses 32 bit encoding for both integer and float types, with a being TValue 8 bytes long.  The Lua types boolean, nil and number are  are stored by value in the TValue and are thus not collectable.  The remaining types function, string, table, thread and userdata are known as collectable objects; the TValue contains a reference to the object rather than an immediate value in these cases, incidentally enabling multiple Lua values to point to the same object.
Collectable objects, with the exception of strings, can reference other collectable objects.  For example, a table can include objects as keys or values; a function can reference objects as upvalues and when called any locals will be allocated as Tvalues on the thread's stack.  This web of references creates a directed graph.  Any object that has been created in the LVM is reachable if it is referenced by another object that is reachable.  The Lua runtime treats three objects as special: the global environment _G, the main thread, and the Lua registry in that these are deemed always to be reachable, and these objects therefore root this directed graph.
Note that one of the main jobs of the Lua Registry is to enable services such as timers and network objects to root otherwise unreachable values such as Lua callback functions; this prevents the LGC from collecting them.  (One of our biggest sources of leakage is when such services do not use finalisers to clean up such registry entries correctly.)
Also note that any object may well be reachable by multiple paths; however any object that is not reachable is classified as unreachable.  This difference between reachable and referenced is subtle but important: it is quite possible to create a set of objects that have a set of circular references so every object is referenced, but none in the circuit is reachable. This doesn't matter to the LGC: once an object is unreachable, it is in effect dead and its resources will be reclaimed. (In Python for example, the application would need explictly to break the circle for the runtime to collect them.)
The job of the LGC is to return the memory resources associated with unreachable objects back to the heap for reuse.  There are two complications to this rule:

Weak Tables.  (See LRM 2.5.2 – Weak Tables).  These implement a concept of tables with weak references.  This complicates the internal processing rules of the LGC, but this can all be wrapped up in the "magic happens" category.
Finalisers. (See LRM 2.5.1 –  Garbage-Collection Metamethods).  These are created when an object has an associated __gc metamethod which is bound to a Lua or callable C function. These complicate LGC processing.

The __gc metamethod can only be fired once the LGC has determined that its associated object is itself unreachable.  The funcion must be executed within the LVM, so this object and any objects that it references must be reachable for this all to work correctly. Hence, the LGC must first resurrect this dependent object hierarchy before calling the metamethod. This function is also free to resurrect the object or any of its dependents by explicitly creating a reference through another reachable object; if it doesn't then the LGC will free the object's memory.  Any further cleanup of the dependent objects is deferred to the next LGC pass. The function can also create yet more new objects.
Of course  __gc metamethods won't be fired for reachable objects, so number of objects with active finalisers will typically be small or zero for most LGC passes.
The consequence of all this is that the LGC must carefully order and schedule any finalisers.  This magic is all done as an epilogue to the collection cycle, and because finalisers can increase memory resources, any Emergency LCG pass will ignore any objects that include finalisers; these can only be collected during normal LGC passes.


This all makes deletion of unreachable objects a fraught area.  The 5.3 LGC adopts a somewhat conservative approach to this: it is safe to delete any object that has remained unreachable for two LGC passes.  Hence unreachable objects can persist for one extra LGC cycle before the LGC recovers the associated memory resources.
The LVM creates objects as needed during execution, and it can also overwrite or remove a reference to any object in a TValue. What it can't do is to delete the object itself, so the number of collectable objects can only increase during execution.  What LVM can do is to yield control to the LGC to allow the LGC to identify unreachable objects and to delete them in order to free up heap.
Memory performance can be critical on an IoT device given its limited RAM, so it is worth covering the memory use and scope of objects in more detail:

Local values.  Any file, function or do scope can define one or more local variables. Each is allocated a TValue slot in the function's stack frame. These local TValues aren't individually collectable but any can reference a collectable object; all of the Tvalues in the call frame are discarded on exit of the function / scope.
Upvalues.  Lua is a lexically scoped language, so any function (or do scope) can refer to and use a local variable declared in an outer scope.  The do scoping uses a TValue in the current call frame, but the function scoping is more nuanced, as the scope of the upvalue is not the function call, but rather the closure (see following), with each closure object including a TValue vector sized one per upvalue.  The value itself is implemented in one of two ways: if the outer function is still a parent in the call chain, then its TValue call frame still exists in the stack and the upvalue is a reference back to the true local value; in Lua it is also possible for a closure to remain in scope after the function that created the closure has returned, and in this case the true local value is no longer valid on the stack (see following example), but any upvalues must still remain valid even though no longer referenced on the stack. Such upvalues are relocated to a thread pool via another piece of Lua magic.  Access to, allocation of, and destruction of both local and upvalues is fast / cheap and doesn't involve the LGC.

  local count = 0
  return {incr = function() local c = count + 1; count = c; return c; end,
          decr = function() local c = count - 1; count = c; return c; end}

Strings. In Lua strings are divided into two categories: short (40 bytes or less) or long.  Short strings are resolved and interned using the strt and ROstrt; resolution is based on a simple hash lookup (or two lookups in the case of RO strings), though most string references are made directly through a TValue link. (Long strings are typically used for computed strings and in practice the chances of identical long strings being generated are small so it makes sense not to intern these and avoid hash table churn.)

RAM strt entries are typically created by string operations such as concatenation or string library calls.  In a typical IoT application, such TString values are usually quite short; the header adds a 16 byte overhead so malloc units are typically in the 32-64 byte range.  The strt hash vector itself is resized dynamically on a doubling basis with 4 bytes per slot.  LVM execution can only increase TString storage during execution, with the LGC doing any trimming as strings are no longer reachable in the executing application.
Most of strings used are compiled constant strings that are stored in LFS; this moves the bulk of string management out of strt and eliminates the related string LGC churn; this typically reduces this related LGC overhead by significantly more than the increased resolution overhead of the extra hash lookup needed for the split across strt and ROstrt.


Functions.  Functions are somewhat confusing because as well as the Lua variable type there is also a hidden implementation type that isn't exposed programmatically: the Proto which is a loadtime construct.  Consider the above example: loading this file creates 3 compiled functions or Protos: one for the main function implicit in compiling the file, and two for the incr and decr anonymous functions.  These are essentially compile constructs that contain only metadata: how to execute the code; any constants it uses; debug metadata on variable names, etc.; sizing data on the numbers of stack slots needed, parameters, upvalues, locals, constants and nested functions. The actual function values themselves are only created at runtime by the LVM executing the CLOSURE opcode; for example, requiring the example above creates the main closure; and evaluating return creates two more closures for the two methods.  Each closure construct includes an upvalue vector (with one entry in this case).  Closures are subsequently removed by the LGC using standard scoping rules so cnt = node.LFS.counter() will create a Lua function value and execute its main function to return two more.  The subsequent LGC passes will collect this now unreachable closure; cnt.incr and cnt.decr still reference the other two.  Executing either will create the functions call frame on the stack and run it.  The LGC collects any dead Protos under normal live scoping rules, though this doesn't take place for (RO) LFS functions.
Tables.  Tables comprise a header record and optional array and node vectors to hold the Tvalues for elements. The array vector for list type arrays where the keys are consecutive integers start at 1 (for example {'Red', 'Green', 'Blue'}) with a TValue for each slot.  The node hash is for keyed lookup (for example {Red = 1, Green = 2, Blue = 3}) and each slot comprises 3 × TValue + an integer (40 bytes in NodeMCU). These vectors are sized and resized on a doubling basis, with the first non-zero size of 4, so for example an array {re = 4.0, im = 0.0} will require to 2 malloc records of 28 + 160 bytes.

The # operator scans the array vector for the first nil entry (up to its current size).
Setting an array element to nil functionally deletes it (that is the Lua next and pairs functions will skip it during enumeration. Likewise removing a table entry sets its value to nil.
Certain operation can trigger resizing of an array.  New array and node vectors are sized based on the count of non-nil elements in the array, so these vectors can grow as needed as well as shrink (e.g. during LGC array finalisation); during resizing the contents are copied across from old to new before deleting the old vectors.  This is a relatively expensive runtime operation.  Note that even if the LCG shrinks a table, the heap use will still need to increase temporarily during resizing.
Lua application objects are implemented by a table with a metatable, so using object oriented code (especially where objects are nested hierarchically) carries a relatively high memory and GC overhead.


Userdata.  A userdata value represents a block of raw memory, and as such has no predefined operations in Lua (except assignment and identity test).  However userdata can have an associated metatable, allowing C API functions to operate on it.  The LGC does little other than the standard M&S detection of death, and the invocation of the __gc finalizer.
Threads. There are few threads in an IoT application (often only the main thread) so the size of the Thread structure has little impact.  However the thread has a dynamically vector of TValues associated with it: the thread's stack.  If call nesting gets too deep then the stack can grow considerably.

LGC performance

Consider a typical LFS application using its 40Kb nominal heap size main for Lua variables and the dynamically sized vectors that some Lua types use.  What are the biggest consumers of RAM?

Aggregating Tables.  Constructing any table with a large number of slots is problematic.  For example a keyed array with 256 entries requires a contiguous allocation of 10Kb RAM.  Adding one more entry requires an extra 20Kb RAM for the Node array to be rehashed into before the original 10Kb vector can be freed.
Complex Table (or Object) Hierarchies.  I recently review a code example where the developer was running out of RAM, but maintained the system config in a reasonably complex hierarchy involving some 20 tables; most had less that 4 entries, but 6 had over 8.  This is all trivial for a PC application, but initialising this on an ESP 8266 would use 10 Kb ─ a quarter of the available heap.
Incremental Aggregation of Strings.  I often come across "clear coding" examples where there is a long dist of rec = rec .. 'some more' or a for loop around an equivalent indexed expression.  This practice generates a large number of 'use-once' intermediate strings that will be dead before the function exits.
Deep Recursion levels. As discussed above.
Sloppy descoping of functions and class Tables / Userdata.  Object values that end up in _G and the Lua registry are doubly pernicious because the LGC treats these as grey and does not collect any resources referenced (directly or indirectly) by them.
RAM based compiled code.  Not using LFS for application code.

Heavy use of dynamically compiled strings and lots of small array use can generate lots of small object churn and fragment the heap space, but the LGC will be effective in returning dead object resources to the heap.  In contrast, Lua uses a doubling / halving strategy to resize any vector based resource: stacks, array hashes, etc.  This is because the resizing operation is itself a pretty expensive operation.  Whereas the Lua runtime can use the Emergency LGC to recover successfully from memory exhaustion from small object churn, this will rarely result in reducing  heap fragmentation and thus allow the allocation of a larger vector growth: so the <failed alloc, Emergency LGC, restart alloc> strategy will still typically fail and throw a Lua out-of-memory error: larger arrays of anything are bad news.
Another issue that impacts ESP IoT implementations is that other system drivers and services use the same allocator and address space as the Lua runtime. These run asynchronous functions that need to allocate RAM for buffers etc.  If Lua runs out of RAM, the emergency LGC may scrub enough dead objects to restart the LVM at the failed allocation, however if a service such as WiFi runs out of RAM, then the WiFi and IP stacks can fail.
NodeMCU developers seem to love to attempt to implement web servers and other end-user interfaces on an ESP target. These typically require lots of the above and therefore are continually beset by memory exhaustion issues. Also compiling programs into RAM can be very problematic, and the need to write them as 'ephemeral' functions (that is in such away as to allow dead Protos be correctly processed by the LGC) can significantly complicate source implementation. So my general practice is:

Use LFS for all application code. About the only time that I use loading into RAM is for individual testing of new source modules.
Use locals and outer scope access through upvals when practical.
Avoid creating dynamic strings, and use string operations library calls that are optimised to minimise "use once" working strings (e.g. table.concat(), string.format()).
Avoid using arrays; keep them as small as practical; use indexed arrays in preference to keyed ones.
Avoid using objects with finalisers.
Always use proper tail calls (see LRM §3.4.10 – Function Calls) where appropriate to minimise stack growth.

What I mean by the use of 'avoid' in this context is: don't use blindly, and just be aware of the potential consequences; the benefits in terms of functionality and code maintainability should be balanced against the consequential memory and LGC overheads.
The exact timing of a full LGC pass is hard to approximate because the marking sweep is essentially a recursive tree-walk of all reachable objects, albeit with short-circuits where a visited object is already marked.  The outcome of the mark pass is to sentence all objects into a set of lists.  These are then swept in a set of sub-phases because of the complexities introduced by the coroutining of the LVM and LGC and the support for finialisers that will be executed within the LVM.
The main point to observe is that is dominantly an O(N) process, that is the runtime is mainly proportional to the number of collectable objects, and for an LFS-based ESP 8266 application with a typical mix of mostly string and array objects simply dividing the average size of such smallish objects into say 50% of the 40 Kb RAM (we need room for thread stack and OS service resource) gives a ballpark number of perhaps 500 objects, maybe +/- a factor of two depending on coding practices, but this is still a good ballpark.  Clearly use of LFS is a major factor here as running code from RAM will significantly result in Protos and code string constants eating into ½-¾ of the RAM and drop this limit 2-4-fold. Moving to the ESP-32 with its larger RAM could increase this number 4-fold.  If the LGC makes 3 runs over a list of 500 objects to do a full collection, then this is an expensive runtime operation with the LVM is locked out during this collection; this can cause significant problems with latency dependent applications.
The need to mitigate this latency was the rational behind the introduction of a step-based incremental LGC algorithm where the quantum of LGC time in any single step is significantly less than a full collection cycle. This is implemented by coroutining LVM and LGC with outer and inner collection cycles:

Full GC cycle. This is the complete M&S cycle.  At the end of cycle the LGC is in a 'paused' state, and the LVM then execute continuously (without yielding to the LCG) and consuming heap until a start criteria occurs; this sets the LGC back to a running state where steps can automatically interleave with LVM execution.
GC Step.  Each full cycle is sub-divided into a set of steps, with LVM execution being interleaved between each step. This means that that:

The LGC process must be interruptable at the step boundary and internally retain enough state to be able to resume after the LVM yields control back to the LGC to execute the next step.
Any changes to reachablilty made by the executing LVM must not invalidate the integrity constraints of the LGC.  The marking strategy uses a variation on Dijkstra's 3-colour white / grey / black scheme to facilitate this and some LVM internal primitives include barriers and marking cascades to ensure integrity; however for the purposes of this discussion, the details of all this can be treated as "magic happens" that are largely irrelevant.  The overall M&S process comprises a number of logical phases; most of these are resumable under these integrity rules and can this be split across steps.  However a couple of phases need to be atomic as the marking structures are in an indeterminate state from an LVM perspective, and these must be entirely contained within a single LGC step.


Two parameters can be adjusted for tuning how the LGC coroutines with the LVM:

setpause. This determines when the LGC is restarted after being paused.  The idea is that restart occurs after the heap used has grown by a tuneable percentage.  This might be appropriate to a VM-based memory model, but there is nothing to be gained for a RAM-constrained IoT application, as any pause will invariable increase both the peak heap consumption and the LGC latencies introduced in the steps immediately following resumption.  It makes a lot more sense to run the stepper continuously.  Rather that changing the fundamental algorithm, this is easiest realised by setting the default pause value to 100. With this setting the LGC will never pause and continuously interleave LVM execution and LGC collection steps.
setstepmul. This step multipler controls the finer grain interleaving of the LVM and LGC.  A global state field, GCdebt (the GC debt), accumulates the number of bytes allocated since the last adjustment by the LGC.  This debt saw-tooths between the LGC and LVM executions.  The stepping process subtracts from the debt (driving it negative) as it recovers memory, and it returns control to the LVM when sufficient debt margin has been achieved. The LVM memory allocator then adds to it as execution allocates memory resources. After the LVM executes any opcode that could allocate memory, it checks if the debt has flipped positive and if so it yields back to the LGC.  The threshold for 'end of step collection' is scaled by stepmul / 200, so setting the multiplier to 200 creates a unit scaling and this means that on average one step will recover at roughly double the overall rate of allocation.  Setting it to 300, say, will mean that the stepper will run more frequently and recover memory more aggressively; individual steps will be shorter, but the aggregate overhead of the LGC larger.  Dropping it to 150, say, will have the opposite effect and make the LGC a bit more sluggish, but the aggregate LGC execution overhead less.

As well as the automatic running of the LGC, the Lua collectgarbage() library function can be used to control the stepping process under application control, depending on its first parameter:

"collect": (default option) perform a full GC cycle.  As full GC can involve a material quantum of processing delay, I suggest that full GC steps should only be used with care, especially if the application has low latency goals (e.g. it uses gpio triggers). A good time to use full GC is at the end of one-time activities such as application initiation to flush out all of the now unreachable / dead values only used in initialisation. Another option is to do a full GC as the epilogue to any Lua callback; this will leave the heap clean for the next callback execution.
"step": performs a garbage-collection step.  Returns true if the step finished a complete GC cycle.  The optional n argument will force the step collector to perform as if nKb had been allocated by the LVM.  A good alternative strategy to using  a full GC epilogue to a callback, is to collectgarbage('step',4) say.  This has a cheaper cost and trims the heap for the next callback execution; this leaves a good debt so delaying the next automatic step. Note that executing a step automatically restarts the collector.
"stop" and "restart": stop / restart automatic execution of the garbage collector.  In general these should be avoided as stopping the LGC will quickly cause memory exhaustion if not restarted.  An example exception to this is when the application is doing some time-critical GPIO diddling; here, time critical code should be bracketted by a stop / restart or (step + stop) / step pair if the code is doing any operations that could trigger an automatic step (string creation, new array assignment, Lua function calls), though of course it is sensible to avoid these anyway during time critical code.
"setpause" and "setstepmul": as discussed above, leave to defaults or tweak once during application initialisation.
"count" and "isrunning"": returns the total (Kb) memory in use by Lua, and a boolean for whether the collector is running.  I am not sure how useful these are since node.heap() is more important and the LGC will always be running with the new default pause value.