Skip to content

Instantly share code, notes, and snippets.

@TerryE
Last active Apr 25, 2020
Embed
What would you like to do?
Lua 5.3 port to NodeMCU Firmware

Background and Objectives

The NodeMCU firmware is currently based on the Lua 5.1.4 core with the eLua patch and other NodeMCU specific enhancements and optimisations ("Lua51"). This paper discusses the rebaselining of NodeMCU to the latest production Lua version 5.3.5 ("Lua53"). Our goals in this upgrade were:

  • NodeMCU should offer a current Lua version, 5.3.5 that is as functionally complete as practical.

  • Lua53 will adopt a minimum change strategy against the standard Lua source code base, that is changes to the VM and runtime system will only be made where there is a compelling reasons for any change, for example Lua53 preserves some valuable NodeMCU enhancements, for example the addition of Lua VM support for constant Lua program and data being executable directly from Flash ROM in order to free up RAM for application use to mitigate the RAM limitations of ESP-class IoT devices

  • NodeMCU will provide a clear and stable migration path for both existing hardware libraries and ESP Lua applications being migrated from Lua51 to Lua53.

  • The Lua53 implementation will provide a common code base for ESP8266 and ESP32 architectures. (The current Lua51 implementation was historically been forked with variant code bases for the two architectures.)

Contents

Specific Design Decisions

  • The NodeMCU C module API is built on the standard Lua C API that is common across the Lua51 and Lua53 build environments but again with limited changes needs to reflect our IoT changes. Note that Lua 5.3 introduced some core functional and C API changes; however, the use the standard Lua 5.3 compatibility modes largely hides these changes, though modules may optionally make use of the LUA_VERSION_NUM define should version-specific code variants be needed.

  • The historic NodeMCU C module API (following eLua precedent and) added some extensions that somewhat compromised the orthogonal design principles of the standard Lua API; these are that modules should only access the Lua runtime via the lua_XXX macros and and calls exported through the lua.h header (or the wrapped helper luaL_XXXX versions exported through the lauxlib.h header). Such inconsistencies will be removed from the existing NodeMCU API and modules, so that all modules can be compiled and executed within either the Lua51 or the Lua53 environment.

  • Lua publishes Reference Manuals (LRM) for the Language specification, core libraries and C APIs. The Lua53 implementation will include a supplemental Reference Manual to document the NodeMCU extensions to the core libraries and C APIs. As these API are unified across Lua51 and 53, this also provides a common reference that can also be used for developing both Lua51 and Lua53 modules.

  • The two Lua code bases will be maintained within a common Git branch (in parallel NodeMCU sub-directories app/lua and app/lua53), with an optional make parameter LUA=53 selecting a build with Lua based on app/lua53, thus generating a Lua 5.3 firmware image. (At a later stage once Lua53 is proven and stable, we will swap the default to Lua53 and move the Lua51 tree into frozen support.)

  • Many of the important features of the eLua changes to Lua51 used by NodeMCU have now been incorporated into core Lua53 and can continue to be used 'out of the box'. Other NodeMCU LFS, ROTable and LCD functionality has been rewritten for NodeMCU, so the Lua53 code base will no longer uses the eLua patch.

  • Lua53 will ultimately support three build targets that correspond to the ESP8266, ESP32 and native host targets using a common Lua53 source directory. The ESP build targets generate a firmware image for the corresponding ESP chip class, and the host target generates a host based luac.cross executable. This last can either be built standalone or as a sub-make of the ESP builds.

  • The Lua53 host build luac.cross executable will continue to extend the standard functionality by adding support for LFS image compilation and include a Lua runtime execution environment that can be invoked with the -e option. An optional make target also adds the Lua Test Suite to this environment to enable use of this test suite.

  • Lua 5.3 introduces the concept of subtypes which are used for Numbers and Functions, and Lua53 follows this model adding a ROTables subtype for Tables.

    • Lua numbers have separate Integer and Floating point subtypes. There is therefore no advantage in having separate Integer and Floating point build variants. Lua53 therefore ignores the LUA_NUMBER_INTEGRAL build option. However it will provide option to use 32 or 64 bit numeric values with floating point numbers being stored as single or double precision respectively as well as the current hybrid where integers are 32-bit and double for floating point. Hence 32-bit integer only applications will have similar memory use and runtime performance as existing Lua51 Integer builds.

    • Lua tables have separate RWTable and ROTable subtypes. The Lua53 implementation of this subtyping will be backported to Lua51 where these are separate types, so that the table access API is the same for both Lua51 and Lua53.

    • Lua Functions have separate Lua, C and lightweight C subtypes, with this last being a special case where the C function has no upvals and so it doesn't need an associated Closure structure. It is also essentially the same TValue type as our lightweight functions, but there is no need for and explicit API support for it.

  • Lua 5.3 also introduces another small number of other but significant Lua language and core library / API changes. Our Lua53 implementation does not limit these, though we have enabled the appropriate compatibility modes to limit the impact at a Lua developer level. These are discussed in further detail in the following compatibility sections.

  • Many standard OS services aren't available on a embedded IoT device, so Lua53 will follow the Lua51 precedent by omitting the os library for target builds as the relevant functionally is largely replaced by the node library.

  • Flash based code execution can incur runtime performance impacts if not mitigated. Some small code changes are required as the current GCC toolchain for the Xtensa processors doesn't handle all of these mitigations during code generation. For example, as the hardware only supports aligned word access to flash memory ,'hot' byte constant accesses have been encapsulated to remove non-aligned exceptions at a cost of one extra Xtensa instruction per access; the remaining byte accesses to flash use a software exception handler.

  • NodeMCU employs a single threaded event loop model (somewhat akin to Node.js), and this is supported by some task extensions to the C API that facilitate use of a callback mechanism.

Detailed Implementation Notes

We follow two broad principles (1) everything other than the Lua source directory is common, and (2) only change the Lua core when there is a compelling reason. The following sub-sections describe these issues / changes in detail.

Build variants and includes

Lua53 supports three build targets which correspond to:

  • The ESP8266 (and derivative ESP8385) architectures use the Espressif non-OS SDK and its GCC Xtensa tool-chain. The macros LUA_USE_ESP8266 and LUA_USE_ESP are defined for this target.

  • The ESP32 architecture using the Espressif IDF and its GCC Xtensa toolchain. The macros LUA_USE_ESP32 and LUA_USE_ESP are defined for this target.

  • A host architecture using the standard host C toolchain. The macro LUA_USE_HOST is defined for this target. We currently support any POSIX environment that supports the GCC toolchain and Windows builds using native MSVC, WSL, Cygwin and MinGW.

LUA_USE_HOST and LUA_USE_ESP are in effect mutually exclusive: LUA_USE_ESP is defined for a target firmware build, and LUA_USE_HOST for a build of the host-based luac.cross executable for the host environment. Note that LUA_USE_HOST also defines LUA_CROSS_COMPILER as used in Lua51.

Our Lua51 source has been recently migrated to use newlib conformant headers and C runtime calls. An example of this is that the standard SDK headers supplied by Espressif include "c_string.h" and use c_strcmp() as the string comparison function. NodeMCU source files use <string.h> and strcmp().

Caution: The NodeMCU source is compliant rather than fully conformant with the standard headers such as <string.h>; that is the current subset of these APIs used by the code will successfully compile, link and execute as an image, but code additions which attempt to use extra functions defined in the APIs might not; this at least minimises the need to change standard source code to compile and run on the ESP8266.

As with Lua51, Lua53 heavily customises the linit.c, lua.c and luac.c files because of the demands of an embedded runtime environment. Given the amount of change, these are stripped of the functionally dead code.

TString types and implementation

The Lua 5.3 makes a significant modification to the treatment of strings, dividing them into two separate subtypes based on the string length (at LUAI_MAXSHORTLEN, 40 in the current implementation). This decision reflects two empirical observations based on a broad range of practical Lua applications: the long the string, the less likely the application is to recreate it independently; and the cost of ensuring uniqueness increases linearly with the length of the string. Hence Lua53 now treats the string type differently:

  • Short Strings are stored uniquely using the strt and ROstrt string table as discussed below. Two short TStrings are identical if and only if their addresses are the same.

  • Long Strings are created and copied by reference, but are not guaranteed to be stored uniquely.

Since short strings are stored uniquely, identity comparison is based comparing their TString address. For long TStrings identify comparison is a little more complex:

  • They are identical if their addresses are the same
  • They are different if their lengths are different.
  • Failing these short circuits, a full memcmp() must be carried out.

Lua GC of both types is essentially the same, excepting that collection of long strings does not need to update the strt.

Note that for real applications, identical long strings are rarely generated by other than by copy-reference and hence in general the runtime savings benefits exceed the small chance of storage duplication. Also note that this and other sub-typing is hidden at the Lua C API level and is handled privately inside the Lua VM implementation.

Whilst running Lua applications make heavy use of TStrings, the Lua VM itself makes little use of TStrings and typically pushes any string literals as CStrings. The Lua53 VM introduced a key cache to avoid the runtime cost of hashing string and doing the strt lookup for this type of CString constant. The NodeMCU implementation shares this key cache is shared with ROTable field resolution.

Short strings in the standard Lua VM are bound at runtime into the the RAM-based strt string table. Our LFS implementation adds a second LFS-based readonly ROstrt string table that is is created during LFS image load and then referenced on subsequent CPU restarts. The Lua VM and C API resolves each new short string first against the strt, then the ROstrt string table, before adding any unresolved strings into the strt. Hence short strings are interned across these two strt and ROstrt string tables. Thus any runtime reference to strings already in the LFS ROstrt do not create additional entries in RAM. So applications are free include dummy resource functions (such as dummy_strings.lua in lua_examples/lfs) to preload additional strings into ROstrt and thus avoid needing RAM for these. Such functions don't need to called; simply inclusion in the LFS build is sufficient.

An LFS section below discusses further implementation details.

ROTables

The ROTables concept was introduced in eLua, with the ROTable format designed to be compiled by being declarable with C source and so statically included in the firmware at built-time, rather taking up RAM. This essential functionality has been preserved across both Lua51 and Lua53. At an API level ROTables are handled as a table subtype within the Lua VM except that:

  • ROTables are declared statically in C code.
  • Only a subset of key and value types is supported for ROTables.
  • Attempting to write to a ROTable field will raise an error.
  • The C API provides a method to push a ROTable reference direct to the Lua stack, but other than this, the Lua API to read ROTables and Tables is the same.

We have now completely replaced the eLua implementation for Lua53, with this implementation backported to Lua51. Tables are now declared using LROT macros with the LROT_END() macro also generating a ROTable structure, which is a variant of the standard Table header and linking to the luaR_entry vector declared using the various LROT_XXXXENTRY() macros. This has new implementation has some major advantages:

  • RWTables and ROTables are separate subtypes of Table, and so only minor code changes are needed within ltable.c to implement this, with the implementation now effectively hidden from the rest of the runtime and any library modules; this has enabled us to remove most of the ROTable code patches added by eLua.

  • The luaR_entry vector is a linear list, so (unlike a standard RAM Table) any ROTable has no associate hash table for fast key lookup. However we have introduced a unified ROTable key cache to provide direct access into ROTable entries with a typical hit rate over 99% (key cache misses still require a linear key scan), and so average ROTable access is only slightly slower than RAM Table access, unlike the eLua implementation which was extremely slow.

  • The ROTable structure variants are not GC collectable and so the next field is set to the marker constant ((GCObject *) 1), and (since ROTables can only refer to other RO objects) this allows the Lua GC to short-circuit GC sweeps across such RO nodes. The ROTable structure variant also drops unused fields to save space, and again this is handled internally within ltable.c.

  • As all tables have a header record that includes a valid flags field, the fasttm() optimisations now work for both ROTables and Tables.

The same Lua51 ROTables functionality and limitations also apply to Lua53 in order to minimise migration impact for C module libraries:

  • ROTables can only have string keys and a limited set of Lua value types (Numeric, Light CFunc, Light UserData, ROTable, string and Nil). In Lua 5.3 Integer and Float are now separate numeric subtypes, so LROT_INTENTRY() takes an integer value. The new LROT_FLOATENTRY() is used for a non-integer values. This isn't a migration issue as none of the modules use floating point constants in ROTables declared in our modules; there is the only one currently used is in math.PI.

    • For 5.1 builds, LROT_FLOATENTRY() is a synonym of LROT_NUMENTRY().
    • For 5.3 builds, LROT_NUMENTRY() is a synonym of LROT_INTENTRY().
  • Some ordering limitations apply: luaR_entry vectors can be unordered except for any metafields: Any entry with a key name starting in "_" must must be ordered and placed at the start of the vector.

  • The LROT_BEGIN() and LROT_END() take the same three parameters. (These are ignored in the case of the LROT_BEGIN() macro, but by convention these are the same to facilitate the begin / end pairing).

    • The first field is the table name.
    • The second field is used to reference the ROTable's metatable (or NULL if it doesn't have one).
    • The third field is 0 unless the table is a metatable, in which case it is a bit mask used to define the fasttm() flags field. This must match any metafield entries for metafield lookup to work correctly.

Proto Structures

Standard Lua 5.3 contains a new peep hole optimisation relating to closures: the Proto structure now contains one RW field pointing to the last closure created, and the GC adopts a lazy approach to recovering these closures. When a new closure is created, if the old one exists and the upvals are the same then it is reused instead of creating a new one. This allows peephole optimisation of a usecase where a function closure is embedded in a do loop, so the higher cost closure creation is done once rather than n times.

This reduces runtime at the cost of RAM overhead. However for RAM limited IoTs this change introduced two major issues: first, LFS relies on Protos being read-only and this RW cache field breaks this assumption; second closures can now exist past their lifetime, and this delays their GC. Memory constrained NodeMCU applications rely on the fact that dead closed upvals can be GCed once the closure is complete. This optimisation changes this behaviour. Not good.

Lua53 removes this optimisation for all prototypes.

Locale support

Standard Lua 5.3 introduces localisation support. NodeMCU Lua53 disables this because IoT implementation doesn't have the appropriate OS support.

Memory Optimisations

Various Lua structures have double fields which are align(8) by default. There is no reason or performance benefit for doing align(8) on ESPs so all Lua code is compiled with the -fpack-struct=4 option.

Lua53 also reimplements the Lua51 LCD (Lua Compact Debug) patch. This replaces the sizecode ìnt vector giving line info with a packed byte array that is typically 15-30× smaller. See the LCD whitepaper for more information on this algo.

Unaligned exception avoidance

By default the GCC compiler emits a l8ui instruction to access byte fields on the ESP8266 and ESP32 Xtensa processors. This instruction will generate an unaligned fetch exception when this byte field is in Flash memory (as will accessing short fields). These exceptions are handled by emulating the instruction in software using an unaligned access handler; this allows execution to continue albeit with the runtime cost of handling the exception in software. We wish to avoid the performance hit of executing this handler for such exceptions.

lobject.h defines a new GET_BYTE_FN(name,t,wo,bo) macro. In the case of host targets this macro generates the normal field access, but in the case of Xtensa targets these macros define an static inline access function for each field. Use of these functions at the default -O2 optimisation level cause the code generator to emit a pair of l32i.n + extui instructions replacing the single l8ui instruction. This has the cost of an extra instruction execution for accessing RAM data, but also removes the 200+ clock overhead of the software exception handler in the case of flash memory accesses.

There are 9 byte fields in the GCObject,TString, Proto, ROTable structures that can either be statically compiled as const struct into libraries or generated by the lua cros compiler into the LFS region, and the GET_BYTE_FN macro has been used to create access macros for these fields, and read references of the form (o)->tt (for example) have been recoded using the access macro form gettt(o). There are 44 such changed access references in the source which together represent perhaps 99% of potential sources of this software exception within the Lua VM.

The access macro hasn't been used where access is guarded by a conditional that implies the field in a RAM structure and therefore the l8ui instruction is executed correctly in hardware. Another exclusion is in modules such as lcode.c which are only used in compilation, and where the addition runtime penalty is acceptable.

A wider review of const char initialisers and -S asm output from the compiler confirms that there are few other cases of character loads of constant data, largely because inline character constants such as '@' are loaded into a register as an immediate parameter to a movi.n instruction. Ditto use of short fields.

Modulus and division operation avoidance

The Lua runtime uses the modulus (%) and divide (/) operators in a number of computations. This isn't an issue for most uses where the divisor is an integer power of 2 since the gcc optimiser substitutes a fast machine code equivalent which typically executes 1-4 inline Xtensa instructions (ditto for many constant multiplies). The compiler will also fold any used in constant expressions to avoid runtime evaluation. However the ESP Xtensa CPU doesn't implement modulus and divide operations in hardware, so these generate a call to a subroutine such as _udivsi3() which typically involves 500 instructions or so to evaluate. A couple of frequent uses have been replaced. (I have ensured that such uses are space delimited, so seaching for " % " will locate these. grep -P " (%|/) (?!(2|4|8|16))" app/lua53/*.[hc] will list them off.)

Key cache

Standard Lua 5.3 introduced a string key cache for constant Cstring to TString lookup. In parallel NodeMCU Lua51 also introduced a lookaside cache for ROTable fields access. In practice this provides single probe access for over 99% of key hit accesses to ROTable entries.

In Lua53 these two caching functions (for CString and ROTable key lookup) have been unified into a common Key cache to provide both caching functions with the runtime overhead of a single cache table in RAM. Folding these two lookups into a single Key cache isn't ideal, but given our limited RAM this allows the cache use to be rebalanced at runtime reflection the relative use of CString and ROTable key lookups.

Flash image generation and loading

The current Lua51 app/lua implementation has two variants for dumping and loading Lua bytecode: (1) ldump.c + lundump.c; (2) lflashimg.c + lflash.c. In Lua53, these have been unified into a single load / unload mechanism. However, this mechanism must facilitate sequential loading into flash storage, which is is straight forward if with some small changes to the standard internal ordering of the LC file format. The reason for this is that any Proto can embed other Proto definitions internally, creating a Proto hierarchy. The standard Lua dump algorithm dumps some Proto header components, then recurses into any sub-Protos before completing the wrapping Proto dump. As a result each Proto's resources get interleaved with those of its subordinate Proto hierarchy. This means that resources get to written to RAM non-serially, which is bad news for writing serially to the LFS region.

The NodeMCU Lua53 dump reorders the proto hierarchy tree walk, so that resources of the lowest protos in the hierarchy are loaded first:

dump_proto(p)
  foreach subp in p
    dump_proto(subp)
  dump proto content
end

This results in any proto references now being backwards references to protos that are already loaded, and this in turn enables the Proto resources to be allocated as a sequential contiguous allocation units, so the same code can be used for loading LCs into RAM and into LFS.

The standard Lua 5.3 dump format embeds string constants in each proto as a len+byte string definition. , NodeMCU needs to separate the collection of strings into an ROstrt for LFS loading, and this requires an extra processing pass either on dump or load. By doing a preliminary Proto scan to collect tracking the strings used then dumping these as a prologue makes the load process on the ESP a single pass and avoids any need for string resolution tables in the ESP's RAM. The extra memory resources needed for this two-pass dump aren't a material issue in a PC environment.

Changes to lundump.c facilitate the addition of LFS mode. Writing to flash uses a record oriented write-once API. Once the flash cache has been flushed when updating the LFS region, this data can be directly accesses using the memory-mapped RO flash window, the resources are written directly to Flash without any allocation in RAM.

Both the dump.c and lundump.c are compiled into both the ESP firmware and the host-based luac cross compiler. Both the host and ESP targets use the same integer and float formats (e.g. 32 bit, 32-bit IEEE) which simplifies loading and unloading. However one complication is that the host luac.cross application might be compiled on either a 32 or 64 bit environment and must therefore accommodate either 4 or 8 byte address constants. This is not an issue with the compiled Lua format since this uses the grammatical structure of the file format to derive resource relationships, rather than offsets or pointers.

We also have a requirement to generate binary compatible absolute LFS images for linking into firmware builds. The host mode is tweaked to achieve this. In this case the write buffer function returns the correct absolute ESP address which are 32-bit; this doesn't cause any execution issue in luac since these addressed are never used for access within luac.cross. On 64-bit execution environments, it also repacks the Proto and other record formats on copy by discarding the top 32-bits of any address reference.

Handling embedded integers in the dump format

A typical dump contains a lot of integer fields, not only for Integer constants, but also for repeat count and lengths. Most of these integers are small, so rather than using a fixed 4-byte field in the file stream all integers are unsigned and represented by a big-endian multi-byte encoding, 7 bits per byte, with the high-bit used as a continuation flag. This means that integers 0..127 encode in 1 byte, 128..32,767 in 2 etc. This mult-ibyte scheme has minimal overhead but reduces the size of typical .lc and .img by 10% with minimal extra processing and less than the cost of reading that extra 10% of bytes from the file system.

A separate dump type is used for negative integer constants where the constant -x is stored as -(x+1). Note that endianness isn't an issue since the stream is processed byte-wise, but using big-endian simplifies the load algorithm.

Handling LFS-based strings

The dump function for an individual Proto hierarchy for loading as an .lc file follows the standard convention of embedding strings inline as a \<len><\byte sequence>. Any LFS image contains a dump of all of the strings used in the LFS image Protos as an "all-strings" prologue; the Protos are then dumped into the image with string references using an index into the all-strings header. This approach enables a fast one-pass algorithm for loading the LFS image; it is also a compact encoding strategy as string references typically use 1 or 2 byte integer offset in the image file.

One complication here is that in the standard Lua runtime start-up adds a set of special fixed strings to the strt that are also tagged to prevent GC. This could cause problems with the LFS image if any of these constants is used in the code. To remove this conflict the LFS image loader always automatically includes these fixed strings in the ROstrt. (This also moves an extra ~2Kb string constants from RAM to Flash as a side-effect.) These fixed strings are omitted from the "all-strings prologue", even though the code itself can still use them. The llex.c and ltm.c initialisers loop over internal char * lists to register these fixed strings. NodeMCU adds a couple of access methods to llex.c and ltm.c to enable the dump and load functions to process these lists and resolve strings against them.

Handling LFS top level functions

Lua functions in standard Lua 5.1 are represented by two variant Closure headers (for C and Lua functions). In the case of Lua functions with upvals, the internal Protos can validly be bound to multiple function instances. eLua and Lua 5.3 introduced the concept of lightweight C functions as a separate function subtype that doesn't require a Closure record. Note that a function variable in a Lua exists as a TValue referencing either the C funtion address or a Closure record; this Closure is not the same as the CallInfo records which are chained to track the current call chain and stack usage.

Whilst lightweight C functions can be declared statically as TValues in ROTables, There isn't a corresponding mechanism for declaring a ROTable containing LFS functions. This is because a Lua function TValue can only be created at runtime by executing a CLOSURE opcode within the Lua VM. Our Lua51 implementation avoids this issue by generating a top level Lua dispatch function that does the equivalent of emitting if name == "moduleN" then return moduleN end for each entry, and this takes 4 Lua opcodes per module entry. This lookup has an O(N) cost which becomes non-trivial as N grows large, and so Lua51 has a somewhat arbitrary limit of 50 for the maximum number modules in a LFS image.

In the Lua53 LFS implementation the undump loader appends a ROTable to the LFS region which contains a set of entries "module name"= Proto_address. These table values aren't directly accessible via Lua but the NodeMCU C function that does LFS lookup can still retrieve the required Proto address, execute the CLOSURE and return the corresponding Tvalue. Since this approach uses the standard table access API, which is a lot more efficient that the 4×N opcode if chain implementation.

Garbage collection

Lua51 includes the eLua emergency GC, plus the various EGC tuning parameters that seem to be rarely used. The default setting (which most users use) is node.egc.ALWAYS which triggers a full GC before every memory allocation so the VM spends maybe 90% of its time doing full GC sweeps.

Standard Lua 5.3 has adopted the eLua EGC but without the EGC tuning parameters. (I have raised a separate GitHub issue to discuss this.) We extend the EGC with the functional equivalent of the ON_MEM_LIMIT setting with a negative parameter, that is only trigger the EGC with less than a preset free heap left. The runtime spends far less time in the GC and code will run perhaps 5× faster. Since we will only support one ECG mode, we don't need to track this setting in G(L).

Panic Handling

Standard Lua includes a throw / catch framework for handling errors. (This has been slightly modified to enable yielding to work across C API calls, but this can be modification can ignored for the discussion of Panic handling.) All calls to Lua execution are handled by ldo.c through one of two mechanisms:

  • All protected calls are handled via luaD_rawrunprotected() which links its C stack frame into the struct lua_longjmp chain updating the head pointer at L->errorJmp. Any luaD_throw() will longjmp up to this entry in the C stack, hence as long as there is at least one protected call in the call chain, the C call stack can be properly unrolled to the correct frame.

  • If no protected calls are on the Lua call stack, then L->errorJmp will be null and there is no established C stack level to unroll to. In this case the luaD_throw() will directly call the at_panic() handler. Since there is no valid stack frame to unroll to and execution cannot safely continue, so the only safe next step is to abort, which in our case restarts the processor.

Any Lua calls directly initiated through lua.c interpreter loop or through luac.cross are protected. However NodeMCU applications can also establish C callbacks which are called directly by the SDK / event dispatcher. The current practice is that these invoke their associated Lua CB using an unprotected call and hence the only safe option is to restart the processor on error. The Lua53 changes have introduced a new luaL_pcallx() call variant as a NodeMCU extension; This is new call is designed to be used within library CBs that execute Lua CB functions, and is argument compatible with lua_call(), except that in the case of caught errors it will also return a negative call status. It establishes a default error handler which is invoked at the erroring stack level to provide a stack trace.

This handler posts a task to a panic error handler (with the error string as an upval) before returning control to the invoking routine. If the Lua registry entry onerror exists and is set to a function, then the handler calls this with the error string as an argument otherwise it calls standard print function. This function can return false in which case the handler exits, otherwise it restarts the processor. The application can use node.setonerror() to override the default "always restart" action if wanted (for example to write an error to a logfile or to a network syslog before restarting). Note that print returns nil and this has the effect of printing the full error traceback before restarting the processor.

Currently all (bar 1) of the cases of such Lua callbacks within the NodeMCU C modules use a simple lua_call(), with the result that any runtime error executes a panic on error and reboots the processor. By replacing these calls with the luaL_pcallx(), control is always returned to the C routine, and a later post task can report the error. Note that substituting library uses of lua_call() by luaL_pcallx() does changes processing paths in the case of thrown errors. If the library CB function immediately returns control to the SDK/event scheduler after the call, then this is the correct behaviour. However, in a few cases, the routine performs post-call clean-up and this adapt the logic depending on the return status.

The luac.cross execution environment

As with Lua51, the Lua53 host-build luac.cross executable will extend the standard functionality by adding support for LFS image compilation and also include a Lua runtime execution environment that can be invoked with the -e option. This environment was added primarily to facilitate in host testing (albeit with some limitations) of the NodeMCU.

The make target TEST=1 also adds the Lua Test Suite to the luac.cross -e execution environment to enable this test support. Due to NodeMCU extensions some changes were required to the Test suite so app/lua53/host/tests includes the version regressed against our current build. Note that I planning to add some variant capability to the ESP target firmware build in the future.

Enabling the test suite also disables some compiler optimisations and hence increases the size of compiled Lua files, so this test option is not enabled by default in the luac.cross make.

The test configuration has some variations from the standard suite:

  • NodeMCU lua and luac.cross do not support dynamic loading and the related dynamic loading tests are omitted.
  • The tests adopt the Lua compatibility modes implemented in our builds.
  • The standard Lua VM supports the initiation of multiple VM environments and this feature is used in some tests. Our firmware supports multiple Lua threads but only one lua_newstate() instance. So the host luac.cross make with the TEST=1 option set also supports multiple VM environments.

This execution environment also emulates LFS loading and execution using the -F option to load an LFS image before running the -e script. On POSIX environments this allocates the LFS region using kernel extension, a page-aligned allocator and also uses the kernel API to turning off write access to this region except during the simulated write to flash operations. In this way unintended writes to the LFS region throw a H/W exception in a manner parallel to the ESP environment.

API Compatibility for NodeMCU modules

The Lua public API has largely been preserved across both Lua versions. Having done a difference analysis of the two API and in particular the lua.h and lauxlib.h headers which contain the public API as documented in the LRM 5.1 and 5.3, these differences can be grouped into the following categories:

  • NodeMCU features that we will be adding to Lua53 as part of this migration.

  • Differences (additions / removals / API changes) that we are not using in our modules and which can therefore be effectively ignored for the purposes of migration.

  • Source differences which can be encapsulated through common macros will be be removed by updating module code to use this common macro set. In a very small number of cases module functionality will be recoded to employ this common API base.

Both the RM and PiL make quite clear that the public API for C modules is as documented in lua.h and all its definitions start with lua_. This API strives for economy and orthogonality. The supplementary functions provided by the auxiliary library (auxlib) access Lua services and functions through the lua.h interface and without other reference to the internals of Lua; this is exposed through lauxlib.h and all its definitions start with luaL_;

There are significant changes to internal APIs as exposed in the other "private" headers within the Lua source directory, and so any code using these APIs may fail to work across the two versions.

One thing that this analysis has underline is that we've been lax about how we allow our modules to be implemented. All existing modules have been modified to use only the public API. If any new or changed module required any of the 'internal' Lua headers to compile, then it is implemented incorrectly.

Lua Language and Libary Compatibility for NodeMCU Lua modules

For the immediate future we will be supporting both builds based on both language variants, so Lua module writers either:

  • Avoid using Lua 5.3 language features and implement their module in the common subset (this is currently our preferred approach);
  • Or explicitly state any language constraints and include a test for _VERSION=='Lua.5.3' (or 5.1) in the module startup and explicitly error if incompatible.

Other Implementation Notes

  • Use of Linker Magic. Lua51 introduced a set of linker-aware macros to allow NodeMCU C library modules to be marshalled by the GNU linker for firmware builds; Lua53 target builds maintain these to ensuring cross-version compatibility. However, the lua53 luac.cross build does all library marshalling in linit.c and this removes the need to try to emulate this strategy on the diverse host toolchains that we support for compiling luac.cross.

  • Host / ESP interoperability. Our strategy is to build the firmware for the ESP target and luac.cross in the same make process. This requires the host to use a little endian ANSI floating point host architecture such as x68, AMD64 or ARM, so that the LC binary formats are compatible. This ain't a material constraint in practice.

  • Emergency GC. The Lua VM takes a more aggressive stance than the standard Lua version on triggering a GC sweep on heap exhaustion. This is because we run in a small RAM size environment. This means that any resource allocation within the Lua API can trigger a GC sweep which can call __GC metamethods which in turn can require to stack to be resized.

  • Enforcing LUA_CORE. Some of the Lua header files (e.g. lua.h and lauxlib.h) provide a public C API for the Lua runtime. These provide a consistent cross-version API. The remaining headers (e.g. lstring.h) are intended to be internal to the runtime implementation and these have significant differences between Lua 5.1 and 5.3; these should not be include in C library modules. Both Lua51 and Lua53 have a concept of Lua core files, and these set the LUA_CORE define. In order to enforce limited access to the 'private' internal APIs, #ifdef LUA_CORE` guards have been added to all such Lua headers effectively hiding them from application library access.

Detailed changes from standard Lua 5.3 core for NodeMCU Lua53

  • The Lua type LUA_TTABLE now has subtypes LUA_TTBLRAM and LUA_TTBLROF, with handling of these subtypes following the model adopted for strings being split into short and long subtypes. In general the variant coding for table subtypes is managed as low as possible in the ltable.c routine. The new ROTable is a subset of Table that only includes the fields used in ROTables.

  • The new string cache added in Lua 5.3 is replaced by a unified key cache used for both string and ROTable entry caching. The hash algorithm is now prime multiplier based to allow the use of a modulo of 2^n to avoid the need for an expensive software modulus calculation during cache lookup.

  • The byte-field access macrosgetXXX(o) replace (o)->XXX read accesses for lu_byte fields in record types that could be in constant (flash-based) memory.

  • lua.c is a complete reimplementation that more more closely follows the current NodeMCU 5.1 lua.c implementation. This is as a result of architectural drivers arising from its context and being initiated within the startup sequence of the IoT embedded runtime.

    1. Processing is based on a single threaded event loop model. The Lua interactive mode processes input lines from a stdin pipe. This must be handled on a line by line basis, and other Lua tasks can interleave any multiline processing, so the standard doREPL approach doesn't work.
    2. Most OS services and environment processing are supported so much of the standard functionality is irrelevant and is stripped out for simplicity.
    3. stderr and stdout redirection aren't offered as an SDK service, so this is handled in the baselib print function and errors are sent to print. General error reporting on XTENSA builds is not directed to stderr, but instead the error string is posted as an upval to the C closure error reporter helper. This then calls the Lua error reporter as a separate task.
  • lapi.c and lauxlib.c implement the API and interface changes listed below.

  • ldblib.c now contains a complete debug implementation (Note that debug.debug for firmware builds is still TODO as this would require take-over of stdin).

  • lfunc.c does not implement caching of last closure in Proto records

  • lmath.c has less functions disabled so the math library is now pretty complete.

  • lobject.c includes a fast implementation of luaO_ceillog2() using an asm("nsau %0, %1;") instruction which is a lot faster and avoids the need for a byte lookup table.

  • The lstate.c seed algorithm is fixed rather than using randomisation (as this last would break LFS).

  • ltest.c has its Memcontrol structure extended to include a double linked list. This allows H/W watchpoints to be set on dangling blocks in host gdb to work out what blocks aren't being GCed. Note that the test suite has disabled tests which aren't appropriate (e.g. dynamic loading) and that build will compile in the Lua Test suite if LUA_ENABLE_TEST and LUA_USE_HOST are defined -- that is for test luac.cross builds only.

  • Locales support has been removed from lvm.c.

  • luaconf.h has been reworked to reflect the IoT implementation.

  • Dynamically loaded libraries aren't supported under the non-OS DSK or RTOS so this functionality has been removed from lauxlib.c, ldo.c, and loadlib.c.

  • LCD style compressed line info is implemented as standard through lcode.c and ldebug.c.

  • On XTENSA builds all ROM modules and base functions are in the ROTable ROM. The metatable of _G is _G and both __index and ROM point to this ROTable. On host luac.cross build variants ROM contains only the ROM modules, but its meta __index points to a separate baselib ROTable. This means that both variants can resolve both the base Lua functions and ROM libraries through the global environment, _G. However, the luac.crossbuilds can be linked using standard linker defaults without any GCC, MSVC, etc. botches.

Standard C API and Interface Changes

The API follows the convention of the Lua architecture that subtypes are not in general exposed to the C API, so there for example is no concept of function vs lightweight function when testing a Lua TValue type; these are all of type "function". Ditto for access to and testing of ROTables; for access these are simply tables that are read-only.

However because ROTables can be declared inline in a Library's C module, there are some extra API calls to bind these static structures at runtime:

  • lua_pushrotable, void lua_pushrotable(lua_State *L, const ROTable *p) can be used to push a ROtable onto the Lua Stack.

  • lua_createrotable, void lua_createrotable(lua_State *L, ROTable *t, const ROTable_entry *e, ROTable *mt) is used the create a ROTable header for the specified ROTable_entry vector. This is only required for linker-marshalled entry vectors as the ROTable is normally generated by the LROT macro declarations.

  • luaL_rometatable, int luaL_rometatable(lua_State *L, const char* tname, const ROTable *p) shorthand for lua_pushrotable() and luaL_newmetatable(); used to associate a ROTable metatable with the entry tname in the Lua registry.

Other NodeMCU extensions are:

  • lua_getstate, lua_State *lua_getstate(void). Returns the L0 state. Used in C module callbacks to call the Lua VM.

  • lua_freeheap, int lua_freeheap(void) returns the amount of free heap.

  • lua_gc has an extra parameter option LUA_GCSETMEMLIMIT that is used to set the ECG memory limit.

  • lua_getstrings, int lua_getstrings(lua_State *L, int opt) Debug utility to return a table of strings in the specified string table (0 = RAM, 1 = LFS ROM)

  • luaL_pcallx, int luaL_pcallx(lua_State *L, int narg, int nres). This is designed as a plug-in replacement for lua_call(L, n, m) used to call Lua CBs in the module event routines. Unlike lua_call(), this does a protected call and returns a call status. In the case of the called routine throwing an error, instead of the VM panicing and restarting, the error handler collects a full traceback and posts a separate task with this error string as an upval. The error reporter then calls the users reporter function, which can then print or log the error, and continue or restart as required.

  • luaL_posttask, int luaL_posttask(lua_State* L, int prio) Post the task popped from the stack at the specified priority.

In order to unify the coding for C Library files for execution in both the Lua51 and Lua53 environments, these changes have also been regressed back into the Lua 5.1 code base.

@nwf
Copy link

nwf commented Jul 29, 2019

Thank you for the exquisitely detailed writeup and all the hard work that's gone into it! I don't think I have anything useful to add. :)

@jmattsson
Copy link

A+

That Proto change is nasty - is/was it easy to disable that optimisation? And if not, is it something we should consider pushing upstream?

You don't mention anything about the horrible memory bug we hit in 5.1 and what level of effort will be needed on 5.3 to close those holes. Is that relevant here, or not worth discussing?

On the integer vs float support side, it's not clear whether you're advocating for having a single 32<->64 setting that covers both types, or whether it's possible to choose each one individually. For example, I'd imagine 32bit integer + 64bit double would be a sweet default spot - nice native hardware ints where you can get away with it, but full double available for where you can't.

100% agree with EGC support for negative memlimit setting.

@TerryE
Copy link
Author

TerryE commented Jul 29, 2019

@jmattsson

That Proto change is nasty ...

The way it works is a hint. The GC takes a lazy approach to GCing any Closure linked in this Proto field. The VM only coerces the closure if it exists and the upvals are the same. If not it creates a new closure as before. My preference would be to add a luaconfig define to make this Proto caching optional and to disable it for NodeMCU. The behaviour with closures would then be as with our Lua51, as this is a far better approach for low memory usecases such as in IoT devices.

In terms of types vs subtypes, the general principle here seems to be that applications should only work using Lua types and that subtyping is an internal optimisation used within the VM. This makes a lot of sense to me. I need to look at the conversion and arithmetic rules when operating with arithmetic subtypes and I'll get back to you on this.

This hiding of subtypes where practical makes a great deal of sense to me. Why should we be coding "if this is a Table or a ROTable" in our modules. You can also implement RO table's at an application level, so the only difference between a Table and a ROTable in use is that writing to a ROTable will throw an error anyway, so IMO applications and the C API should only be interested in whether the variable on the stack is a table or not.

Ditto and doubly so for the lightweight function type. The easiest way to unify this handling is to remove the subtype entirely. This makes the VM code and the application simpler to code and faster to execute. I could easily back this change into Lua51 but this is better done as a separate patch so you can cherry pick this back into dev-esp32.

I will raise an issue on this. Watch for the poke.

@TerryE
Copy link
Author

TerryE commented Aug 14, 2019

@jmattsson @nwf, this is a working control document for me and so I keep my local copy updated with progress and decisions. I've just uploaded a current snapshot which has quite a few changes from the last iteration. Any final copy needs a proper proofing edit to remove errors and unneeded duplication.

@TerryE
Copy link
Author

TerryE commented Apr 21, 2020

I've just done a major refresh to this doc to reflect the current status. We will be merging some version of this into the docs tree as a Lua53 Whitepaper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment