We're working on a new Lua VM for Roblox and also introducing optional type checking to Lua (based on a combination of type inference and type annotations - the latter require extensions to the syntax). This page summarizes the questions often asked.
Why not just use LuaJIT?
We obviously know about LuaJIT; it's a fantastic project, and really what inspired us to go down this route - it provided an existence proof that Lua can be much faster. Our primary performance target is a wide set of platforms, many of which (iOS, Xbox) don't allow JIT per se - but LuaJIT has a very fast interpreter. So - use it, we're done? Well...
LuaJIT is a large, almost complete rewrite of Lua VM. Over the years we had a set of changes aimed at improving sandboxing in the VM - isolating individual scripts from each other, making sure scripts can't interact with the "outside" world in uncontrolled ways, etc. These changes would need to be retargeted to an unfamiliar codebase; additionally we'd have to do a large study of safety in the VM. Being a complete rewrite means that it's also very dangerous for us to deploy it - we like to take small, measured steps towards future goals.
LuaJIT is, unfortunately, not really maintained anymore. If we were to go down that route, we'd effectively have to own this codebase end to end - which is fine, but this means we don't get assumed advantages ("you can benefit from the work others are doing"), we only get to benefit from the existing work.
LuaJIT interpreters are written in somewhat custom assembly per architecture, with many compile time branches in the code; changing this code is much harder than changing your average C code - and we need to be able to freely change it going forward.
LuaJIT uses NaN tagging for object storage. This is the straw that broke the camel's back. NaN tagging is problematic for us for two reasons: it prevents us from introducing optimized "native" support for float3 type to the VM, which is key to performance in some code our users want to run fast, and it runs into issues on 64-bit platforms, specifically some AArch64 variants - so it's somewhat dangerous going forward.
The performance of calling Roblox APIs is important to us; we can't use FFI provided by LuaJIT because of the semi-dynamic structure of our APIs, and we have some extensions to the VM to accelerate calls to our APIs that are hard to backport.
Really, it boils down to this - if we started with LuaJIT as a baseline 10 years ago, we'd get to solve all of these issues over time, find interesting and exciting ways to extend LuaJIT to fit our needs better. But we started with Lua, and ended up in a place where it's better for us to continue along this path.
What performance level are you targeting?
Currently our actual goal is to be "much faster" than our current version of Lua on a set of benchmarks that include ones that are "contained" within the VM, and ones that interact heavily with Roblox APIs. The satellite goal is to get close to LuaJIT interpreter on benchmarks that are contained within the VM; the hope is that with that, and with other Roblox-specific performance improvements we're going to be better off compared to a theoretical future where we integrated LuaJIT.
Are you going to use JIT?
One other exciting possibility is to leverage type information. By the time we're fully done with the interpreter we're expecting to have solid support for types - at which point we can figure out the soundness boundaries in the system and invest into optimizations that remove type checks from internals of type-safe code and leave them at the boundaries. Together with JIT this can be pretty powerful, especially given that these type checks can be "strong" in that instead of deoptimizing when hitting a type mismatch we can abort the execution which makes for a leaner code and less restrictions on the JIT compiler.
Of course this is an unexplored area so we may also fail to extract performance this way. Time will tell!
Are you going to open-source this?
We don't know! There are a couple of barriers to open-sourcing this work - for example, writing a decompiler is trivial if you have the constantly up to date source of the compiler and full bytecode structure and documentation; our developer community will benefit from some period of time where we can fight exploiters who implement decompilers to analyze client-side scripts that developers are writing.
This is not off the table - but no immediate plans either. Again, time will tell.
What about GC?
We currently are using vanilla 5.1 GC with a fast small block allocator. LuaJIT 3.0 never-officially-released GC design looks interesting, we'll likely experiment in this area once we're done with the interpreter itself.
We're also planning to look at reducing GC pressure in other ways - the aforementioned float3 native support will help a lot since it's
the source of a lot of generated garbage right now, we're planning to eliminate closure allocations in some cases (which has complex
getfenv but we're willing to risk some breakage if the performance gains are there), potentially reworking upvalue
handling to be more allocation friendly, and may or may not investigate escape analysis / etc. once we have inlining fully working.
Is your interpreter written in assembly?
Nope! It's mostly portable C. This was a big concern when we started this work - there's a famous post by Mike Pall (http://lua-users.org/lists/lua-l/2011-02/msg00742.html) where he talks about the challenges with C when writing fast interpreters - single branch source from switch(), complex CFG makes register allocation hard, there aren't enough x86 registers for C compiler to fit the VM state, etc.
We were very happy to discover, that on 3 out of 4 architectures we care about (x64, ARM, AArch64), clang does an admirable job if your
code is written carefully and uses some features to guide the code generation like
__builtin_expect. We have minimum or no register
spilling, very few cases where the codegen isn't what we expect, and pretty good levels of performance overall. We can push this a bit
further if we rewrite this in assembly - we have the benefit of always having a portable C fallback that we can use if we have any issues
and we don't have to write the assembly intepreter for every architecture we care about! - but we expected devastating codegen and it's
actually fine. There was apparently a lot of progress in the past 8 years or so since that post was written!
Now, MSVC on x64 does a worse job than clang - it's not our primary performance target but it may push us into building at least the VM with clang. MSVC on x86 is pretty sad, with lots of register spilling, lack of C extensions we need to have the VM run fast. It's still substantially faster than Lua but also substantially slower than LuaJIT interpreter. We could use clang there as well, or transition to x64 for the majority of users who run a 64-bit OS anyhow (we currently use a 64-bit editor but 32-bit client).
Did you have to write a new compiler as well?
Yup! We had to write our own parser because we need to support several features that need an AST representation of the source code:
- We have a set of linting passes that try to find common mistakes such as unknown globals and promote good style such as using locals when possible
- We support autocomplete that tries to understand the intent behind the untyped Lua code and convert it into "probable, if imprecise, type information"
- We are working on actual rigorous type inference and checking (with syntactic extensions for type annotations)
Since we had to write a parser anyway, instead of compiling from source to bytecode as we parse, as Lua and LuaJIT do, we implemented a more traditional AST -> bytecode compiler. This allows us to implement high level optimizations such as "deep" constant folding across function boundaries, local function inlining, smarter register allocation etc.
In absence of type information we are currently limited in the optimizations we can do due to possible side effects of many operations - most operations in Lua can result in arbitrary Lua code being called - but when we get type information we're hoping to be able to do even more high level optimizations.