Self-Hosting in SpiderMonkey
Implementing parts or the whole of a runtime or compilation environment for a language in that very language itself is called self-hosting, and it isn't a new invention by any means. Examples of languages with (partly) self-hosted compilers include C/C++, Pascal and Rust. Partially self-hosted VMs exist, among others, for Lisps of all ages, Python, Ruby, Java, .Net, ActionScript3 - and JS. In fact, the V8 VM that executes JS in Chromium and Node.js has been partially self-hosted for a number of years now (maybe even from the outset, I don't know).
Various reasons. Code that is implemented in managed, memory-safe languages like JS is immune against a range of common security-critical bugs, such as buffer overflows and allocation errors. That, however, only holds insofar as the VM executing that code doesn't contain bugs which allow attacks of this sort. By self-hosting parts of the VM, we're able to reduce the risk of that happening.
JS being a higher-level language than C++ has additional advantages, however: new features can be implemented faster, with fewer lines of code and, because of the reduced risk for security-critical bugs, they are easier and quicker to review.
And finally, we hope to make the project more approachable to new contributors. There are many different things we can/should be/are doing towards that goal, but self-hosting certainly is one of them.
But but, slow!
In a word: no. Lots of code just isn't performance-critical. Whether something takes one nanosecond, or a few of them, doesn't matter in a great lot of cases. Additionally, SpiderMonkey contains a baseline compiler and IonMonkey, two JIT compilers that can generate code that's efficient and efficient-er. IonMonkey, especially, is able to create very fast code.
That leaves those cases where every little bit of performance is important. Turns out that some of those get a lot faster once you sprinkle a bit of self-hosting onto them.
Take iterating over lists, or converting and filtering their elements: In JS, Array instances have methods for all these tasks;
every, respectively. These methods all take a callback as an argument. They then iterate over the list and invoke that callback during each step of the iteration. When implemented natively, execution has to switch between compiled C++ code and interpreted or JITted JS code for each step in that iteration. For various reasons, such a context switch is fairly expensive, so doing it a lot is undesirable. By self-hosting the iteration, we get rid of it completely. Staying in the same execution context and within the same language allows us to have additional nice things. In the perfect case — which isn't always, but often, achievable in practice — the JIT compiler can inline the callback and create highly-efficient machine code for the combined code of the looping function and the callback.
In our testing,
Array.forEach can be about as fast as a hand-rolled
for (var i; i < length; i++) loop. That is, it can be as fast as possible within the given bounds of SpiderMonkey.
Smart as you are, you figured out that some of the Array extras have been converted to self-hosted implementations. That is, in fact, all the self-hosted code that made it into Firefox 21.
More exciting things are happening with that infrastructure, however: The first major project to use it is our implementation of ECMA-402, which provides thorough internationalization support for JS. Norbert Lindenberg, the editor of that spec, did our implementation, using JS for all parts where that was feasible.
The second big project that extensively uses self-hosting is the experimental ParallelArrays project. Niko Matsakis from Mozilla Research gives a thorough account of how ParallelArrays are implemented in a number of blog posts.
Easy: Stuff some JS code into a char array, interpret it during startup and hook functions from it up in the right places. Then, spend a few months and dozens of patches dwarfing the initial one to make things secure and performant. Rely on smarter colleagues for the latter part, and a short nine months later, you go to production.
All joking aside, there really isn't that much to self-hosting support.
We have to make sure that our builtins' operations can't be changed by — potentially malicious — user code. For example,
Array.forEach takes an optional second argument that specifies the scope in which to execute the callback given as the first argument. If we were to implement this using
Function.call, user code could change
forEach's behavior by replacing
Function.call. Thus, all self-hosted code is interpreted in an environment that is detached from all user code. This environment is implemented as a global object that contains all the natively implemented builtins, plus some intrinsics. That is the term we're using for native functions only available to self-hosted code. For example, the internationalization code uses the ICU project's library to do much of its heavy lifting. ICU's C-bindings need to be used in native code, so we have intrinsics that expose their functionality to self-hosted code.
Another thing we have to work around is that all builtin functions are specced as not being constructible (except for those that should, in fact, be used as constructors) and not having a prototype. Hence, self-hosted functions by default exhibit these properties and have to be marked as constructible and have a prototype assigned to them if required. This is done using the intrinsic
Skipping some details here, let's get to the last interesting aspect of our implementation: lazy function cloning. See, all self-hosted functions (and their supporting object structures) start their life in the above-mentioned self-hosting global. To actually be used, however, they have to reside in the global of the JS code that's using them. The easy thing to do would be to just copy them all over during creation of each global object. However, given that Firefox creates anything between a hundred and several hundred of global objects right during startup (with the exact number depending on how many tabs are being restored and how many addons are installed), that would get boring, and memory intensive, real fast. So what do we do instead? We create lazy functions: mostly-empty shells that contain just enough information to enable filling them with the proper code once they're needed. As a function's script contents are only needed for executing it, that happens upon first invocation of each of these functions, so if any one of them isn't ever called in a global, it won't be cloned into it.
As it turns out, this particular bit of the infrastructure is useful for other things, too: we're very close to landing a change to the parser that causes all scripts to only be parsed fully when and if they're executed. This work uses the same mechanism, and can be tracked in bug 678037.
Cool, so now what?
Glad you asked. Apart from self-hostedly implementing as much of the new functionality specified in ES6 as we can, we hope to be able to convert increasing amounts of our existing codebase, too. Obvious candidates are all the library functions that exist on our builtins. Ideally, though, we'd go beyond that and self-host parts of the core engine itself. Some day, we might have some or most of the interpreter written in JS. Or the parser.
I, for one, welcome our new self-hosting overlords.