mikedeboer/msgpack.md

## msgpack.md

      
    Raw
  

              msgpack.md
            
          
    Perf: a new streaming API, based on msgpack

=============================================
We use a number of persistence mechanisms in our codebase, one of which is storing stringified JSON to disk. The most notable example of this is session persistence (as used by session restore).
We use JSON here, because it's a very developer-friendly format to deal with; it's just JavaScript. Being able to persist 'JS' data structures makes it all the more practical to use. However, there are a few downsides to using JSON-to-disk as a full blown persistence mechanism.

JSON is not streamable; there is (of course) no limit as to how large a blob of JSON may be before it's persisted. When you store large datasets (think: metrics, browser session data), reading it back into memory requires a JSON.parse(blob), which will block the main thread O(size-of-blob)[1] time. In other words: there is no off-main-thread, chunked deserialization mechanism for JSON.
JSON is not compact; there no built-in method to compress serialized JSON data to reduce bandwidth across network and/ or process endpoints. In practice, this means that compression layers are added before data persistence (think: tar+gz before writing to disk) or data transmission (think: gzip for HTTP transmission). Downside here is that each persistence and/ or transport layer needs to implement a compression mechanism and compression is performed after all the data arrived, or before all the data is sent.

It would be a step forward if browser processes could restore a session as soon as chunks of data come in that can be parsed and used right away, without having to wait for the whole blob. Same goes for loading a huge list of add-on search results, workers that can yield partial results via postMessage(), etc.
At this point I'd like to introduce MessagePack (msgpack). Quoted from the website, http://msgpack.org:

MessagePack is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON but it's faster and smaller. For example, small integers (like flags or error codes) are encoded into a single byte, and typical short strings only require an extra byte in addition to the strings themselves.

My proposal is to add a msgpack XPCOM component and possibly a helper JavaScript module with the following API:
Components.utils.import("gre://modules/Msgpack.jsm");

// Construct a new future resource
let res = Msgpack.Resource("file://foo.baz");
res.on("msg", function(aMsg) {
  dump("Receiving a chunk of data as JS Object: " +
       typeof aMsg + "\n");
});
res.on("error", function(aError) {
  dump("There was an error whilst reading resource: " +
       aError.message + "\n");
});
// Start reading the resource with the default options.
// I can imagine several options to be available.
res.read();

// Let's start writing
let isWritable = yield res.isWritable();
if (!isWritable) {
  throw new Error("Yuck! We can't write to this resource!");
}

let err = yield res.write(largeJSONBlobOrByteArray);
if (err) {
  throw err;
}
The most important things to take away from this example are:

The API should be entirely asynchronous.
Reading a resource is done in chunks, so that each time the I/O layer has a message available, it'll be delivered to the consumer right away. This continuous stream of data allows, for example, the UI to update instantly.
The MessagePack parser is written in C/C++ and available to JS via IDL bindings and possibly DOM bindings in the future. The Javascript module is a wrapper to provide a convenient API to work with.

This idea is also meant to spark a discussion about how we deal with data streams that will continue to grow is size vs. application responsiveness.
So, what do you think?
Mike de Boer, June 2013.
[1] time spent on parsing vs. size depends on many factors and differs per implementation.