Skip to content

Instantly share code, notes, and snippets.

Last active July 29, 2016 17:06
Show Gist options
  • Save dominictarr/2934a6aa17061a67d012 to your computer and use it in GitHub Desktop.
Save dominictarr/2934a6aa17061a67d012 to your computer and use it in GitHub Desktop.
ideas for a modular database


we've had great success building modular database stuff on top of leveldb with node, but as I have learnt more about databases it's become apparent to me that the idea of a modular database would be better implemented at a slightly lower level.

Level db provides a sorted key:value store, which, because of the sorted property, many things can be implemented on top of. For example, for replication, or for consistent materialized views, we often need a write ahead log. This can easily be implemented via a batch write to level, and writing the log into a section of the leveldb key space which is treated as append only.

The problem with this, is that you are writing a log on top of a merge tree, which is it self built on top of a log. Double handling the data like this decreases maxiumum performance somewhat (software performance is like loosing weight, not building muscles)

What would be better, is to have access to building blocks that operated one level lower.


LogDb would at the lowest level expose an append only log, this could have a subset of the levelup api.

//append is like batch, but you do not get to choose the keys,
//which would be either an incrementing sequence or a timestamp.
logdb.append(values, cb)
//query the value at a particular sequence id.
logdb.get(seq, cb)
//streaming range from the log, with lt,gt,lte,gte,reverse,limit,live options
logdb.range(opts) => ReadableStream

This could be then hooked up to views (such as a log structured merge tree, or others to create a keyvalue or other style of database)

The views would take a stream (or batch) and write them to their own internal datastructures) and expose whatever read api's to those data structures they choose. Writes to each view would be async, and acknowledged after they have succeed. you'd just have to keep track of what sequence number each view is currently in sync with.

via an option, the callback to each write to the database could be delayed until all (or a set of) views had acknowledged that write, and/or when reading from a view you could ask it to wait until it is in sync with a given sequence.

It wouldn't matter if the views are very durable, because the log would be. You'd just reinitialize the views at startup. For example, you could have a view that was purely in memory, and you'd just pipe the entire history back through it at startup. This may make slow startup, so you could serialize the database at various points. (again, it wouldn't need to be after every write, it's a trade off between startup time and time required to perform a write)

This is just how most databases already work anyway. It's just this one would be built from modular components, and it would be easy to extend, and since the point of extention is close to the natural structure of the database, this extention would be pretty efficient.

One striking consequent of this design, is that if the log was immutable, you could easily rebuild the database to any point in the log, much like in datomic.


what would this log actually look like? Writes to the log could be variable sized, and contain pointers back to previous items so that it's possible to perform binary search over the log, alternatively, the log could have fixed size entries, so you had random access to the log, with pointers to the value in that right, which would be stored in another file (or files).

If the values are largish, (i.e. photos) then the second method would work well. Otherwise, the first method might be better. Maybe you'd use a combination of both, like, put the keyname in the variable log, but store the value in another file. This would be a tradeoff and both would probably be worth implementing.

Indeed, each log design might also be useful for views also.

//the variable sized log might look like this:

[sequence (8 bytes), length of record, [pointers to previous records], value]

//the number of pointers would be a function of the sequence, to form a binary tree.
//you'd only need log(n) pointers, on average.

the fixed size log would be simpler,

// by storing the values in multiple files instead of one,
// large writes wouldn't need to block each other.
[sequence (8 bytes), file (2 bytes), offset (8 bytes), length (8 bytes)]

//since the offset and length is stored in the log file, the value file would
//just be the values concatenated!


there are lots of views that you may want to implement, and this is where the potential efficiency gain would be possible. Instead of using leveldb as a one-size fits all foundation, you could select structures more suited to the requirements of your application. Some indexes may be needed to be strictly consistent (like the user who currently owns a resource) and some only eventually consistently (like, an full text index)

You could also take advantage of in memory structures as views, which could be extremely fast, if the right data structure was selected.

This design would be particularily good at time bucketed aggregations, since the input to the view would naturally be in that order.

By syncing the view on read instead of on write (or in the background, async to the write) you could tradeoff read and write performance, which you can't do convieniently with level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment