Skip to content

Instantly share code, notes, and snippets.

@samoht
Forked from mefyl/decentrilazed-store.md
Last active May 25, 2020 16:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save samoht/d1e3000932cdcfa587e86e908b628654 to your computer and use it in GitHub Desktop.
Save samoht/d1e3000932cdcfa587e86e908b628654 to your computer and use it in GitHub Desktop.

A distributed, decentralized immutable block store for dune-cache

Objective

The aim of this project is to create a system to store data in the form of immutable, content addressable blocks that can be used as a backend for the dune-cache-daemon.

The envisioned characteristics are:

  • Distributed: the storage, bandwidth and computational load can be smoothed over multiple machines with limited overheads: most of the operataions can be done in constant time, for the one which can't it will be logarithmic over the total number of machines at worse.
  • Highly available: the system can be configured to serve multiple replicas of blocks to prevent data loss and to increase throughput.
  • Decentralized: no machines of the cluster need have a special "coordinator" status, making it trivial to add and remove machine, avoiding scalability bottlenecks on the master and freeing the user from the operational load of tending to those machines in particular (ie. no pets, only cattle).

Janestreet specific scenario

To accomodate Hydra needs, we would concentrate on a simple overlay and block store implementation that have the following characteristics:

  • Low number of machines (in our context, low means dozens): we can use omniscient algorithms, were every node knows the location of any block and the path to every other machine.
  • No replication: stored information being purely cache, the physical loss of a machines can entail the lost of its owned data.
  • No churn: no monitoring takes place, as failing block requests are non fatal and can be purged a posteriori in case of failure.

V1 Architecture

The proposed architecture is similar to what was successfully achieved at Infinit, except greatly simplify because we only concern ourselves with immutable blocks.

Backend

Blocks

Data blocks and stored in a global space where their address is the hash of their contents. Blocks being immutable, this address never changes and can be used to verify the block contents.

Overlay network

Machines that take part of a cluster are connected together through an overlay network. This class of algorithm has been around for some time with protocols such as Chord, Kelips, Kademlia (EDonkey), BitTorrent, etc. The role of the overlay network is:

  • Discovery: new nodes joining the cluster via one member node can discover other member nodes and be discovered by them.
  • Routing: find a network path to contact any member node.
  • Monitoring: Detect other nodes becoming faulty.
  • Distributed hashtable: given any block address, determine which nodes possess a copy of that block.

For the V1 release we will use a very simple overlay network and rely on a complete-graph connection between machines: every machine knows every machine and the route to it. This is possible because of the low number of machine and the absence of NAT/firewalling between them in the case of Jane Street.

We will also use sharding for the blocks repartition: the address space is equally divided between all the machines and block address are statistically evenly distributed over all nodes. This has a 0 performance overhead and is trivial to implement. This does not react well to churn however, as adding or removing a machine reshapes the whole space. In that case, the bock store can simply drop the blocks it is not supposed to own anymore, as data loss is no issue and adding or removing machines would be a rather rare operation. Moreover, the amount of blocks to discard is (number of new machines / total number of machines) which should be reasonable in general.

Block store

Each node allocates some space to store blocks locally. These can be stored directly in the filesystem initially, but other storage media could be used later. Eg. block devices are more optimal to avoid the filesystem overload when storing block-like, immutable data. The block store responds to requests to store, retrieve or delete blocks, and ensures enough copy of the block are live by replicating them if the DHT notifies of a copy being lost.

For the V1 release, we will store the blocks compressed on the local filesystem cache and keep replication to 1: the cost of replicating blocks vastly exceeds the cost of losing blocks in the very unlikely case a machine becomes faulty.

Frontend

The distributed cache will be available to cache clients via an HTTP API. See the Distributed store artifact transfer protocol RFC for the initial disucssion. For v1 we are proposing to use a simple HTTP scheme.

For the V1 release, sharding will make very easy to compute the address of the machine storing a given block. So frontends can directly connect to the correct block store to send any requests. As the adress space is randomized, the requests will be distributed evenly across all the available machines in the cluster, greatly improving throughput.

Additional concerns

Garbage collection

The system, per se, makes no hypothesis on the blocks contents and consider them as opaque blobs. It can however easily provide meta information, such as the last time a block was fetched. It would be easy to crawl the cache model in a client process and remove blocks that have not been accessed for, say, a week. An issue with client-side triggered GC is that they won't be aware of specific storage constraints of the nodes.

Access control

The recommended way to implement access control in such a system would be through cryptography. Each block can be encrypted with a symetric cipher, and thus only accessible with the right key. While this implies some additional computation on the client and server sides, it has a multitude of advantages:

  • Storage security: cluster machines need not concern themselves with security, they can store and transmit block to any member for storage, as access control in intrinsic.
  • Zero overhead protocol : fetching a block is just a GET, with no additional authentication and access control check. The protocol itself need not be encrypted, while if blobs are transmitted in clear it has to, voiding the point of not performing blocks decryption.
  • Scalability : there is no centralized bottleneck access control server that each node must refer to before sending a block back.
  • Software simplicity and flexibility : the storage system can rely on cryptographic keys only instead of implementing and being tied to a specific access protocol.

These cryptographic keys could be managed via existing authentication systems mechanism such as Kerberos. Only clients will have to deal with that integration.

Pluggable modular design

The described architecutre is very modular design enables to create different kind of storage, with performance / flexibility / complexity trade-offs. With a full-blown DHT algorithm, one can get a BitTorrent like storage that scales to thousands of machines with high churn. With an overlay network that stores block ownership in a Raft log on three master, one can emulate the exact same architecture as etcd for instance. With a preset number of machine and deterministic sharding, one can get a 0-overhead storage system, at the cost of not being able to easily add/remove machines.

The design of the V1 release is compatible with these more complex DHT implementations if we decide to use them in later release.

Questions

  • Does "one block"="one file" or can we have a given be split between several blocks living on different machines?

In V1 yes. Later on we can have B-trees for large files (using something like irmin-chunk. We can also decide to not cache a file when it is too large.

  • How can we compute statistics on the stored blocks?

In V1 use a client which will have a full, structured view on everything. Note: we don't want that statistic process to change the date of use of the blocks and have an impact on the GC.

In the next version will could store statistics on the block themselves so that the block store deamon could report their own partial stats.

  • Regarding garbage collection, will this be backed into the system or will we need a separate process that performs it?

As stated above, given the limitation of client-side triggerd GC, Every machine will have to perform GC independently as they might have different storage constraints. So every block store daemon will have built-in GC which will monitor the available disk-space and run an asynchronous GC when necessary. It will use the last time of access to discard old blocks (and maybe will consider the size of blocks too and the time it took to produce it).

  • When machine is full, what happens? does it start rejecting new storage instructions?

In V1 the GC should ensure that a node is never full. It that happens, that will force a GC slice anyway.

  • What happens for metadata which are not content addressable?

That should be fine as we trust the builders. But we should have a way to give feedback ifa collision happen (e.g. when a rule is not reproducible).

  • What about access-control?

Not in V1. But later symmetric encryption should be ok if connected with Kerberos somehow. Jérémie need to check with the security team what kind of scheme is acceptable as we don't think that doing a Kerberos check on every block request would work.

  • What happens if there is an error?

Someone should be notified. Sending an email? Maybe not in V1.

  • What about Irmin?

The block store daemon is an Irmin backend and could re-use some already existing libraries (such as the HTTP API or chunking if needed). But it doesn't use the Irmin data-model as the cached data is already immutable and have a simple data-model. So it will not link with Irmin core libraries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment