samoht/decentrilazed-store.md

## decentrilazed-store.md

      
    Raw
  

              decentrilazed-store.md
            
          
    A distributed, decentralized immutable block store for dune-cache

Objective

The aim of this project is to create a system to store data in the
form of immutable, content addressable blocks that can be used as a
backend for the dune-cache-daemon.
The envisioned characteristics are:

Distributed: the storage, bandwidth and computational load can be
smoothed over multiple machines with limited overheads: most of the
operataions can be done in constant time, for the one which can't
it will be logarithmic over the total number of machines at worse.
Highly available: the system can be configured to serve multiple
replicas of blocks to prevent data loss and to increase throughput.
Decentralized: no machines of the cluster need have a special
"coordinator" status, making it trivial to add and remove machine,
avoiding scalability bottlenecks on the master and freeing the user
from the operational load of tending to those machines in particular
(ie. no pets, only cattle).

Janestreet specific scenario

To accomodate Hydra needs, we would concentrate on a simple overlay
and block store implementation that have the following characteristics:

Low number of machines (in our context, low means dozens): we can
use omniscient algorithms, were every node knows the location of any
block and the path to every other machine.
No replication: stored information being purely cache, the physical
loss of a machines can entail the lost of its owned data.
No churn: no monitoring takes place, as failing block requests are
non fatal and can be purged a posteriori in case of failure.

V1 Architecture

The proposed architecture is similar to what was successfully achieved
at Infinit, except greatly simplify
because we only concern ourselves with immutable blocks.
Backend

Blocks

Data blocks and stored in a global space where their address is the
hash of their contents. Blocks being immutable, this address never
changes and can be used to verify the block contents.
Overlay network

Machines that take part of a cluster are connected together through an
overlay network. This class of algorithm has been around for some time
with protocols such as Chord, Kelips, Kademlia (EDonkey), BitTorrent,
etc. The role of the overlay network is:

Discovery: new nodes joining the cluster via one member node can
discover other member nodes and be discovered by them.
Routing: find a network path to contact any member node.
Monitoring: Detect other nodes becoming faulty.
Distributed hashtable: given any block address, determine which
nodes possess a copy of that block.

For the V1 release we will use a very simple overlay network and rely
on a complete-graph connection between machines: every machine knows
every machine and the route to it. This is possible because of the low
number of machine and the absence of NAT/firewalling between them in
the case of Jane Street.
We will also use sharding for the blocks repartition: the address
space is equally divided between all the machines and block address
are statistically evenly distributed over all nodes. This has a 0
performance overhead and is trivial to implement. This does not react
well to churn however, as adding or removing a machine reshapes the
whole space. In that case, the bock store can simply drop the blocks
it is not supposed to own anymore, as data loss is no issue and adding
or removing machines would be a rather rare operation.  Moreover, the
amount of blocks to discard is (number of new machines / total number
of machines) which should be reasonable in general.
Block store

Each node allocates some space to store blocks locally. These can be
stored directly in the filesystem initially, but other storage media
could be used later. Eg. block devices are more optimal to avoid the
filesystem overload when storing block-like, immutable data. The block
store responds to requests to store, retrieve or delete blocks, and
ensures enough copy of the block are live by replicating them if the
DHT notifies of a copy being lost.
For the V1 release, we will store the blocks compressed on the local
filesystem cache and keep replication to 1: the cost of replicating
blocks vastly exceeds the cost of losing blocks in the very unlikely
case a machine becomes faulty.
Frontend

The distributed cache will be available to cache clients via an HTTP
API. See the Distributed store artifact transfer protocol
RFC for the initial
disucssion. For v1 we are proposing to use a simple HTTP scheme.
For the V1 release, sharding will make very easy to compute the
address of the machine storing a given block. So frontends can
directly connect to the correct block store to send any requests. As
the adress space is randomized, the requests will be distributed
evenly across all the available machines in the cluster, greatly
improving throughput.
Additional concerns

Garbage collection

The system, per se, makes no hypothesis on the blocks contents and
consider them as opaque blobs. It can however easily provide meta
information, such as the last time a block was fetched. It would be
easy to crawl the cache model in a client process and remove blocks
that have not been accessed for, say, a week. An issue with client-side
triggered GC is that they won't be aware of specific storage constraints
of the nodes.
Access control

The recommended way to implement access control in such a system would
be through cryptography. Each block can be encrypted with a symetric
cipher, and thus only accessible with the right key. While this
implies some additional computation on the client and server sides, it has a
multitude of advantages:

Storage security: cluster machines need not concern themselves with
security, they can store and transmit block to any member for
storage, as access control in intrinsic.
Zero overhead protocol : fetching a block is just a GET, with no
additional authentication and access control check. The protocol
itself need not be encrypted, while if blobs are transmitted in
clear it has to, voiding the point of not performing blocks
decryption.
Scalability : there is no centralized bottleneck access control
server that each node must refer to before sending a block back.
Software simplicity and flexibility : the storage system can rely on
cryptographic keys only instead of implementing and being tied to a
specific access protocol.

These cryptographic keys could be managed via existing authentication
systems mechanism such as Kerberos. Only clients will have to deal
with that integration.
Pluggable modular design

The described architecutre is very modular design enables to create
different kind of storage, with performance / flexibility / complexity
trade-offs. With a full-blown DHT algorithm, one can get a BitTorrent
like storage that scales to thousands of machines with high
churn. With an overlay network that stores block ownership in a Raft
log on three master, one can emulate the exact same architecture as
etcd for instance. With a preset number of machine and deterministic
sharding, one can get a 0-overhead storage system, at the cost of not
being able to easily add/remove machines.
The design of the V1 release is compatible with these more complex DHT
implementations if we decide to use them in later release.
Questions


Does "one block"="one file" or can we have a given be split between
several blocks living on different machines?

In V1 yes. Later on we can have B-trees for large files (using
something like
irmin-chunk. We
can also decide to not cache a file when it is too large.

How can we compute statistics on the stored blocks?

In V1 use a client which will have a full, structured view on
everything. Note: we don't want that statistic process to change the
date of use of the blocks and have an impact on the GC.
In the next version will could store statistics on the block
themselves so that the block store deamon could report their own
partial stats.

Regarding garbage collection, will this be backed into the system or
will we need a separate process that performs it?

As stated above, given the limitation of client-side triggerd GC,
Every machine will have to perform GC independently as they might have
different storage constraints. So every block store daemon will have
built-in GC which will monitor the available disk-space and run an
asynchronous GC when necessary. It will use the last time of access to
discard old blocks (and maybe will consider the size of blocks too and
the time it took to produce it).

When machine is full, what happens? does it start rejecting new
storage instructions?

In V1 the GC should ensure that a node is never full. It that happens,
that will force a GC slice anyway.

What happens for metadata which are not content addressable?

That should be fine as we trust the builders. But we should have a way
to give feedback ifa collision happen (e.g. when a rule is not
reproducible).

What about access-control?

Not in V1. But later symmetric encryption should be ok if connected
with Kerberos somehow. Jérémie need to check with the security team
what kind of scheme is acceptable as we don't think that doing a
Kerberos check on every block request would work.

What happens if there is an error?

Someone should be notified. Sending an email?  Maybe not in V1.

What about Irmin?

The block store daemon is an Irmin backend and could re-use some
already existing libraries (such as the HTTP API or chunking if
needed). But it doesn't use the Irmin data-model as the cached data is
already immutable and have a simple data-model. So it will not link
with Irmin core libraries.