AljoschaMeyer/leyline-replication-design-notes.md

## leyline-replication-design-notes.md

      
    Raw
  

              leyline-replication-design-notes.md
            
          
    Design notes on leyline's default replication mechanism

First things first, this replication mechanism is not mandatory for interoperability of leyline-based applications. How a certain program got its data is fairly irrelevant, it only matters that the data is in the correct format. So this design is just one of many possible replication mechanisms. But realistically speaking, the default mechanism will be used by a vast majority of peers, and may end up as a de-facto standard for interoperable data replication. Few people will build their own carried-pidgeon based system.
The replication mechanism is fully decentralized and conceptually divided into five distinct layers:

creation of bidirectional, encrypted communication channels between two peers
a peer sampling service building a random-ish overlay among all leyline peers
a topology layer to prioritize connections between peers that want to exchange data
efficient exchange of leyline-core logs
efficient encoding of partial subscriptions based on leyline specific information

These layers should be self-contained, so that they could be reused in different contexts. Then again, letting some assumptions leak through those layers might improve efficiency, so full genericity is not the primary goal.
Connections

The default replication mechanism works over the internet. All communication occurs directly between two peers (i.e. no routing-layer multicast) and should be bidirectional and encrypted if both peers want so (also reliable, back-pressured and heart-beatable, but those are implementation details, not fundamental properties). Additionally, congestion-control is a must. Peers need to exchange network addresses, possibly of different types (e.g. ip4, ip6, ipX-behind-a-NAT, webrtc, cjdns, onion, local), so some form of multiformat is necessary. Connections can be plain, authenticated or encrypted (requiring encryption might exclude peers in jurisdictions where encryption is illegal, and if a peer insists on encryption, it is free to reject unencrypted connections).
Peers are identified by a public key. Public keys also form the basis of connection establishment. Just as with ssb, if you don't have the private key corresponding to some public key, it should be impossible to find out the public key of peers trying to connect to that public key. The connection-establishing handshake might also allow the responder to request a proof of work, to defend against denial of server attacks.
NATs may prevent two peers from directly connecting to each other, but we assume that if A is connected to B and B to C, then A can connect to C (via hole-punching assistance from B if necessary).
Like ssb, it might make sense to support alt-nets based on a network key.
This will probably end up as tcp with STUN NAT-traversal for most address types. That's far from optimal (I'd love to go for dccp over ipsec), but the most realistic/pragmatic choice.
Peer Sampling Service (PSS)

Since we can't enforce everyone to use the same pss protocol, instead there's a set of rpcs suitable to implement different policies. The default implementation will be based on SPRAY. Possible primitives:

ping (for latency estimation)
view push
view push-pull
random walks (see here for more details than you ever wanted)
a joining procedure where the initial peer forwards the new peer's address via random walks

for peers behind a NAT, those walks can have length at most 1


When sending sets (or possibly multisets) of multiaddresses, some additional information is included:

age
trust (a nibble indicating how much the sender trusts this peer)
shyness (a nibble indicating how far the peer wants its address to spread)
arbitrary length-delimited data (to support a topology-building framework like VICINITY)

sign this?
cap the maximum amoun of data, design higher layers so that they can omit data to fit into the maximum amount


The pss is not secure against adversaries. No currently known secure pss is suitable for leyline (PuppetCast relies on centralized, trusted authorities, Brahms is defeated by sybil attacks and we don't want to attach proof-of-work to log creation). Instead, leyline will rely on trust, favoring connections to well-known peers. The trust nibble that is gossiped also allows estimating the threat posed by unknown peers. The shyness nibble is nothing but a friendly request not to spread a peer's address to peers that don't match a certain trust threshold. Malicious peers can of course ignore this (and even tamper with the trust nibble), so it provides no hard privacy guarantees whatsoever.
Adversaries can manipulate the self-adaptive view sizes of a SPRAY-like pss, so implementations should gracefully deal with this by

setting a maximum view size (which they should do anyways since computational resources are finite)
setting a minimum view size (depending on resource constraints and an educated guess of the network size)
(and optionally:) generously maintaining larger views than absolutely necessary if they can spare the resources

Heterogenous addresses and some peers only supporting connection establishment to a subset of those, can negatively effect the randomness of the provided peer samples. Leyline mostly shrugs and accepts this. The trust-based bias in connection establishment already skewes the randomness anyways. Topology construction and replication benefit from higher-quality randomness, but we expect the system to still work reasonably well (see the next section for details and a possible workaround).
A note on bootstrapping and network partitions: Since leyline wants to avoid central authorities, it doesn't come with bootstrapping servers. Peers are expected to somehow compile a list of addresses they can use to (re-)join the swarm. Even connected peers can occasionally contact these well-known entry points, allowing the swarm to heal from network partitions. Since leyline aims at eventually-consistent replication, temporary network partitions are not a problem (in fact, offline usage can be seen as a network partition). This also greatly reduces the impact of malicious nodes attacking at the pss layer. Combined with the web-of-trust, we expect the system to be sufficiently robust against attacks. Most users probably know and trust some other users personally, so the system should still be able to run in a reduced capacity under large-scale attacks, by restricting connections to only highly-trusted peers.
Some more half-baked thoughts:

what about relaying messages over a chain of peers if a direct connection cannot be established?

relevant when NAT-traversal fails or if two peers don't support any common connection type
also connection establishment is fairly expensive (four-way handshake, potentially NAT traversal), relaying might be more efficient in some cases


do we always send all multiaddresses, even if we don't support them?
caps on message sizes
what about connections that are only used for data replication but not for view shuffling? Those are resilient against attacks on the pss layer while still allowing data exchange with untrusted peers.
evicting a peer from the view does not mean that the connection needs to be terminated immediately (related to above point in that this results in non-pss connections)
active view and passive view (à la HyParView)?
peers may choose to not cooperate with certain (blocked) ids, how does this interact with adaptive view sizes? Should the pss be aware of this as a first-class concept? Might as well respect that choice since we can't enforce uniform handling of all ids anyways.

Topology and choice of replication peers

The pss provides more-or-less random peers, the topology layer is responsible for prioritizing those that we can replicate with. This will use VICINITY. We can't prescribe a single SELECT function, but we can reasonably assume that some loose transitivity holds since every peer will look for replication partners.
Analogously to the pss layer, there's some decisions to be made regarding handling of connections that were evicted from a view, peers with whom to replicate data but that are ignored for topology building, blocking, and the impact of malicious peers (who can falsify their own profile as well as other peer's profiles (the latter could be prevented by signing profiles)).
The target topology will probably emerge from all peers trying to find other peers interested in the same data. The "value" of a new connection thus depends on the existing neighbourhood of each peer. It might be worthwhile to not be too greedy with this, as it might conflict with transitivity.
A structured overlay like RINGCAST is probably not feasable (although attempting to build one certainly won't hurt anyone).
Over this topology, peers can then replicate data. It would be desirable if peers don't need to relay log entries they aren't interested in themselves, but that increases the risk of partitions where two groups of peers are interested in the same logs, but they can't replicate because they are connected only via peers not interested in that log. Due to the randomness of the pss and VICINITY, such partitions become connected eventually, but an explicit mechanism to bridge these partitions could speed that process up. In a setting where most peers only interact with highly trusted peers, this problem becomes more pronounced, as the randomness of new connections is very low. Any mechanism that bridges those "interest partitions" should respect the shyness parameter of the pss. Theres also the question whether those partitions should be bridged by relaying replication data or by creating direct connections.
The actual replication can use the same hybrid of eager push and lazy push as employed by PLUMTREE to achieve a good tradeoff between reliablity, bandwidth usage and latency.
Possibly part of this abstraction layer: Searching for peers that replicate (certain parts of) a certain log. Also searching for something by its hash. Although I think the latter (and maybe also the former) are actually different problems than topology construction and should thus be considered independently. Then again, it could be useful to indicate that you want to find some log with higher priority in the VICINITY profile (e.g. because the application layer just requested the target of a specific cypherlink).
Replicating leyline-core entries

Here are a few different, sensible replication requests (some of which are special cases of the others):

request an entry by sequence number
request a range of entries delimited by the starting and end sequence number
request all entries newer than a sequence number
request the newest n entries
all of the above while also requesting the corresponding certificate pools (from now on we'll generally assume that is the usual case)
all of the above but replacing sequence numbers with hashes
unions of any of the above
request a full log

When a request is for more than one entry, the server should not send the same entry multiple times, even if it for example occurs in multiple certificate pools. Open-ended requests can send new entries over time as they become available to the server. Also avoiding duplicates for those means tracking some state across time, but that state can hopefully be encoded compactly (see next paragraph).
When requesting entries from a feed of which some entries are already available, it would be nice to compactly encode that information and attach it to the request, so that no superflous data is sent. In general it is not necessary to encode all available entries, it is sufficient to encode all (or most) entries that are part of the certificate pools of the requested data. This can be done by transmitting numbers and specifying what kind of function to apply to those numbers to yield a set of sequence numbers that do not need to be transmitted. Those functions would be very similiar to the computation of certificate pools in the first place. I haven't designed a concrete set of functions to use, but I'm confident that this approach works and allows describing large feed subsets in very little space (which is possible because we don't need to encode arbitrary subsets of the numbers between 0 and 2^64 but rather subsequences of certificate pools, of which there are far fewer).
Aside from not transmitting entries in the first place, there's also the option of transmitting only parts of the entries and letting the client reconstruct them. This is possible to different degrees for different requests, some opportunities:

if the client knows the sequence numbers it will receive, they can be completely omitted
backlinks to entries the client already stores (or that have been sent previously as part of the answer to the request) can be omitted

this is another reason why it is valuable for a client to compactly tell the server about entries it already has


payload hashes can be omitted if the payload itself is also transferred
since most (or even all) entries should begin with a zero tag, the tag can frequently be omitted as well
signatures can never be omitted

Some of these omissions require more computational effort to reconstruct than others, so clients should be able to specify which omissions they expects.
It might make sense to distinguish between sending verified and unverified entries. Somewhat related, peers might want to track the quality of service provided by other peers (amount of data sent, latencies, amount of invalid data). They could use this information to deprioritize connections, and it might also be a possibility to extend the topology layer and the pss with this information to propagate information about low-quality peers. Then again, this discriminates against resource constrained peers. Generosity and solidarity may be better reactions than shunning and sanctioning.
Leyline-specific replication

Beyond the sequence number based replication of leyline-core, leyline might provide additional information (comparable to e.g. the "type" or "timestamp" attributes of ssb messages). These will be chosen in a way that leads to compact representation of content-based partial requests (like "All entries of type foo"). It should be possible to combine them with the leyline-core based requests (e.g. "The 20 newest entries of type foo").
Some sketches of possible things that could be used for filtering:

claimed timestamp (not sure whether timestamps should make it into leyline, and if they do, they'll be optional)
type (conceptually, the type could be a list of integers (remember that the set of utf8 strings is a subset of the set of integers), e.g. ["githubclone", "issue", "comment"], and you could request "Every type whose first entry is 'githubclone'", or "Every type whose first entry is 'githubclone' and whose second entry is 'issue'", and so on. Or even "Every type whose first entry is lexicographically sorted between 'github' and 'gitfoo'".)
topics (unique identifiers, each message could specify any number of topics it belongs to)

More abstractly, partial requests could be viewed as (unions of) ranges over a totally ordered (and possibly dense) universe. Note that this is totally stolen from SUB-2-SUB and it might even make sense to reuse its topology construction.
With these ranges over some universe, it might be possible that a peer can provide entries from a subrange but not the full range. Depending on how leyline will define the universe this might lead to tricky situations where the client doesn't know whether the subranges it got add up to the full range. This knowledge however is necessary to yield the data to the application layer. This might mean that peers should only answer a range request if they indeed have all data in that range. On the other hand, it feels unintuitive to not send data the client wants.
Another relevant aspect here: Is it possible to tell a peer that you request some range but already have all the data from a certain subrange, so that this data isn't sent redundantly? In some cases this might be achieved by splitting the range into multiple subranges that do not include over the subranges already available. Defining the universe in a way that this is always possible could be difficult, also this results in larger overhead in the encoding of this information. It seems like a lot of these decisions will come down to finding a compromise between sufficiently expressive filter criteria and efficiently encodable filter criteria. This will be fun.
This kind of information will not only be used for requests, but will also make up large parts of the VICINITY profiles. Since those are of limited size, a peer likely can not include encodings of all its subscriptions. Instead, the peer will need to decide on subsets (whether heuristically or randomly is of course up to the implementation and can't be prescribed). A peer could even initiate replication based on a VICINITY profile, saving round trips (though this violates the layering of these abstractions). It might be advantagous for both the peer and the overall network topology to send different such profiles to different peers.
Another interesting decision regarding the vicinity profile is whether to include the newest sequence number matching the subscription that the peer already has (note that this is another thing that becomes more complicated if partial responses are allowed). This could indicate to other peers whether they would actually benefit from exchanging information. If you receive a profile that indicates that the peer has newer entries for something you are interested in, that's a strong incentive to start replicating with that peer. If the profile is at the same state as your own data, then replication becomes uninteresting (but still the overall topology quality likely increases if you keep the connection). If the peer is not up to date, there's little incentive for a selfish peer to start replicating, but replication is clearly beneficial from a global point of view (and again, the quality of the topology rises). So in some sense, there's no clear value in adding the newest sequence data, but on the other hand it's very cheap and it might prove useful to resource-constrained peers.
Leyline-core supports partial log subscription (unlike ssb), so it makes sense that this section explores how to leverage this. But in some sense, partial subscriptions are rather selfish. Instead of donating your resources to replicate a fellow human's data, you only replicate the bare minimum you need for yourself. Then again, for resource-constrained peers this might be the only sensible option to participate in the network. Full subscription can be seen and implemented as a special case of partial subscription, but there are actually some mechanisms that can make full subscriptions more efficient than partial subscriptions, namely data structures for approximate member queries (AMQs) such as bloom filters or quotient filters. All fully subscribed logs can be put into an AMQ and then be sent to a peer, which is much more efficient than an uncompressed encoding (this is especially useful for space-constrained VICINITY profiles). The peer can then respond with the newest sequence number for all feeds in the AMQ that it is interested in. This results in negligible cost for false positives.
AMQs are not suitable for range queries, since the receiver can't possibly check for all subranges of the ranges it serves whether they are in the AMQ. Even for range queries, AMQs can still be used to trade bandwidth for latency: First send an AMQ containing all relevant logs, wait for a response indicating which of those AMQs the other peers has, then send only the range queries regarding these logs.
If there exists a probabilistic, space-saving set representation that preserves the hierarchical containment relation of ranges over a universe, then this would be useful for leyline. Unfortunately I'm not aware of any such data structure.

Addendum: Range subscriptions as described above have a problem: The client can't know whether any elements inside the range are missing. To fix this, everything needs to include the sequence number of the previous element of the range. That severely restricts how much can be done (ranges need to be known in advance), although nested types with a prevnum per layer sound reasonably expressive while still fairly efficient (linear space overhead in the depth, but with a good constant). I'm unsure whether the idea of topics can be salvaged (on the plus side I haven't been sure whether those are a good idea in the first place). Funnily enough, mandatory monotonic timestamps don't suffer from this problem since they are isomorphic to sequence numbers with respect to their total orders.