mmalex/gist:7917083

## gistfile1.txt
@mmalex:
the redis feedback thread has been interesting, and I share many of antirez's 'tastes' when it comes to wanting to create pragmatic software that is 'robust enough' for many use cases. OTOH I do think that redis is in an interesting place regarding how its users view (and potentially misinterpret) its data 'promises' , especially in cluster

I just wanted to give an example in case it is a useful datapoint in the spectrum being discussed for redis (pure memory store vs evolving into a disk cluster that might or might not make CP claims...)

TL;DR difference from redis cluster: I basically wanted not to have the distinction of master vs slave, and all the complexity that brings (having to do elections, split brain, etc) and instead build everything around merging between symmetrical peers - ie every node is equal, and they merge their differences on (re)connection. I think it's a really powerful paradigm, and not one that I've seen antirez talk much about with respect to redis. if you're interested in more detail, read on.

--

the system I'm describing was used to ship LittleBigPlanet 2 (and at the same time replace LBP 1) servers, and is still in use; it runs over ~20 large amazon instances and has a db size of maybe 100 gigs plus a few terabytes of binary blob data stored elsewhere. it's in-memory datastructure store that shares a lot with early redis, but clustered; I started it before redis was public so that wasn't an option at the time.

basics: it's a big hash table style K-V store, sharded like redis cluster based on bucketing on hash of K, (tho I have only 64 buckets, not 16k). nodes serve arbitrary (sysop configured) subets of the 64 buckets. keys are strings, values are directionaries, I think redis calls them sets, {"a":2,"b":3, etc}; these are often 1 element big, but some are large (~50m entries). different from redis, the system can build & maintain, lazily, multiple 'secondary indices' aka sort orders over these sets, so you can sort by key, value, and even some more complex things that are not relevant here.

like redis, each node uses an append-only file to log local changes (not EBS, just local disk) with periodic complete memory dump checkpoints (using fork trick) which are written to EBS & eventually S3 . the log is split at each checkpoint so you can reboot from last checkpoint + latest log.

now I just want to mention some differences in choices about the replication and distributed side of things. a lot of this is my taste and also specific use case.

I hate the complexity that the idea of a master and a slave brings; eg the need to elect masters, and split brain. instead the whole distributed system is built on the idea that all nodes are peers, and everything is built on merging. under partition, clients can read stale data, but the window for losing writes is very small (non 0 I admit!) since once acknowledged, its within one event-loop cycle of being fsynced; once on disk, the write will eventually be replicated to another node once the partition is fixed; crucially, there is never a case where a node's data is 'discarded' because it was wrongly considered a slave or master - there is no such thing; when two nodes connect, they effectively merge their entire databases.

the way I do the merge is that two peers iterate through the entire database (keyscan), 1024 key/value-sets at a time, comparing hashes (merkle tree style). thus they can skip quite fast through areas of the database they agree on. as soon as there is disagreement, they retry that block for 512 keys, then 256, etc, until they get to a small enough block that entire keys/value data is sent (not hashes), and then they can merge the results. the scan is written to happen in the background, and both sides can accept writes during the scan (if you want to configure it that way).

re merge algos: I chose timestamped values (local server based ms clocks), which was good enough for me in terms of 'newest wins'; I am absolutely fine with this - I force the user of the db to design specifically to avoid hundreds of clients fighting for the same key. that's one of the constraints that is 'written large' in the design; *in practice*, I found it rare (in fact, 0) to need to handle extreme contention for a key. so ms clocks was never a problem.
however for tie breaks (to keep things deterministic) 'highest value wins';
since the main value type is a set, they are unioned (delete is a timestamped tombstone).

this simple merge semantic, while braindead , is extremely easy to reason about, and it turns out, because I have efficient set cardinality operations, things like counters (which are hard to merge) are better expressed as sets of things-to-count ;sets are easy to merge.

the simple timestamp thing worked because I chose a data design such that there were never many competing writers needing to hammer a single key. what I'm saying is, I laid out the super simple merge semantics above, and since they were so simple to reason about, it was possible to find ways to store all kinds of data in that style, sometimes requiring a bit more client-work than a fancier merge might do, but keeping things simple to reason about. it was a really pragmatic balance between simple distributed db design & putting a tiny bit of constraint on the 'client' (but no fancy error checking required - just careful 'schema' design)

basically, I find merge (in particular in a world without masters & slaves) to be a very powerful tool for reducing the window for data loss. I think it might be a good fit for some version of redis cluster, and it imposes (IMHO) constraints on the user of the system that are easier to reason about & use than other 'kinda C' operations like WAIT.
	@mmalex:
	the redis feedback thread has been interesting, and I share many of antirez's 'tastes' when it comes to wanting to create pragmatic software that is 'robust enough' for many use cases. OTOH I do think that redis is in an interesting place regarding how its users view (and potentially misinterpret) its data 'promises' , especially in cluster

	I just wanted to give an example in case it is a useful datapoint in the spectrum being discussed for redis (pure memory store vs evolving into a disk cluster that might or might not make CP claims...)

	TL;DR difference from redis cluster: I basically wanted not to have the distinction of master vs slave, and all the complexity that brings (having to do elections, split brain, etc) and instead build everything around merging between symmetrical peers - ie every node is equal, and they merge their differences on (re)connection. I think it's a really powerful paradigm, and not one that I've seen antirez talk much about with respect to redis. if you're interested in more detail, read on.

	--

	the system I'm describing was used to ship LittleBigPlanet 2 (and at the same time replace LBP 1) servers, and is still in use; it runs over ~20 large amazon instances and has a db size of maybe 100 gigs plus a few terabytes of binary blob data stored elsewhere. it's in-memory datastructure store that shares a lot with early redis, but clustered; I started it before redis was public so that wasn't an option at the time.

	basics: it's a big hash table style K-V store, sharded like redis cluster based on bucketing on hash of K, (tho I have only 64 buckets, not 16k). nodes serve arbitrary (sysop configured) subets of the 64 buckets. keys are strings, values are directionaries, I think redis calls them sets, {"a":2,"b":3, etc}; these are often 1 element big, but some are large (~50m entries). different from redis, the system can build & maintain, lazily, multiple 'secondary indices' aka sort orders over these sets, so you can sort by key, value, and even some more complex things that are not relevant here.

	like redis, each node uses an append-only file to log local changes (not EBS, just local disk) with periodic complete memory dump checkpoints (using fork trick) which are written to EBS & eventually S3 . the log is split at each checkpoint so you can reboot from last checkpoint + latest log.

	now I just want to mention some differences in choices about the replication and distributed side of things. a lot of this is my taste and also specific use case.

	I hate the complexity that the idea of a master and a slave brings; eg the need to elect masters, and split brain. instead the whole distributed system is built on the idea that all nodes are peers, and everything is built on merging. under partition, clients can read stale data, but the window for losing writes is very small (non 0 I admit!) since once acknowledged, its within one event-loop cycle of being fsynced; once on disk, the write will eventually be replicated to another node once the partition is fixed; crucially, there is never a case where a node's data is 'discarded' because it was wrongly considered a slave or master - there is no such thing; when two nodes connect, they effectively merge their entire databases.

	the way I do the merge is that two peers iterate through the entire database (keyscan), 1024 key/value-sets at a time, comparing hashes (merkle tree style). thus they can skip quite fast through areas of the database they agree on. as soon as there is disagreement, they retry that block for 512 keys, then 256, etc, until they get to a small enough block that entire keys/value data is sent (not hashes), and then they can merge the results. the scan is written to happen in the background, and both sides can accept writes during the scan (if you want to configure it that way).

	re merge algos: I chose timestamped values (local server based ms clocks), which was good enough for me in terms of 'newest wins'; I am absolutely fine with this - I force the user of the db to design specifically to avoid hundreds of clients fighting for the same key. that's one of the constraints that is 'written large' in the design; in practice, I found it rare (in fact, 0) to need to handle extreme contention for a key. so ms clocks was never a problem.
	however for tie breaks (to keep things deterministic) 'highest value wins';
	since the main value type is a set, they are unioned (delete is a timestamped tombstone).

	this simple merge semantic, while braindead , is extremely easy to reason about, and it turns out, because I have efficient set cardinality operations, things like counters (which are hard to merge) are better expressed as sets of things-to-count ;sets are easy to merge.

	the simple timestamp thing worked because I chose a data design such that there were never many competing writers needing to hammer a single key. what I'm saying is, I laid out the super simple merge semantics above, and since they were so simple to reason about, it was possible to find ways to store all kinds of data in that style, sometimes requiring a bit more client-work than a fancier merge might do, but keeping things simple to reason about. it was a really pragmatic balance between simple distributed db design & putting a tiny bit of constraint on the 'client' (but no fancy error checking required - just careful 'schema' design)

	basically, I find merge (in particular in a world without masters & slaves) to be a very powerful tool for reducing the window for data loss. I think it might be a good fit for some version of redis cluster, and it imposes (IMHO) constraints on the user of the system that are easier to reason about & use than other 'kinda C' operations like WAIT.