jlouis/gist:2c5cf7ab6bc9653df0f4

## gistfile1.txt
— [2014-09-22] [GH/jlouis/dht_bt] Thoughts on QuickChecking a DHT

	I am currently trying to get the BitTorrent DHT, which is Kademlia based into shape. I have code for all of it, but it is tied
	deeply into the way we did stuff in BitTorrent and I want a way to do this outside of eTorrent, because a world-global DHT
	for Erlang nodes are generally applicable and useful. So this is where we stand. The BitTorrent DHT has two important properties

	* Very high resilience against node failure
	* Slow query rates

	For a global DHT, these are exactly the properties you are after. We don't care that queries will take a bit of time, as long as they
	are resistent to error in other ways.

	The correctness of such a DHT relies on extensive testing. My way of testing is to build QuickCheck models. At least for code I write
	in my spare time at my own leisure. So here goes.

	— General considerations:
	A complete model would essentially be a simulator of a world. That is, our System-under-test (SUT) lives inside a simulator where we
	can inject faulty scenarios and have the system fail in reproducible ways. This attacks the “distributed” part of being a DHT. However,
	the problem with such a model is the complexity buildup. We don't want a model with this kind of complexity early on in a software
	project. In fact, there has to be simpler models capturing simpler problems with DHT software for much less of the work involved.

	— Metrics:
	In Kademlia, each node is mapped into a space S on which we have a distance metric d : (S * S) → double(). The properties of a
	metric must be present:

	reflexivity: d(s, s) = 0 forall s in S
	symmetry: d(x, y) = d(y, x) forall x,y in S
	triangle-inequality: d(x, y) + d(y, z) ≥ d(x, y) forall x,y,z in S

	The XOR operation has this property, but it is good to verify this. It is actually peculiar that the triangle inequality holds. XOR
	takes operations from a 160 bit integer space into an integer, which are embedded in the real numbers. So it is valid as a metric.
	But interestingly it turns out the triangle inequality is true for this.

	— A single-node DHT:
	The next phase is to observe that the simplest possible DHT is a single-node DHT. For such a DHT, a collapse happens and we
	are left with a hash table—no distribution in sight. For such a hash table, we should be able to carry out standard DHT operations
	where we store data into the DHT, query data out again. And we should get the data we stored.

	While not testing the distributed features, it does validate our DHT operates as it should when storing data locally.

	— A single-value DHT
	Another simplifying assumption is a DHT consisting of several nodes, but only stores a single value. Such a DHT is simpler
	to comprehend and understand. For a node, a store is either such that the node itself is closest, or there are another node
	which is closer in which case the store should be forwarded.

	For query, the same holds true. It is either a local query or a forwardable query.

	— Cache correctness
	In Kademlia, you keep a "gossip" cache of other nodes you know about. Since you can't store every node, you look at the
	bitwise prefix of a node compared to your own. For each prefix count of equal bits, you store up to 8 foreign nodes. This means
	you know qualitatively more about close nodes than about far away nodes. It also limits the total node count to 160 * 8 = 1280.

	This is excellent since this is low enough we can construct real-world simulations of such a small number.

	It is important to write a cache test which tests important properties of the node cache. But the actual properties we are seeking
	are hiding themselves for me currently.

	— What properties do we even want?
	What is a correct hash table? We want to check that if enough nodes collaborate then we will eventually find stored data if we
	search for it. And we want to make sure that storing data is actually stored.

	But the last property doesn't hold! If we store data at foreign nodes and all of those nodes dies just thereafter, then the data
	is lost. Forever. We rely on this to be highly unlikely to happen. So the state of the universe is stochastic, and we rely on it being
	highly unlikely that bad things happens. For a test though, we have to be able to express safety properties exactly. So maybe these
	are not the properties we are looking for.

	Convergence could be a nice property. If I ask for nodes closer to an ID, then the node will never respond with nodes farther from
	the ID.

	Hunch: there are a lot of simple properties which describe DHT systems. But it is rather hard to find a single property which uniquely
	describes the system as a whole. So a DHT can be described as a set of simple properties the system must have.
	— [2014-09-22] [GH/jlouis/dht_bt] Thoughts on QuickChecking a DHT

	I am currently trying to get the BitTorrent DHT, which is Kademlia based into shape. I have code for all of it, but it is tied
	deeply into the way we did stuff in BitTorrent and I want a way to do this outside of eTorrent, because a world-global DHT
	for Erlang nodes are generally applicable and useful. So this is where we stand. The BitTorrent DHT has two important properties

	* Very high resilience against node failure
	* Slow query rates

	For a global DHT, these are exactly the properties you are after. We don't care that queries will take a bit of time, as long as they
	are resistent to error in other ways.

	The correctness of such a DHT relies on extensive testing. My way of testing is to build QuickCheck models. At least for code I write
	in my spare time at my own leisure. So here goes.

	— General considerations:
	A complete model would essentially be a simulator of a world. That is, our System-under-test (SUT) lives inside a simulator where we
	can inject faulty scenarios and have the system fail in reproducible ways. This attacks the “distributed” part of being a DHT. However,
	the problem with such a model is the complexity buildup. We don't want a model with this kind of complexity early on in a software
	project. In fact, there has to be simpler models capturing simpler problems with DHT software for much less of the work involved.

	— Metrics:
	In Kademlia, each node is mapped into a space S on which we have a distance metric d : (S * S) → double(). The properties of a
	metric must be present:

	reflexivity: d(s, s) = 0 forall s in S
	symmetry: d(x, y) = d(y, x) forall x,y in S
	triangle-inequality: d(x, y) + d(y, z) ≥ d(x, y) forall x,y,z in S

	The XOR operation has this property, but it is good to verify this. It is actually peculiar that the triangle inequality holds. XOR
	takes operations from a 160 bit integer space into an integer, which are embedded in the real numbers. So it is valid as a metric.
	But interestingly it turns out the triangle inequality is true for this.

	— A single-node DHT:
	The next phase is to observe that the simplest possible DHT is a single-node DHT. For such a DHT, a collapse happens and we
	are left with a hash table—no distribution in sight. For such a hash table, we should be able to carry out standard DHT operations
	where we store data into the DHT, query data out again. And we should get the data we stored.

	While not testing the distributed features, it does validate our DHT operates as it should when storing data locally.

	— A single-value DHT
	Another simplifying assumption is a DHT consisting of several nodes, but only stores a single value. Such a DHT is simpler
	to comprehend and understand. For a node, a store is either such that the node itself is closest, or there are another node
	which is closer in which case the store should be forwarded.

	For query, the same holds true. It is either a local query or a forwardable query.

	— Cache correctness
	In Kademlia, you keep a "gossip" cache of other nodes you know about. Since you can't store every node, you look at the
	bitwise prefix of a node compared to your own. For each prefix count of equal bits, you store up to 8 foreign nodes. This means
	you know qualitatively more about close nodes than about far away nodes. It also limits the total node count to 160 * 8 = 1280.

	This is excellent since this is low enough we can construct real-world simulations of such a small number.

	It is important to write a cache test which tests important properties of the node cache. But the actual properties we are seeking
	are hiding themselves for me currently.

	— What properties do we even want?
	What is a correct hash table? We want to check that if enough nodes collaborate then we will eventually find stored data if we
	search for it. And we want to make sure that storing data is actually stored.

	But the last property doesn't hold! If we store data at foreign nodes and all of those nodes dies just thereafter, then the data
	is lost. Forever. We rely on this to be highly unlikely to happen. So the state of the universe is stochastic, and we rely on it being
	highly unlikely that bad things happens. For a test though, we have to be able to express safety properties exactly. So maybe these
	are not the properties we are looking for.

	Convergence could be a nice property. If I ask for nodes closer to an ID, then the node will never respond with nodes farther from
	the ID.

	Hunch: there are a lot of simple properties which describe DHT systems. But it is rather hard to find a single property which uniquely
	describes the system as a whole. So a DHT can be described as a set of simple properties the system must have.