I've put together these notes as I read about DHT's in depth and then learned how the libtorrent implementation based on the Kademlia paper actually works.
400,000,000,000
(400 billion stars), that's a 4 followed by 11 zeros.
The number of atoms in the universe is estimated to be around 10^82
.
A DHT with keys of 160 bits, can have 2^160
possible numbers, which is around 10^48
The Kademlia DHT can be used to represent and find that many objects in at most log(n)
hops in a distributed fashion. So if you had a DHT with 2^160
participants (nodes) in it, you'd only have to coordinate (in the worst case) with up to 48.1 of those nodes to find other nodes, or objects stored by those nodes. That's pretty remarkable if you think that there could be. 1,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
nodes.
A DHT is a wonderful distributed system for finding things in a huge space of possibilities, it can be used to find other participants, or to store the representation of objects that can be looked up later by any other participant on the network, it all works just as long as every participant follows a simple set of rules. After learning what I've learned, now I think of DHT's more as a Distributed Routing Technology than a distributed storage technology, I'd actually call them DRT, Distributed Routing Tables.
A "peer" is a client/server listening on a TCP port that implements the BitTorrent protocol. A "node" is a client/server listening on a UDP port implementing the distributed hash table protocol.
The DHT is composed of nodes, and stores the location of peers.
BitTorrent clients include a DHT node, which is used to contact other nodes in the DHT to get the location of peers to download from using the BitTorrent protocol
NODES have a unique identifier called "node ID", it's a random 160bit number which can be used to calculate a "distance" (XOR distance, as specified by the kademlia paper)
The routing table keeps a list of known GOOD nodes. A node is considered good if it's been able to respond to one of our queries within the last 15 minutes. A node is also good if it has queried us within the last 15 minutes. After 15 minutes of inactivity, the node becomes questionable. If a node fails to respond several queries in a row it becomes bad, and it's out of the routing table.
The routing table consists of lists of nodes of up to K size, where K is a system wide constant. In the case of bittorrent's DHT K=8.
The lists group nodes by their most significative bits. If our key space is made of 160bit keys, we can only have up to 160 lists, these lists are known as buckets.
So if K=8, and key-bit-length=160, we should in theory have up to 1,280 nodes in our routing table, only that this number won't happen, as each list indexes less possible nearby nodes. It will also be very hard to achieve this number as each split will allow for less and less nodes to be found and added to the K group for that bit.
The routing table consists of buckets of K size, currently K = 8 for libtorrent.
Buckets represent a subrange of the key space, when a table is started empty, it has 2 buckets.
BUCKET[0..2^159]
BUCKET[2^159..2^160]
Your node ID will most likely fall on the bigger bucket and it will be there by itself at the start. When the K-1 fist nodes have been added to our first home bucket (where our node id lives), the next node to come will most likely force that BUCKET to be split since that bucket will be considered to be full now that it has K nodes.
If a bucket is already full of good nodes, no more nodes can be added to it, unless our own node id falls within the range of the bucket in which we're trying to add the new node, meaning it's XOR distance matches the bit for which that bucket list was created.
Again, splits will occur only to accomodate new nodes that may fall within the range of our node only. otherwise these nodes are simply ignored. (the kademlia wikipedia entry recommends to keep these guys in replacement caches, so that they can be inserted to the buckets when purging out bad nodes.)
These splits separate nodes by their most significant N bits in N bucket (lists). This means that if our node ids are 160bits long, we should have up
Buckets have to keep track of when they were last updated (node added, node refreshed, node replaced operations have occurred), on a "last refreshed" variable.
If a node doesn't respond to a ping, it's recommended to ping once again to make sure it's really down.
If a bucket hasn't been refreshed in 15 minutes, it's refreshed by picking a random node in its range and performing a "find_nodes" operation on it.
Nodes that are able to receive incoming requests, usually don't need to have its buckets refreshed that often as the operations related to its inner nodes do the refreshing. Those nodes behind NATs that don't allow them to receive connections will have to be performing these refreshes periodically to keep a healthy routing table in place.
A refresh consists on an node id lookup for a randomly generated node id that happens to fall on the range covered by the k-bucket list.
A simple RPC mechanism consisting of BENCODED dictionaries sent over UDP.
Contacts are encoded as follows:
- Peers: byte[6] consisting of the IP and PORT in network byte order.
- DHT Nodes: byte[26] consisting of the Node ID (byte[20]) and IP:PORT (byte[6]) in network byte order.
A KRPC message is a single dictionary with two keys common to every message and additional keys depending on the type of message. {"t":txId, "y":msgType}
Every message has a key "t" with a string value representing a transaction ID. This transaction ID is generated by the querying node and is echoed in the response, so responses may be correlated with multiple queries to the same node. The transaction ID should be encoded as a short string of binary numbers, typically 2 octects are enough as they cover 2^16 outstanding queries.
The other key contained in every KRPC message is "y" with a single character value describing the type of message.
The value of the "y" key is one of "q" for query, "r" for response, "e" for error.
When a "y=q"
query is performed, it goes along 2 other keys "q=<method name>"
and "a=<dict with named Arguments>"
When a "y=r"
response is performed, it goes along with an "r=<dict with Return values>"
When a "y=e"
error is sent, it goes alogn with a "e=<list>"
, the list's first element is an integer with an error code (201, 202, 203, 204)
, the second element contains an error message.
libtorrent adds/interprets a "v"
key, for the client name and version being used.
it's a 4 character string, 2 characters for the client name, and 2 for the version. Currently known names are "UT"
(Utorrent), "LT"
(Libtorrent), "GR"
(GetRight), "MP"
(MooPolice).
All queries have an "id"
key, its value is the querying node's ID.
A response will also have an "id"
key, but its value will be the responding node's ID.
Let's start with the PING
query.
So, it's a query, therefore it must have a "q" key, and the value is the name of the method "ping"
{
"t": <transaction id>, # String(byte[2])
"y": "q",
"q": "ping",
"a": {"id":<other node id> /** String(byte[20]) */ }
}
Responses have "y=r"
and the "r=<response dictionary>"
the response would be:
{
"t": <same transaction id of querying ping>,
"y": "r",
"r": {"id":<reponding node id>},
}
FIND NODE
(Remember, NODES are DHT nodes):
Given a node's ID, you can try to find its contact information with this message. The querying arguments dictionary looks as follows:
"a": {"id":"querying node id",
"target":"target node id"}
The response looks as follows:
"r" : {"id": "target node id",
"nodes": "8 closest nodes found encoded as 26bytes each"}
Example:
Query:
{"t":"aa",
"y":"q",
"q":"find_node",
"a": {"id":"abcdefghij0123456789",
"target":"mnopqrstuvwxyz123456"}
}
Response:
{"t":"aa",
"y":"r",
"r": {"id":"0123456789abcdefghij", "nodes": "def456..."}
}
GET_PEERS
Given a torrent's infohash, find the associated peers. Remember NODE != PEER
.
So in this message, we perform a query as follows:
Query:
{"t":"bb",
"y":"q",
"q":"get_peers",
"a":{"id":"abcdefghij0123456789",
"info_hash":"20 byte torrent info hash"}
}
If the queried node has a peer that's related to the torrent, it includes it in a list for a "values" key of the "r" response dictionary.
Otherwise, if it can't find anyone, it includes the queried nodes inside a "nodes" key (similar to "q":"find_node").
It also includes a token value which is required for any future announce messages, this token is replaced every 10 minutes or so.
Response:
{"t":"bb",
"y":"r",
"r": {"id":"abcdefghij0123456789",
"token":"aoeusnth",
"values": ["axje.u", "idhtnm"]}
}
or in case it didn't find anybody: Response:
{"t":"bb",
"y":"r",
"r": {"id":"abcdefghij0123456789",
"token":"aoeusnth",
"nodes": "def456..."}
}
If you have the torrent and you're about to join the swarm, you should send a
"q":"announce_peer"
message. It will include the info_hash, the last token you received when you queried for the torrent before.
There's some other information useful to what UDP ports to use, but irrelevant for our exploratory purposes.
So that's pretty much it for the basic messages allowed by the MainLine DHT, but libtorrent's implementation has extended this with other messages.
libtorrent will always include a "nodes" in the "r" key, even if it also found peers associated to the info hash being queried, so it will send both "values" (peers found) and "nodes" (dht nodes queried). libtorrent will keep traversing until it has found 8 nodes, and it will announce to all of them when it's time for the "announce_peer" message.
It is recommended for forward compatibility, that any unknown message that contains either an "info_hash"
or "target"
argument to be interpreted as a "find_node" message, so that the message can at least be forwarded by clients that don't understand this message, instead of just being blocked.
This is the part that interests me the most, and after I explain it I'll make a secondary video explaining why I'm still scratching my head on how to use it.
The idea of this extension is to implement messages that allow us to store and retrieve arbitrary binary data.
It supports storing IMMUTABLE items. Once stored, they can't be changed. The key of such items is the SHA-1 hash of the data being stored.
It also supports storign MUTABLE items, in this case the key, is the public key used to sign the data.
The messages proposed are "q":"put"
, and "q":"get"
.
These messages are similar to "get_peers"
and "announce_peer"
.