gomathi/HashTree in Voldemort

## HashTree in Voldemort
This document describes the Merkle tree(Hash tree) design for voldemort.

Merkle tree is a data structure using which two nodes can synch up quickly by exchanging little information. For more information, refer, dynamo paper [1]. As a fun project, I am working on implementing merkle tree for voldemort. I would like to get the review from other voldemort developers.

Quick summary of Hash-Tree structure:

Each node will maintain a binary tree structure(disk based). Number of leaf nodes are fixed. It is configurable. Each leaf node corresponds to a segment. When a (key,value) comes to voldemort, (key,digest(value)) will be added to one of the segment. Keys are distributed among the segment based on hash function.

For example, a segment looks like as following

  segid 1 -> (k1, d1)
             (k3, d3)


Now we have set of segments. For each segment, a segment hash is calculated by just concatenating sorted (key,digest) pairs on that segment. All the leaf nodes will be updated with the segment hash as a first step.

                0
          1             2
      3       4     5       6

For example, in the above binary tree, (3,4,5,6) are the leaf nodes, and they correspond (0,1,2,3) segments. After calculating the segment hash for each of (0,1,2,3) segments, we will have (3, segment-hash), ... (6,segment-hash), segment hashes for the internal nodes will be calculated for each level in the binary tree, all the way upto root node. In the above tree, node 1, will have segment hash of (3,4).

Hash-tree storage:

Hash-tree will subscribe to all the key additions/deletions to the node. I dont want to maintain the entire hash tree in main memory. (Assume, assume we have 1 billion key,value pairs in a node, and each (key,value) pair takes about 20 bytes, and this will cost  about 20GB roughly). I am planning to use level db as the backend for storing segment data, as well as segment hashes. Hash tree will be running in a background thread to enable non blocking calls for addition/deletion of key value pairs, and this will avoid increase in latency to Voldemort.

Maintaining Hash-tree structure:

On every addition/deletion key value pair, the corresponding segment will be updated in the storage. So segments are always up-to-date. Now we want to update the internal nodes' hashes as per the segments. This involves taking the digest of each segment, and this is a costly task, I dont want to update the internal nodes' hashes on every update in the segment. On every update of the segment, a dirty marker will be set for the corresponding segment. Using this information, we can run a background thread, that reads the dirty segment markers, and updates the internal nodes' hashes at regular intervals.

I assume all the changes to voldemort are made through some internal api, and no direct changes are made to the backend. If not that case, then HashTree will not have consistent view of node data. If that can happen, then we need to have a job which rebuilds the entire hash tree, by discarding segment data at regular intervals.

Since in voldemort, a same node can maintain many partitions, we need to have hash tree for each partition. Otherwise during the sync between two nodes, root nodes will not have the same segment hash even though they have the same data for the particular partition.

Hash-tree sync:

Hash tree will run a sync thread, which will initiate the connection with other backup parition nodes, and will figure out the differences. At the end of the difference calculation, it will have the information about differing segments, missing segments in remote node, missing segments in local node. For each differing segment, the sync thread will check the (key,digest) pairs in the local node against the remote node, and will add/update/delete key pairs based on that. For each missing segment in the remote node, the local node will node send out all keys to the remote node. For each missing segment in the local node, the local node will ask the remote node to delete all the keys in that segment.

Complexity involves when we initiate sync with the remote node, assume the remote node has just started for the first time, and this will result in many missing segments in the remote node. To avoid that, the hash tree will maintain timestamp about when the segment hashes updated, and if the timestamp is not within last few minutes, then sync will not be happening against the remote node.

I have used thrift for rpc communication between two nodes.

Level-db entries:

Segment data will have the following format

Key                             Value

['S'|tree-id|segment-id|key] -> digest // Here 'S' is the marker to denote segment data, (marker, tree-id, segment id, and key) becomes composite elements of the key. marker costs 1 byte, tree-id costs 4 bytes, segment-id costs 4 bytes, and key can arbitary length. Digest will cost about 20 bytes (sha1 digest)


Segment hash

['H'|tree-id|node-id] -> segment hash

Implementation details

1) sha1 algorithm is used for digest of the values, and for generating the digest of the internal nodes in the tree.
2) LevelDB is used as permanent storage. Calls to the storage will be through nonblocking class. Thus storage api layer will not see any latency issue due to enablement hashtrees in voldemort server.
3) In memory db is used for just unit testing.
4) Segment hash for the internal nodes are not updated on each addition/removal of a key. Rather segment hashes are updated in regular intervals.
5) There will be a background job which will completely rebuild the hash tree at longer time intervals. This is to prevent divergence from actual storage and hash tree. This can happen if updates are made against directly to the storage, without forwarding the calls to hash tree.

Links:

[1] http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
	This document describes the Merkle tree(Hash tree) design for voldemort.

	Merkle tree is a data structure using which two nodes can synch up quickly by exchanging little information. For more information, refer, dynamo paper [1]. As a fun project, I am working on implementing merkle tree for voldemort. I would like to get the review from other voldemort developers.

	Quick summary of Hash-Tree structure:

	Each node will maintain a binary tree structure(disk based). Number of leaf nodes are fixed. It is configurable. Each leaf node corresponds to a segment. When a (key,value) comes to voldemort, (key,digest(value)) will be added to one of the segment. Keys are distributed among the segment based on hash function.

	For example, a segment looks like as following

	segid 1 -> (k1, d1)
	(k3, d3)


	Now we have set of segments. For each segment, a segment hash is calculated by just concatenating sorted (key,digest) pairs on that segment. All the leaf nodes will be updated with the segment hash as a first step.

	0
	1 2
	3 4 5 6

	For example, in the above binary tree, (3,4,5,6) are the leaf nodes, and they correspond (0,1,2,3) segments. After calculating the segment hash for each of (0,1,2,3) segments, we will have (3, segment-hash), ... (6,segment-hash), segment hashes for the internal nodes will be calculated for each level in the binary tree, all the way upto root node. In the above tree, node 1, will have segment hash of (3,4).

	Hash-tree storage:

	Hash-tree will subscribe to all the key additions/deletions to the node. I dont want to maintain the entire hash tree in main memory. (Assume, assume we have 1 billion key,value pairs in a node, and each (key,value) pair takes about 20 bytes, and this will cost about 20GB roughly). I am planning to use level db as the backend for storing segment data, as well as segment hashes. Hash tree will be running in a background thread to enable non blocking calls for addition/deletion of key value pairs, and this will avoid increase in latency to Voldemort.

	Maintaining Hash-tree structure:

	On every addition/deletion key value pair, the corresponding segment will be updated in the storage. So segments are always up-to-date. Now we want to update the internal nodes' hashes as per the segments. This involves taking the digest of each segment, and this is a costly task, I dont want to update the internal nodes' hashes on every update in the segment. On every update of the segment, a dirty marker will be set for the corresponding segment. Using this information, we can run a background thread, that reads the dirty segment markers, and updates the internal nodes' hashes at regular intervals.

	I assume all the changes to voldemort are made through some internal api, and no direct changes are made to the backend. If not that case, then HashTree will not have consistent view of node data. If that can happen, then we need to have a job which rebuilds the entire hash tree, by discarding segment data at regular intervals.

	Since in voldemort, a same node can maintain many partitions, we need to have hash tree for each partition. Otherwise during the sync between two nodes, root nodes will not have the same segment hash even though they have the same data for the particular partition.

	Hash-tree sync:

	Hash tree will run a sync thread, which will initiate the connection with other backup parition nodes, and will figure out the differences. At the end of the difference calculation, it will have the information about differing segments, missing segments in remote node, missing segments in local node. For each differing segment, the sync thread will check the (key,digest) pairs in the local node against the remote node, and will add/update/delete key pairs based on that. For each missing segment in the remote node, the local node will node send out all keys to the remote node. For each missing segment in the local node, the local node will ask the remote node to delete all the keys in that segment.

	Complexity involves when we initiate sync with the remote node, assume the remote node has just started for the first time, and this will result in many missing segments in the remote node. To avoid that, the hash tree will maintain timestamp about when the segment hashes updated, and if the timestamp is not within last few minutes, then sync will not be happening against the remote node.

	I have used thrift for rpc communication between two nodes.

	Level-db entries:

	Segment data will have the following format

	Key Value

	['S'\|tree-id\|segment-id\|key] -> digest // Here 'S' is the marker to denote segment data, (marker, tree-id, segment id, and key) becomes composite elements of the key. marker costs 1 byte, tree-id costs 4 bytes, segment-id costs 4 bytes, and key can arbitary length. Digest will cost about 20 bytes (sha1 digest)


	Segment hash

	['H'\|tree-id\|node-id] -> segment hash

	Implementation details

	1) sha1 algorithm is used for digest of the values, and for generating the digest of the internal nodes in the tree.
	2) LevelDB is used as permanent storage. Calls to the storage will be through nonblocking class. Thus storage api layer will not see any latency issue due to enablement hashtrees in voldemort server.
	3) In memory db is used for just unit testing.
	4) Segment hash for the internal nodes are not updated on each addition/removal of a key. Rather segment hashes are updated in regular intervals.
	5) There will be a background job which will completely rebuild the hash tree at longer time intervals. This is to prevent divergence from actual storage and hash tree. This can happen if updates are made against directly to the storage, without forwarding the calls to hash tree.

	Links:

	[1] http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf