I've come up with an interesting design for a hash map that instead of bucket lists, recursivly embeds more hash maps, but with a new key for each level.
Each hash map is super simple with two slots and hashe keys are 1 bit. However, the initial value is hashed to a 32-bit integer where each bit in the number is the key for each level in the recursive hash map.
In C, my structure looks like:
struct hash_tree_node {
void *key;
void *value;
struct hash_tree_node *left;
struct hash_tree_node *right;
}
In other words, this is shaped exactly like a binary tree.
Algorithm for setting a value, starting at the root node.
- If the slot is free or matches our key, use it.
- otherwise:
- if the hash ends in 1, recurse on left (shift the hash right 1 bit to get new final bit)
- else recurse on right (shift the same here)
As you can see, finding a free slot is O(log n). Each hash hey gives you a unique pseudo-random search path to traverse the binary tree.
I optimize a little with a top node that's special and records a running total of all value nodes for quick length checks.
struct hash_tree {
struct hash_tree_node *root;
uintptr_t count;
}
Reading a value from the hash is also O(log n) cost as we must walk the tree using the hashed bits.
Deleting a value has similar cost. Once found, we simply set key to null and the set algorithm knows it can reuse this slot instead of looking deeper.
Iterating over this can be done without recursion or a stack by using morris traversal.
Is there a name for this structure?
How does it compare to traditional hash maps and balanced binary trees?
Note that while this is shaped like a binary tree, there is no balancing. Nodes can be empty and reused later. Assuming a good integer hash for the bitfield used for the path, the tree should naturally stay balanced.
Since my bitpath is 32 bits long, if you happened to store more than 4 billion unique keys in this, the logic will turn into linear chains. But there will be 4,294,967,296 of them so it shouldn't matter unless you went way past this value.
Of course if you're storing that many keys in my simple hash/tree, you're probably using it wrong. You could simply use a 64-bit hash key instead to prevent the poor performance for trees deeper than 32 levels. Most likely you need to rethink your solution a further back.
I'm sure this isn't the best hash map purely from a CPU performance perspective. My use cases are on very memory constrained systems where I often already have native data structures for cons pairs (which can be used to build the tree).
For example, in one system, I have a tiny embedded scripting language that has exactly one native data structure. The cons pair. It's a fixed tuple of length two that can hold two arbitrary values. Using this I can construct these nodes in the following manner:
Since all the garbage collected data structures are based on this tuple type, I can use very compact heap space with fixed width and simple GC tracking (usually using mark-sweep).
I write high-level functions in C that know how to implement this algorithm on top of the simple cons pair system to reduce the number of jumps between script and vm space. The special top node I wrote in C can be done in these cons pairs as:
Then if I can fit in a few bits to tag the tuples with types, this can be typed as a hashmap and the vm won't expose the internals to the script by default. It will look like a new powerful primitive, but really be built out of simple primitives under the hood.