Skip to content

Instantly share code, notes, and snippets.

@creationix
Last active April 2, 2024 20:16
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save creationix/3ea0d27dd100c5b53ca8546a2084ad47 to your computer and use it in GitHub Desktop.
Save creationix/3ea0d27dd100c5b53ca8546a2084ad47 to your computer and use it in GitHub Desktop.

I've come up with an interesting design for a hash map that instead of bucket lists, recursivly embeds more hash maps, but with a new key for each level.

Each hash map is super simple with two slots and hashe keys are 1 bit. However, the initial value is hashed to a 32-bit integer where each bit in the number is the key for each level in the recursive hash map.

In C, my structure looks like:

struct hash_tree_node {
  void *key;
  void *value;
  struct hash_tree_node *left;
  struct hash_tree_node *right;
}

In other words, this is shaped exactly like a binary tree.

Algorithm for setting a value, starting at the root node.

  • If the slot is free or matches our key, use it.
  • otherwise:
    • if the hash ends in 1, recurse on left (shift the hash right 1 bit to get new final bit)
    • else recurse on right (shift the same here)

As you can see, finding a free slot is O(log n). Each hash hey gives you a unique pseudo-random search path to traverse the binary tree.

I optimize a little with a top node that's special and records a running total of all value nodes for quick length checks.

struct hash_tree {
  struct hash_tree_node *root;
  uintptr_t count;
}

Reading a value from the hash is also O(log n) cost as we must walk the tree using the hashed bits.

Deleting a value has similar cost. Once found, we simply set key to null and the set algorithm knows it can reuse this slot instead of looking deeper.

Iterating over this can be done without recursion or a stack by using morris traversal.


Is there a name for this structure?

How does it compare to traditional hash maps and balanced binary trees?

Note that while this is shaped like a binary tree, there is no balancing. Nodes can be empty and reused later. Assuming a good integer hash for the bitfield used for the path, the tree should naturally stay balanced.

Since my bitpath is 32 bits long, if you happened to store more than 4 billion unique keys in this, the logic will turn into linear chains. But there will be 4,294,967,296 of them so it shouldn't matter unless you went way past this value.

Of course if you're storing that many keys in my simple hash/tree, you're probably using it wrong. You could simply use a 64-bit hash key instead to prevent the poor performance for trees deeper than 32 levels. Most likely you need to rethink your solution a further back.

@creationix
Copy link
Author

I'm sure this isn't the best hash map purely from a CPU performance perspective. My use cases are on very memory constrained systems where I often already have native data structures for cons pairs (which can be used to build the tree).

For example, in one system, I have a tiny embedded scripting language that has exactly one native data structure. The cons pair. It's a fixed tuple of length two that can hold two arbitrary values. Using this I can construct these nodes in the following manner:

node = ((key,value),(left,right))

Since all the garbage collected data structures are based on this tuple type, I can use very compact heap space with fixed width and simple GC tracking (usually using mark-sweep).

I write high-level functions in C that know how to implement this algorithm on top of the simple cons pair system to reduce the number of jumps between script and vm space. The special top node I wrote in C can be done in these cons pairs as:

hash = (len, root)

Then if I can fit in a few bits to tag the tuples with types, this can be typed as a hashmap and the vm won't expose the internals to the script by default. It will look like a new powerful primitive, but really be built out of simple primitives under the hood.

@creationix
Copy link
Author

When I say this is compact, I mean it's really compact. For microcontrollers with having between 5 and 100kb of free ram, my tuples are often just 4 bytes each (including allocation metadata, type tags, gc flags, etc). Each value added to the hash adds between 2 and 3 nodes (8-12 bytes).

Compare with lua where an empty hash table is around 40 bytes + 40 bytes for each value index (allocated in powers of 2). (if your indexes are only integers, then it's only 16 bytes for each, but still allocated in powers of 2)

@nojvek
Copy link

nojvek commented Apr 20, 2020

If I understand correctly, your hashmap is basically a trie of keys right ?

so you have key = hash(val), and key is a uint32

@creationix
Copy link
Author

I believe so. Not sure I understood the trie concept back when I wrote this.

@dragoncoder047
Copy link

For example, in one system, I have a tiny embedded scripting language that has exactly one native data structure. The cons pair. It's a fixed tuple of length two that can hold two arbitrary values. Using this I can construct these nodes in the following manner:

node = ((key,value),(left,right))

I found this -- very smart idea. I would implement it the same way with cons pairs.

However, there is a problem when you want to delete elements. Picture what happens if you have an empty hashmap, then set an element, it will always be the root node. Then picture what happens if you add another element with hash X, then delete the root node (by setting the key and value to nil) and then try to insert X again. It will be put into the root node, but the X's original child node still exists -- but the search algorithm will never find it, so it doesn't appear to be a problem right now. Except when you try to delete X by finding its node and setting key and value to nil, it will cause the search algorithm to again find the original X node, but X was supposed to be deleted.

So, when either setting or deleting an element from the map, add this -- you cannot return once the first target node is found, because there may be shadow nodes below it in the hash search sequence, and these need to be cleared as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment