Skip to content

Instantly share code, notes, and snippets.

Last active Jul 28, 2022
What would you like to do?
// This can grow a Robin Hood linear probing hash table near word-at-a-time memcpy speeds. If you're confused why I use 'keys'
// to describe the hash values, it's because my favorite perspective on Robin Hood (which I learned from Paul Khuong)
// is that it's just a sorted gap array which is MSB bucketed and insertion sorted per chain:
// The more widely known "max displacement" picture of Robin Hood hashing also has strengths since the max displacement
// can be stored very compactly. You can see a micro-optimized example of that here for small tables where the max displacement
// can fit in 4 bits: Sub-nanosecond Searches Using Vector Instructions,
void grow(Table *table) {
u64 exp = 64 - table->shift;
// We grow the table downward in place by a factor of 2 (not counting the overflow area at table->end).
u64 *keys = table->keys - (1ull << exp);
assert(keys >= table->start);
// We rely on "MSB hashing" where the home slot index corresponding to a hash value k is k >> shift. By decrementing
// the shift amount we're exposing a new LSB in the slot index but the MSBs of the index are the same but shifted.
// The net result of this is that items in the table will spread out but stay sorted.
u64 shift = --table->shift;
// The scatter-based rehashing assumes the lower part of the table is pre-cleared to EMPTY. A proper implementation
// should probably do blocked clearing in the grow function itself (outside of the branchless inner loop) in
// blocks sized as a fraction of the L1 cache.
for (u64 *src = table->keys, *dest = keys; src != table->end; src++, dest++) {
u64 k = *src;
*src = EMPTY;
// The conditional is unnecessary if EMPTY = 0 but EMPTY = ~0 is useful as a search sentinel for Robin Hood.
// An alternative benefit of using EMPTY = 0 is that you can rely on the kernel's virtual memory manager to
// pre-zero pages which synergizes with reserving a worst-case virtual memory range and growing in place.
// That eats more memory bandwidth (vs blocked clearing which only operates on soon-to-be-hot cache lines)
// but a lot of it will be done asynchronously so the pre-zeroing bandwidth is spread out over time. And
// if you're already getting pre-zeroed pages directly from the kernel (as opposed to recycling memory in
// user space) you might as well take advantage of it. Still, you can memset an L1 cache line (64 bytes)
// per cycle with AVX2 instructions nowadays, so a nonzero EMPTY sentinel is nearly free if you do
// blocked clearing--simplifying the search loop is worth it.
u64 i = k != EMPTY ? k >> shift : 0; // Branchless
u64 *p = &keys[i];
// With Robin Hood linear probing we already have the items in the right order and we just need to leave
// gaps in the right places. Two cases: Either append to the current chain or jump the gap to the home slot.
dest = p >= dest ? p : dest; // Branchless
*dest = k;
table->keys = keys;
// By the way, the out-of-place version of this algorithm should be ideal for multi-threaded rehashing. Contiguous
// source ranges map to contiguous destination ranges, so you can use the "split evenly and sync across the boundary"
// strategy to assign independent source/destination ranges to workers. In a concurrent hash table this would typically
// be done via a helper scheme where threads that want to insert while a rehash is ongoing will help finish the job
// by grabbing some outstanding ranges to work on. This can be made wait-free if you use fetch-and-add to grab work;
// instead of suffering the usual amortization spike on one thread, you're spreading it out across multiple threads,
// so no inserter thread can be starved as could be the case in a merely lock-free algorithm. Unfortunately, while
// rehashing a Robin Hood table looks to be ideally suited for multi threading, the same cannot be said for sorted
// inserts. With classical linear probing you can do an insert with a compare-and-swap into the first empty slot at
// the end of a probe chain. Robin Hood inserts require you to shift slots in a chain to make room. It's doable but
// it's inherently harder and more costly (I think you'd need to shift from the end of the chain and leave temporary
// duplicates with a flag bit). But the good news is that this style of multi-threaded contiguous range rehashing should
// be adaptable to classical linear probing with MSB hashing. You don't have the same strong monotonicity invariant
// as with Robin Hood linear probing but you do have monotonicity across gap-separated chains (i.e. items in different
// chains are ordered correctly even if items within the same chain may be out of order), which should be enough
// for the range partitioning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment