pervognsen/rh_grow.c

## rh_grow.c
// This can grow a Robin Hood linear probing hash table near word-at-a-time memcpy speeds. If you're confused why I use 'keys'
// to describe the hash values, it's because my favorite perspective on Robin Hood (which I learned from Paul Khuong)
// is that it's just a sorted gap array which is MSB bucketed and insertion sorted per chain:
// https://pvk.ca/Blog/2019/09/29/a-couple-of-probabilistic-worst-case-bounds-for-robin-hood-linear-probing/
// The more widely known "max displacement" picture of Robin Hood hashing also has strengths since the max displacement
// can be stored very compactly. You can see a micro-optimized example of that here for small tables where the max displacement
// can fit in 4 bits: Sub-nanosecond Searches Using Vector Instructions, https://www.youtube.com/watch?v=paxIkKBzqBU
void grow(Table *table) {
	u64 exp = 64 - table->shift;
	// We grow the table downward in place by a factor of 2 (not counting the overflow area at table->end).
	u64 *keys = table->keys - (1ull << exp);
	assert(keys >= table->start);
	// We rely on "MSB hashing" where the home slot index corresponding to a hash value k is k >> shift. By decrementing
	// the shift amount we're exposing a new LSB in the slot index but the MSBs of the index are the same but shifted.
	// The net result of this is that items in the table will spread out but stay sorted.
	u64 shift = --table->shift;
	// The scatter-based rehashing assumes the lower part of the table is pre-cleared to EMPTY. A proper implementation
	// should probably do blocked clearing in the grow function itself (outside of the branchless inner loop) in
	// blocks sized as a fraction of the L1 cache.
	for (u64 *src = table->keys, *dest = keys; src != table->end; src++, dest++) {
		u64 k = *src;
		*src = EMPTY;
		// The conditional is unnecessary if EMPTY = 0 but EMPTY = ~0 is useful as a search sentinel for Robin Hood.
		// An alternative benefit of using EMPTY = 0 is that you can rely on the kernel's virtual memory manager to
		// pre-zero pages which synergizes with reserving a worst-case virtual memory range and growing in place.
		// That eats more memory bandwidth (vs blocked clearing which only operates on soon-to-be-hot cache lines)
		// but a lot of it will be done asynchronously so the pre-zeroing bandwidth is spread out over time. And
		// if you're already getting pre-zeroed pages directly from the kernel (as opposed to recycling memory in
		// user space) you might as well take advantage of it. Still, you can memset an L1 cache line (64 bytes)
		// per cycle with AVX2 instructions nowadays, so a nonzero EMPTY sentinel is nearly free if you do
		// blocked clearing--simplifying the search loop is worth it.
		u64 i = k != EMPTY ? k >> shift : 0; // Branchless
		u64 *p = &keys[i];
		// With Robin Hood linear probing we already have the items in the right order and we just need to leave
		// gaps in the right places. Two cases: Either append to the current chain or jump the gap to the home slot.
		dest = p >= dest ? p : dest; // Branchless
		*dest = k;
	}
	table->keys = keys;
	// By the way, the out-of-place version of this algorithm should be ideal for multi-threaded rehashing. Contiguous
	// source ranges map to contiguous destination ranges, so you can use the "split evenly and sync across the boundary"
	// strategy to assign independent source/destination ranges to workers. In a concurrent hash table this would typically
	// be done via a helper scheme where threads that want to insert while a rehash is ongoing will help finish the job
	// by grabbing some outstanding ranges to work on. This can be made wait-free if you use fetch-and-add to grab work;
	// instead of suffering the usual amortization spike on one thread, you're spreading it out across multiple threads,
	// so no inserter thread can be starved as could be the case in a merely lock-free algorithm. Unfortunately, while
	// rehashing a Robin Hood table looks to be ideally suited for multi threading, the same cannot be said for sorted
	// inserts. With classical linear probing you can do an insert with a compare-and-swap into the first empty slot at
	// the end of a probe chain. Robin Hood inserts require you to shift slots in a chain to make room. It's doable but
	// it's inherently harder and more costly (I think you'd need to shift from the end of the chain and leave temporary
	// duplicates with a flag bit). But the good news is that this style of multi-threaded contiguous range rehashing should
	// be adaptable to classical linear probing with MSB hashing. You don't have the same strong monotonicity invariant
	// as with Robin Hood linear probing but you do have monotonicity across gap-separated chains (i.e. items in different
	// chains are ordered correctly even if items within the same chain may be out of order), which should be enough
	// for the range partitioning.
}
	// This can grow a Robin Hood linear probing hash table near word-at-a-time memcpy speeds. If you're confused why I use 'keys'
	// to describe the hash values, it's because my favorite perspective on Robin Hood (which I learned from Paul Khuong)
	// is that it's just a sorted gap array which is MSB bucketed and insertion sorted per chain:
	// https://pvk.ca/Blog/2019/09/29/a-couple-of-probabilistic-worst-case-bounds-for-robin-hood-linear-probing/
	// The more widely known "max displacement" picture of Robin Hood hashing also has strengths since the max displacement
	// can be stored very compactly. You can see a micro-optimized example of that here for small tables where the max displacement
	// can fit in 4 bits: Sub-nanosecond Searches Using Vector Instructions, https://www.youtube.com/watch?v=paxIkKBzqBU
	void grow(Table *table) {
	u64 exp = 64 - table->shift;
	// We grow the table downward in place by a factor of 2 (not counting the overflow area at table->end).
	u64 *keys = table->keys - (1ull << exp);
	assert(keys >= table->start);
	// We rely on "MSB hashing" where the home slot index corresponding to a hash value k is k >> shift. By decrementing
	// the shift amount we're exposing a new LSB in the slot index but the MSBs of the index are the same but shifted.
	// The net result of this is that items in the table will spread out but stay sorted.
	u64 shift = --table->shift;
	// The scatter-based rehashing assumes the lower part of the table is pre-cleared to EMPTY. A proper implementation
	// should probably do blocked clearing in the grow function itself (outside of the branchless inner loop) in
	// blocks sized as a fraction of the L1 cache.
	for (u64 src = table->keys, dest = keys; src != table->end; src++, dest++) {
	u64 k = *src;
	*src = EMPTY;
	// The conditional is unnecessary if EMPTY = 0 but EMPTY = ~0 is useful as a search sentinel for Robin Hood.
	// An alternative benefit of using EMPTY = 0 is that you can rely on the kernel's virtual memory manager to
	// pre-zero pages which synergizes with reserving a worst-case virtual memory range and growing in place.
	// That eats more memory bandwidth (vs blocked clearing which only operates on soon-to-be-hot cache lines)
	// but a lot of it will be done asynchronously so the pre-zeroing bandwidth is spread out over time. And
	// if you're already getting pre-zeroed pages directly from the kernel (as opposed to recycling memory in
	// user space) you might as well take advantage of it. Still, you can memset an L1 cache line (64 bytes)
	// per cycle with AVX2 instructions nowadays, so a nonzero EMPTY sentinel is nearly free if you do
	// blocked clearing--simplifying the search loop is worth it.
	u64 i = k != EMPTY ? k >> shift : 0; // Branchless
	u64 *p = &keys[i];
	// With Robin Hood linear probing we already have the items in the right order and we just need to leave
	// gaps in the right places. Two cases: Either append to the current chain or jump the gap to the home slot.
	dest = p >= dest ? p : dest; // Branchless
	*dest = k;
	}
	table->keys = keys;
	// By the way, the out-of-place version of this algorithm should be ideal for multi-threaded rehashing. Contiguous
	// source ranges map to contiguous destination ranges, so you can use the "split evenly and sync across the boundary"
	// strategy to assign independent source/destination ranges to workers. In a concurrent hash table this would typically
	// be done via a helper scheme where threads that want to insert while a rehash is ongoing will help finish the job
	// by grabbing some outstanding ranges to work on. This can be made wait-free if you use fetch-and-add to grab work;
	// instead of suffering the usual amortization spike on one thread, you're spreading it out across multiple threads,
	// so no inserter thread can be starved as could be the case in a merely lock-free algorithm. Unfortunately, while
	// rehashing a Robin Hood table looks to be ideally suited for multi threading, the same cannot be said for sorted
	// inserts. With classical linear probing you can do an insert with a compare-and-swap into the first empty slot at
	// the end of a probe chain. Robin Hood inserts require you to shift slots in a chain to make room. It's doable but
	// it's inherently harder and more costly (I think you'd need to shift from the end of the chain and leave temporary
	// duplicates with a flag bit). But the good news is that this style of multi-threaded contiguous range rehashing should
	// be adaptable to classical linear probing with MSB hashing. You don't have the same strong monotonicity invariant
	// as with Robin Hood linear probing but you do have monotonicity across gap-separated chains (i.e. items in different
	// chains are ordered correctly even if items within the same chain may be out of order), which should be enough
	// for the range partitioning.
	}