Per Vognsen pervognsen

## simd_bucket_hash_discrim.c
// The two sweetspots are 8-bit and 4-bit tags. In the latter case you can fit 14 32-bit keys in
// a cacheline-sized bucket and still have one leftover byte for metadata.

// As for how to choose tags for particular problems, Google's Swiss table and Facebook's F14 table
// both use hash bits that weren't used for the bucket indexing. This is ideal from an entropy perspective
// but it can be better to use some particular feature of the key that you'd otherwise check against anyway.
// For example, for 32-bit keys (with a usable sentinel value) you could use the 8 low bits as the tag
// while storing the remaining 24 bits in the rest of the bucket; that fits 16 keys per bucket. Or if the keys
// are strings you could store the length as the discriminator: with an 8-bit tag, 0 means an empty slot,
// 1..254 means a string of that length, and 255 means a string of length 255 or longer. With a 4-bit tag

## mark_coalesce_gc.c
// This is another take on the mark-compact collector from https://gist.github.com/pervognsen/7fe51bef8977cb249ac4c6f830f818a5
// To avoid having to do global compaction, our object indices will have two parts: a block index and a block offset.
// Within a block we have the same linear older-to-newer ordering by offset. But now blocks are allowed to have different ages.
// The block ages are defined by their position in a linked list: There's oldest_block and newest_block indices and then
// previous_block[block_index] for each block. This enables newest-to-oldest block iteration and the linked-list structure
// means that we can free an empty block by unlinking it. When a block is reused, it becomes the newest_block. Now, instead
// of only compacting within a block we will actually be coalescing across an age range of blocks. By doing so, we will usually
// be able to empty out entire blocks from the newer part of that age range, so they can be reused. This should have very similar
// performance characteri

## linear_gc.c
// Linear-scan mark-compact collector for causal data structures, i.e. the reference graph is acyclic and the objects
// are ordered by age in the heap (which happens if you use a linear allocator) so they are automatically topologically sorted.
// This algorithm has very high memory-level parallelism which can be exploited by modern out-of-order processors, and should
// be memory throughput limited.
void collect(void) {
    // Initialize marks from roots.
    memset(marks, 0, num_nodes * sizeof(uint32_t));
    int newest = 0, oldest = num_nodes;
    for (int i = 0; i < num_roots; i++) {
        marks[roots[i]] = 1;

## self_validating_handles.c
struct Object {
    Key key; // The key is any piece of data that uniquely identifies the object.
    // ...
};

struct Handle {
    Key key;
    Index index; // This caches a speculative table index for an object with the corresponding key.
};

## graph_separators.txt
A grid graph is a directed graph where the vertices are arranged in k rows each
containing k vertices, where k is a positive integer. Let v[i,j] denote the jth
vertex in the ith row. Let s = v[1,1]. A grid graph has the following vertices and edges:

V = {v[i,j] | 1 <= i <= k and 1 <= j <= k}

E = {(v[i,j], v[i,j+1]) | 1 <= i <= k and 1 <= j < k} union
    {(v[i,j], v[i,j-1]) | 1 <= i <= k and 1 < j <= k} union
    {(v[i,j], v[i+1,j]) | 1 <= i < k and 1 <= j <= k}

## add_tail_call.c
ValueRef add(ValueRef left, ValueRef right) {
    // Canonical commutation: x + y => y + x (constants have negative pos and move to the left)
    if (left.pos > right.pos) SWAP(left, right);
    switch (left.op) {
    case NEG:
        // Strength reduction: (-x) + y => y - x
        return sub(right, getleft(left));
    case NUM:
        // Strength reduction: 0 + x => x
        if (getnum(left) == 0) return right;

## perfold.c
// Heavily based on ideas from https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/lj_opt_fold.c
// The most fundamental deviation is that I eschew the big hash table and the lj_opt_fold()
// trampoline for direct tail calls. The biggest problem with a trampoline is that you lose
// the control flow context. Another problem is that there's too much short-term round-tripping
// of data through memory. It's also easier to do ad-hoc sharing between rules with my approach.
// From what I can tell, it also isn't possible to do general reassociation with LJ's fold engine
// since that requires non-tail recursion, so LJ does cases like (x + n1) + n2 => x + (n1 + n2)
// but not (x + n1) + (y + n2) => x + (y + (n1 + n2)) which is common in address generation. The
// code below has some not-so-obvious micro-optimizations for register passing and calling conventions,
// e.g. the unary_cse/binary_cse parameter order, the use of long fields in ValueRef.

## compress_hash_displace.py
# Hash, displace, and compress: http://cmph.sourceforge.net/papers/esa09.pdf
# This is expected linear time for any seeded hash function that acts like a random hash function (universality isn't enough).
# (Actually, the code as written is O(n log n) when targeting 100% load. It's O(n) when targeting any smaller load factor.)
# You can make keys_per_bucket higher than the default of 4 but construction time will start to increase dramatically.
# The paper this is based on compresses the seeds (so the fact that the algorithm tries seeds in increasing order is important)
# which brings the representation size close to the information-theoretical minimum. I don't do any of that here, but it could
# be done as a postprocess.
def make_perfect_hash(keys, load_factor=1.0, keys_per_bucket=4, rhash=murmurhash, max_seed=1000000):
    m = int(len(keys) / load_factor)
    r = int(len(keys) / keys_per_bucket)

## clang_arm64_repro.c
// The buffer is terminated with "\n\0"
int count(const char *p) {
    int n = 0;
    while (*p) {
        while (*p != '\n') p++;
        n++;
        while (*p == '\n') p++;
    }
    return n;
}

## perfect_hash_search.c
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>

const char *keywords[] = {
    "completed_in",
    "contributors",
    "contributors_enabled",
    "coordinates",
	// The two sweetspots are 8-bit and 4-bit tags. In the latter case you can fit 14 32-bit keys in
	// a cacheline-sized bucket and still have one leftover byte for metadata.

	// As for how to choose tags for particular problems, Google's Swiss table and Facebook's F14 table
	// both use hash bits that weren't used for the bucket indexing. This is ideal from an entropy perspective
	// but it can be better to use some particular feature of the key that you'd otherwise check against anyway.
	// For example, for 32-bit keys (with a usable sentinel value) you could use the 8 low bits as the tag
	// while storing the remaining 24 bits in the rest of the bucket; that fits 16 keys per bucket. Or if the keys
	// are strings you could store the length as the discriminator: with an 8-bit tag, 0 means an empty slot,
	// 1..254 means a string of that length, and 255 means a string of length 255 or longer. With a 4-bit tag
	// This is another take on the mark-compact collector from https://gist.github.com/pervognsen/7fe51bef8977cb249ac4c6f830f818a5
	// To avoid having to do global compaction, our object indices will have two parts: a block index and a block offset.
	// Within a block we have the same linear older-to-newer ordering by offset. But now blocks are allowed to have different ages.
	// The block ages are defined by their position in a linked list: There's oldest_block and newest_block indices and then
	// previous_block[block_index] for each block. This enables newest-to-oldest block iteration and the linked-list structure
	// means that we can free an empty block by unlinking it. When a block is reused, it becomes the newest_block. Now, instead
	// of only compacting within a block we will actually be coalescing across an age range of blocks. By doing so, we will usually
	// be able to empty out entire blocks from the newer part of that age range, so they can be reused. This should have very similar
	// performance characteri
	// Linear-scan mark-compact collector for causal data structures, i.e. the reference graph is acyclic and the objects
	// are ordered by age in the heap (which happens if you use a linear allocator) so they are automatically topologically sorted.
	// This algorithm has very high memory-level parallelism which can be exploited by modern out-of-order processors, and should
	// be memory throughput limited.
	void collect(void) {
	// Initialize marks from roots.
	memset(marks, 0, num_nodes * sizeof(uint32_t));
	int newest = 0, oldest = num_nodes;
	for (int i = 0; i < num_roots; i++) {
	marks[roots[i]] = 1;
	struct Object {
	Key key; // The key is any piece of data that uniquely identifies the object.
	// ...
	};

	struct Handle {
	Key key;
	Index index; // This caches a speculative table index for an object with the corresponding key.
	};
	A grid graph is a directed graph where the vertices are arranged in k rows each
	containing k vertices, where k is a positive integer. Let v[i,j] denote the jth
	vertex in the ith row. Let s = v[1,1]. A grid graph has the following vertices and edges:

	V = {v[i,j] \| 1 <= i <= k and 1 <= j <= k}

	E = {(v[i,j], v[i,j+1]) \| 1 <= i <= k and 1 <= j < k} union
	{(v[i,j], v[i,j-1]) \| 1 <= i <= k and 1 < j <= k} union
	{(v[i,j], v[i+1,j]) \| 1 <= i < k and 1 <= j <= k}
	ValueRef add(ValueRef left, ValueRef right) {
	// Canonical commutation: x + y => y + x (constants have negative pos and move to the left)
	if (left.pos > right.pos) SWAP(left, right);
	switch (left.op) {
	case NEG:
	// Strength reduction: (-x) + y => y - x
	return sub(right, getleft(left));
	case NUM:
	// Strength reduction: 0 + x => x
	if (getnum(left) == 0) return right;
	// Heavily based on ideas from https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/lj_opt_fold.c
	// The most fundamental deviation is that I eschew the big hash table and the lj_opt_fold()
	// trampoline for direct tail calls. The biggest problem with a trampoline is that you lose
	// the control flow context. Another problem is that there's too much short-term round-tripping
	// of data through memory. It's also easier to do ad-hoc sharing between rules with my approach.
	// From what I can tell, it also isn't possible to do general reassociation with LJ's fold engine
	// since that requires non-tail recursion, so LJ does cases like (x + n1) + n2 => x + (n1 + n2)
	// but not (x + n1) + (y + n2) => x + (y + (n1 + n2)) which is common in address generation. The
	// code below has some not-so-obvious micro-optimizations for register passing and calling conventions,
	// e.g. the unary_cse/binary_cse parameter order, the use of long fields in ValueRef.
	# Hash, displace, and compress: http://cmph.sourceforge.net/papers/esa09.pdf
	# This is expected linear time for any seeded hash function that acts like a random hash function (universality isn't enough).
	# (Actually, the code as written is O(n log n) when targeting 100% load. It's O(n) when targeting any smaller load factor.)
	# You can make keys_per_bucket higher than the default of 4 but construction time will start to increase dramatically.
	# The paper this is based on compresses the seeds (so the fact that the algorithm tries seeds in increasing order is important)
	# which brings the representation size close to the information-theoretical minimum. I don't do any of that here, but it could
	# be done as a postprocess.
	def make_perfect_hash(keys, load_factor=1.0, keys_per_bucket=4, rhash=murmurhash, max_seed=1000000):
	m = int(len(keys) / load_factor)
	r = int(len(keys) / keys_per_bucket)
	// The buffer is terminated with "\n\0"
	int count(const char *p) {
	int n = 0;
	while (*p) {
	while (*p != '\n') p++;
	n++;
	while (*p == '\n') p++;
	}
	return n;
	}
	#include <stdlib.h>
	#include <stdio.h>
	#include <stdint.h>
	#include <string.h>

	const char *keywords[] = {
	"completed_in",
	"contributors",
	"contributors_enabled",
	"coordinates",