Skip to content

Instantly share code, notes, and snippets.

Multi-dimensional array views for systems programmers

As C programmers, most of us think of pointer arithmetic for multi-dimensional arrays in a nested way:

The address for a 1-dimensional array is base + x. The address for a 2-dimensional array is base + x + y*x_size for row-major layout and base + y + x*y_size for column-major layout. The address for a 3-dimensional array is base + x + (y + z*y_size)*x_size for row-column-major layout. And so on.

struct Object {
Key key; // The key is any piece of data that uniquely identifies the object.
// ...
};
struct Handle {
Key key;
Index index; // This caches a speculative table index for an object with the corresponding key.
};
// x64 encoding
enum Reg {
RAX, RCX, RDX, RBX, RSP, RBP, RSI, RDI,
R8, R9, R10, R11, R12, R13, R14, R15,
};
enum XmmReg {
XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7,
XMM8, XMM9, XMM10, XMM11, XMM12, XMM13, XMM14, XMM15,
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
typedef struct ion_type_t ion_type_t;
typedef struct ion_value_t ion_value_t;
typedef struct ion_state_t ion_state_t;
typedef struct ion_instruction_t ion_instruction_t;
typedef void (*ion_operation_t)(ion_state_t *, ion_value_t, ion_value_t, ion_value_t *);
// This can grow a Robin Hood linear probing hash table near word-at-a-time memcpy speeds. If you're confused why I use 'keys'
// to describe the hash values, it's because my favorite perspective on Robin Hood (which I learned from Paul Khuong)
// is that it's just a sorted gap array which is MSB bucketed and insertion sorted per chain:
// https://pvk.ca/Blog/2019/09/29/a-couple-of-probabilistic-worst-case-bounds-for-robin-hood-linear-probing/
// The more widely known "max displacement" picture of Robin Hood hashing also has strengths since the max displacement
// can be stored very compactly. You can see a micro-optimized example of that here for small tables where the max displacement
// can fit in 4 bits: Sub-nanosecond Searches Using Vector Instructions, https://www.youtube.com/watch?v=paxIkKBzqBU
void grow(Table *table) {
u64 exp = 64 - table->shift;
// We grow the table downward in place by a factor of 2 (not counting the overflow area at table->end).
void gen_mul(Value *dest, Value *src) {
if (isconst(dest)) {
gen_swap(dest, src); // this doesn't generate instructions and just swaps descriptors ("value renaming")
}
val_to_reg(dest); // the post-condition is that dest is allocated to a register
if (isconst(src) && isimm32(src->ival)) {
if (src->ival == 0) {
int_to_val(dest, 0);
} else if (src->ival == 1) {
// do nothing
typedef struct {
// These fields are internal state and not considered part of the public interface.
Node *parent;
int next_index;
// This field is public and valid to read after a call to next that returns true.
Node *child;
} Iter;
Iter iter_children(Node *node) {
struct AbstractMatrix {
int m; // number of rows
int n; // number of columns
// Pack block at ib, jb of size mb, nb into dest in row-major format.
virtual void pack_rowmajor(int ib, int jb, int mb, int nb, float *dest) const = 0;
// Unpack row-major matrix from src into block at ib, jb of size mb, nb.
virtual void unpack_rowmajor(int ib, int jb, int mb, int nb, const float *src) = 0;
// The two sweetspots are 8-bit and 4-bit tags. In the latter case you can fit 14 32-bit keys in
// a cacheline-sized bucket and still have one leftover byte for metadata.
// As for how to choose tags for particular problems, Google's Swiss table and Facebook's F14 table
// both use hash bits that weren't used for the bucket indexing. This is ideal from an entropy perspective
// but it can be better to use some particular feature of the key that you'd otherwise check against anyway.
// For example, for 32-bit keys (with a usable sentinel value) you could use the 8 low bits as the tag
// while storing the remaining 24 bits in the rest of the bucket; that fits 16 keys per bucket. Or if the keys
// are strings you could store the length as the discriminator: with an 8-bit tag, 0 means an empty slot,
// 1..254 means a string of that length, and 255 means a string of length 255 or longer. With a 4-bit tag
// Length-segregated string tables for length < 16. You use a separate overflow table for length >= 16.
// By segregating like this you can pack the string data in the table itself tightly without any padding. The datapath
// is uniform and efficient for all lengths < 16 by using unaligned 16-byte SIMD loads/compares and masking off the length prefix.
// One of the benefits of packing string data tightly for each length table is that you can afford to reduce the load factor
// on shorter length tables without hurting space utilization too much. This can push hole-in-one rates into the 95% range without
// too much of a negative impact on cache utilization.
// Since get() takes a vector register as an argument with the key, you want to shape the upstream code so the string to be queried
// is naturally in a vector. For example, in an optimized identifier lexer you should already have a SIMD fast path for length < 16