Per Vognsen pervognsen

## multidimensional_array_views.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              39 stars
            
          
                pervognsen
                / multidimensional_array_views.md
            
            
              Last active
              March 24, 2024 02:09
            
          
    Multi-dimensional array views for systems programmers

As C programmers, most of us think of pointer arithmetic for multi-dimensional arrays in a nested way:
The address for a 1-dimensional array is base + x.
The address for a 2-dimensional array is base + x + y*x_size for row-major layout and base + y + x*y_size for column-major layout.
The address for a 3-dimensional array is base + x + (y + z*y_size)*x_size for row-column-major layout.
And so on.

  
## self_validating_handles.c
struct Object {
    Key key; // The key is any piece of data that uniquely identifies the object.
    // ...
};

struct Handle {
    Key key;
    Index index; // This caches a speculative table index for an object with the corresponding key.
};

## asm_x64.c
// x64 encoding

enum Reg {
    RAX, RCX, RDX, RBX, RSP, RBP, RSI, RDI,
    R8,  R9,  R10, R11, R12, R13, R14, R15,
};

enum XmmReg {
    XMM0, XMM1, XMM2,  XMM3,  XMM4,  XMM5,  XMM6,  XMM7,
    XMM8, XMM9, XMM10, XMM11, XMM12, XMM13, XMM14, XMM15,

## ion.c
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

typedef struct ion_type_t ion_type_t;
typedef struct ion_value_t ion_value_t;
typedef struct ion_state_t ion_state_t;
typedef struct ion_instruction_t ion_instruction_t;

typedef void (*ion_operation_t)(ion_state_t *, ion_value_t, ion_value_t, ion_value_t *);

## rh_grow.c
// This can grow a Robin Hood linear probing hash table near word-at-a-time memcpy speeds. If you're confused why I use 'keys'
// to describe the hash values, it's because my favorite perspective on Robin Hood (which I learned from Paul Khuong)
// is that it's just a sorted gap array which is MSB bucketed and insertion sorted per chain:
// https://pvk.ca/Blog/2019/09/29/a-couple-of-probabilistic-worst-case-bounds-for-robin-hood-linear-probing/
// The more widely known "max displacement" picture of Robin Hood hashing also has strengths since the max displacement
// can be stored very compactly. You can see a micro-optimized example of that here for small tables where the max displacement
// can fit in 4 bits: Sub-nanosecond Searches Using Vector Instructions, https://www.youtube.com/watch?v=paxIkKBzqBU
void grow(Table *table) {
	u64 exp = 64 - table->shift;
	// We grow the table downward in place by a factor of 2 (not counting the overflow area at table->end).

## gen_mul.c
void gen_mul(Value *dest, Value *src) {
    if (isconst(dest)) {
        gen_swap(dest, src); // this doesn't generate instructions and just swaps descriptors ("value renaming")
    }
    val_to_reg(dest); // the post-condition is that dest is allocated to a register
    if (isconst(src) && isimm32(src->ival)) {
        if (src->ival == 0) {
            int_to_val(dest, 0);
        } else if (src->ival == 1) {
            // do nothing

## nodeiter.c
typedef struct {
    // These fields are internal state and not considered part of the public interface.
    Node *parent;
    int next_index;

    // This field is public and valid to read after a call to next that returns true.
    Node *child;
} Iter;

Iter iter_children(Node *node) {

## abstract_matrix.cpp
struct AbstractMatrix {
    int m; // number of rows
    int n; // number of columns

    // Pack block at ib, jb of size mb, nb into dest in row-major format.
    virtual void pack_rowmajor(int ib, int jb, int mb, int nb, float *dest) const = 0;

    // Unpack row-major matrix from src into block at ib, jb of size mb, nb.
    virtual void unpack_rowmajor(int ib, int jb, int mb, int nb, const float *src) = 0;


## simd_bucket_hash_discrim.c
// The two sweetspots are 8-bit and 4-bit tags. In the latter case you can fit 14 32-bit keys in
// a cacheline-sized bucket and still have one leftover byte for metadata.

// As for how to choose tags for particular problems, Google's Swiss table and Facebook's F14 table
// both use hash bits that weren't used for the bucket indexing. This is ideal from an entropy perspective
// but it can be better to use some particular feature of the key that you'd otherwise check against anyway.
// For example, for 32-bit keys (with a usable sentinel value) you could use the 8 low bits as the tag
// while storing the remaining 24 bits in the rest of the bucket; that fits 16 keys per bucket. Or if the keys
// are strings you could store the length as the discriminator: with an 8-bit tag, 0 means an empty slot,
// 1..254 means a string of that length, and 255 means a string of length 255 or longer. With a 4-bit tag

## segregated_tables.c
// Length-segregated string tables for length < 16. You use a separate overflow table for length >= 16.
// By segregating like this you can pack the string data in the table itself tightly without any padding. The datapath
// is uniform and efficient for all lengths < 16 by using unaligned 16-byte SIMD loads/compares and masking off the length prefix.

// One of the benefits of packing string data tightly for each length table is that you can afford to reduce the load factor
// on shorter length tables without hurting space utilization too much. This can push hole-in-one rates into the 95% range without
// too much of a negative impact on cache utilization.

// Since get() takes a vector register as an argument with the key, you want to shape the upstream code so the string to be queried
// is naturally in a vector. For example, in an optimized identifier lexer you should already have a SIMD fast path for length < 16
	struct Object {
	Key key; // The key is any piece of data that uniquely identifies the object.
	// ...
	};

	struct Handle {
	Key key;
	Index index; // This caches a speculative table index for an object with the corresponding key.
	};
	// x64 encoding

	enum Reg {
	RAX, RCX, RDX, RBX, RSP, RBP, RSI, RDI,
	R8, R9, R10, R11, R12, R13, R14, R15,
	};

	enum XmmReg {
	XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7,
	XMM8, XMM9, XMM10, XMM11, XMM12, XMM13, XMM14, XMM15,
	#include <stdio.h>
	#include <stdint.h>
	#include <stdlib.h>

	typedef struct ion_type_t ion_type_t;
	typedef struct ion_value_t ion_value_t;
	typedef struct ion_state_t ion_state_t;
	typedef struct ion_instruction_t ion_instruction_t;

	typedef void (ion_operation_t)(ion_state_t , ion_value_t, ion_value_t, ion_value_t *);
	// This can grow a Robin Hood linear probing hash table near word-at-a-time memcpy speeds. If you're confused why I use 'keys'
	// to describe the hash values, it's because my favorite perspective on Robin Hood (which I learned from Paul Khuong)
	// is that it's just a sorted gap array which is MSB bucketed and insertion sorted per chain:
	// https://pvk.ca/Blog/2019/09/29/a-couple-of-probabilistic-worst-case-bounds-for-robin-hood-linear-probing/
	// The more widely known "max displacement" picture of Robin Hood hashing also has strengths since the max displacement
	// can be stored very compactly. You can see a micro-optimized example of that here for small tables where the max displacement
	// can fit in 4 bits: Sub-nanosecond Searches Using Vector Instructions, https://www.youtube.com/watch?v=paxIkKBzqBU
	void grow(Table *table) {
	u64 exp = 64 - table->shift;
	// We grow the table downward in place by a factor of 2 (not counting the overflow area at table->end).
	void gen_mul(Value dest, Value src) {
	if (isconst(dest)) {
	gen_swap(dest, src); // this doesn't generate instructions and just swaps descriptors ("value renaming")
	}
	val_to_reg(dest); // the post-condition is that dest is allocated to a register
	if (isconst(src) && isimm32(src->ival)) {
	if (src->ival == 0) {
	int_to_val(dest, 0);
	} else if (src->ival == 1) {
	// do nothing
	typedef struct {
	// These fields are internal state and not considered part of the public interface.
	Node *parent;
	int next_index;

	// This field is public and valid to read after a call to next that returns true.
	Node *child;
	} Iter;

	Iter iter_children(Node *node) {
	struct AbstractMatrix {
	int m; // number of rows
	int n; // number of columns

	// Pack block at ib, jb of size mb, nb into dest in row-major format.
	virtual void pack_rowmajor(int ib, int jb, int mb, int nb, float *dest) const = 0;

	// Unpack row-major matrix from src into block at ib, jb of size mb, nb.
	virtual void unpack_rowmajor(int ib, int jb, int mb, int nb, const float *src) = 0;
	// The two sweetspots are 8-bit and 4-bit tags. In the latter case you can fit 14 32-bit keys in
	// a cacheline-sized bucket and still have one leftover byte for metadata.

	// As for how to choose tags for particular problems, Google's Swiss table and Facebook's F14 table
	// both use hash bits that weren't used for the bucket indexing. This is ideal from an entropy perspective
	// but it can be better to use some particular feature of the key that you'd otherwise check against anyway.
	// For example, for 32-bit keys (with a usable sentinel value) you could use the 8 low bits as the tag
	// while storing the remaining 24 bits in the rest of the bucket; that fits 16 keys per bucket. Or if the keys
	// are strings you could store the length as the discriminator: with an 8-bit tag, 0 means an empty slot,
	// 1..254 means a string of that length, and 255 means a string of length 255 or longer. With a 4-bit tag
	// Length-segregated string tables for length < 16. You use a separate overflow table for length >= 16.
	// By segregating like this you can pack the string data in the table itself tightly without any padding. The datapath
	// is uniform and efficient for all lengths < 16 by using unaligned 16-byte SIMD loads/compares and masking off the length prefix.

	// One of the benefits of packing string data tightly for each length table is that you can afford to reduce the load factor
	// on shorter length tables without hurting space utilization too much. This can push hole-in-one rates into the 95% range without
	// too much of a negative impact on cache utilization.

	// Since get() takes a vector register as an argument with the key, you want to shape the upstream code so the string to be queried
	// is naturally in a vector. For example, in an optimized identifier lexer you should already have a SIMD fast path for length < 16