Skip to content

Instantly share code, notes, and snippets.

void gen_mul(Value *dest, Value *src) {
if (isconst(dest)) {
gen_swap(dest, src); // this doesn't generate instructions and just swaps descriptors ("value renaming")
}
val_to_reg(dest); // the post-condition is that dest is allocated to a register
if (isconst(src) && isimm32(src->ival)) {
if (src->ival == 0) {
int_to_val(dest, 0);
} else if (src->ival == 1) {
// do nothing
typedef struct {
// These fields are internal state and not considered part of the public interface.
Node *parent;
int next_index;
// This field is public and valid to read after a call to next that returns true.
Node *child;
} Iter;
Iter iter_children(Node *node) {
struct AbstractMatrix {
int m; // number of rows
int n; // number of columns
// Pack block at ib, jb of size mb, nb into dest in row-major format.
virtual void pack_rowmajor(int ib, int jb, int mb, int nb, float *dest) const = 0;
// Unpack row-major matrix from src into block at ib, jb of size mb, nb.
virtual void unpack_rowmajor(int ib, int jb, int mb, int nb, const float *src) = 0;
// The two sweetspots are 8-bit and 4-bit tags. In the latter case you can fit 14 32-bit keys in
// a cacheline-sized bucket and still have one leftover byte for metadata.
// As for how to choose tags for particular problems, Google's Swiss table and Facebook's F14 table
// both use hash bits that weren't used for the bucket indexing. This is ideal from an entropy perspective
// but it can be better to use some particular feature of the key that you'd otherwise check against anyway.
// For example, for 32-bit keys (with a usable sentinel value) you could use the 8 low bits as the tag
// while storing the remaining 24 bits in the rest of the bucket; that fits 16 keys per bucket. Or if the keys
// are strings you could store the length as the discriminator: with an 8-bit tag, 0 means an empty slot,
// 1..254 means a string of that length, and 255 means a string of length 255 or longer. With a 4-bit tag
// Length-segregated string tables for length < 16. You use a separate overflow table for length >= 16.
// By segregating like this you can pack the string data in the table itself tightly without any padding. The datapath
// is uniform and efficient for all lengths < 16 by using unaligned 16-byte SIMD loads/compares and masking off the length prefix.
// One of the benefits of packing string data tightly for each length table is that you can afford to reduce the load factor
// on shorter length tables without hurting space utilization too much. This can push hole-in-one rates into the 95% range without
// too much of a negative impact on cache utilization.
// Since get() takes a vector register as an argument with the key, you want to shape the upstream code so the string to be queried
// is naturally in a vector. For example, in an optimized identifier lexer you should already have a SIMD fast path for length < 16
Take a simple expression grammar G:
E = n | (E + E)
A language is a set of strings. The grammar corresponds to an operator mapping languages to languages:
G(X) = {n} union {(e1 + e2) | e1 in X, e2 in X}
G's language L is the least fixed point of this operator, which is essentially the limit of G(X) starting with X = {}:
// Heavily based on ideas from https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/lj_opt_fold.c
// The most fundamental deviation is that I eschew the big hash table and the lj_opt_fold()
// trampoline for direct tail calls. The biggest problem with a trampoline is that you lose
// the control flow context. Another problem is that there's too much short-term round-tripping
// of data through memory. It's also easier to do ad-hoc sharing between rules with my approach.
// From what I can tell, it also isn't possible to do general reassociation with LJ's fold engine
// since that requires non-tail recursion, so LJ does cases like (x + n1) + n2 => x + (n1 + n2)
// but not (x + n1) + (y + n2) => x + (y + (n1 + n2)) which is common in address generation. The
// code below has some not-so-obvious micro-optimizations for register passing and calling conventions,
// e.g. the unary_cse/binary_cse parameter order, the use of long fields in ValueRef.
# I happened to be looking at some of Cranelift's code, and I noticed that their constant-time dominates()
# check was using a somewhat more ad-hoc version of a hidden gem from the data structures literature called the
# parenthesis representation for trees. As far as I know, this was invented by Jacobson in his 1989 paper
# Space-Efficient Static Trees and Graphs. I first learned about it from the slightly later paper by Munro and Raman
# called Succinct Representations of Balanced Parentheses and Static Trees. I figured I'd give it an extremely
# quick intro and then show how it leads to a (slightly better) version of Cranelift's algorithm.
#
# This parenthesis representation of trees is surprisingly versatile, but its most striking feature is that
# it lets us query the ancestor relationship between two nodes in a tree in constant time, with a few instructions.
# And the idea is extremely simple and intuitive if you just draw the right kind of picture.
@pervognsen
pervognsen / shift_dfa.md
Last active January 27, 2024 19:54
Shift-based DFAs

A traditional table-based DFA implementation looks like this:

uint8_t table[NUM_STATES][256]

uint8_t run(const uint8_t *start, const uint8_t *end, uint8_t state) {
    for (const uint8_t *s = start; s != end; s++)
        state = table[state][*s];
    return state;
}
@pervognsen
pervognsen / rad.py
Last active January 18, 2024 02:30
# Reverse-mode automatic differentiation
import math
# d(-x) = -dx
def func_neg(x):
return -x, [-1]
# d(x + y) = dx + dy
def func_add(x, y):