Skip to content

Instantly share code, notes, and snippets.

@hjr3
Created July 23, 2014 00:27
Show Gist options
  • Save hjr3/cf351d03f0090683eb5f to your computer and use it in GitHub Desktop.
Save hjr3/cf351d03f0090683eb5f to your computer and use it in GitHub Desktop.

I'm sick right now and probably have a fever, so this may make little sense or be completely incorrect. For years, I've been tossing this strange idea around in my head. The goals are:

  • Make associative and sparse data structures cheap and fast.
  • Make memory management trivially simple (no two-step process of adjusting data segment, then allocating from within it; no per-process address spaces).
  • Make memory allocation time very consistent, if not constant.

Note that I have at best a basic understanding of most of the topics here and this is probably pathologically naive. This is not a blog post; it's the kind of thing that I'd send to a friend who knows more than me about these topics.

SYSTEM MODEL

All code is executed through a VM within the kernel. The VM provides process isolation; hardware process isolation (protection rings) are disabled.

I assume single-core, partly because it's just a thought experiment, and partly because I've never dealt with multiple cores in assembly. I don't consider virtual memory at all because it would complicate the experiment.

Except for DMA buffers etc. needed by the kernel, all allocation is done in fixed-size units. I assume ARMv8 whenever I talk about hardware details or reference instructions. You'll need this ARMv8 instruction manual to look up instructions: http://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf.

MEMORY MODEL

Memory is addressed in 64-bit words organized into 128-bit blocks. All blocks are aligned on multiples of 128 bits. This is because crit bit trees, which we'll get to in a moment, can fit nicely into 128-bit nodes. The blocks could easily be larger or smaller, or even of a couple of mixed sizes that fit together nicely. I chose a single size because it makes reasoning about the system easier.

56 bits of memory are addressable, leaving 8 bits in each pointer available for other uses (e.g., for crit bit tree metadata). Happily, ARMv8 reserves the high 8 bits of pointers and they're ignored during load/store operations, so no extra cycles are burned to untag pointers before using them. 56 bits gives 64 PiB of addressable memory.

Almost all memory is immutable. Mutation will be achieved with simple references to immutable values, where the reference exists outside of the normal crit bit-based system described below.

Because all allocation is in a fixed size, and aligned to that size, fragmentation doesn't exist. A list of free addresses can be maintained, with allocation simply pulling the first one in the list. (Although that would be inefficient; a much better scheme is described below.)

DATA TYPE REPRESENTATION

A maximum allocation size of 128 bits changes the relative costs of primitive data types. Machine integers (64-bit) and floats (64-bit double) are still fast, but they incur 50% waste since the second word of the block isn't used.

The general purpose integer and decimal types provided by the VM are of arbitrary precision. This is clearly The Right Thing for almost all programming tasks. For arbitrary-precision integers, I'd expect blocks to be treated as a pair of {64 bits of integer data, 64 bit pointer to the next segment}. A similar scheme works for decimals and strings.

Many collection types get implemented on top of crit bit trees, as describe by D. J. Bernstein. That includes hashes most notably, but even standard arrays can be implemented using crit bit trees (although I don't know how they'd compare to a more straightforward binary tree based implementation).

All collection types are persistent (immutable). If a programming language uses persistent data types in today's systems, it will end up with a huge number of small allocations for intermediate nodes. That will almost certainly lead to a custom allocator in the language implementation. If most programming is done in such a language, why not promote this fine-grained memory model to be system-wide?

string integer decimal tree

In general, basic data types will tend to incur a doubling of memory usage vs. traditional systems. Organizing a 64-bit machine into 128-bit blocks, with no larger allocations possible, makes this size doubling unavoidable.

CRIT BIT TREES

A crit bit tree node has three components:

  • A "critical bit location", which is the index of the bit where this node pivots.
  • A "left" pointer for when the bit is 0.
  • A "right" pointer for when the bit is 1.

Imagine that we want to store 0001b and 0010b in a crit bit tree. They differ at bit 3, so the critical bit location will be 3. The left pointer will point to a terminal node containing only 0001b, since it has a 0 in bit 3. The right pointer will point to a terminal node containing only 0010b.

For more details on crit bit trees and their high performance for common operations, see: http://cr.yp.to/critbit.html.

In this system, all crit bit tree nodes are 128 bits (surprise). They're structured as:

Word 1:
	- 1 bit: is left pointer a node or data?
	- 1 bit: is right pointer a node or data?
	- 6 bits unused
	- 56-bit left pointer
Word 2:
	- 6-bit critbit location
	- 2 bits unused
	- 56-bit right pointer

ARMv8 reserves the top eight bits of a pointer for metadata, and it has 8-bit instructions. I didn't work out the details, but I expect that assembly to extract the crit bit location and navigate based on it will be very fast, even with it being packed into the second pointer. If it's not, the block size can always be increased, or multiple block sizes can exist (which is what I'd expect to see in practice).

FREE LIST

The free list is stored as a 64-way tree, with the root level being a single word. Each of the 64 bits in the root level indicates whether any free blocks exist in the corresponding word in level 1. Level 1 is 64 words (4,096 bits) wide. This continues: each bit in level 1 indicates whether free blocks exist in the corresponding word in level 2. Level 2 is 4,096 (2^12) words (2^18 bits) wide. Level 3 is 2^18 words (2^24 bits) wide. Etc. Level 6 is 2^30 words (2^36 bits) wide, which is enough to have one bit for each address in a 64 GB memory space.

Because memory is dword-aligned, we can track 128 GB of allocable memory using a 6-level free list. The 128 GB number isn't special; expanding beyond it just requires adding another level to the tree, giving 8 TB of allocatable memory.

Level 6 of the free node tree is 1 GB in size (1.5% of the 64 GB of addressable memory). The total free list overhead is a bit more than that because of the other five levels. There's no additional metadata (as is present in most allocators), so 1.5% is the entire overhead. In a memory space smaller than 64 MB, the unused sections of the tree won't be present, so the overhead will stay around 1.5%.

A "1" bit in levels 1-5 indicates the presence of at least one free block in the corresponding word in the next level. In level 6 (the last level), a "1" bit indicates that the corresponding address in the allocatable pool is free. Conceptually, level 6 has a single bit for each allocatable block in memory.

You should think about each level as representing a certain subset of memory. The 64-bit root node represents all allocatable memory. The 64 nodes at level 1 each represent 1/64th. Etc.

For example, if there are "1" bits at bit positions 10, 11, 12, 13, 14, and 15 in the corresponding levels, the index of the free dword is:

(10 * 64 ^ 5) + (11 * 64 ^ 4) + (12 * 64 ^ 3) + (13 * 64 ^ 2) + (14 * 64 ^ 1) + (15 * 64 ^ 0)
= (10 * 1073741824) + (11 * 16777216) + (12 * 262144) + (13 * 4096) + (14 * 64) + (15)
= 10,925,167,503

This is the index of the target block. To convert it into a pointer, we need to add the allocatable pool's base address (arbitrarily chosen here to be 1000000), and multiply by 2 to convert from 128-bit blocks to 64-bit words.

1000000 + 10,925,167,503 * 2
= 21,851,335,006

This is the raw address of the free block, somewhere in the 21st gigabyte.

ALLOCATION COST

Allocation requires traversing the tree from the root down. At each level, a 1 bit must be found, indicating the index into the next level where we can continue on to eventually find a free block. Here's some pseudo ARMv8 assembly for finding a free block. Assume that all "loops" in pseudo code here will be unrolled.

Do this six times starting with level=0 and index=0, incrementing level by 1 each time:
    MADD to compute `addr` = the memory address of the tree entry using `level`, `index`, and the constant base address for this level
    LDR `addr` to `node`
    CLZ to find the first "1" bit of `node, stored as the new `index`

The LDR in the last iteration gives us the index of a free block. This was 6 iterations * 3 instructions = 18 instructions. The last CLZ isn't needed, so the total cost to find a free block is 17 instructions.

The block must now be marked as allocated. This is a continuation of the above, so all of the tree nodes and addresses are still in registers, as is index.

Do this six times:
    BFI: set a '0' at position `index` of this level
STR the node back to its original address
    CBNZ out of this whole process if the tree node that we just stored isn't 0

The CBNZ ensures that we stop if we ever encounter a tree node where there are other subtrees marked as having free blocks. Marking higher-level nodes as non-free when they actually contain free nodes would permanently leak memory.

The total cost here can be up to 18 instructions, with six branches. However, levels 4 and higher will rarely need to be marked as free. Level 4 represents 4,096 allocatable blocks, so frequent marking and unmarking at that level or higher means that memory is almost full.

Together, searching for a block and updating the tree take at most 36 instructions, but I expect the the vast majority of allocations to take 21-24 because of the early exit via CBNZ. There's no bookkeeping other than remembering which blocks are and aren't allocated, so this is the whole cost.

The cost of computing the actual address from the free node in the tree isn't included here; it will be a few instructions.

FREE COST

Freeing is slightly more complicated, although that's probably because I'm tired by now and failing to see a simpler solution.

Do this six times starting with `addr` being the index of the block to be freed:
    LSL by 6 bits to convert from address to index within this tree level; store in `addr`
    AND to get the bottom 6 bits; store in `index`
    LDR the tree node from `addr`
    TBNZ to jump out of this process if the bit is already a 1 (and by extension we can assume that the corresponding bits in the parent levels are 1)
    BFI: set a '1' bit 

There's unfortunate weirdness both here and in the allocation code, since they deal with block indexes but we ultimately want pointers. This could be smoothed out by (1) storing pointers directly in the tree instead of block indexes, and (2) beginning the allocatable memory at address 0 (but I'm not sure whether that's possible). If both of those are done, translating from a level 6 block bit to a pointer simply requires multiplying by 64.

OPTIMIZATIONS AND EXTENSIONS

Considering the massive number of allocations that persistent data structures would cause, it may make sense to dedicate some of ARMv8's 32 registers to allocation (remember, we're considering single-core only). Level 0 of the free tree can be stored in a register to avoid load cycle when searching for free blocks. The most-recently-freed level 6 block can be stored in a register for reuse, skipping the standard allocation process entirely.

Alternately, we can hold a whole level-6 tree entry in a register and simply reuse its free nodes until it's exhausted. When the tree entry becomes exhausted, we can perform the standard algorithm to fetch a new one. This will have the unfortunate side effect of making allocation cost less predictable as memory fills up and more fetching is needed.

Because all code executes through a VM in the kernel, there's no chance of application code tampering with these memory tracking registers. If they turn out not to pull their weight (or are impractical due to a multi-core system), similar optimizations can be done in memory.

Multiple allocations will be very frequent (map, for example, will need to allocate a duplicate collection). Optimizing for that may provide a performance improvement, but the two unrolled loops during allocation are only three instructions long, so reintroducing loop overhead might negate much of the savings.

If the entire memory space were content-addressable, then we could avoid ever allocating a duplicate object. I don't have any sense for whether that would pay for itself in terms of space, or justify the time it would take, but it's an interesting question. This content-addressed optimization is what set me to thinking about this whole scheme years ago.

Reminder: I'm sick and just thinking aloud; this might be wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment