Skip to content

Instantly share code, notes, and snippets.

@Rajveer100
Last active March 28, 2024 07:35
Show Gist options
  • Save Rajveer100/b4f215de60ed920816f0142964524f75 to your computer and use it in GitHub Desktop.
Save Rajveer100/b4f215de60ed920816f0142964524f75 to your computer and use it in GitHub Desktop.
Potential Proposal - A new and fast (er) register allocator for Cranelift backend

A new and fast (er) register allocator for Cranelift backend

Overview

Cranelift has an integration with the Rust Compiler through the rustc_codegen_cranelift project, where it is used as a backend for rustc, instead of the default LLVM backend. The primary goal of this project is to improve the compilation time by compiling much faster than LLVM. The compiled code would be less efficient which is the expected trade-off since the current register allocator, i.e, regalloc2, although producing good quality code isn't needed for typical debug builds.

Current Design of regalloc2

The in-depth design of the current allocation pipeline is well described, which includes the input form (which is SSA IR), instructions, operands and a series of algorithms to compute an allocation using appropriate structures like heaps and maps to keep track of the liveness but the main highlight is that, RA2 spends a lot of time (almost 50%) just building these live range data structures for the main allocation loop which is quite expensive and although yields optimal runtime performance, significantly slows down the compilation time.

Alternative approach to handle live ranges

Before jumping to the approach, as a short recap, liveness information is useful to determine whether some register is “live” at a given point of program, i.e. that it contains a value that may be used at a later point in the program. The typical linear scan algorithm assumes we already know or have performed these computations, but this requires a full pass before hand. As mentioned earlier, this typically involves using data structures that quickly let's us fetch the closest interval considering we maintain the points in an increasing order (takes logarithmic time for most operations).

Allocation in reverse

We can instead consider ranges in reverse order (begin is end and vice versa), this will help us solve the problem in a single pass without having to spend a lot of time bookkeeping. There's a nice blog which demonstrates this approach quite well (LuaJIT also uses this approach), although to summarise this what we can do:

  • We can use a simple and cheap Least Recently Used (LRU) cache that will prevent us from using registers that used quite frequently, thereby reducing register spilling.
  • For the register allocator, a struct with dependent fields including allocations, registers, memory and other fields can be created for handling various operations.

To handle multiple basic blocks and branching we need a Control Flow Graph (CFG), the current allocator under Cranelift already has an existing trait which is defined by the regalloc client to provide access to its machine instruction/CFG. So we don't require one from scratch and can be adopted for our new allocator.

We also need to have an option/flag to use this allocator with minimal changes in cranelift.

Implementation

Since the overall infrastructure of the current allocator is already well defined, we do not need to build every sub-part from scratch saving us a ton of time, let's breakdown the entire process:

Setting the option to choose the allocator

/// Options for allocation.
#[derive(Clone, Copy, Debug, Default)]
pub struct RegallocOptions {
    // ...
    
    /// Faster allocator with sub-optimal run-time as trade-off.
    pub ssra_alloc: bool,
}

By passing this flag, we can choose to use the faster version and it allows us to reuse the existing tests, although we do need additional tests for the specifics/working of this allocator.

Data Structures

Previously we used a priority queue by using BinaryHeap and a BTreeMap to maintain the live ranges and for various other fast lookups:

#[derive(Clone, Debug)]
pub struct PrioQueue {
    pub heap: alloc::collections::BinaryHeap<PrioQueueEntry>,
}

#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord)]
pub struct PrioQueueEntry {
    pub prio: u32,
    pub bundle: LiveBundleIndex,
    pub reg_hint: PReg,
}

In our case, we would implement a LRU cache to keep track of the least recently used registers, while still reusing many other struct's and instead of the prio field we could potentially have a random value assigned here for fuzzy testing:

pub struct Lru {
    pub lru_node: LruNode
}

pub struct LruNode {
    pub reg_hint: PReg,
    
    pub next: usize,
    pub prev: usize,
}

As seen above, we are using the next and prev fields as usize index rather than a Box<LruNode> which uses the heap and isn't as CPU cache friendly than a stack-assigned constant size array which we can pass as a const generic parameter.

pub struct Env<'a, F: Function> {
    /// ...
    pub allocation_queue: Lru,
    /// ...

(Similar replacements would be done for each of the struct's like above.)

Alternatively, there are many widely available crates (for the LRU part) which is one of the powerful features of Rust, although having native code does have benefits depending on the scenario.

We also wouldn't need ordered or unordered structures for handling the registers, since each VReg and PReg can be fetched by its index, we have essentially reduced to constant time due to quick access via Vec which is even faster than hashing.

Spilling, Multiple Basic Blocks and Branching

/// Get the index of the entry block.
fn entry_block(&self) -> Block;

/// Provide the range of instruction indices contained in each block.
fn block_insns(&self, block: Block) -> InstRange;

/// Get CFG successors for a given block.
fn block_succs(&self, block: Block) -> &[Block];

The Function trait will help us retrieve the blocks, its instructions and the successor blocks. If any live range extends beyond the boundary of the current block, we would spill when out of registers. Once we process the instructions in the current block, we would reload the code inserted previously in the next block.

For example:

let a: i32 = 2;
let b: i32 = 3;

if a > b {
    let c: i32 = // <[some operation between a and b] (or) [only a] (or) [only b]>;
} else {
    let c: i32 = // <[some operation between a and b] (or) [only a] (or) [only b]>;
}

As seen above, the if block does the branching, depending on which the live values would change. Initially we would just spill all registers to the stack at the block boundaries and for each individual basic block the reverse scanning approach would be fine since we are in the same scope.

Pseudo-code for the Algorithm

ReverseRegisterAllocation():
    Initialize an empty LRU cache C
    Initialize an empty stack S
    Initialize a counter for register IDs
    
    // Traverse variables in reverse order of last usage
    For each variable v in reverse order of its last usage:
        If v is already in cache C:
            Move v to the front of the cache (most recently used)
        Else:
            If cache C is not full:
                Add v to cache C and assign it a register ID
            Else:
                // Cache is full, need to spill
                Pop the least recently used variable u from cache C
                Push u onto stack S
                Add v to cache C and assign it the same register ID as u
    
    // Assign registers to spilled variables 
    While stack S is not empty:
        Pop a variable v from stack S
        Assign v a new register ID
        
    // Output register allocation
    For each variable v:
        Output the register assigned to v

Initial idea is to just spill everything to the stack on block boundaries. This assumes that every variable's live range is the whole program which is augmented by using the LRU structure to choose which registers should be freed.

Timeline

Here's an approximate timeline, considering the community bonding period from May 1 - May 27:

  • May 27 - July 8 (Mid-Term Evaluation) => We should have the initial codebase ready at this point with support for switching between the allocation techniques and also be able to process the input with correctness. While the spilling process may not be the most optimal here, we should be able to handle most cases and just spill to the stack in places where we are yet to make a smart decision.

  • July 12 - August 19 (Final Evaluation) => At this point we should have our implementation ready and since we are focusing on improving the compilation time, testing while handling for corner cases with other potential optmisations and heuristics would be done, also fuzz testing can be added where we do not quite have the information to make a decision.

  • September 3 - November 4 => In case of a potential extension, we could improve further which depends on our benchmarks compared to the existing allocator and also how well our allocation techinique responds to different forms of input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment