Skip to content

Instantly share code, notes, and snippets.

@DavidDeSimone
Last active April 29, 2022 12:57
Show Gist options
  • Save DavidDeSimone/400e0f6041d2f4a1eb7938f113a6036f to your computer and use it in GitHub Desktop.
Save DavidDeSimone/400e0f6041d2f4a1eb7938f113a6036f to your computer and use it in GitHub Desktop.
[RFC] Using Rust's HashMap for a LispHashTable, and the start of porting the Lisp GC to Rust.

Feature Name: Using Rust's std::collections::HashMap to implement Lisp_Hash_Table

Start Date: 7/25/2017

RFC PR: remacs/remacs#286

Summary

Emacs maintains it's own Hash Table implementation, represented by a Lisp_Hash_Table struct. We have begun the process of converting some of the C hash table functions into Rust. In doing this, it has been suggested that instead of maintaining our own hash table, we instead leverage Rust's std::collections::HashMap. Attempting to do this brings up two large problems that remacs has yet to discuss, Rust based allocation of LispObjects, and the Rustification (oxidation? heh) of the Lisp GC.

We could simplify our HashTable struct to something like

#[derive(Eq, PartialEq, Serialize, Deserialize, Copy, Clone)]
 enum HashFunction {
     Eq,
     Eql,
     Equal,
     UserFunc(LispObject, LispObject, LispObject),
 }

#[repr(C)]
pub struct LispHashTable {
    header: Lisp_Vectorlike_Header,
    weak: bool,
    is_pure: bool,
    table_test: HashFunction,
    map: HashMap<LispObject, LispObject>,
}

Motivation

We can greatly simplify the implementation of a Lisp Hash Table by leveraging Rust's standard hash map. Functions like 'hash_clear' and 'hash_put' can be greatly simplified, boiled down to 1-2 liners that should be easy for anyone to understand. Evidence also suggests making this switch will give us a sizable performance gain.

Detailed Design

The key idea here is remove the Lisp_Hash_Table type completely. The type 'Lisp_Hash_Table' will be removed from the C layer entirely, and replaced by a Rust defined type 'LispHashTable'. For tagging and identification, LispHashTable will still be #[repr(C)], and will still begin with a vectorlike header. Besides accessing this vectorlike header, a LispHashTable will be an opaque type to the C layer, and the C layer will not directly manipulate the hash table in any way.

The current skeleton of hashtable.rs should be able to easily be preserved, updated for this new structure. We currently have code that looks like:

pub type LispHashTableRef = ExternalPtr<Lisp_Hash_Table>;

impl LispHashTableRef {    
    pub fn allocate() -> LispHashTableRef {
        let vec_ptr =
            allocate_pseudovector!(Lisp_Hash_Table, count, PseudovecType::PVEC_HASH_TABLE);
        LispHashTableRef::new(vec_ptr)
    }
    // ... setters and getters 

}

Now COULD have something that looks like:

#[repr(C)]
pub struct LispHashTable {
    header: Lisp_Vectorlike_Header,
    weak: bool,
    is_pure: bool,
    table_test: HashFunction,
    map: HashMap<LispObject, LispObject>,
}

impl LispHashTable {
     pub fn new() // ...
}

pub type LispHashTableRef = ExternalPtr<LispHashTable>;

impl LispHashTableRef {
     pub fn allocate() -> LispHashTableRef {
     	 let map = LispHashTable::new();
	 let ptr = garbage_collector!().manage_hashtable(map); // New API that will be elaborated on
	 LispHashTableRef::new(ptr)
     }
}

#[lisp_fn]
fn make_hash_table(...) -> LispObject {
   let table = LispHashTableRef::allocate();
   // ...
   
   LispObject::from_hash_table(table)
}

The code above looks sane and straight forward, until we get into alloc::manage_vector. At this point in remacs history, all of our lisp objects have been allocated and managed in the C layer. To use a Rust HashMap, we will need to allocate a Rust object that will be managed by the Lisp GC.

Discussion on Emacs Lisp GC and allocation.

Emacs Lisp has a basic mark and sweep GC, implemented in alloc.c. Like it's name implies, it has two stages, a mark stage where it records if an object is not garbage, and a sweep stage where it frees all non-marked objects.

Adding support for marking should be straight forward. In the mark_object function in alloc.c, we have a case for a Lisp_Hash_Table:

case PVEC_HASH_TABLE:
{
    struct Lisp_Hash_Table *h = (struct Lisp_Hash_Table *) ptr;

    mark_vectorlike (ptr);
    // ...
}

This logic can be delegated to a Rust function called from the C, so now the above becomes

case PVEC_HASH_TABLE:
{
    rust_mark_hash_table(ptr);
}

The sweep is a little more complex. Vectors (and vector-likes) are swept in the sweep_vectors in alloc.c. Without getting to into the weeds on how this function works, we can summarize this function as

void sweep_vectors() {
    for (/* each vector in global record */) {
       if (/* vector not marked */) {
           cleanup_vector(vector);
       }
    }
}

This is actually a very large simplification, as vectors are using a custom block allocator for smaller vectors, while maintaining a list of "large vectors" that have there own rules. However, at a high level we only need to care about the fact that there is a structure of exsisting vectors that will be swept a certain point in the gc flow. I see no reason why we couldn't add something like

void sweep_vectors() {
    for (/* each vector in global record */) {
       if (/* vector not marked */) {
           cleanup_vector(vector);
       }
    }    
}

void sweep() {
   // ...
   sweep_vectors(); 
   rust_sweep(); // Sweeps all unmarked rust allocations. 
   // ...
}

We could maintain a structure of Rust allocated vectors, and when we notice are unmarked, we will drop them. There are a number of ways we can accomplish this, each with trade offs, but I would like to discuss a different approach that I believe will set us up better in the future.

Looking at the marking code, it seems that is is emulating somethign that would fit very well in a Rust trait. My idea is that first, we have a trait along the lines of:

pub trait GCThing: Send + Sync {
    fn mark(&mut self);
    fn unmark(&mut self);
    fn is_marked(&self);
}

For vectors, the C layer uses a block allocator, with blocks of size 4096. It maintains a free list of open slots. If a block is not used, it will be freed. If all blocks are at capacity, a new block will be allocated. We can emulate this in Rust via a BlockAllocator struct.

We would then create an object along the lines of:

struct LispGarbageCollector {
    managed_hashtables: BlockAllocator<LispHashTable>,
    // We could have a similar pattern for other pseudovector types...
}

With an implementation along the lines of:

impl LispGarbageCollector {
    pub fn sweep(&mut self);
    pub fn manage_hashtable(table: LispHashTable) -> ExternalPtr<LispHashTable>; 
    // ... 
}

Rooting

If we look back at a code sample I posted eariler, with this new GC api, it would look something like

impl LispHashTableRef {
     pub fn allocate() -> LispHashTableRef {
     	 let map = LispHashTable::new();
	 let ptr = garbage_collector!().manage_hashtable(map); 
	 LispHashTableRef::new(ptr) 
     }
}

This will be safe because the Lisp GC examines the current call stack for what it (thinks) are LispObjects or pointers to LispObjects.

#[lisp_fn]
fn my_binding(table: LispObject) -> LispObject {
    let object = table.as_hash_table_or_error();
    object.foo();
    Fgarbage_collect();
    object.bar(); // This should be fine.
    // ...
}

Purity

This still requires more investigation, but I believe we will be fine on purity if we use a binary rust serializer like bincode, calling purecopy on all of our dependent objects. We will dump our LispHashMap as binary into pure space, and when you need a copy of a hash table in pure space, we will deserialize.

Drawbacks

This is a lot of complexity! However, I believe that the complexity is stemming from problems that remacs will eventually have to solve. The problems of moving the GC over to Rust, and using Rust data structures aren't going to go away, and we will have to face them eventually. It may not yet be time to face these problems.

Rust's HashMap includes support for custom hashing functions, however emacs HashTables having support for controling the rehash_threshold and rehash_size, which as far as I can tell, are not supported in Rust's HashMap.

Alternatives

We can port the exsisting hashmap code over 1:1 from C to Rust, not using any Rust specific data structures.

Performance

Based on my preliminary tests, I am seeing up to a 30% perf. gain for map insertion, and an almost 8% gain for map fetching.

Unresolved Questions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment