boj/blog_zig_piece_table.md

## blog_zig_piece_table.md

      
    Raw
  

              blog_zig_piece_table.md
            
          
    Introduction

Inspired by the 300k USD Donation Pledged by Mitchell Hashimoto I decided to finally give Zig a look and see what all the fuss is about. I was curious what people meant by the simplicity of using something called "a modern C replacement". Rust has satisfied my low-level programming with safety itch, but people aren't wrong when they say it's a complicated language. What does a simple language look like in 2024?
Naturally, the first step to testing a language is finding a suitable project or domain to tackle. I love command line tools, and while it seems incredibly cliche, I've made it my goal to write a text editor of sorts. A true programmer's trial of ascension.
One of the first problems that needs solved when building a text editor is the matter of how you store and represent the text internally. There's a lot of naive approaches, such as storing each character in an array, or each line in a 2d array. However, those methods can be very inefficient with regards to memory management, make it difficult to introduce concepts like undo/redo, and depending on the implementation, make it hard to scale loading and editing very large files.
After a bit of searching and blog reading I narrowed my goal down to the handful of algorithms typically used in this scenario. I inevitably came across descriptions of a Gap Buffer, as used by Emacs (this blog has really cool animations showing how it works). A data structure called Rope, which is essentially a binary tree. And finally the one I settled on, the Piece Table, also described by the VSCode team as what they ended up reimplementing their buffer in. I also came across this helpful blog post which gave me a notion for how to implement it myself.
I'm not sure if my own implementation is better. It diverges from the idea of having "original" and "add" buffers, and instead is just an append-only ArrayList of Pieces which also include their operation, one of origin, add, delete. In order to get the latest editor state, the entire buffer of actions is replayed from the original point when the file was loaded. Conceptually, this looked a lot like event sourcing to me. This approach should remain reasonably efficient during an editing session, even with thousands of edits.
Goal

An example of what this might look like in practice:
Piece { .op, .string, .start }

Op.origin + Op.add (=new origin) + Op.delete (=new origin):

Piece(Op.origin, "Hello World", 0)
H e l l o   W o r l d  \n
0 1 2 3 4 5 6 7 8 9 10 11

Piece(Op.add, "There ", 6)
H e l l o   T h e r e     W  o  r  l  d  \n
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Piece(Op.delete, "llo The", 2)
H e r e   W o r l d \n
0 1 2 3 4 5 6 7 8 9 10

Op.delete + Op.add ("replace") (=new origin):

Piece(Op.delete, "Here", 0)
Piece(Op.add, "Sup", 0)
S u p   W o r l d \n
0 1 2 3 4 5 6 7 8 9

Zig

Digging around the std library has been a great source for learning and inspiration. There's a handful of Zig-isms that I wouldn't have considered, such as return a full type and its actions in a struct, or using the Self = @This() self referencing.
pub fn PieceTable() type {
    return struct {
        const Self = @This();

        allocator: std.mem.Allocator,
        root: ArrayList(Piece),

        const Piece = struct {
            op: Op,
            string: []const u8,
            start: usize,
            active: bool,

            fn init(op: Op, string: []const u8, start: usize) Piece {
                return Piece{
                    .op = op,
                    .string = string,
                    .start = start,
                    .active = true,
                };
            }
        };

        pub fn init(allocator: Allocator) Self {
            const root = ArrayList(Piece).init(allocator);
            return .{
                .allocator = allocator,
                .root = root,
            };
        }
        
        pub fn deinit(self: *Self) void {
            self.root.deinit();
        }

    }
}
One of the more interesting design decisions taken by Zig is that there are no hidden memory allocations. The developer handles this by passing in one of the default allocators from the standard library, or even a custom memory allocator, as the caller. This post goes into more detail about it. This approach leads to the element of least surprise, and makes it clear where memory is being allocated or not.
Replay

The workhorse logic inside my implementation is the replay() function, which walks through all of the text edits starting from the origin, and manipulates an ArrayList with the relevant slices of text. It's interesting to note that Zig has no concept of a string, which may surprise people coming from higher level languages; all characters are represented as a u8 (or multiple, in the case of unicode).
// Caller owns and must free the returned slice with the same allocator
pub fn replay(self: *Self) ![]const u8 {
    var origin = std.ArrayList(u8).init(self.allocator);
    defer origin.deinit();

    if (self.root.items.len > 0) {
        var i: usize = 0;
        while (i <= self.compute_idx()) : (i += 1) {
            const piece = self.root.items[i];
            switch (piece.op) {
                Op.origin => try origin.appendSlice(piece.string),
                Op.add => try origin.insertSlice(piece.start, piece.string),
                Op.delete => {
                    for (0..piece.string.len) |_| {
                        _ = origin.orderedRemove(piece.start);
                    }
                },
            }
        }
    }

    return origin.toOwnedSlice();
}
Hopefully the code speaks for itself, with the bulk of the logic tied up in the switch in the op codes and ArrayList manipulation. The final representation is built up into a new ArrayList, which the caller must free when they are done. The drawback to this approach is that it does end up requiring textSize * 2 amount of memory; the storage buffers, and the fast-forwarded representation. Perhaps there's a better approach?
Undo / Redo

The final piece to the puzzle was implementing undo/redo. This is achieved by adding an active field to the Piece struct and setting it to true if that is the point the user wishes to be at. The replay() function will then stop at that point when rendering the text out. (Note: As I write this it dawned on me that I forgot to add the "what happens after you introduce a new edit after an undo?" logic. Fun!)
fn compute_idx(self: *Self) usize {
    var i: usize = 0;
    for (0..self.root.items.len) |idx| {
        if (self.root.items[idx].active == false) i += 1 else break;
    }
    return i;
}

pub fn undo(self: *Self) void {
    const idx = self.compute_idx();
    if (idx == 0) return;
    if (self.root.items.len > 0) {
        self.root.items[idx].active = false;
        self.root.items[idx - 1].active = true;
    }
}

pub fn redo(self: *Self) void {
    const idx = self.compute_idx();
    if (idx < self.root.items.len - 1) {
        self.root.items[idx + 1].active = true;
        self.root.items[idx].active = false;
    }
}
There's not much to it. The core functions just manipulate the active status of any given Piece based on its position in the buffer. Trying to go beyond the size of the buffer results in a noop for each action.
Conclusion

Zig has been an enjoyable language to tinker with. The standard library is small. The number of operations to memorize are trivial. There doesn't seem to be much in the way of footguns or gotchas. Overall it seems a lot easier to reason about the code than traditional C. The tooling is actually quite amazing; you can use it to not only compile Zig projects but C projects as well, cross compilation for all major platforms, and ditch old school Makefiles. The project also takes a path similar to Go by supporting decentralized package management, rather than centralized similar to Rust's crates.io and the like.
The language is only on version 0.13 as of this writing and sounds like it still has a ways to go before an official 1.0 release. That said, it seems to be stable enough to use in its current state. The ecosystem isn't all that large compared to more popular languages, however, that isn't entirely unexpected given how long it's been in development.
While there's likely not much of a use case at my day job for such a tool, it feels worth adding to my toolbox on the off chance an opportunity comes along where it makes sense to use it. I look forward to exploring it more!
Notes

For the full implementation, including tests, see: https://github.com/boj/zany/blob/main/src/piece_table.zig