Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
separate_layout_from_content

Why storing accurate editor history requires to separate layout from content.

Let us understand programming as converting strings to another machine-readable or executable output with a program according to a specification (called compiler). Now let the sole layout as graphical positioning be defined by additional white spaces and newlines.

Source code formatting ensures stylistic consistency to improve readability and eases on refactoring and writing code. All formatters that enable custom user data come with an escape hatch for visualization:

// zig fmt: off
const matrix = {{1, 1, 1},
                {2, 2, 2},
                {3, 3, 3}};
// zig fmt: on

is more readable than

const matrix = {{1, 1, 1}, {2, 2, 2}, {3, 3, 3}};

zig fmt as layout change

However, the user expects an input without zig fmt: off and on lines to convert the input

const matrix = {{1, 1, 1},
                {2, 2, 2},
                {3, 3, 3}};

on running zig fmt to

const matrix = {{1, 1, 1}, {2, 2, 2}, {3, 3, 3}};

There is no semantic reason why an editor should not be able to do the same transformation, if it can store the Abstract Syntax Tree of the input with the additional offsets of line 2 and 3 different from "where the next tokens are expected". For example, const matrix = {{1, 1, 1},, {2, 2, 2}, and {3, 3, 3}}; have matching layout. In between, are either 1 newline with white spaces or only 1 white space.

More complex is utilizing optional brackets, as they require a lookahead to count number of statements between {}-brackets.

if (condition) runFunction();
if (condition)
  runFunction();
// alternative
if (condition) {
  runFunction();
}

Special in Zig

Zig specifically has one design aspect, which makes it very nice to use, but limits formatting performance (in theory):

const matrix = {{1, 1, 1}, {2, 2, 2}, {3, 3, 3},};

is formatted to

const matrix = {{1, 1, 1},
                {2, 2, 2},
                {3, 3, 3},};

Thus the formatter needs to traverse the full bracket to check existence of a comma to identify the correct layout, which can be slow for huge files.

One simple, but unergonomic to code (you can not simply append some code), workaround would be to use a comma prefix ({,{1, 1, 1}...{3, 3, 3}}) instead of postfix ({{1, 1, 1}...{3, 3, 3},}).

Summary

  1. Comparison against the "normalized layout" is possible, even for sections with disabled formatter.
  2. The difference of tokens in between can be stored and versioned.

The next article will outline consequences, if we try to store the changes efficiently. Use cases are undo files, source code version control, source code reduction and generally anything which rewrites AST and needs to take into consideration the user formatting of source code.

@matu3ba
Copy link
Author

matu3ba commented Dec 3, 2022

  1. rope for undo+redo (with file format) and cheap indexing (after undo/redo)
    for O(1) read lookup.
    => complexity comes from unicode + grapheme interactions, NOT syntax

See "rope science". Rope implementations inherit this complexity, so adding AST-based editing on top + handling of spaces etc
for unicode sounds unfeasible.
On the other hand, non-unicode will not have general acceptance/usage.

@matu3ba
Copy link
Author

matu3ba commented Dec 3, 2022

With that in hindsight, I decided to not follow up this idea. Without unicode support, this looks however feasible to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment