Skip to content

Instantly share code, notes, and snippets.

@Wilfred
Last active January 7, 2020 17:22
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Wilfred/331cdf1762dcc955da88662dbc022c3a to your computer and use it in GitHub Desktop.
Save Wilfred/331cdf1762dcc955da88662dbc022c3a to your computer and use it in GitHub Desktop.
Native Rust Regular Expressions in Remacs

RFC: Use Rust’s regex crate

We want to port Remacs to use a regex crate implemented in Rust. The Rust implementations are highly optimised, and this would simplify the Remacs codebase.

The two major crates are Rust’s regex crate, and the fancy-regex crate.

Background: The Gap Buffer

Text in an Emacs buffer is not contiguous: it uses a gap buffer data structure. A minimal gap buffer implementation looks like this:

struct GapBuffer {
    text: Vec<u8>,
    gap_start: usize,
    gap_end: usize,
}

The actual text is stored in two parts, buffer.text[..gap_start] and buffer.text[gap_end..]. This enables very fast insertion (gap_start++) and deletion (gap_start--) of text at the current cursor position (‘point’ in Emacs terminology).

Moving point and inserting text elsewhere forces the gap to be moved, which is O(N) in the size of text. You can see the gap moving with the following elisp command:

(defun wh/report-gap-buffer ()
  (interactive)
  (message "gap starts at %d and is %d bytes"
           (gap-position) (gap-size)))

Try moving around with/without making changes and call M-x wh/report-gap-buffer.

Searching Gap Buffers Efficiently

The regex crate supports searching &str or &[u8], but does not support searching Iterator<u8>. Since this requires contiguous data, we will need to move the gap to the beginning or end of the buffer to search the whole buffer.

This is O(N/2), and the current Emacs C implementation does not have this performance overhead. In re_match_2_internal (defined in regex.c), Emacs matches a char-at-a-time, and steps over the gap.

We can alleviate this overhead in several ways.

Bounds: Most elisp regex functions take a limit. For example, re-search-forward starts at point and has an optional end limit. If point hasn’t moved since the last edit, we don’t need to search over the gap.

Newlines: It should be straightforward to detect if an elisp regex matches newlines, because . does not match a newline. If the current regex does not contain newlines, we could search each line separately:

  • Search the complete lines in text before the gap.
  • Join the last line before the gap to the first line after the gap, and search it.
  • Search the complete lines in text after the gap.

Regex Feature Compatibility

Emacs’ regex syntax is based on POSIX. It is not as powerful as PCRE, and the pcre2el project shows that it’s possible to convert elisp regexes to PCRE.

The Rust regex crate does not provide all the features of a PCRE implementation. Instead, it is designed to provide regex features that it can search for in linear time (much like the re2 project).

The fancy-regex crate provides a much richer feature set (such as backreferences), leveraging the regex crate as much as possible so simple patterns are still fast.

Emacs lisp regexes have the following features that aren’t available in the regex crate, but are available in fancy-regex:

Backreferences: For example:

"\\(.\\)\\1" ;; Matches any repeated character.

Repeated backreferences: For example:

"\\(.\\)\\{2,3\\}" ;; Matches any character repeated two or three times.

Elisp references also have features that depend on the state of the current buffer:

Syntax classes: For example:

"[[:space:]]" ;; Matches any character that has whitespace syntax
              ;; according to the current syntax table.
"\\s(" ;; Matches an open parenthesis/bracket according to the
       ;; current syntax table.
"\\_<" ;; Matches the beginning of a symbol according to the syntax 
       ;; table.

Case tables: For example:

"[[:upper:]]" ;; Matches any upper case letter according to the
              ;; current case table (unless `case-fold-search' is t).

In this cases, I believe we can transform elisp regexps into regexps acceptable by fancy-regex:

fn convert(elisp_regexp: &str, buffer: BufferRef) -> String

Implementation

We should initially define a function remacs--re-search-forward and get a basic implementation working. We can limit this first version to handling a subset of elisp regexps on unibyte buffers. By providing a separate function, we can compare behaviour in a running Remacs instance.

We can then expand the feature set to bring parity with the C implementation:

  • Full elisp regexp syntax
  • Buffer-sensitive regexps (case, syntax)
  • Honouring narrowing (e.g. re-search-forward only looks at the visible content of the buffer).
  • Honouring case-fold-search

, and benchmark our version. This will require some upstream patches (fancy-regex does not support searching a &[u8] yet).

Unresolved Questions

How do we handle multibyte strings and buffers with different encodings? Specifically, how do we handle the following regexps?

"[[:multibyte:]]"

"[[:graph:]]" "[[:alpha:]]" ;; these are both unicode-aware

How do we handle \=? This matches point.

It’s not clear what a good benchmark would be. One option is to measure syntax highlighting performance using emacsbench.

How do we handle backtracking? Emacs provides a sepaprate set of functions for ‘full POSIX backtracking’, such as posix-search-forward.

@Hi-Angel
Copy link

Hi-Angel commented Jan 7, 2020

Emacs’ regex syntax is based on POSIX. It is not as powerful as PCRE, and the pcre2el project shows that it’s possible to convert elisp regexes to PCRE

Please note that Emacs regexes include ability of executing arbitrary LISP code through \, regex object (or how is it called). Just noting it in case it may cause some troubles for porting effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment