We want to port Remacs to use a regex crate implemented in Rust. The Rust implementations are highly optimised, and this would simplify the Remacs codebase.
The two major crates are Rust’s regex crate, and the fancy-regex crate.
Text in an Emacs buffer is not contiguous: it uses a gap buffer data structure. A minimal gap buffer implementation looks like this:
struct GapBuffer {
text: Vec<u8>,
gap_start: usize,
gap_end: usize,
}
The actual text is stored in two parts, buffer.text[..gap_start]
and
buffer.text[gap_end..]
. This enables very fast insertion (gap_start++
)
and deletion (gap_start--
) of text at the current cursor position
(‘point’ in Emacs terminology).
Moving point and inserting text elsewhere forces the gap to be moved,
which is O(N) in the size of text
. You can see the gap moving with the
following elisp command:
(defun wh/report-gap-buffer ()
(interactive)
(message "gap starts at %d and is %d bytes"
(gap-position) (gap-size)))
Try moving around with/without making changes and call M-x
wh/report-gap-buffer
.
The regex crate supports searching &str
or &[u8]
, but does not support
searching Iterator<u8>. Since this requires contiguous data, we will
need to move the gap to the beginning or end of the buffer to search
the whole buffer.
This is O(N/2), and the current Emacs C implementation does not have
this performance overhead. In re_match_2_internal
(defined in
regex.c), Emacs matches a char-at-a-time, and steps over the gap.
We can alleviate this overhead in several ways.
Bounds: Most elisp regex functions take a limit. For example,
re-search-forward
starts at point and has an optional end limit. If
point hasn’t moved since the last edit, we don’t need to search over
the gap.
Newlines: It should be straightforward to detect if an elisp regex matches
newlines, because .
does not match a newline. If the current regex
does not contain newlines, we could search each line separately:
- Search the complete lines in text before the gap.
- Join the last line before the gap to the first line after the gap, and search it.
- Search the complete lines in text after the gap.
Emacs’ regex syntax is based on POSIX. It is not as powerful as PCRE, and the pcre2el project shows that it’s possible to convert elisp regexes to PCRE.
The Rust regex crate does not provide all the features of a PCRE implementation. Instead, it is designed to provide regex features that it can search for in linear time (much like the re2 project).
The fancy-regex crate provides a much richer feature set (such as backreferences), leveraging the regex crate as much as possible so simple patterns are still fast.
Emacs lisp regexes have the following features that aren’t available in the regex crate, but are available in fancy-regex:
Backreferences: For example:
"\\(.\\)\\1" ;; Matches any repeated character.
Repeated backreferences: For example:
"\\(.\\)\\{2,3\\}" ;; Matches any character repeated two or three times.
Elisp references also have features that depend on the state of the current buffer:
Syntax classes: For example:
"[[:space:]]" ;; Matches any character that has whitespace syntax
;; according to the current syntax table.
"\\s(" ;; Matches an open parenthesis/bracket according to the
;; current syntax table.
"\\_<" ;; Matches the beginning of a symbol according to the syntax
;; table.
Case tables: For example:
"[[:upper:]]" ;; Matches any upper case letter according to the
;; current case table (unless `case-fold-search' is t).
In this cases, I believe we can transform elisp regexps into regexps acceptable by fancy-regex:
fn convert(elisp_regexp: &str, buffer: BufferRef) -> String
We should initially define a function remacs--re-search-forward
and
get a basic implementation working. We can limit this first version to
handling a subset of elisp regexps on unibyte buffers. By providing a
separate function, we can compare behaviour in a running Remacs
instance.
We can then expand the feature set to bring parity with the C implementation:
- Full elisp regexp syntax
- Buffer-sensitive regexps (case, syntax)
- Honouring narrowing (e.g.
re-search-forward
only looks at the visible content of the buffer). - Honouring
case-fold-search
, and benchmark our version. This will require some
upstream patches (fancy-regex does not support searching a &[u8]
yet).
How do we handle multibyte strings and buffers with different encodings? Specifically, how do we handle the following regexps?
"[[:multibyte:]]"
"[[:graph:]]" "[[:alpha:]]" ;; these are both unicode-aware
How do we handle \=
? This matches point.
It’s not clear what a good benchmark would be. One option is to measure syntax highlighting performance using emacsbench.
How do we handle backtracking? Emacs provides a sepaprate set of
functions for ‘full POSIX backtracking’, such as posix-search-forward
.
Please note that Emacs regexes include ability of executing arbitrary LISP code through
\,
regex object (or how is it called). Just noting it in case it may cause some troubles for porting effort.