Wilfred/regex.org

## regex.org

      
    Raw
  

              regex.org
            
          
    RFC: Use Rust’s regex crate

We want to port Remacs to use a regex crate implemented in Rust. The
  Rust implementations are highly optimised, and this would simplify the
  Remacs codebase.
The two major crates are Rust’s regex crate, and the fancy-regex
  crate.
Background: The Gap Buffer

Text in an Emacs buffer is not contiguous: it uses a gap buffer
  data structure. A minimal gap buffer implementation looks like this:
struct GapBuffer {
    text: Vec<u8>,
    gap_start: usize,
    gap_end: usize,
}
The actual text is stored in two parts, buffer.text[..gap_start] and
  buffer.text[gap_end..]. This enables very fast insertion (gap_start++)
  and deletion (gap_start--) of text at the current cursor position
  (‘point’ in Emacs terminology).
Moving point and inserting text elsewhere forces the gap to be moved,
  which is O(N) in the size of text. You can see the gap moving with the
  following elisp command:
(defun wh/report-gap-buffer ()
  (interactive)
  (message "gap starts at %d and is %d bytes"
           (gap-position) (gap-size)))
Try moving around with/without making changes and call M-x
  wh/report-gap-buffer.
Searching Gap Buffers Efficiently

The regex crate supports searching &str or &[u8], but does not support
  searching Iterator<u8>. Since this requires contiguous data, we will
  need to move the gap to the beginning or end of the buffer to search
  the whole buffer.
This is O(N/2), and the current Emacs C implementation does not have
  this performance overhead. In re_match_2_internal (defined in
  regex.c), Emacs matches a char-at-a-time, and steps over the gap.
We can alleviate this overhead in several ways.
Bounds: Most elisp regex functions take a limit. For example,
  re-search-forward starts at point and has an optional end limit. If
  point hasn’t moved since the last edit, we don’t need to search over
  the gap.
Newlines: It should be straightforward to detect if an elisp regex matches
  newlines, because . does not match a newline. If the current regex
  does not contain newlines, we could search each line separately:

  Search the complete lines in text before the gap.
  Join the last line before the gap to the first line after the gap,
    and search it.
  Search the complete lines in text after the gap.

Regex Feature Compatibility

Emacs’ regex syntax is based on POSIX. It is not as powerful as PCRE,
  and the pcre2el project shows that it’s possible to convert elisp
  regexes to PCRE.
The Rust regex crate does not provide all the features of a PCRE
  implementation. Instead, it is designed to provide regex features that
  it can search for in linear time (much like the re2 project).
The fancy-regex crate provides a much richer feature set (such as
  backreferences), leveraging the regex crate as much as possible so
  simple patterns are still fast.
Emacs lisp regexes have the following features that aren’t available
  in the regex crate, but are available in fancy-regex:
Backreferences: For example:
"\\(.\\)\\1" ;; Matches any repeated character.
Repeated backreferences: For example:
"\\(.\\)\\{2,3\\}" ;; Matches any character repeated two or three times.
Elisp references also have features that depend on the state of
  the current buffer:
Syntax classes: For example:
"[[:space:]]" ;; Matches any character that has whitespace syntax
              ;; according to the current syntax table.
"\\s(" ;; Matches an open parenthesis/bracket according to the
       ;; current syntax table.
"\\_<" ;; Matches the beginning of a symbol according to the syntax 
       ;; table.
Case tables: For example:
"[[:upper:]]" ;; Matches any upper case letter according to the
              ;; current case table (unless `case-fold-search' is t).
In this cases, I believe we can transform elisp regexps into regexps
  acceptable by fancy-regex:
fn convert(elisp_regexp: &str, buffer: BufferRef) -> String
Implementation

We should initially define a function remacs--re-search-forward and
  get a basic implementation working. We can limit this first version to
  handling a subset of elisp regexps on unibyte buffers. By providing a
  separate function, we can compare behaviour in a running Remacs
  instance.
We can then expand the feature set to bring parity with the C
  implementation:

  Full elisp regexp syntax
  Buffer-sensitive regexps (case, syntax)
  Honouring narrowing (e.g. re-search-forward only looks at the
    visible content of the buffer).
  Honouring case-fold-search

, and benchmark our version. This will require some
  upstream patches (fancy-regex does not support searching a &[u8] yet).
Unresolved Questions

How do we handle multibyte strings and buffers with
  different encodings? Specifically, how do we handle the following
  regexps?
"[[:multibyte:]]"

"[[:graph:]]" "[[:alpha:]]" ;; these are both unicode-aware
How do we handle \=? This matches point.
It’s not clear what a good benchmark would be. One option is to
  measure syntax highlighting performance using
  emacsbench.
How do we handle backtracking? Emacs provides a sepaprate set of
  functions for ‘full POSIX backtracking’, such as posix-search-forward.