makenowjust/draft.md

## draft.md

      
    Raw
  

              draft.md
            
          
    Introduce cache-based optimization for Regexp matching

Regexp matching cause time-complexity explosion problem as known as ReDoS (https://en.wikipedia.org/wiki/ReDoS). ReDoS has become serious vulnerability in many places in recent years, and Ruby is no exception. The following are the incomplete list of such vulnerability reports:

sinatra/sinatra#1823
lostisland/faraday-net_http#27

These reports have been addressed by fixing the library/software implementation. But, if the language’s Regexp implementation become safe, the vulnerabilty is fundamentally archived.
For a few month, Ruby has implemented a Regexp matching timeout (https://bugs.ruby-lang.org/issues/17837). It is one of the useful methods for preventing ReDoS vulnerability, but it is a problem that setting a correct timeout value is hard. This value is depending on input length, environment, network status, system load, etc. When the value is too small, a system may be broken, but when the value is too large, it is not useful for preventing ReDoS.
Therefore, as a new way to prevent ReDoS, we propose to introduce cache-based optimization for Regexp matching. As CS fundamental knowledge, automaton matching result depends on position on input and state. In addition, matching time explosion is caused for repeating to arrive the same position and state many times. Then, ReDoS can be prevent when pairs of position and state arrived once is recorded (cached). In fact, under such an optimization, it is known as the matching time complexity is linear against input size [1].
[1]: Davis, James C., Francisco Servant, and Dongyoon Lee. "Using selective memoization to defeat regular expression denial of service (ReDoS)." 2021 IEEE symposium on security and privacy (SP). IEEE, 2021. https://www.computer.org/csdl/proceedings-article/sp/2021/893400a543/1oak988ThvO
See the following example.
$ ruby --version
ruby 3.2.0preview2 (2022-09-09 master 35cfc9a3bb) [arm64-darwin21]

$ time ruby -e '/^(a|a)*$/ =~ "a" * 28 + "b"'
ruby -e '/^(a|a)*$/ =~ "a" * 28 + "b"'  8.49s user 0.04s system 98% cpu 8.663 total

$ ./miniruby --version
ruby 3.2.0dev (2022-10-27T13:39:56Z recache bc59b7cc1e) [arm64-darwin21]

$ time ./miniruby -e '/^(a|a)*$/ =~ "a" * 28 + "b"'
./miniruby -e '/^(a|a)*$/ =~ "a" * 28 + "b"'  0.01s user 0.01s system 8% cpu 0.310 total
In this example, using ruby v3.2.0-preview2, matching /^(a|a)*$/ against "a" * 28 + "b" takes 8.6 seconds because matching time of this Regexp takes exponential against "a" * N + "b" form string. But, using the patched version ruby, it takes 0.01 seconds. Incredibly it is 860 times faster, because matching is done in linear time.
By this optimization, the matching time is linear to input size. It sounds secure and good. Unfortunately, when Regexp uses some extension (e.g. look-around, back-reference, subexpression call), the optimization is not applied. Also, the optimization needs a bit more memory for caching. However, we have already investigate that they are not so major problem (See the "Limitation" section).
Implementation

The basic cache implementation is complete at this time and can be found in the following Pull Request.
ruby/ruby#6486
Some tests seem to be failed, but it is no problem! Because the failed tests are for Regexp timeout, optimization works correctly and so they are failed as expected. Of course we need to fix these tests before merging.
Implementation notes:

A null-check on loop causes non-linear behavior, so the field to index the latest null-check item on stack is added to OnigStackType. (https://github.com/ruby/ruby/pull/6486/files#diff-4347460e379cd970ba0b88b4acb215ea60cbae308124675de8811eed377367aaR831)
When the loop is null and this loop has captures, matching behaves monomaniac. To reproduce this behavior, cache in the loop is cleared as necessary. (https://github.com/ruby/ruby/pull/6486/files#diff-c3cfe19efff0cc5181384741386067b81eadb2505b8b9b3f980d10f815037395R3433)
Like flip-flop operator, we hope to drop this if possible. But it still remains backward compatibility.

Limitation

Cache-based optimization is not applied in the following cases:

Regexp using some extensions (back-reference and subexpression call, look-around, atomic, absent operators) are not optimized because it is impossible or hard. However, it may be possible for look-around operators.
A bounded or fixed times repetition nesting in another repetition (e.g. /(a{2.3})*/). It is completly implementation issue, but we believe it is hard to support this case correctly.
Bounded or fixed times repetition is too large (e.g. /(a|b){100000,200000}/). The cache table size is propotional to product of the number of cache points of regex and input size. In this case, since the number of cache points becomes too large, the optimization cannot be applied.

Experiments were conducted to investigate how these limitations are problematic in practice. We used ObjectSpace to collect Regexp and investigate whether they could be optimized and the number of cache points. Regexp were collected from the standard library, Webrick, and Rails. See the following gist for the details (https://gist.github.com/makenowjust/83e1e75a2d7de8b956e93bdac004a06b).
Experiments result is shown in the following table.


Collected from
# Regexp
# non-optimizable
Maximum number of cache points


stdlib
1009
86 (8.52%)
81


Webrick
356
44 (12.36%)
20


Rails
759
74 (7.75%)
27


Total 
 (Duplications are reduced)
1506
118 (7.84%)
81


This result shows that the percentage of non-optimizable Regexp is less than 10%, and the amount of memory used for optimization is about 10 times the length of the string (81/8, for a bit array) at worst in this case. It is considered that a sufficient number of Regexp can be optimized in practice.
Specification

Detailed specification has been fixed yet. We have some ideas and we would like to discuss about it.

When is optimization enabled? Currently, it turns on when the backtrack causes as many as the input length.
How the number of cache points is allowed, and how memory can be allocated? It is not determined for now.
If the above parameters can be specified by users, how are they specified? (via command-line flag, or static / instance parameter like Regexp#.timeout= and Regexp#timeout=)
Unless the input size it too large, availability of optimization can be determined on compile-time. So, we would like to add a new flag to Regexp to determine whether cache is available. It becomes one of the criteria whether Regexp is efficientally executable or not. We believe it helps users. Thus, which letter is preffered for this purpose? l (linear) or r (regular) sounds good, but I am not sure what is the best.

Thank you.
Collected from	# Regexp	# non-optimizable	Maximum number of cache points
stdlib	1009	86 (8.52%)	81
Webrick	356	44 (12.36%)	20
Rails	759	74 (7.75%)	27
Total (Duplications are reduced)	1506	118 (7.84%)	81