Davidebyzero/gist:9090628

Last active November 25, 2020 07:58

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/Davidebyzero/9090628.js"></script>
Save Davidebyzero/9090628 to your computer and use it in GitHub Desktop.

Download ZIP

Solutions to teukon's Draft Regex Golf Bonus Levels (SPOILERS! SPOILERS!) and discussion of mathematical regexes

Raw

gistfile1.md

See https://gist.github.com/teukon/9060863

Author

Davidebyzero commented Mar 14, 2014

Yay! Very well done sir.

Thank you kindly.

Here's a debug trace of your Fibonacci regex taking the input of 46368. You need 7-Zip to uncompress it (hopefully you already have that installed). Once uncompressed it is a 30 megabyte text file. (To do: Add line numbers, and show the whole line, with a ^ pointing to the current position.)

The numbers in [] are how many backref captures that group owns. The numbers in {} are positions within the input number. The rest should be self-explanatory.

Edit (as of 20:42 UTC): If you already downloaded this, please re-download. All the (?:,(,(?=,(?! in the group stack trace were off by one nesting level.

BTW, you have some $ outside the nesting level of the backreference preceding it, making my current optimization unable to see that it comes right after the backreference. For example, (?=\3\1*(\1(\2+))$) would perform better as (?=\3\1*(\1(\2+$))) in my current engine.

Here is a debug trace of your regex with minor edits to fix that (17.5 MB text file inside).

teukon commented Mar 14, 2014

Cool. It seems your engine is working on the regex without stripping comments, newlines, or leading/trailing whitespace.

Author

Davidebyzero commented Mar 14, 2014

It strips all that in its internal representation of the regex. To make the debug trace I gave each symbol a pointer to its text representation in the original unprocessed text file.

teukon commented Mar 14, 2014

It strips all that in its internal representation of the regex. To make the debug trace I gave each symbol a pointer to its text representation in the original unprocessed text file.

I see.

BTW, you have some $ outside the nesting level of the backreference preceding it, making my current optimization unable to see that it comes right after the backreference. For example, (?=\3\1*(\1(\2+))$) would perform better as (?=\3\1*(\1(\2+$))) in my current engine.

So is it twice as fast like this? (I noticed it takes about half as many steps). How much faster is it than with pcregrep?

Author

Davidebyzero commented Mar 14, 2014

So is it twice as fast like this? (I noticed it takes about half as many steps). How much faster is it than with pcregrep?

With your unedited regex, it took 201 seconds to go from 0 to 46368. With your edited regex, it took 64.4 seconds.

pcregrep took 270 seconds and perl took 625 seconds. The edit makes no difference to them.

Eventually I'll make my optimization smarter, where it can scan up to shallower nesting depths and know when it's safe to do the optimization. But that'll probably have to take place in a static-analysis stage between parsing and matching.

teukon commented Mar 14, 2014

Oh, it's quite a lot faster then. Nice work.

By the way. I'll be away for a couple of days and won't be responding here. Good luck with your project.

Author

Davidebyzero commented Mar 14, 2014

Oh, it's quite a lot faster then. Nice work.

By the way. I'll be away for a couple of days and won't be responding here. Good luck with your project.

Thank you! And thanks for the heads-up. Have a good time wherever you're going!

Author

Davidebyzero commented Mar 15, 2014

My regex engine now works with all of our ^x*$ domain regexes.

Author

Davidebyzero commented Mar 15, 2014

Added preliminary support for matching actual strings, and it works with "regex for matching multiplication - simple method.txt". It does not yet work with "regex for matching multiplication - factoring method.txt", however; it appears to be completely failing to do the p^s+p^t divide-down loop, and only doing the common-prime-factors test at the bottom.

Edit: Oh, wow. I forgot to trim whitespace inside of what my engine treats as a single symbol. So, for example \ 8 was not treated as \8. (This is actually the standard. My use of whitespace for alignment of backrefs between lines in "regex for matching multiplication - factoring method.txt" is nonstandard, but I like it so I'm going to keep it as an option.)

Author

Davidebyzero commented Mar 15, 2014

"regex for matching multiplication - factoring method.txt" is still giving some false positives, but I think it's because it only worked in the first place due to a bug in PCRE. The bug in the regex is that it effectively assumes the loop that divides down p^s+p^t is greedy-possessive, not just greedy. The bug in PCRE would be that it either doesn't try to backtrack on the number of loop iterations, or doesn't restore backref values to what they were on the earlier iteration when it does backtrack. Maybe it only has this bug when the last iteration of the loop, the one being backtracked through, is zero-length.

Edit: Indeed, it was a bug in PCRE! I ported the test script to use Perl's regex engine, and it had the same false positives. My engine has the correct behavior.

You know a regex engine is progressing well when it helps identify bugs in other engines!

My engine executes the fixed version of "regex for matching multiplication - factoring method.txt" perfectly! And does it faster than Perl. (Even despite the fact that it's actually comparing strings now.) Perl goes up to 25*25=625 in 160 seconds, and my engine does it in 135 seconds. And there are still many more optimizations I can do. (I've only done one, really.)

Author

Davidebyzero commented Mar 16, 2014

Finished implementing the proper backtracking into alternatives in greedy and lazy loops. I think I've got the fully correct behavior now.

Author

Davidebyzero commented Mar 16, 2014

Added an optimization for backref followed by sequence of fixed-repetition backrefs inside a lookahead. For example \1+(?=\2\3{4}\5{3}$). Previously matching this required backtracking all the way from the maximum number of matches of \1 to the point at which it resulted in a match, and if the subsequent terms didn't match it had to subsequently backtrack \1 all the way to the minimum number of matches (1 in this case).

With the optimization it immediately goes forward the correct length and doesn't need to backtrack \1 at all.

This makes "regex for matching multiplication - factoring method.txt" go up to 25*25=625 in 73 seconds instead of 120 seconds (which is in turn down from the previous 135 seconds due to minor optimizations).

Author

Davidebyzero commented Mar 16, 2014

I templated the choice between comparing strings and comparing numbers, so now it's possible to choose between the two at runtime.

With your edited [Fibonacci] regex, it took 64.4 seconds [to go from 0 to 46368]. pcregrep took 270 seconds and perl took 625 seconds.

Now it's down to 48.5 seconds.

With my Fibonacci regex, it takes 12.1 seconds. pcregrep takes 532 seconds (and I don't even want to time it with perl). Interesting — my regex runs much faster than yours under my engine, and your regex runs much faster than mine under PCRE.

Edit: Interesting. The "square root via fourth root" algorithm is actually much slower under my engine than the plain minimal square root. It takes 7.2 seconds for the plain one to go up from sqrt(0)=0 to sqrt(14400)=120, but takes 26.5 seconds for the one that uses 4th root.

teukon commented Mar 17, 2014

You've certainly been busy.

Congratulations on getting the factoring method for multiplication working; that was one devilish regex.

How does your engine do with this regression test?

Author

Davidebyzero commented Mar 17, 2014 •

edited

Loading

Welcome back, teukon.

Congratulations on getting the factoring method for multiplication working; that was one devilish regex.

Thank you :)

How does your engine do with this regression test?

That test isn't all that useful to me. Virtually all the tests either test features I haven't implemented yet (which aren't part of ECMAScript), or tests bugs that Perl or PCRE had but my engine would never have.

What I really want to test is alternatives and the lazy and greedy backtracking of groups, and especially the interaction of all three.

I haven't implemented character classes yet, and \s\S\d\D\w\W, but am about to.

Does the ECMAScript specification say whether ^,$,\b,\B can be followed by a quantifier? This would be useless, of course, but I've noticed Opera supports it, but Firefox and Chrome do not. All three support a quantifier on lookaheads, which is useless too, and I think goes against the specification (as it seems the ECMAScript specification calls lookaheads assertions).

They appear to break the specification in other ways, too. It says "An escape sequence of the form \ followed by a nonzero decimal number n matches the result of the nth set of capturing parentheses (see 15.10.2.11). It is an error if the regular expression has fewer than n capturing parentheses." But all three seem to make \n act as an octal escape in that case; for example, \141 matches a.

teukon commented Mar 18, 2014

That test isn't all that useful to me. Virtually all the tests either test features I haven't implemented yet (which aren't part of ECMAScript), or tests bugs that Perl or PCRE had but my engine would never have.

I see. I wasn't sure how far along you were but your statement that you'd found a bug in PCRE due to the accuracy of your engine in one particular area brought this list to mind.

Does the ECMAScript specification say whether ^,$,\b,\B can be followed by a quantifier?

Yes it does, and they can't. According to the spec., ^, $, \b, and \B are assertions. The constructor only allows quantifiers to follow atoms, not assertions.

I don't know why some browsers are allowing some quantified assertions.

With my Fibonacci regex, it takes 12.1 seconds. pcregrep takes 532 seconds (and I don't even want to time it with perl). Interesting — my regex runs much faster than yours under my engine, and your regex runs much faster than mine under PCRE.

That is interesting. Anyway, well done on getting your engine to be so fast with your Fibonacci regex. Is your engine able to handle very large Fibonacci numbers? If yes, is the engine still quick relative to Perl and PCRE?

teukon commented Mar 18, 2014

They appear to break the specification in other ways, too. It says "An escape sequence of the form \ followed by a nonzero decimal number n matches the result of the nth set of capturing parentheses (see 15.10.2.11). It is an error if the regular expression has fewer than n capturing parentheses." But all three seem to make \n act as an octal escape in that case; for example, \141 matches a.

Ugh! So ugly.

Author

Davidebyzero commented Mar 18, 2014

Well this is weird. In http://regexpal.com/ quantifiers are not allowed after assertions in Firefox, Chrome, or Opera. But in http://regex.alf.nu/ they're allowed after all assertions in Opera, and allowed after lookaheads in all three.

As for all the other bugs in Opera, and the octal escape ugliness in all three browsers, they happen in both sites.

There's another apparent break from the ECMAScript specification. It states that { and } are not valid PatternCharacters. However, in all three browsers, any { or } that is not in the precise form of a quantifier is treated as a literal. For example x{} matches x{}, x{,5} matches x{,5}, and the regexes x{1\}, x\{1}, and x{[1]} all match the string x{1}. (On the other hand, a quantifier that is in the precise form, but in the wrong context, is treated as an error.) The same happens in Perl and PCRE. I think this is a nice feature.

teukon commented Mar 18, 2014

I think this is a nice feature.

We're just going to have to agree to disagree.

Author

Davidebyzero commented Mar 18, 2014

Found another bizarre PCRE bug:

$ echo 'abc def'|pcregrep -o '^.*?\b'
abc
$ echo 'abc def'|pcregrep -o '\babc'
abc
$ echo 'abc def'|pcregrep -o 'abc def\b'
abc def
$ echo 'aaa'|pcregrep -o '^.*?(?=a)'
a
$ echo 'aaa'|pcregrep -o '^.*?(?=aaa)'

$ echo 'aaa'|pcregrep -o '^.*(?=a)'
aa
$ echo 'aaa'|pcregrep -o '^.*?(^|$)'
aaa
$ echo 'aaa'|pcregrep -o '^.*?a'
a

Seems like a lazy search with a minimum count of 0 tries a count of 1 as the first possibility if the match following it is zero-length, only backtracking to a count of 0 for the match if it has to.

Perl does not have this bug:

$ echo 'abc def'|perl -E '@m = <> =~ /^.*?\b/g; print @m[0]'

$ echo 'abc def'|perl -E '@m = <> =~ /^.*\b/g; print @m[0]'
abc def
$ echo 'aaa'|perl -E '@m = <> =~ /^.*?(?=a)/g; print @m[0]'

$ echo 'aaa'|perl -E '@m = <> =~ /^.*(?=a)/g; print @m[0]'
aa
$ echo 'aaa'|perl -E '@m = <> =~ /^.*?(^|$)/g; print @m[0]'

$ echo 'aaa'|perl -E '@m = <> =~ /^.*?a/g; print @m[0]'
a

teukon commented Mar 18, 2014

Weird. I'm surprised that there are so many bugs in common regex engines.

Author

Davidebyzero commented Mar 18, 2014

I implemented character classes :) and of course the first thing I tried was our robust Triples solution. It works perfectly.

teukon commented Mar 19, 2014

I implemented character classes :) and of course the first thing I tried was our robust Triples solution. It works perfectly.

Brilliant.

Author

Davidebyzero commented Mar 22, 2014

Hi teukon!

Well I finally got it releasable and posted my regex engine on github. The name isn't final. There's still some polishing to be done (especially, adding parser error messages), but it is quite usable. Hopefully you'll be able to compile it without any trouble.

teukon commented Mar 22, 2014

Great. I'll put this on my to-do list but I'm currently snowed under with work.

Author

Davidebyzero commented Mar 24, 2014

This gist has gotten very long, so I've started a new one to continue the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment