Skip to content

Instantly share code, notes, and snippets.

@Davidebyzero
Created March 24, 2014 06:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Davidebyzero/9735222 to your computer and use it in GitHub Desktop.
Save Davidebyzero/9735222 to your computer and use it in GitHub Desktop.
Continued discussion of Regex Golf Bonus Levels and mathematical regexes (SPOILERS! SPOILERS!)
@Davidebyzero
Copy link
Author

Honestly, I expected a functional solution to be about twice as long as this. You're really done a good job here.

Thank you!

I would attempt to prove this by taking a molecular regex and replacing it with a standard one recursively (following the syntactic definition).

Could you please elaborate on what you mean by "replacing it recursively"? I don't understand what you mean here. I can't think of any general way of replacing any molecular lookahead with an atomic lookahead; as I see it now, I'd have to adapt to the circumstances and adapt the algorithm of the particular regex to accommodate this change, e.g. by making it break up its captures into pieces and doing multiple passes of multiplication with remainders, or refactoring the algorithm so that it has enough room to work.

@teukon
Copy link

teukon commented Apr 8, 2014

Quite naively: I'm thinking the main desire for molecular lookaheads is to have something which is sensitive to the rest of the space, but which doesn't consume space. What I'm thinking is if you have an expression which relies on some group \1 of length m, and matches some subset K of the naturals (or some subset of strings of the form ^x*$ if you prefer), then it should be possible to construct an expression which matches precisely the set of all k - m in the naturals where k is in K. The new expression will depend on exactly how '\1' is referenced in the old expression, which is I guess why you talk about wanting to reason about each expression in turn. However, I imagine that each syntactical element can be reasoned about abstractly.

For example. If we have the expression (?*(x*))(?:[expression A]|[expression B]), where expressions A and B contain no molecular lookaheads and do not reference each other's captures, and we know that expressions A and B (with appropriate group renumbering) are such that (?*(x*))[expression A] and (?(x*))[expression B] are equivalent to (x*)[expression C] and (x*)[expression D] respectively, where expressions C and D contain no molecular lookaheads, then we can reason than (?*(x*))(?:[expression A]|[expression B]) is equivalent to (x*)(?:[expression C]|[expression D]) (with appropriate group renumbering).

I expect similar reasoning can be done for each syntactical element (some easier, some harder). This would expand to recursion on the complexity of the non-molecular part of the expression (along the grammar if you will).

Handling multiple or even nested molecular lookaheads should be easy enough.

The expression within the molecular lookahead is I think much more difficult to manage, but again I would proceed in the manner described above. Some steps will involve breaking one molecular lookahead into several, such as when breaking up a string of terms. I think coping with multiple groups within the same molecular lookahead will be easier than it might at first appear (and should just drop out).

Still, I could be missing something. I've no experience with molecular lookaheads.

@Davidebyzero
Copy link
Author

I expect similar reasoning can be done for each syntactical element (some easier, some harder).

It's not at all obvious to me what any of the other transforms might be, or if such a low-level method is even feasible — but I suspect it isn't, since a conjunction of two syntactical elements, as opposed to a disjunction, can mean something entirely different than either of the elements alone.

But thinking about this topic some more has given me ideas.

Below I will use the term "molecular capture groups" to refer to standard capture groups (as supported by ECMAScript) which are not inside a lookahead relative to where their backrefs are used.

You mentioned subsets of the naturals. It had already occurred to me that if I want to test some set of statements for a subset of values, for example all prime factors of the main number, then instead of capturing the prime factor in a molecular capture group, I could capture an index in a molecular capture group and then translate that index into a prime number inside an atomic lookahead, where 0 -> 2, 1 -> 3, 2 -> 5, 3 -> 7, 4 -> 11, etc. (For the sake of argument it's assumed that translating an index to a prime is possible in a ECMAScript regex; it might not be possible.) But that still consumes some space; up to as much space as the number of unique different numbers you need to capture. Whereas a molecular lookahead would consume no space.

But what occurred to me today is, we could be even more clever (and also stick within the realm of known possibility), and capture two numbers using molecular capture groups. Then, for a maximum cost of consuming M xes, we could subsequently use the pair of backrefs (a,b such that a+b ≤ M) inside an atomic lookahead, to capture a number that can take on (M+1)(M+2)/2 unique values — using a triangular transform, (a+b)(a+b+1)/2+a. We could then use this triangular-ified number directly as a prime (discarding the values that aren't prime). (Triangular numbers grow faster than primes, so this is even more efficient that the index method described in the previous paragraph.) Tetrahedral numbers would be better still — (a+b+c)(a+b+c+1)(a+b+c+2)/6 + (a+b)(a+b+1)/2 + a, where a+b+c ≤ M.

It seems to me this would be of very limited usefulness as far as being a substitute for molecular lookaheads, though. We could take this even farther, but we'll always have to consume some amount of space, which scales along with the number (and we can't even get to the point where it scales exponentially; the best we can do is n-simplex numbers, where n is the number of molecular captures per lookahead-coded number). If the range of numbers captured doesn't have to scale with the main number, then the minimum would be consuming 0 or 1 x, and having a number of unique values equal to the number of capture groups plus one.

Back to the original problem... I think I might have the beginnings of a sketch of a proof (or perhaps just strong indication?) that everything that could be done with molecular lookaheads could be done without them. Something like, divide the starting number by two and capture the remainder (after also dividing all previous captures that are to be used in this scope by two and capturing their halves and remainders). Then you can capture something up to half the original number with just one capture group (or up to the original number with two capture groups), and subsequently do all your calculations modified to pretend they're working on double the remaining number plus the captured remainder. However, there's still the problem of not being able to carry over a new remainder from one iteration of a loop to the next... we'd still have to prove that doesn't make anything impossible that otherwise would be possible.

The above won't work for a pair of captures or more, where a set of statements needs to be tested for every possible pair of values. So this would actually be a situation in which the triangular, tetrahedral, etc. capturing could be useful. Either that, or for a pair of molecular captures, divide the original number by 4 instead of 2, and capture a remainder that can be in the range 0..3 — but it seems that this might have a higher chance of conceivably making some operations impossible than just dividing the number by 2 and having a remainder in the range 0..1.

BTW, I'm pretty sure it's possible to do this idea you talked about a month ago:

I've also been thinking about real numbers. I'd like to handle things like:

x*sqrt(2)=x
xx*sqrt(2)=xx
xxx*sqrt(2)=xxxx
xxxx*sqrt(2)=xxxxx
xxxxx*sqrt(2)=xxxxxxx

Though I think a more fun way to do it would be to take a single number as input, and return as a match the number divided by sqrt(2).

Whether it's possible to do the same thing with, say, π or e, is another question entirely.

@Davidebyzero
Copy link
Author

I've shortened the Abundant numbers regex to 597 characters:

^(?=(?:(?=(xx+?)\1+$)(x+)\2*(?=\2$))+(?!\1+$)(?=(x(x+?))\3*$)(x(x+)))(?=(x(x*)).*(?=\6$)(?=\7*$
)\4\8+$)(?=.*(x((?=\7*$)\3\8+$)))(?=(x*?)(?=\9*$)(x(x*))(?=\12*$)(?=\10+$)\10\13+$)(?=(x(x*))
(?=\14*$)\6\15+$)((?=(x*?(?=\14*$)(?=(x(x+))(?=\15+$)\18*$)((?=\5+$)\12{2}((?=.*(?=\9$)\11{2}x)
|x)|)))(?=(x(x(x*))).*(?=\14$)(?=\22*(?=\22$)(?!\18|(xx+)\25+$))(|(\22+)\27*(?=\27$))(?!((x+)
(?=\29+$)x+)\28*(?!\29+$)\28$)x(x*))(?=.*(?=\30$)(x(x*))(?=(\31*)\24\32*$)\31*$\33)(?=.*(x((?=
\31*$)\22\32+$)))(.*(?=\35$)\17.*$|(?=.*?(?!x\17)(?=\34*$)(x(x*))(?=\37*$)(?=\35+$)\35\38+$)(?=
\37(x*)(?=\14*$)\22\15+$)\39))+$

It's also much faster. The first version took 8.97 hours to go from 0 to 33588; this version took 7.24 minutes.

Edit: Have now tested numbers up to 270000 so far, with no false negatives or positives.

@teukon
Copy link

teukon commented Apr 10, 2014

Nice progress. Under 600 characters and fast! What number did you get up to?

I think I'm really out of the loop when it comes to the molecular lookaheads as I have no experience with them.

The sqrt(2) thing sounds interesting. Is there an easy argument for why you think this can be done?

@Davidebyzero
Copy link
Author

Nice progress. Under 600 characters and fast!

Thanks. :)

I think I'm really out of the loop when it comes to the molecular lookaheads as I have no experience with them.

Well, all the reasoning I've done about them so far is just theoretical, but I suppose I should implement them in RegexMathEngine so we can actually try them out in practice.

The sqrt(2) thing sounds interesting. Is there an easy argument for why you think this can be done?

So we want N/sqrt(2). Basically, we can do calculations in base floor(sqrt(N))+1 with carry.

So say N=1000. Then we work in base 32. The largest digit in base 32 is 31, and we have room to calculate 312=961. (This works if N is a perfect square, too. If N=10000, then we work in base 101, and we can calculate 1002=10000.)

Pretend A,B,C,D,E are base 32 digits (except that E is allowed to exceed the base), where adjacency indicates a multi-digit number, not multiplication. Then we can do (BA)2 = EDC, where C = A2, D = 2×A×B, E = B2. We then take C mod 32, while C/32 is added as a carry to D. Then we do the same for D, adding the carry to E.

So we calculate N2 using this method, and divide the multi-digit result by 2. Then we find a two-digit number which, when squared, is as close as possible to the other multi-digit result we calculated. This gets us sqrt(N2/2) which is what we wanted, so then we convert the two digits into a proper number by multiplying the most-significant digit by the base and adding the least-significant digit.

[Edit]

What number did you get up to?

Up to 1 million with no false negatives or positives, in 107 hours.

Incidentally, a tiny C program using a sieve-initialized sigma-function array goes up to 10 million (enumerating the first 2476737 abundant numbers) in 1.0 seconds, including the time to initialize the array.

@Davidebyzero
Copy link
Author

Okay, I pushed a version of RegexMathEngine that has molecular lookahead implemented. Currently it is always-enabled; in a future commit I'll make it optional, so that it can stick to ECMAScript compatibility by default.

Remember my regex to find the prime(s) of highest multiplicity, at the top of this gist? Well, look how compact it can become using molecular lookahead (119 characters):

(?# N = current tail number; starts as being the main number )
^
(?=
    (
        (?*(x(x+?))\2+$)          (?# \2 = proposed smallest factor of N that has
                                      all the same prime factors as N; \3 = \2-1 )
        (?!                       (?# Assert that \2 has all the same prime factors as N has,
                                      by asserting that there is no prime factor of N, \4,
                                      that is not a factor of \2 )
            (?*
                (x+)\4*(?=\4$)    (?# \4 = iterates through all prime factors of N )
                (?!(xx+)\5+$)     (?# assert that \4 is prime )
            )
            (?!.*(?=\2$)\4+$)
        )
        (?=
            (x(x*))               (?# \6 = N / \2; \7 = \6-1 )
            (?=(\6*)$)            (?# \8 = N - \6 )
            (?=\3+$)              (?# must match if \6 <  \2; otherwise doesn't matter )
            \3\7+$                (?# must match if \6 >= \2; otherwise doesn't matter )
        )
        \8                        (?# Leave \6 as the new N for the next iteration of the loop )
    )*
    (x*)                          (?# \9 = the prime with the highest multiplicity;
                                      if it's a tie, the product of the tied primes )
)
\9

It's also enormously faster. It can process all the numbers from 0 to 4250 in the same amount of time as the regex at the top of this gist processes the numbers from 0 to 500.

BTW, I'm pretty sure I've implemented molecular lookahead correctly, but not having it in any other engine makes it hard to know for sure. If you find it not behaving the way you expect in some regard, please let me know because it may or may not be a bug.

@teukon
Copy link

teukon commented Apr 11, 2014

So say N=1000. Then we work in base 32. The largest digit in base 32 is 31, and we have room to calculate 312=961. (This works if N is a perfect square, too. If N=10000, then we work in base 101, and we can calculate 1002=10000.)

Pretend A,B,C,D,E are base 32 digits (except that E is allowed to exceed the base), where adjacency indicates a multi-digit number, not multiplication. Then we can do (BA)2 = EDC, where C = A2, D = 2×A×B, E = B2. We then take C mod 32, while C/32 is added as a carry to D. Then we do the same for D, adding the carry to E.

Aha, that's a cute technique. I think I'm starting to see what you're getting at.

Remember my regex to find the prime(s) of highest multiplicity, at the top of this gist? Well, look how compact it can become using molecular lookahead:

Yes, it was partly due to this that I didn't expect a solution to abundant numbers to be so short. It is particularly elegant when you are allowed to make use of molecular lookaheads.

BTW, I'm pretty sure I've implemented molecular lookahead correctly, but not having it in any other engine makes it hard to know for sure. If you find it not behaving the way you expect in some regard, please let me know because it may or may not be a bug.

Will do, although I must admit I'm not in the frame of mind to test this at present. I guess if there are some subtle bugs you can claim "reference implementation" whereby all differing engines are wrong by default.

@Davidebyzero
Copy link
Author

How would you like to do a little golf challenge (one that's much simpler than Fibonacci)?

Make a regex that works on the domain of powers of 2, which returns the base 2 logarithm as its match. For example, for ^x{512}$ it should return x{9} as its match, and for ^x$ it should return an empty match.

Here are test scripts for Perl, PCRE, and RegexMathEngine. The RegexMathEngine script requires that you pull the latest version. The test scripts don't verify that the output is correct yet.

My best solution so far has md5sum c55d01ff6313d223df7422b1b6886b03.

@teukon
Copy link

teukon commented Apr 12, 2014

Sounds like fun! I'm too busy to get into this right now though; I have a lot of work to get through by Wednesday.

Still, this seems like it would be a good level for me. I look forward to tackling it.

@Davidebyzero
Copy link
Author

Cool :) I look forward to seeing what you come up with.

Meanwhile I've superseded my previous solution with one having md5sum b31a12cf9a9982d7d3204d198f87ba4f.

@Davidebyzero
Copy link
Author

I've found some errors in this paper on the density of abundant numbers. On "page 142" (page 6 of the PDF) there's a table of the number of non-deficient numbers in a range of intervals. My C program agrees with those up to 108, but finds that there are 247610960 non-deficient numbers under 109, whereas the paper says 247610965. Looks like Deléglise may have used 32-bit math without properly checking for overflow (I'm using 64-bit math).

Also, the table leaves out the number for [1, 106] and accidentally shows the one for [1, 105] next to [1, 106]. And it uses [a, b] to represent open intervals, but [a, b] is the notation for a closed interval — the paper's [1, 108], for example, does not include 108 itself.

@teukon
Copy link

teukon commented Apr 14, 2014

You'd be surprised of the errors that crop up in mathematics papers. I find it a bit of a stretch to describe the closed interval thing an error, but the notation is certainly very common so this is at least clumsy. The 32-bit thing could quite easily be simply wrong.

It's easy to overlook things. For example: Subtraction (182):

^(.*)(.*). \2. \1$

@Davidebyzero
Copy link
Author

^(.*)(.*). \2. \1$

Nice. Figures that the first problem would be the one we didn't get around to revisiting (until now, at least).

BTW, I prefer ^(x*)(.+)- \2= \1$, because it's easier to understand at a glance.

The 32-bit thing could quite easily be simply wrong.

While I have virtually no doubt that it's true that the asymptotic density of abundant numbers is between 0.2474 and 0.2480, the author of the paper may have not correctly proven this. I haven't taken the time to try to puzzle through the math in that paper, but the result involved 100 hours of CPU time and the entire time it may have been using a buggy algorithm.

Probably with modern CPU power it'd be possible to tighten those bounds quite a bit.

What would be much cooler, however, would be to prove that for every n, |An(2) - A(2)| < f(n) for some f(n) that asymptotically approaches zero, where an f(n) for which this is true is found.

@Davidebyzero
Copy link
Author

Here's a direct port of the molecular lookahead version of the "prime(s) of highest multiplicity" regex to a standard ECMAScript regex:

(?# N = current tail number; starts as being the main number )
^
(?=
    (
        (?=(x*)\2(x?))                    (?# \2 = N / 2;  \3 = N % 2 )

        ((x*?)x??)                        (?# {\4, \5} = x\4\5 encoded as a pair to save space )
        (?=                               (?# require that \6 = x\4\5 is a factor of N )
            .*(?=\2\3$)
            (?=(x(\4\5))+(x*))            (?# \6 = x\4\5; \7 = \6-1; \8 = remainder )
            (?=.*(?=\6)\8(x*))            (?# \9 = \6 - \8 )
            \3\9\6*$
        )
        (?!                               (?# Assert that \6 has all the same prime factors as N has, by asserting
                                              that there is no prime factor of N, \12, that is not a factor of \6 )
            ((x*)x?)                      (?# {\10, \11} = x\10\11 encoded as a pair to save space )
            (?=                           (?# require that \12 = x\10\11 is a factor of N )
                .*(?=\2\3$)
                (?=(x\10\11)+(x*))        (?# \12 = x\10\11; \13 = remainder )
                (?=
                    .*(?=\12$)
                    (?!(xx+)\14+$)        (?# assert that \12 is prime )
                    \13(x*)               (?# \15 = \12 - \13 )
                )
                \3\15\12*$
            )
            (?!.*(?=\6$)\12+$)
        )
        ((x*)x?)                          (?# {\16, \17} = \16\17 encoded as a pair to save space )
        (?=                               (?# require that \18 = x\16\17 = N / \6 )
            .*(?=\2\3$)
            (?=(x(\16\17))+(x*))          (?# \18 = x\16\17; \19 = \18-1; \20 = remainder )
            (?=.*(?=\18)\20(x*))          (?# \21 = \18 - \20 )
            (?=\3\21\18*$)

            (?# the following must match if \18 <  \6; otherwise it doesn't matter )
            (?=\18\7*(x*))                (?# \22 = remainder )
            (?=.*(?=\7)\22(x*))           (?# \23 = \7 - \22 )
            (?=\3\23\7*$)

            (?# the following must match if \18 >= \6; otherwise it doesn't matter )
            (?=\6\19*(x*))                (?# \24 = remainder )
            (?=.*(?=\19)\24(x*))          (?# \25 = \19 - \24 )
            (?=\3\25\19*$)
        )
        .*(?=\18$)                        (?# Leave \18 as the new N for the next iteration of the loop )
    )*
    (x*)                                  (?# \26 = the prime with the highest multiplicity;
                                              if it's a tie, the product of the tied primes )
)
\26

And here's an alternative direct port:

(?# N = current tail number; starts as being the main number )
^
(?=
    (
        ((x*?)x??)                        (?# {\2, \3} = x\2\3 encoded as a pair to save space )
        (?=                               (?# require that \4 = x\2\3 is a factor of N )
            (?=(x(\2\3))+(x*))            (?# \4 = x\2\3; \5 = \4-1; \6 = remainder )
            (?=.*(?=\4)\6(x*))            (?# \7 = \4 - \6 )
            .*(?=\2$)
            \7\4*$
        )
        (?!                               (?# Assert that \4 has all the same prime factors as N has, by asserting
                                              that there is no prime factor of N, \10, that is not a factor of \4 )
            ((x*)x?)                      (?# {\8, \9} = x\8\9 encoded as a pair to save space )
            (?=                           (?# require that \10 = x\8\9 is a factor of N )
                (?=(x\8\9)+(x*))          (?# \10 = x\8\9; \11 = remainder )
                (?=
                    .*(?=\10$)
                    (?!(xx+)\12+$)        (?# assert that \10 is prime )
                    \11(x*)               (?# \13 = \10 - \11 )
                )
                .*(?=\2\8$)
                \13\10*$
            )
            (?!.*(?=\4$)\10+$)
        )
        ((x*)x?)                          (?# {\14, \15} = x\14\15 encoded as a pair to save space )
        (?=                               (?# require that \16 = x\14\15 = N / \4 )
            (?=(x(\14\15))+(x*))          (?# \16 = x\14\15; \17 = \16-1; \18 = remainder )
            (?=.*(?=\16)\18(x*))          (?# \19 = \16 - \18 )
            (?=(.*)(?=\2\14$)\19\16*$)    (?# \20 )

            (?# the following must match if \16 <  \4; otherwise it doesn't matter )
            (?=\16\5*(x*))                (?# \21 = remainder )
            (?=.*(?=\5)\21(x*))           (?# \22 = \5 - \21 )
            (?=\20\22\5*$)

            (?# the following must match if \16 >= \4; otherwise it doesn't matter )
            (?=\4\17*(x*))                (?# \23 = remainder )
            (?=.*(?=\17)\23(x*))          (?# \24 = \17 - \23 )
            (?=\20\24\17*$)
        )
        .*(?=\16$)                        (?# Leave \16 as the new N for the next iteration of the loop )
    )*
    (x*)                                  (?# \25 = the prime with the highest multiplicity;
                                              if it's a tie, the product of the tied primes )
)
\25

And the latter, whittled down to 185 characters:

(?# N = current tail number; starts as being the main number )
^
(?=
    (
        (?=
            ((x*?)x??)            (?# {\2, \3} = x\2\3 encoded as a pair to save space;  tail = N - \2 )
            (?=x\3(x(\2\3))+$)    (?# require that \4 = x\2\3 is a factor of N;  \4 = x\2\3; \5 = \4-1 )
            (?!                   (?# Assert that \4 has all the same prime factors as N has, by asserting
                                      that there is no prime factor of N, \8, that is not a factor of \4 )
                ((x*)x?)                    (?# {\6, \7} = x\6\7 encoded as a pair to save space;  tail = N - \2 - \6 )
                (?=                         (?# require that \8 = x\6\7 is a factor of N )
                    (?=(x\6\7)(\8*(x*)))    (?# \8 = x\6\7; \10 = remainder; \9 = N - \2 - \6 - \8 )
                    (?=
                        \9                  (?# tail = \8 )
                        (?!(xx+)\11+$)      (?# assert that \8 is prime )
                        \10(x*)             (?# \12 = \8 - \10 )
                    )
                    .*(?=\2\6$)             (?# tail = \2\6 )
                    \12\8*$
                )
                .*(?=\4$)
                (?!\8+$)
            )
        )
        (?=
            (x(x*))               (?# \13 = N / \4; \14 = \13-1 )
            (?=(\13*)$)           (?# \15 = N - \13 )
            (?=\5+$)              (?# must match if \13 <  \4; otherwise doesn't matter )
            \5\14+$               (?# must match if \13 >= \4; otherwise doesn't matter )
        )
        \15                       (?# Leave \13 as the new N for the next iteration of the loop )
    )*
    (x*)                          (?# \16 = the prime with the highest multiplicity;
                                      if it's a tie, the product of the tied primes )
)
\16

@Davidebyzero
Copy link
Author

I found another way of doing Multiplication that is very very close to being as short:
^(x(x*)).(?=x*=(x(x*))(?=\3*$)(?=\2+$)\2\4*$)\3=\1+$ (52 characters)
compare to the one that has remained uncontested for six weeks:
^(x(x*))\*?(x*)\*?\1=(?=(\1\3)+$)\1(\2\3(?!\4+$))+$ (51 characters)

This new method works by doing division by one of the factors and comparing the quotient with the other factor.

@Davidebyzero
Copy link
Author

Well, I suppose it was inevitable that at some point I'd start finding length optimizations that result in significant slowdown. The Abundant numbers regex is now down to 511 characters:

^(?=(?:(?=(xx+?)\1+$)(x+)\2*(?=\2$))+(?!\1+$)(?=(x(x+?))\3*$)(x(x+)))(?=(x(x*)).*(?=\6$)(?=\7*$
)\4\8+$)(?=.*(x((?=\7*$)\3\8+$)))(?=(x*?)(?=\9*$)(x(x*))(?=\12*$)(?=\10+$)\10\13+$)(?=(x(x*))
(?=\14*$)\6\15+$)((?=(x*?(?=\14+$)(?=\14+?(?=((xx(x*))(?=\15+$)\19*$))(x+).*(?=\14$)\21*(?=\21$
)(?!(xx+)\22*(?!\19+$)\22$)\19+$)((?=\5+$)\12{2}((?=.*(?!\9)\11{2})|x)|)))(?=.*(?=\21)x(x(x*))
(?=\25*$)\20\26*$)(?=.*(x((?=\25*$)\19\26+$)))(.*(?!\27)\17|(?=.*?(?!x\17)(?=\27*$)(x(x*))(?=
\30*$)(?=\28+$)\28\31+$).*(?=\30\18$)))+$

The non-slowed-down version is at 550 characters.

@Davidebyzero
Copy link
Author

I beat our Multiplication score!
^x(x*)\*?(x*)\*?x\1=(?=(x(\1\2))(\3*)\1\4*$)\3*$\5 (50 characters)

Turns out you were right after all when you wrote this:

I can't find a way to improve on 51 characters, but I highly doubt that 51 characters is optimal.

@teukon
Copy link

teukon commented Apr 18, 2014

Nice work! The 50 character solution to Multiplication is particularly pleasing to see.

I'm afraid I'm somewhat losing interest in the regex stuff these days, partly because I'm currently focused on writing a paper. I can see that you're still going strong though.

I might have to pass on your little golf challenge. It's on my to-do list but is lower than one might hope.

Also, it's a bit of a shame that, even with Erling adding the bonus problems to his site, that no-one seems to have taken interest (that I can find). Oh well.

@Davidebyzero
Copy link
Author

Thanks for your honesty.

It's been a privilege, and great fun, playing Regex Tennis with you. And I don't think I will ever understand how you were able to come up with such heavily optimized solutions to Perfect squares and Multiplication so quickly.

I still find this pretty fascinating. I hope I'll still feel that way when/if your interest returns.

I'd really like to know the answer to such questions such as: Is it possible for an ECMAScript regex to calculate prime(n) (where n is the index of a prime) where the answer can be as big as the available space? I know it's possible with O(n2 ln n) space, but can't think of a way to do it with O(n ln n) space (equivalently — is it possible to convert a prime into an index with no extra space available?) And more generally, I'd like to see at least one thing proved impossible to do. Or (more astoundingly) a proof that all computable functions are possible to do in a regex (obviously excluding those whose output can exceed their input).

Oh, and I think you've probably guessed where I'm leading, with that little golf challenge I posed to you. Matching a completely generalized Powers, ^x+\^x*=x+$, is indeed possible. It's funny how hard I thought it'd be a month ago:

Match[ing] statements of the form ^x+\^n=x+$ should be easy, but matching ^n\^x+=x+$ or even ^x+\^x+=x+$, without scratch space, seems like a very hard problem (maybe impossible?). But perhaps there's a way to do it using a similar trick to my original Multiplication.

@teukon
Copy link

teukon commented Apr 18, 2014

It's been a privilege, and great fun, playing Regex Tennis with you. And I don't think I will ever understand how you were able to come up with such heavily optimized solutions to Perfect squares and Multiplication so quickly.

Kind words indeed. I've been equally impressed by some of your work, notably the abundant numbers solution, and the beautiful solution to Dominoes.

And more generally, I'd like to see at least one thing proved impossible to do. Or (more astoundingly) a proof that all computable functions are possible to do in a regex (obviously excluding those whose output can exceed their input).

I expect there are some examples which are provably impossible. It would be nice to have a very simple one but I've run out of ideas.

Oh, and I think you've probably guessed where I'm leading, with that little golf challenge I posed to you. Matching a completely generalized Powers, ^x+^x*=x+$, is indeed possible. It's funny how hard I thought it'd be a month ago:

Yeah. Well done tying up this loose end. I know how you feel though, I remember thinking that square numbers would be incredibly difficult to handle.

This has been fun!

@Davidebyzero
Copy link
Author

The split of Fibonacci regexes between these two designs:

  1. Find a number such that when squared, multiplied by 5, and 4 is added or subtracted, is a perfect square — resulting in finding the Fibonacci that has half the Fibonacci index of the target number
  2. Build up the Fibonacci sequence, up to the one that has one-quarter the index of the target number

... is no longer just about brevity versus speed. If I am not mistaken, only algorithm #2 can be used to return the Fibonacci index as a match. (This can't be done for the Fibonacci numbers 2 and 3, which have indexes 3 and 4 — but this would be easily remedied by turning it into a problem of matching statements of the form ^F_x*=x*$.)

(Incidentally — it would be possible to implement algorithm #1 as going directly to the target number, by using multi-"digit" arithmetic with carry, as I described earlier. But I doubt this would be possible to do in 165 characters or less.)

@Davidebyzero
Copy link
Author

In the interest of tying up another loose end (from Feb 25, 2014):

I had another thought about your powers of 6 problem. Have you considered powers of 3?

I know this can be solved by checking that every factor is a multiple of 3, but I don't think there's a way of doing this that is as short as the loop (imagine: "How the hell are you able to score 181 on Powers of three!?!"). I think I prefer powers of 4 to the other options, but just wanted to share this thought.

A solution to "Powers of 3" that is analogous to the ^(?!(x(xx)+|)\1*$) (18 character) solution of "Powers of 2":
^(?!(x(xxx)+x?|xx|)\1*$) (24 characters)

A solution to "Powers of 3" that is analogous to the ^((x+)(?=\2$))*x$ (17 character) solution of "Powers of 2":
^((x+)\2(?=\2$))*x$ (19 characters)

So yes, the former is not as short as the latter, but it's really close. "Powers of 4", or powers of any non-prime for that matter, would not be solvable at all using the former method.

@Davidebyzero
Copy link
Author

Davidebyzero commented Dec 15, 2018

I've found some errors in this paper on the density of abundant numbers. On "page 142" (page 6 of the PDF) there's a table of the number of non-deficient numbers in a range of intervals. My C program agrees with those up to 108, but finds that there are 247610960 non-deficient numbers under 109, whereas the paper says 247610965. Looks like Deléglise may have used 32-bit math without properly checking for overflow (I'm using 64-bit math).

Also, the table leaves out the number for [1, 106] and accidentally shows the one for [1, 105] next to [1, 106]. And it uses [a, b] to represent open intervals, but [a, b] is the notation for a closed interval — the paper's [1, 108], for example, does not include 108 itself.

So it turns out that even back when I wrote this, there was already a paper that superseded it: On the density of abundant numbers, Kobayashi (2010). It just didn't show up in my searches at the time.

This is an excellent paper. It goes into depth on things I'd wished were in Deléglise's 1998 paper. It derives the time-complexity of computing n digits of the abundant number density to be O(een). It includes the full C++ source code (and describes their workings) of the program that computes bounds on the density; I tried it, and it works! I was able to compute bounds narrowed down to [0.2476167, 0.2476528] in 245 minutes, single-threaded, by passing it N=200000000000000 on the command line. That's weird, because it's almost as narrow as the new bounds established by the paper: [0.2476171, 0.2476475]. I would have thought it would take a lot more computing time to reach that.

Edit: After 23.54 hours of computation with N = 5000000000000000, the program gave a result of [0.24761740, 0.24764507]. That's even tighter than the bounds stated in the paper.
Edit 2: After 40.48 hours of computation with N = 15000000000000000, it gave a result of [0.24761748, 0.24764422].

FWIW, it also gets the closed-open interval right, stating them as [a, b). It does have at least one error though:

any finite union of multiple sets is a periodic set

on page 6. I think he meant "any finite union of multiple periodic sets is a periodic set".

I do wish the paper discussed what limitation the use of native floating point data types (80-bit long double) has on the best bounds achievable by the program. There is some discussion on the matter starting at page 112, but if I understand correctly, this only describes how a problem regarding precision was solved. In scanning through the paper, I didn't find anything explicitly discussing limitations on precision that still apply. So I don't know whether there's a point at which the program will start narrowing the bounds without rigorous mathematical justification.

The same mathematician also co-wrote an article in 2014 about the error term in the count of abundant numbers. This pertains to another thing I said earlier here:

What would be much cooler, however, would be to prove that for every n, |An(2) - A(2)| < f(n) for some f(n) that asymptotically approaches zero, where an f(n) for which this is true is found.

The function they came up with, though, is very very pessimistic (extremely slow to converge, much slower than the density itself seems to converge). It is a start though, I suppose. The article itself is behind a paywall, unfortunately.

@Davidebyzero
Copy link
Author

Davidebyzero commented Dec 15, 2018

I thought of a pretty good analogy for what constructing ECMAScript regexes for functions on the natural numbers is like. I think it's much like doing straightedge and compass geometric construction. There are many things that are impossible to construct with those tools, but that hasn't stopped mathematicians from devoting much time to the subject.

I suspect that only a strict subset of primitive recursive functions are possible but at this point still can't prove it. I'm also pretty sure that with unbounded scratch space, all primitive recursive functions become possible (this would probably be easy to prove).

@Grimy
Copy link

Grimy commented Jan 31, 2019

Sorry to intrude, but there's still one loose end: the log2 golf challenge. I have a length 108 73 66 solution (8cf2ef821f9428e4a611b457ccc57344). What's the length of your b31a12cf9a9982d7d3204d198f87ba4f solution?

@Davidebyzero
Copy link
Author

Hello, Grimy, and welcome!

Wow, that is awesome. My b31a12cf9a9982d7d3204d198f87ba4f is 69 bytes, so you have beaten me! Very well done. I'd like to exchange our regexes, but privately, to maintain the viability of this challenge for a little longer. I'd like to see your earlier versions too!

I recently forked the log2 regex into:

  1. A version that strictly rounds down, which is 93 bytes with md5sum cef4f7a85f9703825c97b01012358d4d, and like the original, returns no-match for zero
  2. A base 10 version that returns the number of digits, which is 112 bytes with md5sum cc065c2c0e212933c0bcb1e3e9717a86 (it's the same as strictly rounded-down logarithm plus one – except with an input of zero, for which both regexes return no-match, because you can't return 1 with an input of 0)
  3. Logarithm in base N (passed as two comma-delimited unary numbers, base first, and including the base and comma in the returned result) at 165 bytes with md5sum df2b6dab8eb3fdf9a6c2d2e4e52257f3 (have not made a strictly-rounding-down or digit-counting version of this yet).

Tying up another loose end, I also wrote a regex to take a single unary number as input, and divide by sqrt(2) (rounding down) and return the result as a match, using an algorithm very similar to the one I sketched above. It uses molecular lookahead. The first fully correct version was 1159 bytes and I got it down to 849. I plan on porting it to pure ECMAScript, but I expect it will roughly double in length.

@Grimy
Copy link

Grimy commented Feb 4, 2019

Thanks!

I am now down to 57 chars on the log2 regex (md5sum c58bef1340954f1bbaf02c5790f33a01, --npcg+ only, --neam independent). I’ll gladly share it if you want.

I haven’t yet tried applying this algorithm to the base 10 or rounding down version of the task, but it should give significant improvements.

EDIT:

  • 76 chars for the version that strictly rounds down (or 74 if I’m allowed to return 0 instead of no match for a 0 input).
  • 97 chars for the base 10 version

@Davidebyzero
Copy link
Author

Davidebyzero commented Feb 7, 2019

Wow! 57 chars. That's incredible. I don't want you to just show it to me right away though; I'd like some time to try to figure it out. Can't work on it right away but I plan on it. I would indeed like to see them, please!

I did have an idea on how to make my floor-logarithm regexes much shorter. It did a little worse than the technique I already used, and frankly I'm very happy about that. The one I'd already used is much more interesting, and the one I just now tried would trivialize it if it did better.

And I'm also quite impressed to see that 76-57=19, a smaller difference than my 93-69=24.

@Davidebyzero
Copy link
Author

This discussion has been continued in the Mathematical Regexes Stack Exchange chat room.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment