Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Continued discussion of Regex Golf Bonus Levels and mathematical regexes (SPOILERS! SPOILERS!)
@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Mar 24, 2014

So, in the quest towards writing an ECMAScript regex that identifies abundant numbers (or perfect numbers), here is one that finds the prime (or primes) with the highest multiplicity. I'm sure you'd come up with something much shorter, but I can't think of any way at present.

It works by repeatedly dividing the number by its smallest factor that is divisible by all the number's current prime factors. It takes 103 seconds to go from 0 to 1000 on the current version of my engine (which still has no optimization for multiplication). Perl takes 110 63.8 seconds to do the same. Looks like I have some work to do, optimizing my engine better.

Counting from 0 to 500, Perl is also faster than my engine at evaluating this regex, taking 11.5 7.55 seconds when mine takes 12.7 seconds.

(?# N = main number )
^
(?=
    (
        (?=
            (x(x*?))                 (?# \2 = smaller factor; \3 = \2 - 1 )
            (?=
                (x*)                 (?# \4 = proposed value such that \2 * \2\4 = N )
                (?=(\2\4)+$)         (?# \5 = \2\4, the larger factor )
                \3
                (?!\5+$)
                (?:
                    \3\4
                    (?!\5+$)
                )*$
            )
            (?# We shall now assert that \2 has all the same prime factors as N; this will take more than one step )
            (?# Step 1: Assert that if \5 is prime, then \2 is divisible by it; do this by asserting that
                either \5 is not prime, or \2 is divisible by \5, which can only happen if \2 == \5, since \2 <= \5 )
            (?=
                .*(?=\5$)
                (?:
                    \2$              (?# either \5 == \2, )
                |
                    (xx+)\6+$        (?# or \5 is not prime )
                )
            )
            (?# Step 2: Assert that any prime factor of \5 is also a factor of \2 )
            (?!
                (x+)                 (?# \7 = proposed prime factor )
                (?=
                    .*(?=\5$)
                    \7+$
                )
                (?=
                    .*(?=\7$)
                    (?!(xx+)\8+$)    (?# assert that \7 is prime )
                )
                (?!
                    .*(?=\2$)
                    \7+$
                )
            )
        )
        .*(?=\5)
    |
        (?=
            (x(x*))                  (?# \9 = smaller factor; \10 = \9 - 1 )
            (?=
                (x*)                 (?# \11 = proposed value such that \9 * \9\11 = N )
                (?=(\9\11)+$)        (?# \12 = \9\11, the larger factor )
                \10
                (?!\12+$)
                (
                    \10\11
                    (?!\12+$)
                )*$
            )
            (?# We shall now assert that \12 has all the same prime factors as N,
                by asserting that any prime factor of \9 is also a factor of \12 )
            (?!
                (x+)                 (?# \14 = proposed prime factor )
                (?=
                    .*(?=\9$)
                    \14+$
                )
                (?=
                    .*(?=\14$)
                    (?!(xx+)\15+$)   (?# assert that \14 is prime )
                )
                (?!
                    .*(?=\12$)
                    \14+$
                )
            )
        )
        .*(?=\9)
    )*
    (.*)                             (?# \16 = the prime with the highest multiplicity; if it's a tie, the product of the tied primes )
)
\16
@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Mar 24, 2014

BTW, I recompiled pcregrep with a much newer version of GCC (4.8.1, whereas I was using 3.4.5 before), and with JIT support (which the previous one I compiled didn't have) and 64-bit instead of 32-bit. It no longer silently gives up on things like large powers of 2, and it's faster than before. It also no longer has this bug:

Point of interest:

^(?=((?=(x*))x)+)\2$

matches ^x$ with Perl but not with PCRE.

i.e., it now matches ^x$ with that regex.

It also now runs your early Fibonacci regex flawlessly (the 483 character md5sum 4d9dc4ee9f8eb70ab497dc613d683fb1 one, on which it had lots of false negatives and positives before).

It still has this bug, however:

Seems like a lazy search with a minimum count of 0 tries a count of 1 as the first possibility if the match following it is zero-length, only backtracking to a count of 0 for the match if it has to.

Note, this is the same version of pcregrep, just compiled with a different compiler.

I've also installed a 64-bit build of Perl (replacing the 32-bit build of the same version) and it's much faster. I've edited the above post accordingly.

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 24, 2014

Well I finally got it releasable and posted my regex engine on github. The name isn't final. There's still some polishing to be done (especially, adding parser error messages), but it is quite usable. Hopefully you'll be able to compile it without any trouble.

I finally got around to trying this. I built the code with "make", no problem, but the resultant executable "regex" simply seg faults when I run it. I haven't looked at the code yet. Thoughts?

So, in the quest towards writing an ECMAScript regex that identifies abundant numbers (or perfect numbers), here is one that finds the prime (or primes) with the highest multiplicity. I'm sure you'd come up with something much shorter, but I can't think of any way at present.

Does this mean that you're confident that such a regex exists? Have you discovered the prime with the highest multiplicity to be useful? The last thing I did on the perfect numbers front was to sketch how I thought an algorithm might go; I hadn't convinced myself that the problem could be solved.

I like your algorithm for finding a prime with the highest multiplicity; no superior method leaps to mind. If I understand your code correctly (nice comment structure by the way) the result is the largest factor for which no non-trivial factor has greater multiplicity, but a prime with maximal multiplicity can be trivially extracted from this.

Good luck with your quest! Just as a reminder: "I don't intend to try and build a regex for this one but am happy to discuss potential algorithms. Fibonacci numbers was my limit."

BTW, I recompiled pcregrep with a much newer version of GCC (4.8.1, whereas I was using 3.4.5 before), and with JIT support (which the previous one I compiled didn't have) and 64-bit instead of 32-bit. It no longer silently gives up on things like large powers of 2, and it's faster than before.

Cool.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Mar 24, 2014

I finally got around to trying this. I built the code with "make", no problem, but the resultant executable "regex" simply seg faults when I run it. I haven't looked at the code yet. Thoughts?

Oops, it doesn't currently handle being run without any parameters. Just run it with --help as a parameter.

Since the help doesn't currently give examples:
Numeric mode with ^x*$ as the domain is accomplished using the parameter -nx, -n x, or --num=x.
To test perfect squares from 0 to 10000, you could use the following command lines:
./regex -nx '^((?=(xx+?)\2+$)((?=\2+$)(?=(x+)(\4+$))\5){2})*x?$' -t 0..10000
./regex -nx -f"regex for matching squares - factoring method.txt" -t 0..10000
./regex -nx -f "regex for matching squares - factoring method.txt" -t 0..10000
./regex -nx --file="regex for matching squares - factoring method.txt" -t 0..10000
Or to test a single number:
./regex -nx -f"regex for matching squares - factoring method.txt" -t 15129
echo 15129 | ./regex -nx -f"regex for matching squares - factoring method.txt"
Or to test a string:
echo x | ./regex '^(?=((?=(x*))x)+)\2$'
To test a string and show the exact match:
echo number12345embedded | ./regex -o '\d+'

I like your algorithm for finding a prime with the highest multiplicity; no superior method leaps to mind. If I understand your code correctly (nice comment structure by the way) the result is the largest factor for which no non-trivial factor has greater multiplicity, but a prime with maximal multiplicity can be trivially extracted from this.

Thanks! Yes, I haven't written the extra regex code to extract an actual prime from this, but as you said it'll be trivial.

Does this mean that you're confident that such a regex exists?

Not quite confident per se, but I do think there's a good chance of it.

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 24, 2014

Oops, it doesn't currently handle being run without any parameters. Just run it with --help as a parameter.

Thanks, that sorted me out.

I tried running your engine with my Fibonacci numbers regex. In numeric mode it matches 21 but in normal operation it doesn't match. Is this to be expected?

Or to test a string:
echo x | ./regex '^(?=((?=(x*))x)+)\2$'

Ha, it actually matches. Nice.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Mar 24, 2014

I tried running your engine with my Fibonacci numbers regex. In numeric mode it matches 21 but in normal operation it doesn't match. Is this to be expected?

Oops, thanks, this was a bug. I've fixed it and pushed the fix to github. Now it works:

$ perl -E 'say "x" x 21' | regex.exe -f"regex for matching Fibonacci numbers - teukon's.txt" | perl -E '$s=<>; chomp($s); say length($s)'
21
$ regex.exe -nx -f"regex for matching Fibonacci numbers - teukon's.txt" -t 21
21 -> 21
@teukon

This comment has been minimized.

Copy link

teukon commented Mar 24, 2014

That was quick. I've just built your latest version and it appears to be working now.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Mar 25, 2014

Looks like I have some work to do, optimizing my engine better.

Well I did it. I changed the backtracking stack from using a simple linked-list (allocating memory for each individual item pushed onto the stack) to using chunks of contiguous memory. This has made it about twice as fast or better. This is a bigger improvement than I expected!

It now does my Fibonacci regex from 0 to 46368 in 4.66 seconds, for example, whereas it did it in 12.1 seconds before.

Edit:
Checking for false negatives only, my Fibonacci regex went up to 7778742049 successfully, but failed to match 12586269025 and 20365011074. Then it successfully matched 32951280099 and 53316291173. And then failed to match 86267571272 and 139583862445. Not sure at this point if it's the regex's fault or my engine's fault, but I suspect the former.

Checking your Fibonacci regex in the same way, for some reason it starts allocating a crazy amount of memory around 433494437 or so, then crashes after successfully matching everything up to 1134903170 in 57 seconds, due to running out of memory (on my system with 16 GB of RAM). Maybe yours is more susceptible to backtracking? Naw, more likely a bug in my engine. But time to sleep now; I'll figure it out tomorrow.

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 25, 2014

Oh dear. It seems I'm incapable of writing anything remotely resource friendly.

I hope you're able to solve your false negatives problem. Might it be that your chunk size and loop-breaking condition are at odds with the asymptotic behaviour of the sequence? I think this could explain an eventual "two on, two off" pattern in the false negatives.

If you're having trouble pinning the problem down, I can double-check this aspect of your algorithm with a bit of algebra.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Mar 25, 2014

Oh dear. It seems I'm incapable of writing anything remotely resource friendly.

No, no, that's not it at all. It turned out to be a bug in my engine. I forgot to free the allocated stack chunks when popping them. Since it's a small amount of memory, my regex was able to keep chugging along, but I think there was a coincidence in the execution of your regex which caused some pushing and popping across the threshold of a chunk, resulting in new chunks being allocated very frequently.

(Yep, I couldn't sleep.)

Edit: Pushed the fix onto github. This results in your regex being allowed to continue on and test large numbers (up to 32951280099 so far successfully).

I think this could explain an eventual "two on, two off" pattern in the false negatives.

FWIW, the pattern is holding. It matched 225851433717 and 365435296162, and did not match 591286729879.

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 25, 2014

Yep, I couldn't sleep.

Sorry to hear that. I guess you're enjoying yourself though.

Edit: Pushed the fix onto github.

Ok. I've updated.

FWIW, the pattern is holding. It matched 225851433717 and 365435296162, and did not match 591286729879.

Then I very much suspect that a bit of algebra and a computationally-trivial tweak to the loop-breaking condition will solve the problem. Would you like me to investigate?

By the way: I'm impressed with the extremely large numbers you're able to test these days.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Mar 25, 2014

By the way: I'm impressed with the extremely large numbers you're able to test these days.

Thanks!

Your regex has gotten up to 591286729879 successfully so far and its process has still got only 11 MB of RAM allocated at peak. (Edit: Now 1548008755920, and 15 MB.)

In my regex the pattern of two on, two off, has held precisely, up to 2504730781961 so far. If you'd like to investigate it algebraically, I'd welcome your help — I doubt I know how to do that. The way I'm going to investigate it is by writing a program that simulates what the regex does.

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 25, 2014

Your regex has gotten up to 591286729879 successfully so far and its process has still got only 11 MB of RAM allocated at peak.

Nice.

In my regex the pattern of two on, two off, has held precisely, up to 2504730781961 so far. If you'd like to investigate it algebraically, I'd welcome your help — I doubt I know how to do that. The way I'm going to investigate it is by writing a program that simulates what the regex does.

Sure. Which regex precisely are we talking about? "regex for matching Fibonacci numbers - fastest.txt"?

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Mar 25, 2014

Sure. Which regex precisely are we talking about? "regex for matching Fibonacci numbers - fastest.txt"?

That's the one.

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 25, 2014

Oh, and I just noticed an error in a comment.
(?# \43 = 2*\34 - \32; \44 = 2*\34 - 1 )
should be
(?# \43 = 2*\34 - \32; \44 = 2*\34 - \32 - 1 )
or possibly
(?# \43 = 2*\34 - \32; \44 = \43 - 1 )

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 25, 2014

I've just worked through some numbers and my suspicion turned out to be wrong. Your loop-breaking condition and chunk size mesh very well with the asymptotic behaviour of the sequence. I can post some details if you're interested.

At this point I'm curious to know what the value of \12 is when your regex tackles 12586269025.

Now I'm thinking that the problem is more likely with the engine than the regex. After reviewing your algorithm, I'm wondering if it's tripping up on something to do with \31 (just thinking about the "two on, two off" pattern here).

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 25, 2014

Now I'm thinking that the problem is more likely with the engine than the regex. After reviewing your algorithm, I'm wondering if it's tripping up on something to do with \31 (just thinking about the "two on, two off" pattern here).

When working on 12586269025 (= F50), we should have \32 = \23. I'd like to verify this. I'd use --trace but fear the output would be unwieldy for a number this large.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Mar 25, 2014

Interesting, the value of \12 is 233, just as it should be.
http://kingbird.myphotos.cc/Fibonacci-12586269025-trace.7z
I haven't looked at this trace myself yet, except for the first line to see what \12 was. I set it to start tracing only when \21 got a value (i.e., right after the loop finished). I should add this kind of tracing as an option.

Edit: It appears to be choosing the wrong pair to operate on. It picked (46368, 75025) instead of (75025, 121393) as it should have.

It is indeed a bug in my engine, and a really trivial one too. The .*(?=\31$), where \31=4, goes forward by only 232-1 instead of 12586269025-4=12586269021 as it should. I limited quantifier counts to be 32-bit unsigned integers, which is perfectly reasonable (I was thinking there's no good reason to allow anything greater than {4294967294}) but my format for storing quantifiers has the value 232-1 represent the maximum number of matches for * and +, and I made the mistake of having 232-1 be the actual maximum number of matches in some circumstances.

I've just worked through some numbers and my suspicion turned out to be wrong. Your loop-breaking condition and chunk size mesh very well with the asymptotic behaviour of the sequence. I can post some details if you're interested.

Thanks! I am interested. I'm not really sure how to rigorously prove that this algorithm always works.

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 25, 2014

Edit: It appears to be choosing the wrong pair to operate on. It picked (46368, 75025) instead of (75025, 121393) as it should have.

Strange. My understanding is that \31 should be 4. This is looking more and more like a problem with the engine.

Edit: Good work.

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 25, 2014

Thanks! I am interested. I'm not really sure how to rigorously prove that this algorithm always works.

I'm sitting down to dinner right now. I'll post something rudimentary later.

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 26, 2014

Here's a rough look at the theoretical behaviour of your algorithm when presented with large Fibonacci numbers.

  • Your algorithm takes a number n and first calculates a block size, b ≈ (8-1n)1/2.
  • Next, a loop finds the smallest Fibonacci number, Fk+1 say, such that Fk+12 > b.
  • Finally, the algorithm asserts that n = F4k+i for some i < 4.

We'd like to establish, given any large Fibonacci number n = F4k+i (i < 4), that Fk+1 is the smallest Fibonacci number greater than b1/2. This amounts to showing that Fk <= ((8-1F4k+i)1/2)1/2 < Fk+1.

A Fibonacci number Ft is approximately equal to 5-1/2φt. This approximation improves as t becomes large. Consequently, it suffices to establish that
5-1/2φk <= (8-15-1/2φ4k+i)1/4 < 5-1/2φk+1.

This simplifies to
φ-i <= 53/28-1 < φ4-i

which is true for all i < 4 because
1 <= 53/28-1 < φ.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Mar 26, 2014

Thanks teukon, that was an excellent explanation.

Also, I pushed the fix for that bug to github, and have been running my regex for a while; it's up to 4052739537881 so far [edited as of 8 hours after this post], with no false negatives.

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 26, 2014

Thanks teukon, that was an excellent explanation.

Cheers. This argument is not rigourous but it should make it clear how a rigourous argument might go.

it's up to 4052739537881

4 TB strings! I'm guessing this is in numeric mode.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Mar 26, 2014

I implemented a direct optimization to what I believe is the shortest-length kind of general-purpose multiplication possible (by general-purpose, I mean context-independent, where the only constants are that there needs to be room for the product, and the product needs to be captured).

My Fibonacci regex now goes from 0 to 514229 (testing all numbers) in 36 seconds. Testing only for false negatives, it goes from 0 to 2111485077978050 in 34 seconds.

It got up to 160500643816367088 with no false negatives, in 254 seconds. And then it crashed while working on 259695496911122585, due to running out of memory.

Your Fibonacci regex cleverly uses many different forms of multiplication, depending on what best fits the circumstances... programming my engine to recognize these various things as multiplication will be quite difficult.

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 26, 2014

Your Fibonacci regex cleverly uses many different forms of multiplication, depending on what best fits the circumstances... programming my engine to recognize these various things as multiplication will be quite difficult.

I suspected this might be a problem. In particular, the part where I build F2k2, F2kF2k-12, and F2k-12 simultaneously.

    (?=
        (\18\16)                              (?# \22 = b^2 + a^2 = F_{2k-1}. )
        (\18\21\21)                           (?# \23 = b^2 + 2ab = F_{2k}. )
        (?=\23*(.*))\24
        \23*
        (?=(x(?:\23x)*)$)                     (?# \25 = {F_{2k}}^2. )
        \23*
        (?=(\22*)$)                           (?# \26 = F_{2k}F_{2k-1}. )
        \22*
        (x(?:\22x)*)$                         (?# \27 = {F_{2k-1}}^2. )
    )
@teukon

This comment has been minimized.

Copy link

teukon commented Mar 26, 2014

It got up to 160500643816367088 with no false negatives.

I didn't really register the magnitude of this improvement at first. You got up to F84!

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Mar 30, 2014

I got the length-optimized Fibonacci regex down to 165 characters:

^((?=(x(x*)).*(?=x{4}(\2\3+)\4{3}(?=\2*$)\4$)(|x{4})(?=xx(x*)(\6x?))\5(x(x*))
     (?=(\8*)\9+$)(?=\8*$\10)\8*(?=(\2\9+$))(x*)\12)\7\11(\6\11|\12)|x{0,3}|x{5}|x{8}|x{21})$

Pretty-printed version, on github

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 30, 2014

Cool. That is pretty compact.

@ealf

This comment has been minimized.

Copy link

ealf commented Mar 31, 2014

These are so good. Thank you.

I hope you don't mind me linking to these from the main game?

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 31, 2014

Not at all. In fact, I'm thrilled!

Did you try tackling the problems yourself? How does it feel to be on the receiving end for a change?

Thank you very much for the site, it's given me hours of entertainment. Do you use Bitcoin? May I send you a tip?

By the way. Did you think about testing for "non-cheaty" (robust) solutions in the end?

EDIT: It looks like you happened on a draft of my problem set. The final set can be found here: https://gist.github.com/teukon/16b6de379907bb2a3aa4. I cut a few problems, tightened up the solution strings to better guide/test robustness, and changed the order (for a smooth increase in difficulty). I'd probably prefer there to be no link, than for there to be a link to this ugly draft.

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 31, 2014

I've just checked out the problems on the site and they appear to be looking good (aside from the fact that this is an old draft).

May I ask what "hard mode" is?

Also, it strikes me that some of these problems would be better presented with a fixed-width font. I suspect Tic-tac-toe and Latin Squares would look much better (Dominoes would probably qualify too). All the other problems are probably better with a proportional font. What do you think? Do you prefer the consistency of using the same font for every problem?

Would you like me to construct a hint string for each problem?

The ECMAScript link on your help page is broken.

@ealf

This comment has been minimized.

Copy link

ealf commented Mar 31, 2014

Thank you very much for the site, it's given me hours of entertainment.

Likewise!

Do you use Bitcoin? May I send you a tip?

I couldn't possibly. Do send me more levels if you have any :-)

By the way. Did you think about testing for "non-cheaty" (robust) solutions in the end?
...
May I ask what "hard mode" is?

That's the one :-) Hard mode deducts 100 points up front and gives them back if you pass a few thousand random test cases.

It looks like you happened on a draft of my problem set. The final set can be found here: https://gist.github.com/teukon/16b6de379907bb2a3aa4. I cut a few problems, tightened up the solution strings to better guide/test robustness, and changed the order (for a smooth increase in difficulty).

Re-imported, thanks.

fixed-width font
ECMAScript link

Done

I'd probably prefer there to be no link, than for there to be a link to this ugly draft.

Would you like me to construct a hint string for each problem?

It now links to the new gist (https://gist.githubusercontent.com/teukon/16b6de379907bb2a3aa4).

I also left the import script in my docroot. If you add hints after each link like so:

  1. Subtraction This level requires dependently typed regexes

and hit http://importlevel.alf.nu/, it will refetch the list (from https://gist.githubusercontent.com/teukon/16b6de379907bb2a3aa4/raw/) and add the hints. If you change any of the hashes, let me know; I'll have to update the hard-mode checker.

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 31, 2014

I couldn't possibly. Do send me more levels if you have any :-)

I completely understand. I prefer paying in good levels anyway. I'm not thinking about new content right now but if enough people try and enjoy my set I may well be motivated to make more.

That's the one :-) Hard mode deducts 100 points up front and gives them back if you pass a few thousand random test cases.

Perfect!

It looks like you happened on a draft of my problem set. The final set can be found here: https://gist.github.com/teukon/16b6de379907bb2a3aa4. I cut a few problems, tightened up the solution strings to better guide/test robustness, and changed the order (for a smooth increase in difficulty).

Re-imported, thanks.

Thank you, this means a lot.

fixed-width font
ECMAScript link

Done

That looks much better. Thank you.

I also left the import script in my docroot. If you add hints after each link like so:

  1. Subtraction This level requires dependently typed regexes

and hit http://importlevel.alf.nu/, it will refetch the list (from https://gist.githubusercontent.com/teukon/16b6de379907bb2a3aa4/raw/) and add the hints. If you change any of the hashes, let me know; I'll have to update the hard-mode checker.

Brilliant! I'll give some thought to hints. I'm near enough certain that I'll not be changing the hashes, especially given that you now have a hard mode.

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 31, 2014

I've just completed the problems in hard-mode (3729 points). I'm glad that ick$ is correct, and that my 282 point robust solution to Alphabetical is not completely useless.

Davidebyzero will also be pleased with the hard-mode I guess. We worked together very hard to find a robust solution to triples which was shorter than the finite-state machine one.

One comment. Is the treatment of 0 and 1 in "Prime" intentional? Davidebyzero pointed out that my original solution to Prime matched both 0 and 1. I then refined my solution from

^(?!(xx+)\1+$)

to

^(?!(xx+)\1+$)xx

If I pretend that 0 and 1 are prime numbers, I can score 2 more points.

A similar ambiguity exists with 0 in Powers.

Of course, things get very subjective here, so I'm not making any recommendations. I just want to draw this to your attention in light of the new hard mode.

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 31, 2014

I've just added some hints.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Mar 31, 2014

Wow, hello, Erlang!

I like the new look on http://regex.alf.nu/ and the new hard mode you've added. And I am quite pleased to see you've added teukon's bonus levels as a single page.

I do have some bug reports:

  1. Long count does not participate in hard mode. The old non-robust solution gets the same number of points in both modes.
  2. In Long count, on some browsers (actually all of them, on my system), the "1111" at the end of each string wraps to the next line. This doesn't happen at some zoom levels (font sizes) but happens at the default one and many others.
  3. The scoring of Balance has changed. It now gets 160 extra points, in both modes. So the robust 7-level solution gets 443 points instead of 283, and the best exact solution I've found gets 454 points instead of 294.

I was a bit disappointed to see that the maximum palindrome length in "A man, a plan" in hard mode is 7 characters, given that the maximum length of the non-matches is 13 characters.

Dropping "Long count v2" was probably for the best, but it does change the maximum score. I guess we'll just have to live with that. I do think, however, that if you have only one "Long count", its domain in hard mode should be ^.*$.

It would be nice to see a progress indicator, for regexes that are slow to execute. Also, sometimes while typing/editing a regex, there are steps in-between where the regex is extremely slow, and I have to wait for it to finish the long calculation before I can continue typing. I'm not sure if this is possible in Javascript, but it'd be great if you could make typing non-modal, such that typing/editing will terminate a currently-in-progress execution of the currently entered regex.

Oh, and if you beat your non-hard-mode score in a particular level while in hard mode, it should probably update your high score and string for that level in both modes.

Also, I was wondering — in this post, how did andersk know what the high scores list was if it's unpublished?

@teukon

This comment has been minimized.

Copy link

teukon commented Mar 31, 2014

Long count does not participate in hard mode. The old non-robust solution gets the same number of points in both modes.

To clarify:

((.+)0\2+1){8}

scores 256 points in hard mode. Despite the fact that it matches all kinds of crazy things, such as

htach9001,an0anan1 r70 r71i0iii1l_00l_01!0!!!!11011,0,,1-)0-)-)1zrpp_(

I guess this kind of thing is bound to happen. Most problems yield simple, effective hard-mode tests, but Long Count is not among them. Matryoshka and Latin Squares have a similar weakness and I suspect that clever ways of defeating their hard-mode tests can be devised.

I'm simply taking "hard mode" to mean "it's hard to cheat".

@Davidebyzero: Do you have an idea for a simple but effective hard-mode test for Long Count? Maybe something along the lines of string mutations?

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Mar 31, 2014

@Davidebyzero: Do you have an idea for a simple but effective hard-mode test for Long Count? Maybe something along the lines of string mutations?

I'll have to think about that and do some experimentation.

One more note, though:

Primes doesn't participate in hard mode, either. Not that it's as big a deal, since both modes have the same best-scoring string, but perhaps hard-mode should at least test composite numbers from 33 to 70.

^(xxx?|x{5}|x{7}|x{11}(|xx)|x{17}(|xx)|x{23}|x{29}(|xx))$|x{37} - scores 237 points in both modes

Same thing for Powers, with ^(xx?|x{4}|x{8}|x{16}|x{32}|x{64}|x{128}|x{256}|x{512}|x{1024})$ scoring 46 points in both modes, but I would say that's even less of an issue, since both columns go up to 1025.

@ealf

This comment has been minimized.

Copy link

ealf commented Apr 5, 2014

The checker used to run in the ui thread, which was why the hard-mode checker had to give up after a few hundred milliseconds. It now uses a web worker, so it should be responsive even if you accidentally feed it something exponential.

As a bonus it can afford to be a lot more thorough. For the /x+/ problems it now tries all possible inputs by size.

If you don't want it to check large inputs, you can of course do something like

^(xx?|x{4}|x{8}|x{16}|x{32}|x{64}|x{128}|x{256}|x{512}|x{1024}|(?=x{1026})(x*)*y)$

but that's cheating...

@teukon

This comment has been minimized.

Copy link

teukon commented Apr 5, 2014

It now uses a web worker, so it should be responsive even if you accidentally feed it something exponential.

Great. Thanks for this upgrade.

As a bonus it can afford to be a lot more thorough. For the /x+/ problems it now tries all possible inputs by size.

Starting from 1 yes?

^(xx?|x{4}|x{8}|x{16}|x{32}|x{64}|x{128}|x{256}|x{512}|x{1024}|(?=x{1026})(x*)*y)$

Oh, I like this solution. Very nice.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 7, 2014

Regex that matches Abundant numbers:

^(?=(x(x+))\1*(?=\1$)(?!(xx+)\3+$))(?=((?=\1+$)|(\1+)\5*(?=\5$))(?!((x+)(?=\7+$)x+)\6*(?!\7+$)
\6$)(x(x*)))(?=(x(x*)).*(?=\8$)x.*(?=\10*$)\2\11+$)(?=.*(x((?=\10*$)\1\11+$)))(?=(?=\12(x*))
(x*)(?=x\14)(?=\12*$)(x(x*))(?=\16*$)((?=.*(?=\16$)\12)(?=(\16*)\13\17+$).*$\19|(?!.*(?=\16$)
\12)\13+$))(?=(((?=.*(?=\12$)\15{2}x)|x)\16{2}))(?=(x(x*))(?=\22*$)((?=.*(?=\22$)\8)(?=(\22*)
\9\23+$).*$\25|(?!.*(?=\22$)\8)\9+$))((?=.*(?=\22$)\1).*(?=(?=(\22*)\1\23+$)\22*$\27)|(?!.*
(?=\22$)\1).*(?=(?=(\1*)\22\2+$)\1*$\28))((?=\22*(x*))(?=\30(x(x+))(?=\31*$)((?=.*(?=\31$)\22)
(?=(\31*)\23\32+$).*$\34|(?!.*(?=\31$)\22)\23+$))(?=(x(x(x*))).*(?=\22$)(?=\35*(?=\35$)(?!\31)
(?!(xx+)\38+$))((?=\35+$)|(\35+)\40*(?=\40$))(?!((x+)(?=\42+$)x+)\41*(?!\42+$)\41$)x(x*))(?=
((?=.*$\30)\20|\30))(?=.*(?=\43$)(x(x*))(?=(\45*)\37\46*$)\45*$\47)(?=.*(x((?=\45*$)\35\46+$
)))(.*(?=\48$)x\44.*$|(?=.*(?=\44$)(?=\48(x*)).*(?=x\51)(?=\48*$)(x(x*))(?=\52*$)((?=.*(?=\52$
)\48)(?=(\52*)\49\53+$).*$\55|(?!.*(?=\52$)\48)\49+$)).*(?=\52((?=.*(?=\22$)\35)(?=(\22*)\35
\23+$)\22*$\57|(?!.*(?=\22$)\35)(?=(\35*)\22\36+$)\35*$\58))))+$

Pretty-printed, commented version (As usual I've not done any optimization yet. This is the first version that fully works.)

General numerical test scripts for Perl and PCRE

On another note:

Do you think there's anything in domain ^x*$ that is impossible to do with ECMAScript but would be possible with ECMAScript + molecular lookahead? If not, do you think it'd be possible to prove this?

@teukon

This comment has been minimized.

Copy link

teukon commented Apr 7, 2014

WOW!!!

How many numbers have you tested it on?

Do you think there's anything in domain ^x*$ that is impossible to do with ECMAScript but would be possible with ECMAScript + molecular lookahead? If not, do you think it'd be possible to prove this?

I don't know. I'll think about it.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 7, 2014

How many numbers have you tested it on?

Left it running overnight, and it's gotten up to 31100 so far with no false positives or negatives :-)

Before I went to bed, I had tested it up to 2000.

None of the types of multiplication I use in this regex are currently optimized as multiplication by my regex engine. Once I do, I expect it'll be much faster.

@teukon

This comment has been minimized.

Copy link

teukon commented Apr 8, 2014

Honestly, I expected a functional solution to be about twice as long as this. You're really done a good job here.

I quick thought about the molecular lookahead problem suggests to me that there is no such partition, that ECMAScript + molecular lookahead is no more expressive than ECMAScript proper for the domain ^x*$. I would attempt to prove this by taking a molecular regex and replacing it with a standard one recursively (following the syntactic definition). Some work would need to be done to establish that it suffices to consider just one molecular lookahead, making just one capture (I'm not 100% on this). There would be a disjunction involved to handle whether or not this capture is "read" at least once.

Sorry this is so vague, but you know me.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 8, 2014

Honestly, I expected a functional solution to be about twice as long as this. You're really done a good job here.

Thank you!

I would attempt to prove this by taking a molecular regex and replacing it with a standard one recursively (following the syntactic definition).

Could you please elaborate on what you mean by "replacing it recursively"? I don't understand what you mean here. I can't think of any general way of replacing any molecular lookahead with an atomic lookahead; as I see it now, I'd have to adapt to the circumstances and adapt the algorithm of the particular regex to accommodate this change, e.g. by making it break up its captures into pieces and doing multiple passes of multiplication with remainders, or refactoring the algorithm so that it has enough room to work.

@teukon

This comment has been minimized.

Copy link

teukon commented Apr 8, 2014

Quite naively: I'm thinking the main desire for molecular lookaheads is to have something which is sensitive to the rest of the space, but which doesn't consume space. What I'm thinking is if you have an expression which relies on some group \1 of length m, and matches some subset K of the naturals (or some subset of strings of the form ^x*$ if you prefer), then it should be possible to construct an expression which matches precisely the set of all k - m in the naturals where k is in K. The new expression will depend on exactly how '\1' is referenced in the old expression, which is I guess why you talk about wanting to reason about each expression in turn. However, I imagine that each syntactical element can be reasoned about abstractly.

For example. If we have the expression (?*(x*))(?:[expression A]|[expression B]), where expressions A and B contain no molecular lookaheads and do not reference each other's captures, and we know that expressions A and B (with appropriate group renumbering) are such that (?*(x*))[expression A] and (?(x*))[expression B] are equivalent to (x*)[expression C] and (x*)[expression D] respectively, where expressions C and D contain no molecular lookaheads, then we can reason than (?*(x*))(?:[expression A]|[expression B]) is equivalent to (x*)(?:[expression C]|[expression D]) (with appropriate group renumbering).

I expect similar reasoning can be done for each syntactical element (some easier, some harder). This would expand to recursion on the complexity of the non-molecular part of the expression (along the grammar if you will).

Handling multiple or even nested molecular lookaheads should be easy enough.

The expression within the molecular lookahead is I think much more difficult to manage, but again I would proceed in the manner described above. Some steps will involve breaking one molecular lookahead into several, such as when breaking up a string of terms. I think coping with multiple groups within the same molecular lookahead will be easier than it might at first appear (and should just drop out).

Still, I could be missing something. I've no experience with molecular lookaheads.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 8, 2014

I expect similar reasoning can be done for each syntactical element (some easier, some harder).

It's not at all obvious to me what any of the other transforms might be, or if such a low-level method is even feasible — but I suspect it isn't, since a conjunction of two syntactical elements, as opposed to a disjunction, can mean something entirely different than either of the elements alone.

But thinking about this topic some more has given me ideas.

Below I will use the term "molecular capture groups" to refer to standard capture groups (as supported by ECMAScript) which are not inside a lookahead relative to where their backrefs are used.

You mentioned subsets of the naturals. It had already occurred to me that if I want to test some set of statements for a subset of values, for example all prime factors of the main number, then instead of capturing the prime factor in a molecular capture group, I could capture an index in a molecular capture group and then translate that index into a prime number inside an atomic lookahead, where 0 -> 2, 1 -> 3, 2 -> 5, 3 -> 7, 4 -> 11, etc. (For the sake of argument it's assumed that translating an index to a prime is possible in a ECMAScript regex; it might not be possible.) But that still consumes some space; up to as much space as the number of unique different numbers you need to capture. Whereas a molecular lookahead would consume no space.

But what occurred to me today is, we could be even more clever (and also stick within the realm of known possibility), and capture two numbers using molecular capture groups. Then, for a maximum cost of consuming M xes, we could subsequently use the pair of backrefs (a,b such that a+b ≤ M) inside an atomic lookahead, to capture a number that can take on (M+1)(M+2)/2 unique values — using a triangular transform, (a+b)(a+b+1)/2+a. We could then use this triangular-ified number directly as a prime (discarding the values that aren't prime). (Triangular numbers grow faster than primes, so this is even more efficient that the index method described in the previous paragraph.) Tetrahedral numbers would be better still — (a+b+c)(a+b+c+1)(a+b+c+2)/6 + (a+b)(a+b+1)/2 + a, where a+b+c ≤ M.

It seems to me this would be of very limited usefulness as far as being a substitute for molecular lookaheads, though. We could take this even farther, but we'll always have to consume some amount of space, which scales along with the number (and we can't even get to the point where it scales exponentially; the best we can do is n-simplex numbers, where n is the number of molecular captures per lookahead-coded number). If the range of numbers captured doesn't have to scale with the main number, then the minimum would be consuming 0 or 1 x, and having a number of unique values equal to the number of capture groups plus one.

Back to the original problem... I think I might have the beginnings of a sketch of a proof (or perhaps just strong indication?) that everything that could be done with molecular lookaheads could be done without them. Something like, divide the starting number by two and capture the remainder (after also dividing all previous captures that are to be used in this scope by two and capturing their halves and remainders). Then you can capture something up to half the original number with just one capture group (or up to the original number with two capture groups), and subsequently do all your calculations modified to pretend they're working on double the remaining number plus the captured remainder. However, there's still the problem of not being able to carry over a new remainder from one iteration of a loop to the next... we'd still have to prove that doesn't make anything impossible that otherwise would be possible.

The above won't work for a pair of captures or more, where a set of statements needs to be tested for every possible pair of values. So this would actually be a situation in which the triangular, tetrahedral, etc. capturing could be useful. Either that, or for a pair of molecular captures, divide the original number by 4 instead of 2, and capture a remainder that can be in the range 0..3 — but it seems that this might have a higher chance of conceivably making some operations impossible than just dividing the number by 2 and having a remainder in the range 0..1.

BTW, I'm pretty sure it's possible to do this idea you talked about a month ago:

I've also been thinking about real numbers. I'd like to handle things like:

x*sqrt(2)=x
xx*sqrt(2)=xx
xxx*sqrt(2)=xxxx
xxxx*sqrt(2)=xxxxx
xxxxx*sqrt(2)=xxxxxxx

Though I think a more fun way to do it would be to take a single number as input, and return as a match the number divided by sqrt(2).

Whether it's possible to do the same thing with, say, π or e, is another question entirely.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 10, 2014

I've shortened the Abundant numbers regex to 597 characters:

^(?=(?:(?=(xx+?)\1+$)(x+)\2*(?=\2$))+(?!\1+$)(?=(x(x+?))\3*$)(x(x+)))(?=(x(x*)).*(?=\6$)(?=\7*$
)\4\8+$)(?=.*(x((?=\7*$)\3\8+$)))(?=(x*?)(?=\9*$)(x(x*))(?=\12*$)(?=\10+$)\10\13+$)(?=(x(x*))
(?=\14*$)\6\15+$)((?=(x*?(?=\14*$)(?=(x(x+))(?=\15+$)\18*$)((?=\5+$)\12{2}((?=.*(?=\9$)\11{2}x)
|x)|)))(?=(x(x(x*))).*(?=\14$)(?=\22*(?=\22$)(?!\18|(xx+)\25+$))(|(\22+)\27*(?=\27$))(?!((x+)
(?=\29+$)x+)\28*(?!\29+$)\28$)x(x*))(?=.*(?=\30$)(x(x*))(?=(\31*)\24\32*$)\31*$\33)(?=.*(x((?=
\31*$)\22\32+$)))(.*(?=\35$)\17.*$|(?=.*?(?!x\17)(?=\34*$)(x(x*))(?=\37*$)(?=\35+$)\35\38+$)(?=
\37(x*)(?=\14*$)\22\15+$)\39))+$

It's also much faster. The first version took 8.97 hours to go from 0 to 33588; this version took 7.24 minutes.

Edit: Have now tested numbers up to 270000 so far, with no false negatives or positives.

@teukon

This comment has been minimized.

Copy link

teukon commented Apr 10, 2014

Nice progress. Under 600 characters and fast! What number did you get up to?

I think I'm really out of the loop when it comes to the molecular lookaheads as I have no experience with them.

The sqrt(2) thing sounds interesting. Is there an easy argument for why you think this can be done?

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 10, 2014

Nice progress. Under 600 characters and fast!

Thanks. :)

I think I'm really out of the loop when it comes to the molecular lookaheads as I have no experience with them.

Well, all the reasoning I've done about them so far is just theoretical, but I suppose I should implement them in RegexMathEngine so we can actually try them out in practice.

The sqrt(2) thing sounds interesting. Is there an easy argument for why you think this can be done?

So we want N/sqrt(2). Basically, we can do calculations in base floor(sqrt(N))+1 with carry.

So say N=1000. Then we work in base 32. The largest digit in base 32 is 31, and we have room to calculate 312=961. (This works if N is a perfect square, too. If N=10000, then we work in base 101, and we can calculate 1002=10000.)

Pretend A,B,C,D,E are base 32 digits (except that E is allowed to exceed the base), where adjacency indicates a multi-digit number, not multiplication. Then we can do (BA)2 = EDC, where C = A2, D = 2×A×B, E = B2. We then take C mod 32, while C/32 is added as a carry to D. Then we do the same for D, adding the carry to E.

So we calculate N2 using this method, and divide the multi-digit result by 2. Then we find a two-digit number which, when squared, is as close as possible to the other multi-digit result we calculated. This gets us sqrt(N2/2) which is what we wanted, so then we convert the two digits into a proper number by multiplying the most-significant digit by the base and adding the least-significant digit.

[Edit]

What number did you get up to?

Up to 1 million with no false negatives or positives, in 107 hours.

Incidentally, a tiny C program using a sieve-initialized sigma-function array goes up to 10 million (enumerating the first 2476737 abundant numbers) in 1.0 seconds, including the time to initialize the array.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 11, 2014

Okay, I pushed a version of RegexMathEngine that has molecular lookahead implemented. Currently it is always-enabled; in a future commit I'll make it optional, so that it can stick to ECMAScript compatibility by default.

Remember my regex to find the prime(s) of highest multiplicity, at the top of this gist? Well, look how compact it can become using molecular lookahead (119 characters):

(?# N = current tail number; starts as being the main number )
^
(?=
    (
        (?*(x(x+?))\2+$)          (?# \2 = proposed smallest factor of N that has
                                      all the same prime factors as N; \3 = \2-1 )
        (?!                       (?# Assert that \2 has all the same prime factors as N has,
                                      by asserting that there is no prime factor of N, \4,
                                      that is not a factor of \2 )
            (?*
                (x+)\4*(?=\4$)    (?# \4 = iterates through all prime factors of N )
                (?!(xx+)\5+$)     (?# assert that \4 is prime )
            )
            (?!.*(?=\2$)\4+$)
        )
        (?=
            (x(x*))               (?# \6 = N / \2; \7 = \6-1 )
            (?=(\6*)$)            (?# \8 = N - \6 )
            (?=\3+$)              (?# must match if \6 <  \2; otherwise doesn't matter )
            \3\7+$                (?# must match if \6 >= \2; otherwise doesn't matter )
        )
        \8                        (?# Leave \6 as the new N for the next iteration of the loop )
    )*
    (x*)                          (?# \9 = the prime with the highest multiplicity;
                                      if it's a tie, the product of the tied primes )
)
\9

It's also enormously faster. It can process all the numbers from 0 to 4250 in the same amount of time as the regex at the top of this gist processes the numbers from 0 to 500.

BTW, I'm pretty sure I've implemented molecular lookahead correctly, but not having it in any other engine makes it hard to know for sure. If you find it not behaving the way you expect in some regard, please let me know because it may or may not be a bug.

@teukon

This comment has been minimized.

Copy link

teukon commented Apr 11, 2014

So say N=1000. Then we work in base 32. The largest digit in base 32 is 31, and we have room to calculate 312=961. (This works if N is a perfect square, too. If N=10000, then we work in base 101, and we can calculate 1002=10000.)

Pretend A,B,C,D,E are base 32 digits (except that E is allowed to exceed the base), where adjacency indicates a multi-digit number, not multiplication. Then we can do (BA)2 = EDC, where C = A2, D = 2×A×B, E = B2. We then take C mod 32, while C/32 is added as a carry to D. Then we do the same for D, adding the carry to E.

Aha, that's a cute technique. I think I'm starting to see what you're getting at.

Remember my regex to find the prime(s) of highest multiplicity, at the top of this gist? Well, look how compact it can become using molecular lookahead:

Yes, it was partly due to this that I didn't expect a solution to abundant numbers to be so short. It is particularly elegant when you are allowed to make use of molecular lookaheads.

BTW, I'm pretty sure I've implemented molecular lookahead correctly, but not having it in any other engine makes it hard to know for sure. If you find it not behaving the way you expect in some regard, please let me know because it may or may not be a bug.

Will do, although I must admit I'm not in the frame of mind to test this at present. I guess if there are some subtle bugs you can claim "reference implementation" whereby all differing engines are wrong by default.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 12, 2014

How would you like to do a little golf challenge (one that's much simpler than Fibonacci)?

Make a regex that works on the domain of powers of 2, which returns the base 2 logarithm as its match. For example, for ^x{512}$ it should return x{9} as its match, and for ^x$ it should return an empty match.

Here are test scripts for Perl, PCRE, and RegexMathEngine. The RegexMathEngine script requires that you pull the latest version. The test scripts don't verify that the output is correct yet.

My best solution so far has md5sum c55d01ff6313d223df7422b1b6886b03.

@teukon

This comment has been minimized.

Copy link

teukon commented Apr 12, 2014

Sounds like fun! I'm too busy to get into this right now though; I have a lot of work to get through by Wednesday.

Still, this seems like it would be a good level for me. I look forward to tackling it.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 13, 2014

Cool :) I look forward to seeing what you come up with.

Meanwhile I've superseded my previous solution with one having md5sum b31a12cf9a9982d7d3204d198f87ba4f.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 13, 2014

I've found some errors in this paper on the density of abundant numbers. On "page 142" (page 6 of the PDF) there's a table of the number of non-deficient numbers in a range of intervals. My C program agrees with those up to 108, but finds that there are 247610960 non-deficient numbers under 109, whereas the paper says 247610965. Looks like Deléglise may have used 32-bit math without properly checking for overflow (I'm using 64-bit math).

Also, the table leaves out the number for [1, 106] and accidentally shows the one for [1, 105] next to [1, 106]. And it uses [a, b] to represent open intervals, but [a, b] is the notation for a closed interval — the paper's [1, 108], for example, does not include 108 itself.

@teukon

This comment has been minimized.

Copy link

teukon commented Apr 14, 2014

You'd be surprised of the errors that crop up in mathematics papers. I find it a bit of a stretch to describe the closed interval thing an error, but the notation is certainly very common so this is at least clumsy. The 32-bit thing could quite easily be simply wrong.

It's easy to overlook things. For example: Subtraction (182):

^(.*)(.*). \2. \1$
@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 14, 2014

^(.*)(.*). \2. \1$

Nice. Figures that the first problem would be the one we didn't get around to revisiting (until now, at least).

BTW, I prefer ^(x*)(.+)- \2= \1$, because it's easier to understand at a glance.

The 32-bit thing could quite easily be simply wrong.

While I have virtually no doubt that it's true that the asymptotic density of abundant numbers is between 0.2474 and 0.2480, the author of the paper may have not correctly proven this. I haven't taken the time to try to puzzle through the math in that paper, but the result involved 100 hours of CPU time and the entire time it may have been using a buggy algorithm.

Probably with modern CPU power it'd be possible to tighten those bounds quite a bit.

What would be much cooler, however, would be to prove that for every n, |An(2) - A(2)| < f(n) for some f(n) that asymptotically approaches zero, where an f(n) for which this is true is found.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 15, 2014

Here's a direct port of the molecular lookahead version of the "prime(s) of highest multiplicity" regex to a standard ECMAScript regex:

(?# N = current tail number; starts as being the main number )
^
(?=
    (
        (?=(x*)\2(x?))                    (?# \2 = N / 2;  \3 = N % 2 )

        ((x*?)x??)                        (?# {\4, \5} = x\4\5 encoded as a pair to save space )
        (?=                               (?# require that \6 = x\4\5 is a factor of N )
            .*(?=\2\3$)
            (?=(x(\4\5))+(x*))            (?# \6 = x\4\5; \7 = \6-1; \8 = remainder )
            (?=.*(?=\6)\8(x*))            (?# \9 = \6 - \8 )
            \3\9\6*$
        )
        (?!                               (?# Assert that \6 has all the same prime factors as N has, by asserting
                                              that there is no prime factor of N, \12, that is not a factor of \6 )
            ((x*)x?)                      (?# {\10, \11} = x\10\11 encoded as a pair to save space )
            (?=                           (?# require that \12 = x\10\11 is a factor of N )
                .*(?=\2\3$)
                (?=(x\10\11)+(x*))        (?# \12 = x\10\11; \13 = remainder )
                (?=
                    .*(?=\12$)
                    (?!(xx+)\14+$)        (?# assert that \12 is prime )
                    \13(x*)               (?# \15 = \12 - \13 )
                )
                \3\15\12*$
            )
            (?!.*(?=\6$)\12+$)
        )
        ((x*)x?)                          (?# {\16, \17} = \16\17 encoded as a pair to save space )
        (?=                               (?# require that \18 = x\16\17 = N / \6 )
            .*(?=\2\3$)
            (?=(x(\16\17))+(x*))          (?# \18 = x\16\17; \19 = \18-1; \20 = remainder )
            (?=.*(?=\18)\20(x*))          (?# \21 = \18 - \20 )
            (?=\3\21\18*$)

            (?# the following must match if \18 <  \6; otherwise it doesn't matter )
            (?=\18\7*(x*))                (?# \22 = remainder )
            (?=.*(?=\7)\22(x*))           (?# \23 = \7 - \22 )
            (?=\3\23\7*$)

            (?# the following must match if \18 >= \6; otherwise it doesn't matter )
            (?=\6\19*(x*))                (?# \24 = remainder )
            (?=.*(?=\19)\24(x*))          (?# \25 = \19 - \24 )
            (?=\3\25\19*$)
        )
        .*(?=\18$)                        (?# Leave \18 as the new N for the next iteration of the loop )
    )*
    (x*)                                  (?# \26 = the prime with the highest multiplicity;
                                              if it's a tie, the product of the tied primes )
)
\26

And here's an alternative direct port:

(?# N = current tail number; starts as being the main number )
^
(?=
    (
        ((x*?)x??)                        (?# {\2, \3} = x\2\3 encoded as a pair to save space )
        (?=                               (?# require that \4 = x\2\3 is a factor of N )
            (?=(x(\2\3))+(x*))            (?# \4 = x\2\3; \5 = \4-1; \6 = remainder )
            (?=.*(?=\4)\6(x*))            (?# \7 = \4 - \6 )
            .*(?=\2$)
            \7\4*$
        )
        (?!                               (?# Assert that \4 has all the same prime factors as N has, by asserting
                                              that there is no prime factor of N, \10, that is not a factor of \4 )
            ((x*)x?)                      (?# {\8, \9} = x\8\9 encoded as a pair to save space )
            (?=                           (?# require that \10 = x\8\9 is a factor of N )
                (?=(x\8\9)+(x*))          (?# \10 = x\8\9; \11 = remainder )
                (?=
                    .*(?=\10$)
                    (?!(xx+)\12+$)        (?# assert that \10 is prime )
                    \11(x*)               (?# \13 = \10 - \11 )
                )
                .*(?=\2\8$)
                \13\10*$
            )
            (?!.*(?=\4$)\10+$)
        )
        ((x*)x?)                          (?# {\14, \15} = x\14\15 encoded as a pair to save space )
        (?=                               (?# require that \16 = x\14\15 = N / \4 )
            (?=(x(\14\15))+(x*))          (?# \16 = x\14\15; \17 = \16-1; \18 = remainder )
            (?=.*(?=\16)\18(x*))          (?# \19 = \16 - \18 )
            (?=(.*)(?=\2\14$)\19\16*$)    (?# \20 )

            (?# the following must match if \16 <  \4; otherwise it doesn't matter )
            (?=\16\5*(x*))                (?# \21 = remainder )
            (?=.*(?=\5)\21(x*))           (?# \22 = \5 - \21 )
            (?=\20\22\5*$)

            (?# the following must match if \16 >= \4; otherwise it doesn't matter )
            (?=\4\17*(x*))                (?# \23 = remainder )
            (?=.*(?=\17)\23(x*))          (?# \24 = \17 - \23 )
            (?=\20\24\17*$)
        )
        .*(?=\16$)                        (?# Leave \16 as the new N for the next iteration of the loop )
    )*
    (x*)                                  (?# \25 = the prime with the highest multiplicity;
                                              if it's a tie, the product of the tied primes )
)
\25

And the latter, whittled down to 185 characters:

(?# N = current tail number; starts as being the main number )
^
(?=
    (
        (?=
            ((x*?)x??)            (?# {\2, \3} = x\2\3 encoded as a pair to save space;  tail = N - \2 )
            (?=x\3(x(\2\3))+$)    (?# require that \4 = x\2\3 is a factor of N;  \4 = x\2\3; \5 = \4-1 )
            (?!                   (?# Assert that \4 has all the same prime factors as N has, by asserting
                                      that there is no prime factor of N, \8, that is not a factor of \4 )
                ((x*)x?)                    (?# {\6, \7} = x\6\7 encoded as a pair to save space;  tail = N - \2 - \6 )
                (?=                         (?# require that \8 = x\6\7 is a factor of N )
                    (?=(x\6\7)(\8*(x*)))    (?# \8 = x\6\7; \10 = remainder; \9 = N - \2 - \6 - \8 )
                    (?=
                        \9                  (?# tail = \8 )
                        (?!(xx+)\11+$)      (?# assert that \8 is prime )
                        \10(x*)             (?# \12 = \8 - \10 )
                    )
                    .*(?=\2\6$)             (?# tail = \2\6 )
                    \12\8*$
                )
                .*(?=\4$)
                (?!\8+$)
            )
        )
        (?=
            (x(x*))               (?# \13 = N / \4; \14 = \13-1 )
            (?=(\13*)$)           (?# \15 = N - \13 )
            (?=\5+$)              (?# must match if \13 <  \4; otherwise doesn't matter )
            \5\14+$               (?# must match if \13 >= \4; otherwise doesn't matter )
        )
        \15                       (?# Leave \13 as the new N for the next iteration of the loop )
    )*
    (x*)                          (?# \16 = the prime with the highest multiplicity;
                                      if it's a tie, the product of the tied primes )
)
\16
@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 17, 2014

I found another way of doing Multiplication that is very very close to being as short:
^(x(x*)).(?=x*=(x(x*))(?=\3*$)(?=\2+$)\2\4*$)\3=\1+$ (52 characters)
compare to the one that has remained uncontested for six weeks:
^(x(x*))\*?(x*)\*?\1=(?=(\1\3)+$)\1(\2\3(?!\4+$))+$ (51 characters)

This new method works by doing division by one of the factors and comparing the quotient with the other factor.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 17, 2014

Well, I suppose it was inevitable that at some point I'd start finding length optimizations that result in significant slowdown. The Abundant numbers regex is now down to 511 characters:

^(?=(?:(?=(xx+?)\1+$)(x+)\2*(?=\2$))+(?!\1+$)(?=(x(x+?))\3*$)(x(x+)))(?=(x(x*)).*(?=\6$)(?=\7*$
)\4\8+$)(?=.*(x((?=\7*$)\3\8+$)))(?=(x*?)(?=\9*$)(x(x*))(?=\12*$)(?=\10+$)\10\13+$)(?=(x(x*))
(?=\14*$)\6\15+$)((?=(x*?(?=\14+$)(?=\14+?(?=((xx(x*))(?=\15+$)\19*$))(x+).*(?=\14$)\21*(?=\21$
)(?!(xx+)\22*(?!\19+$)\22$)\19+$)((?=\5+$)\12{2}((?=.*(?!\9)\11{2})|x)|)))(?=.*(?=\21)x(x(x*))
(?=\25*$)\20\26*$)(?=.*(x((?=\25*$)\19\26+$)))(.*(?!\27)\17|(?=.*?(?!x\17)(?=\27*$)(x(x*))(?=
\30*$)(?=\28+$)\28\31+$).*(?=\30\18$)))+$

The non-slowed-down version is at 550 characters.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 18, 2014

I beat our Multiplication score!
^x(x*)\*?(x*)\*?x\1=(?=(x(\1\2))(\3*)\1\4*$)\3*$\5 (50 characters)

Turns out you were right after all when you wrote this:

I can't find a way to improve on 51 characters, but I highly doubt that 51 characters is optimal.

@teukon

This comment has been minimized.

Copy link

teukon commented Apr 18, 2014

Nice work! The 50 character solution to Multiplication is particularly pleasing to see.

I'm afraid I'm somewhat losing interest in the regex stuff these days, partly because I'm currently focused on writing a paper. I can see that you're still going strong though.

I might have to pass on your little golf challenge. It's on my to-do list but is lower than one might hope.

Also, it's a bit of a shame that, even with Erling adding the bonus problems to his site, that no-one seems to have taken interest (that I can find). Oh well.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 18, 2014

Thanks for your honesty.

It's been a privilege, and great fun, playing Regex Tennis with you. And I don't think I will ever understand how you were able to come up with such heavily optimized solutions to Perfect squares and Multiplication so quickly.

I still find this pretty fascinating. I hope I'll still feel that way when/if your interest returns.

I'd really like to know the answer to such questions such as: Is it possible for an ECMAScript regex to calculate prime(n) (where n is the index of a prime) where the answer can be as big as the available space? I know it's possible with O(n2 ln n) space, but can't think of a way to do it with O(n ln n) space (equivalently — is it possible to convert a prime into an index with no extra space available?) And more generally, I'd like to see at least one thing proved impossible to do. Or (more astoundingly) a proof that all computable functions are possible to do in a regex (obviously excluding those whose output can exceed their input).

Oh, and I think you've probably guessed where I'm leading, with that little golf challenge I posed to you. Matching a completely generalized Powers, ^x+\^x*=x+$, is indeed possible. It's funny how hard I thought it'd be a month ago:

Match[ing] statements of the form ^x+\^n=x+$ should be easy, but matching ^n\^x+=x+$ or even ^x+\^x+=x+$, without scratch space, seems like a very hard problem (maybe impossible?). But perhaps there's a way to do it using a similar trick to my original Multiplication.

@teukon

This comment has been minimized.

Copy link

teukon commented Apr 18, 2014

It's been a privilege, and great fun, playing Regex Tennis with you. And I don't think I will ever understand how you were able to come up with such heavily optimized solutions to Perfect squares and Multiplication so quickly.

Kind words indeed. I've been equally impressed by some of your work, notably the abundant numbers solution, and the beautiful solution to Dominoes.

And more generally, I'd like to see at least one thing proved impossible to do. Or (more astoundingly) a proof that all computable functions are possible to do in a regex (obviously excluding those whose output can exceed their input).

I expect there are some examples which are provably impossible. It would be nice to have a very simple one but I've run out of ideas.

Oh, and I think you've probably guessed where I'm leading, with that little golf challenge I posed to you. Matching a completely generalized Powers, ^x+^x*=x+$, is indeed possible. It's funny how hard I thought it'd be a month ago:

Yeah. Well done tying up this loose end. I know how you feel though, I remember thinking that square numbers would be incredibly difficult to handle.

This has been fun!

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Apr 21, 2014

The split of Fibonacci regexes between these two designs:

  1. Find a number such that when squared, multiplied by 5, and 4 is added or subtracted, is a perfect square — resulting in finding the Fibonacci that has half the Fibonacci index of the target number
  2. Build up the Fibonacci sequence, up to the one that has one-quarter the index of the target number

... is no longer just about brevity versus speed. If I am not mistaken, only algorithm #2 can be used to return the Fibonacci index as a match. (This can't be done for the Fibonacci numbers 2 and 3, which have indexes 3 and 4 — but this would be easily remedied by turning it into a problem of matching statements of the form ^F_x*=x*$.)

(Incidentally — it would be possible to implement algorithm #1 as going directly to the target number, by using multi-"digit" arithmetic with carry, as I described earlier. But I doubt this would be possible to do in 165 characters or less.)

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Dec 6, 2018

In the interest of tying up another loose end (from Feb 25, 2014):

I had another thought about your powers of 6 problem. Have you considered powers of 3?

I know this can be solved by checking that every factor is a multiple of 3, but I don't think there's a way of doing this that is as short as the loop (imagine: "How the hell are you able to score 181 on Powers of three!?!"). I think I prefer powers of 4 to the other options, but just wanted to share this thought.

A solution to "Powers of 3" that is analogous to the ^(?!(x(xx)+|)\1*$) (18 character) solution of "Powers of 2":
^(?!(x(xxx)+x?|xx|)\1*$) (24 characters)

A solution to "Powers of 3" that is analogous to the ^((x+)(?=\2$))*x$ (17 character) solution of "Powers of 2":
^((x+)\2(?=\2$))*x$ (19 characters)

So yes, the former is not as short as the latter, but it's really close. "Powers of 4", or powers of any non-prime for that matter, would not be solvable at all using the former method.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Dec 15, 2018

I've found some errors in this paper on the density of abundant numbers. On "page 142" (page 6 of the PDF) there's a table of the number of non-deficient numbers in a range of intervals. My C program agrees with those up to 108, but finds that there are 247610960 non-deficient numbers under 109, whereas the paper says 247610965. Looks like Deléglise may have used 32-bit math without properly checking for overflow (I'm using 64-bit math).

Also, the table leaves out the number for [1, 106] and accidentally shows the one for [1, 105] next to [1, 106]. And it uses [a, b] to represent open intervals, but [a, b] is the notation for a closed interval — the paper's [1, 108], for example, does not include 108 itself.

So it turns out that even back when I wrote this, there was already a paper that superseded it: On the density of abundant numbers, Kobayashi (2010). It just didn't show up in my searches at the time.

This is an excellent paper. It goes into depth on things I'd wished were in Deléglise's 1998 paper. It derives the time-complexity of computing n digits of the abundant number density to be O(een). It includes the full C++ source code (and describes their workings) of the program that computes bounds on the density; I tried it, and it works! I was able to compute bounds narrowed down to [0.2476167, 0.2476528] in 245 minutes, single-threaded, by passing it N=200000000000000 on the command line. That's weird, because it's almost as narrow as the new bounds established by the paper: [0.2476171, 0.2476475]. I would have thought it would take a lot more computing time to reach that.

Edit: After 23.54 hours of computation with N = 5000000000000000, the program gave a result of [0.24761740, 0.24764507]. That's even tighter than the bounds stated in the paper.
Edit 2: After 40.48 hours of computation with N = 15000000000000000, it gave a result of [0.24761748, 0.24764422].

FWIW, it also gets the closed-open interval right, stating them as [a, b). It does have at least one error though:

any finite union of multiple sets is a periodic set

on page 6. I think he meant "any finite union of multiple periodic sets is a periodic set".

I do wish the paper discussed what limitation the use of native floating point data types (80-bit long double) has on the best bounds achievable by the program. There is some discussion on the matter starting at page 112, but if I understand correctly, this only describes how a problem regarding precision was solved. In scanning through the paper, I didn't find anything explicitly discussing limitations on precision that still apply. So I don't know whether there's a point at which the program will start narrowing the bounds without rigorous mathematical justification.

The same mathematician also co-wrote an article in 2014 about the error term in the count of abundant numbers. This pertains to another thing I said earlier here:

What would be much cooler, however, would be to prove that for every n, |An(2) - A(2)| < f(n) for some f(n) that asymptotically approaches zero, where an f(n) for which this is true is found.

The function they came up with, though, is very very pessimistic (extremely slow to converge, much slower than the density itself seems to converge). It is a start though, I suppose. The article itself is behind a paywall, unfortunately.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Dec 15, 2018

I thought of a pretty good analogy for what constructing ECMAScript regexes for functions on the natural numbers is like. I think it's much like doing straightedge and compass geometric construction. There are many things that are impossible to construct with those tools, but that hasn't stopped mathematicians from devoting much time to the subject.

I suspect that only a strict subset of primitive recursive functions are possible but at this point still can't prove it. I'm also pretty sure that with unbounded scratch space, all primitive recursive functions become possible (this would probably be easy to prove).

@Grimy

This comment has been minimized.

Copy link

Grimy commented Jan 31, 2019

Sorry to intrude, but there's still one loose end: the log2 golf challenge. I have a length 108 73 66 solution (8cf2ef821f9428e4a611b457ccc57344). What's the length of your b31a12cf9a9982d7d3204d198f87ba4f solution?

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Feb 1, 2019

Hello, Grimy, and welcome!

Wow, that is awesome. My b31a12cf9a9982d7d3204d198f87ba4f is 69 bytes, so you have beaten me! Very well done. I'd like to exchange our regexes, but privately, to maintain the viability of this challenge for a little longer. I'd like to see your earlier versions too!

I recently forked the log2 regex into:

  1. A version that strictly rounds down, which is 93 bytes with md5sum cef4f7a85f9703825c97b01012358d4d, and like the original, returns no-match for zero
  2. A base 10 version that returns the number of digits, which is 112 bytes with md5sum cc065c2c0e212933c0bcb1e3e9717a86 (it's the same as strictly rounded-down logarithm plus one – except with an input of zero, for which both regexes return no-match, because you can't return 1 with an input of 0)
  3. Logarithm in base N (passed as two comma-delimited unary numbers, base first, and including the base and comma in the returned result) at 165 bytes with md5sum df2b6dab8eb3fdf9a6c2d2e4e52257f3 (have not made a strictly-rounding-down or digit-counting version of this yet).

Tying up another loose end, I also wrote a regex to take a single unary number as input, and divide by sqrt(2) (rounding down) and return the result as a match, using an algorithm very similar to the one I sketched above. It uses molecular lookahead. The first fully correct version was 1159 bytes and I got it down to 849. I plan on porting it to pure ECMAScript, but I expect it will roughly double in length.

@Grimy

This comment has been minimized.

Copy link

Grimy commented Feb 4, 2019

Thanks!

I am now down to 57 chars on the log2 regex (md5sum c58bef1340954f1bbaf02c5790f33a01, --npcg+ only, --neam independent). I’ll gladly share it if you want.

I haven’t yet tried applying this algorithm to the base 10 or rounding down version of the task, but it should give significant improvements.

EDIT:

  • 76 chars for the version that strictly rounds down (or 74 if I’m allowed to return 0 instead of no match for a 0 input).
  • 97 chars for the base 10 version
@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Feb 7, 2019

Wow! 57 chars. That's incredible. I don't want you to just show it to me right away though; I'd like some time to try to figure it out. Can't work on it right away but I plan on it. I would indeed like to see them, please!

I did have an idea on how to make my floor-logarithm regexes much shorter. It did a little worse than the technique I already used, and frankly I'm very happy about that. The one I'd already used is much more interesting, and the one I just now tried would trivialize it if it did better.

And I'm also quite impressed to see that 76-57=19, a smaller difference than my 93-69=24.

@Davidebyzero

This comment has been minimized.

Copy link
Owner Author

Davidebyzero commented Feb 23, 2019

This discussion has been continued in the Mathematical Regexes Stack Exchange chat room.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.