Skip to content

Instantly share code, notes, and snippets.

@rtfeldman
Last active September 7, 2015 15:31
Show Gist options
  • Save rtfeldman/a3614e123381e7751b4c to your computer and use it in GitHub Desktop.
Save rtfeldman/a3614e123381e7751b4c to your computer and use it in GitHub Desktop.

Idea: Regex Literals

Motivation

elm-lang/core/#378 surfaced an interesting tension:

  • An unsafe regex compiling function like the current regex function can crash at runtime if given an invalid sytnax. This is a serious problem when you want to support arbitrary regexes coming in from end users.
  • A version that returned Result would neatly handle that case, but would be very inconvenient in the common case where you're hardcoding the regex and know it will definitely compile. You would either have to unsafely extract the Result or else do a lot of unnecessary defensive programming for a case that can't come up.

Idea

There's a third option: do what other languages do and offer regex literals. The validity of their syntax can be checked at compile time.

isWhitespace : Regex
isWhitespace =
    /^\s+$/

Given the ability to easily create a Regex that cannot crash, there is no downside to making the regex function return a Result, which neatly solves elm-lang/core/#378.

Benefits

  • Emitting JavaScript RegExp literals would improve performance according to MDN.
  • Hardcoded regexes can no longer throw runtime exceptions. Although those exceptions typically arrive promptly on startup, like a port error, they might not if the regular expression is instantiated deep in some nested conditionals.
  • Given verified literals, using Result to solve elm-lang/core/#378 has no downside.
  • Syntax highlighters for regex litereals can improve source readability. See for example the highlighting in the above snippet.

Drawbacks

  • It's one more feature, increasing language complexity.
  • Given that this feels like a minor pain point so far, it's significant that the implementation time would mean other language features aren't being worked on instead.
  • It might be difficult to check JS regexp syntax at compile time with sufficient accuracy to guarantee that it won't fail to compile at runtime in any relevant browsers.

What Other Languages Do

Regular expressions are a common enough tool in industry programming that many languages offer first-class support for them, such as:

@eeue56
Copy link

eeue56 commented Aug 30, 2015

A solution like seems sensible to me, leaving regex to be used to handle dynamic regexs and get error messages, and /[abc]/ syntax for static regexs. This syntax would undoubtedly replace usage of regex in most cases and for most users.

If Elm continues using Javascript's regex engine (and there's no reason not to), then the error messages should be as helpful as the current runtime error - they should not differ at all.

There are some issues with using slashes as the syntax for regexs, for example

isAlive = dead /= True && isZombie /= True

A naive implementation may try to grab = True && isZombie as a RegExp, which is valid in JS, but a well defined syntax will be able to deal with that.

@rtfeldman
Copy link
Author

Great point about the slashes. I defaulted to them because they're the most widely used in other languages, so there's be a copy/pasting benefit - especially when it comes to what is backslash-escaped. (And copy/pasting legit comes up a lot with regexes.)

However, your example makes it pretty clear that any regex literal delimiter that is allowed in infix operators would break in this case. It would have to be something disallowed in variable names.

If backticks stopped being for infixification, they could be recycled for this purpose. That would probably be the easiest and nicest option, especially since backticks rarely get used in regexes, so you'd rarely need to escape them.

isWhitespace : Regex
isWhitespace =
    `^\s+$`

That said, I can't think of anything else that would work.

@eeue56
Copy link

eeue56 commented Aug 31, 2015

I feel like a more natural solution is to instead define it as a new operator, rather than reusing something already in use (backticks). In Haskell, regex-posix uses the =~ symbol for dealing with RegExp, although not in the same sense as Perl/JS/Ruby's regex literals, and has trickled down into other libraries (like regex-pcre).

Though that being said, if it is to be built in to the compiler for compile-time checks then I guess it makes sense for it to be part of the language, rather than part of Basics. I'm still not fond of backticks - but maybe something else like

isWhitespace =
    <^\s+$>
-- or
isWhitespace =
    [^\s+$]

could be used instead. I prefer the look of the angle brackets, but square brackets conflict with less existing things.

@ianbollinger
Copy link

What about [regex| |]? (Where regex could be shortened to r if it's too verbose.)

@ARM9
Copy link

ARM9 commented Sep 7, 2015

How about using the clojure style #"regex" or #/regex/? This should avoid any parsing ambiguities since # isn't used for anything yet. Or perhaps r/regex/ (r/abc/ xyz === (/) r abc / xyz?) or r"regex" (bar"hello" === bar "hello" might pose a problem as well). Similarly rregex might be problematic, see http://share-elm.com/sprout/55edadace4b0ff56ab7596d6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment