Skip to content

Instantly share code, notes, and snippets.

@rbuckton
Last active September 10, 2020 22:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rbuckton/2f262b5298d4b2031cb7e0d5a1a62e19 to your computer and use it in GitHub Desktop.
Save rbuckton/2f262b5298d4b2031cb7e0d5a1a62e19 to your computer and use it in GitHub Desktop.

ECMAScript Regular Expression Language Additions

This is a strawperson for the addition of multiple Regular Expression features popular in various languages and parsers. The primary influences for this proposal come from prior art in the following languages and regular expression engines:

Table of Contents

New Regular Expression Flags

  • n - Explicit capture mode. Does not capture unnamed capture groups: (subexpression) is treated like (?:subexpression), but (?<name>subexpression) is treated as normal.

    • Prior Art:
  • x - Ignore pattern whitespace mode. Eliminates whitespace in a regular expression, and enables "x-mode" comments at the end of a line (comments starting with #).

Groups

  • (?#...comment...) - Inline comments. All content between (?# and the next (non-escaped) ) is eliminated from the pattern.

  • (?imnsx-imnsx) - Enables or disables specific RegExp flags from this position until the end of the current group ()) or the end of the pattern. This is very useful when parsing regular expressions specified in other formats, such as in JSON configuration files or TextMate Language files.

    • Examples:
      • /a(?i)b/ - Matches ab, aB. Does not match: Ab, AB
      • /a(?-i)b/i - Matches ab, Ab. Does not match: aB, AB
    • Prior Art:
  • (?imnsxu-imnsxu:subexpression) - Non-capturing group that enables or disables specific RegExp flags for the provided subexpression. This is very useful when parsing regular expressions specified in other formats, such as in JSON configuration files or TextMate Language files.

    • Examples:
      • /a(?i:b)/ - Matches ab, aB. Does not match: Ab, AB
      • /a(?-i:b)/i - Matches ab, Ab. Does not match: aB, AB
    • Prior Art:
  • (?(expression)yes|no), (?(name)yes|no), (?(number)yes|no) - Conditional matching based on an expression or named or numbered backreference. If expression is DecimalDigits, it is treated as a numeric backreference. If expression is the name of an existing capture group, it is treated as a named backreference. For name and number, the expression tests whether the last evaluation of the capture group was a match. For expression, the expression is treated as a zero-width assertion and is treated as (?(?=expression)yes|no). The |no part of the expression may be omitted and is treated as (?(expression)yes|).

    • Examples:
      • /\b(?(\d{2}-)\d{2}-\d{7}|\d{3}-\d{2}-\d{4})\b/ - Matches 12-123456789, 123-12-1234.
      • /\b(?<n2>\d{2}-)?(?(n2)\d{7}|\d{3}-\d{2}-\d{4})\b/ - Matches 12-123456789, 123-12-1234.
      • /\b(\d{2}-)?(?(1)\d{7}|\d{3}-\d{2}-\d{4})\b/ - Matches 12-123456789, 123-12-1234.
    • Prior Art:
  • (?<name1-name2>subexpression) - Balancing groups. Deletes a previously-named group (name2) and stores in the current group (name1) the interval between the previous group and the new group. If no name2 group is defined, the match backtracks. Useful for matching balanced parentheses or brackets.

    • Examples:
      • new RegExp(`
            ^                               # Start at beginning of string.
            [^<>]*                          # Match zero or more characters that are not angle brackets.
            (
                ((?<Open><)[^<>]*)+         # Match one or more open angle brackets followed by zero or
                                            # more non-bracket characters.
        
                ((?<Close-Open>>)[^<>]*)+   # Match one or more close angle brackets followed by zero
                                            # or more non-bracket characters. The substring between
                                            # Open and Close is stored in Close, and the previous Open
                                            # match is deleted.
            )*
            (?(Open)(?!))                   # If any Open groups still remain, fail the entire match
                                            # using a zero-width negative lookahead.
            $                               # Stop at end of string.
        `, "x") // Ignore whitespace to improve readability
        Matches: <abc><mno<xyz>>. Does not match: <, >, <<>, <>>
    • Prior Art:
      • .NET (NOTE: basis for this feature proposal)
      • Perl, "Recursive subpattern" bullet (NOTE: different recursion mechanism with similar capabilities)
      • Oniguruma (NOTE: different recursion mechanism with similar capabilities)
  • (?>subexpression) - Atomic groups. Non-capturing group that disables backtracking in the subexpression.

Backreference Constructs

  • \g<name>, \g<number> - Reexecute the subexpression of the named or numbered capture group at the current-position. Allows reusing a capture group's subexpression without rewriting the capture group.
    • Examples:
      • new RegExp(`
          (?((?!))          # Failing conditional to define reusable groups.
              (?<Year>\d{4})
              (?<Month>\d{2})
              (?<Day>\d{2})
              (?<WeekOfYear>W\d{2})
              (?<DayOfWeek>\d)
              (?<DayOfYear>\d{3})
              (?<CalendarDate>\g<Year>-\g<Month>-\g<Day>)               # YYYY-MM-DD
              (?<WeekDate>\g<Year>-\g<WeekOfYear>-\g<DayOfWeek>)        # YYYY-Www-DD
              (?<OrdinalDate>\g<Year>-\G<DayOfYear>)                    # YYYY-DDD
              (?<Date>\g<CalendarDate>|\g<WeekDate>|\g<OrdinalDate>)
          )
          \g<Date>
        `, "x") // Ignore whitespace to improve readability
        Matches: 2020-01-01, 2020-W1-6, 2020-200
    • Prior Art:
      • Oniguruma (NOTE: basis for this feature proposal)
      • Perl, "Recursive subpattern" bullet.
        • NOTE: Perl uses \g to mean the same thing as \k and uses a different syntax for this construct.
        • NOTE: Perl's ability to reuse a capture group seems to be limited to recursion.

Miscellaneous Constructs

  • # comment - x mode comments. Only enabled when the x flag is set. Eliminates # and any text that follows it until the end of the line.
    • Examples:
      • new RegExp(`
          # this line is ignored
          ab
        `, "x")
        Matches ab.
    • Prior Art:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment