Skip to content

Instantly share code, notes, and snippets.

@ErikCorryGoogle
Created January 10, 2017 08:10
Show Gist options
  • Save ErikCorryGoogle/99825a2393bd174b9eda867595a4c51f to your computer and use it in GitHub Desktop.
Save ErikCorryGoogle/99825a2393bd174b9eda867595a4c51f to your computer and use it in GitHub Desktop.
I like that names have to be unique. Some regexp flavours allow dupes and unify the storage between them,
but that feels complicated and difficult to spec. For something like this:
/(<foo>..)((<foo>..))*/
normally you would reset the capture whenever you iterate the *-loop, but it would be strange to delete the
foo capture on entering the loop, or would it?
By putting the named captures on the match object as properties, you are preventing any future standard from
ever adding a new property to the Match object, ever, since it might conflict with the name of a named capture.
Perhaps it makes more sense to add a .map property to the match of type Map and have string keys on that map.
This also avoids the question of what happens if someone makes a named capture called __proto__ or prototype.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map
You picked the .NET syntax for backreferences \k<name> instead of the Python syntax (P=name). I think the
Python one was probably better, because JS does not do a syntax error for unknown alpha escapes so \k<name>
would previously match the literal string "k<name>", whereas (?P...) would cause a syntax error previously.
If you switch to the .NET syntax for backreferences, you should of course do the same for the captures
themselves.
.NET will number all the unnamed captures first from left to right, then number the named ones from left to
right. Most others number all of them regardless of whether they are named. I think you went with the non-
.NET version, which feels right.
@ErikCorryGoogle
Copy link
Author

  1. I'd like to add something to the match object: string offsets. Something like:

var match = /foo(...)bar/.exec("foo & bar");
match[1]; // This is already " & ".
match.starts[1]; // 3
match.ends[1]; // 6

This doesn't make the match array any heavier - it's just a getter for some information that's already there, but it means that the tags "starts" and "ends" are now also out of bounds for named captures.

Point is you are giving the whole namespace of the match object to named captures, whereas if you just add a getter called "map" that returns a Map then you have the freedom to adds stuff like this later.

Alternative suggestion: Don't allow capture names that start with "". This excludes proto and gives a namespace for alter extensions. "*" is a pretty crappy namespace though.

  1. Tying this to Unicode mode (which is quite a big performance hit) seems bad.

@ErikCorryGoogle
Copy link
Author

My double underscore in quotes turned into "" in the comment above.

@slevithan
Copy link

slevithan commented Jan 12, 2017

I agree with everything @littledan said. Weakly held on #2 because of the very reasonable counterarguments. Also, backreference start indexes are an important missing feature and I would love to have them.

The precedent already linked to the proposal is helpful. You can see even more precedent for named capture syntax and behavior at http://xregexp.com/syntax/named_capture_comparison/.

Does the proposal address the question of how named captures would be referenced within a string replacement closure? In XRegExp I included what I think is a creative solution for this--replace the string literal provided as the first argument with a String object that has named backreferences as properties--but that feels slightly hacky and I'm very curious what might be better approaches.

@slevithan
Copy link

Tying this to Unicode mode (which is quite a big performance hit) seems bad.

Tying stricter syntax errors (that enable introducing many useful, established regex features from other languages) to Unicode mode seems like it might have been bad. This will come up again in the future so it doesn't feel like a strong reason to avoid the \k syntax that has overwhelming consensus.

Even in languages/libraries where both .NET and Python syntax is supported, the Python syntax is typically not encouraged.

@ErikCorryGoogle
Copy link
Author

At some point we could introduce the /x syntax and that would also let us fix the other syntactic strangeness without forcing /u semantics. My favourite one is, what does /\c./ match?

@ErikCorryGoogle
Copy link
Author

Re referencing captures within a string closure, perhaps it's cleaner to just augment "replace" with replaceMatch which takes a single match object like the array RegExp.prototype.exec returns.

I guess you support $ like $1 when the replacement is a string rather than a closure?

@ErikCorryGoogle
Copy link
Author

Github messing with my replies again. I meant:

I guess you support $<foo> // this is supposed to look like dollar-lessthan-foo-greaterthan

@hashseed
Copy link

I agree that adding named captures as properties on the match object is a bad idea, for the mentioned reasons.
One solution I could get behind is having a sub-object for the captures. Another way could be to change match object to be a map, so that you could do something like
/(?<foo>abc)/.exec("abc").get("foo")

@schuay
Copy link

schuay commented Jan 18, 2017

I think using maps in any context here (either as the match object itself, or as the match.groups object) would be a pretty significant performance hit.

+1 for storing captures on a sub-object, where the sub-object is a plain JS object with one property per named capture, and the sub-object exists if and only if the regexp has named captures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment