Skip to content

Instantly share code, notes, and snippets.

@floere
Created March 22, 2012 01:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save floere/2154980 to your computer and use it in GitHub Desktop.
Save floere/2154980 to your computer and use it in GitHub Desktop.
A regexp to match named groups in a URL
# I've looked at the case at hand which is:
# foo.bar -> foo: foo.bar
# foo.bar -> foo: foo, format: bar
# This roughly translates to:
# For each part (foo, format), match as many possible subexpressions consisting of multiple word characters or one non-word characters
# (we might say \. explicitly, in this specific case). Do this lazily, except for the last part, since that one needs to gobble up the rest.
# And: If we have two or more parts, join them by a lazy match of a dot (\.?), which is not included in the named group.
# Examples
# 1 part:
#
p "foo".match(/(?<foo>(\w+|\W?)+)/) # => #<MatchData "foo" foo:"foo">
p "foo.bar".match(/(?<foo>(\w+|\W?)+)/) # => #<MatchData "foo.bar" foo:"foo.bar">
# 2 parts:
#
p "foo.bar".match(/(?<foo>(\w+|\W?)+?)\.?(?<bar>(\w+|\W?)+)/) # => #<MatchData "foo.bar" foo:"foo" bar:"bar">
p "foo.bar.bur".match(/(?<foo>(\w+|\W?)+?)\.?(?<bar>(\w+|\W?)+)/) # => #<MatchData "foo.bar.bur" foo:"foo" bar:"bar.bur">
# 3 parts:
#
p "foo.bar.bur".match(/(?<foo>(\w+|\W?)+?)\.?(?<bar>(\w+|\W?)+?)\.?(?<bur>(\w+|\W?)+)/) # => #<MatchData "foo.bar.bur" foo:"foo" bar:"bar" bur:"bur">
# (Note that the last expression is greedy – in the last example: (?<bur>(\w+|\W?)+) <- greedy – while the others are not, to gobble up the rest :) If you don't want this, don't make it greedy)
# So generally, what this means: If you have "pots", like foo, format etc. or let's say a, b, c, d, e, f… these regexps will distribute anything looking like:
# 1.2.3.4.5.6.7
# in these pots. If there's not enough to go around, e.g. with 1,2, then it will only be distributed to the first two pots.
# If there's too much to go around, it depends whether the last expression is greedy or not:
#
p "foo.bar".match(/(?<foo>(\w+|\W?)+?)/) # => foo matched
p "foo.bar".match(/(?<foo>(\w+|\W?)+)/) # => foo.bar gobbled up
@floere
Copy link
Author

floere commented Mar 22, 2012

I'm asking this because:

r = /[\/\.%2E]+/
p "/foo.bar"    .split(r) # => ["", "foo", "bar"]
p "/foo%2Ebar"  .split(r) # => ["", "foo", "bar"]

r = /[\/@]+/
p "/foo@bar"    .split(r) # => ["", "foo", "bar"]
p "/foo.foo@bar".split(r) # => ["", "foo.foo", "bar"]
p "/foo@bar.bar".split(r) # => ["", "foo", "bar.bar"]

Which is very close to the expected behavior, while being relatively simple (to generate as well). Cheers!

@rkh
Copy link

rkh commented Mar 22, 2012

match is used. So you mean we should detect that scenario and "fix" it after matching?

@floere
Copy link
Author

floere commented Mar 22, 2012

I wonder about the usage of match since using split seems a much better fit (in terms of generating the regexps and mental mapping), still assuming that an array of parts is needed as a result.
No, I am suggesting to use split if that is a possibility.

@rkh
Copy link

rkh commented Mar 22, 2012

I don't get yet what the program flow would be like with split.

@floere
Copy link
Author

floere commented Mar 22, 2012

Can you point me to the match in the Sinatra code, please? :) (Then I am able to help much better)

Note: From your table I assumed you actually need an array of the parts as output, which is why I offered the idea of using split.

@floere
Copy link
Author

floere commented Mar 22, 2012

Thanks! I see.

Ok then, match it is:

r = /^\/([^\/?#]+)(?:\.|%2E)+([^\/?#]+)?$/
p "/foo.bar".match(r)
p "/foo%2Ebar".match(r)

r = /^\/([^\/?#]+)(?:@|%40)+([^\/?#]+)?$/
p "/foo@bar".match(r)
p "/foo.foo@bar".match(r)
p "/foo@bar.bar".match(r)

(Both differ from the original in 1 position: ? -> +)

Cheers!

@rkh
Copy link

rkh commented Mar 22, 2012

That doesn't match "/foo".

@floere
Copy link
Author

floere commented Mar 22, 2012

I'm beginning to understand. Also: How about being more constructive? :)

@rkh
Copy link

rkh commented Mar 22, 2012

Haha, I'm sorry, I super thankful for your investigation so far. I just don't know if this is even possible.

@floere
Copy link
Author

floere commented Mar 22, 2012

Yeah, my problem was that I didn't understand what Sinatra was actually doing – but now I do. Source code reading helps :)

@floere
Copy link
Author

floere commented Mar 22, 2012

I believe I got one. On to the next.

r = /^\/([^\.%2E\/?#]+)(?:\.|%2E)?([^\.%2E\/?#]+)?$/
p "/foo".match(r)
p "/foo.bar".match(r)
p "/foo%2Ebar".match(r)

@floere
Copy link
Author

floere commented Mar 22, 2012

The second one is probably

r = /^\/([^@%40\/?#]+)(?:@|%40)?([^@%40\/?#]+)?$/
p "/foo".match(r)
p "/foo@bar".match(r)
p "/foo.foo@bar".match(r)
p "/foo@bar.bar".match(r)

Can you check, @rkh?

@rkh
Copy link

rkh commented Mar 22, 2012

OK, that just means we'll have to do special parsing (i.e. recognize that it's of to have :format? not match the .). Might need to do proper parsing of the pattern then instead of just some replacements + Regexp.new.

@floere
Copy link
Author

floere commented Mar 22, 2012

Unsure about that. The patterns I posted actually mimic a pattern that I've seen above, and which could be summarized as:
Match all in group except the separator character(s) up to the separator character(s), then continue matching all in group except the separator character(s).
E.g. /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/

@floere
Copy link
Author

floere commented Mar 22, 2012

Are these the routing tests? https://github.com/sinatra/sinatra/blob/v1.3.2/test/routing_test.rb Maybe we can add the nice table above to it? :) (If it's not in there already and I've overlooked it)

@rkh
Copy link

rkh commented Mar 22, 2012

Yes, except that '.' is not a separator character. The issue is, at the moment we do a simple gsub for :format, so we don't know that :format is actually part of :name.?:format? and I got the feeling that in order to solve this properly we'd need a proper parser, as this is not describable with a regexp.

@rkh
Copy link

rkh commented Mar 22, 2012

Are these the routing tests? https://github.com/sinatra/sinatra/blob/v1.3.2/test/routing_test.rb Maybe we can add the nice table above to it? :) (If it's not in there already and I've overlooked it)

All but the failing ones are actually in there already (that's were I got em from).

@floere
Copy link
Author

floere commented Mar 22, 2012

Good point with the "." not being a separator character. A parser feels like the way to go here, but I had a feeling it might be doable using regexps. I'll have a look at Base#compile.

Thanks, I believe the tests would benefit a lot from being in a tabular form, but that's just me :)

@rkh
Copy link

rkh commented Mar 22, 2012

Yeah, I meant a parser for the pattern -> regexp step, not the request path -> route parsing, I would still use a regexp there. the real question would be how to generate it. I would also like :name(.:format)? to be possible.

@floere
Copy link
Author

floere commented Mar 22, 2012

Yes, I got that! :)

@floere
Copy link
Author

floere commented Mar 22, 2012

Ok, rewriting the challenge as: Find a way to elegantly map the given patterns into their corresponding regexps, such that they work for all given examples.

Full Pattern - Regexp mapping:

/                 | /^\/$/
/foo              | /^\/foo$/
/f\u00F6\u00F6    | /^\/f%C3%B6%C3%B6$/
/:foo             | /^\/([^\/?#]+)$/
/:foo/:bar        | /^\/([^\/?#]+)\/([^\/?#]+)$/
/hello/:person    | /^\/hello\/([^\/?#]+)$/
/?:foo?/?:bar?    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/
/*                | /^\/(.*?)$/
/:foo/*           | /^\/([^\/?#]+)\/(.*?)$/
/test.bar         | /^\/test(?:\.|%2E)bar$/
/test$/           | /^\/test(?:\$|%24)\/$/
/te+st/           | /^\/te(?:\+|%2B)st\/$/
/test(bar)/       | /^\/test(?:\(|%28)bar(?:\)|%29)\/$/
/path with spaces | /^\/path(?:%20|(?:\+|%2B))with(?:%20|(?:\+|%2B))spaces$/
/foo&bar          | /^\/foo(?:&|%26)bar$/
/*/foo/*/*        | /^\/(.*?)\/foo\/(.*?)\/(.*?)$/
/:file.:ext       | /^\/([^\/?#]+)(?:\.|%2E)([^\/?#]+)$/
/:name.?:format?  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/
/:user@?:host?    | /^\/([^@%40\/?#]+)(?:@|%40)?([^@%40\/?#]+)?$/
/:name.?:format?  | /^\/([^\.%2E\/?#]+)(?:\.|%2E)?([^\.%2E\/?#]+)?$/

@floere
Copy link
Author

floere commented Mar 22, 2012

I wouldn't mind working on this, if you don't mind :)

@floere
Copy link
Author

floere commented Mar 23, 2012

Ok, almost got it. Cleaning up the code and preparing for a pull request :)

@floere
Copy link
Author

floere commented Mar 23, 2012

Note: I'm assuming this example should actually be nil. At least it looks like that to me. If I am wrong, please tell me.

"/:name.?:format?  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/ | /.bar | [.bar, nil]"

@floere
Copy link
Author

floere commented Mar 23, 2012

Let's move this to: sinatra/sinatra#492 Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment