Skip to content

Instantly share code, notes, and snippets.

@floere
Created March 22, 2012 01:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save floere/2154980 to your computer and use it in GitHub Desktop.
Save floere/2154980 to your computer and use it in GitHub Desktop.
A regexp to match named groups in a URL
# I've looked at the case at hand which is:
# foo.bar -> foo: foo.bar
# foo.bar -> foo: foo, format: bar
# This roughly translates to:
# For each part (foo, format), match as many possible subexpressions consisting of multiple word characters or one non-word characters
# (we might say \. explicitly, in this specific case). Do this lazily, except for the last part, since that one needs to gobble up the rest.
# And: If we have two or more parts, join them by a lazy match of a dot (\.?), which is not included in the named group.
# Examples
# 1 part:
#
p "foo".match(/(?<foo>(\w+|\W?)+)/) # => #<MatchData "foo" foo:"foo">
p "foo.bar".match(/(?<foo>(\w+|\W?)+)/) # => #<MatchData "foo.bar" foo:"foo.bar">
# 2 parts:
#
p "foo.bar".match(/(?<foo>(\w+|\W?)+?)\.?(?<bar>(\w+|\W?)+)/) # => #<MatchData "foo.bar" foo:"foo" bar:"bar">
p "foo.bar.bur".match(/(?<foo>(\w+|\W?)+?)\.?(?<bar>(\w+|\W?)+)/) # => #<MatchData "foo.bar.bur" foo:"foo" bar:"bar.bur">
# 3 parts:
#
p "foo.bar.bur".match(/(?<foo>(\w+|\W?)+?)\.?(?<bar>(\w+|\W?)+?)\.?(?<bur>(\w+|\W?)+)/) # => #<MatchData "foo.bar.bur" foo:"foo" bar:"bar" bur:"bur">
# (Note that the last expression is greedy – in the last example: (?<bur>(\w+|\W?)+) <- greedy – while the others are not, to gobble up the rest :) If you don't want this, don't make it greedy)
# So generally, what this means: If you have "pots", like foo, format etc. or let's say a, b, c, d, e, f… these regexps will distribute anything looking like:
# 1.2.3.4.5.6.7
# in these pots. If there's not enough to go around, e.g. with 1,2, then it will only be distributed to the first two pots.
# If there's too much to go around, it depends whether the last expression is greedy or not:
#
p "foo.bar".match(/(?<foo>(\w+|\W?)+?)/) # => foo matched
p "foo.bar".match(/(?<foo>(\w+|\W?)+)/) # => foo.bar gobbled up
@floere
Copy link
Author

floere commented Mar 22, 2012

Thanks!

@rkh
Copy link

rkh commented Mar 22, 2012

Working
=======

 Pattern             | Current Regexp                                           | Example                          | Should Be                            | Is Currently                        
---------------------|----------------------------------------------------------|----------------------------------|--------------------------------------|--------------------------------------
 "/"                 | /^\/$/                                                   | "/"                              | []                                   | []                                  
 "/foo"              | /^\/foo$/                                                | "/foo"                           | []                                   | []                                  
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/foo"                           | ["foo"]                              | ["foo"]                             
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/foo?"                          | nil                                  | nil                                 
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/foo/bar"                       | nil                                  | nil                                 
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/foo%2Fbar"                     | ["foo%2Fbar"]                        | ["foo%2Fbar"]                       
 "/:foo/:bar"        | /^\/([^\/?#]+)\/([^\/?#]+)$/                             | "/foo/bar"                       | ["foo", "bar"]                       | ["foo", "bar"]                      
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/"                              | nil                                  | nil                                 
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/foo/"                          | nil                                  | nil                                 
 "/f\u00F6\u00F6"    | /^\/f%C3%B6%C3%B6$/                                      | "/f%C3%B6%C3%B6"                 | []                                   | []                                  
 "/hello/:person"    | /^\/hello\/([^\/?#]+)$/                                  | "/hello/Frank"                   | ["Frank"]                            | ["Frank"]                           
 "/?:foo?/?:bar?"    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/                         | "/hello/world"                   | ["hello", "world"]                   | ["hello", "world"]                  
 "/?:foo?/?:bar?"    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/                         | "/hello"                         | ["hello", nil]                       | ["hello", nil]                      
 "/?:foo?/?:bar?"    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/                         | "/"                              | [nil, nil]                           | [nil, nil]                          
 "/?:foo?/?:bar?"    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/                         | ""                               | [nil, nil]                           | [nil, nil]                          
 "/*"                | /^\/(.*?)$/                                              | "/"                              | [""]                                 | [""]                                
 "/*"                | /^\/(.*?)$/                                              | "/foo"                           | ["foo"]                              | ["foo"]                             
 "/*"                | /^\/(.*?)$/                                              | "/"                              | [""]                                 | [""]                                
 "/*"                | /^\/(.*?)$/                                              | "/foo/bar"                       | ["foo/bar"]                          | ["foo/bar"]                         
 "/:foo/*"           | /^\/([^\/?#]+)\/(.*?)$/                                  | "/foo/bar/baz"                   | ["foo", "bar/baz"]                   | ["foo", "bar/baz"]                  
 "/:foo/:bar"        | /^\/([^\/?#]+)\/([^\/?#]+)$/                             | "/user@example.com/name"         | ["user@example.com", "name"]         | ["user@example.com", "name"]        
 "/:file.:ext"       | /^\/([^\/?#]+)(?:\.|%2E)([^\/?#]+)$/                     | "/pony.jpg"                      | ["pony", "jpg"]                      | ["pony", "jpg"]                     
 "/:file.:ext"       | /^\/([^\/?#]+)(?:\.|%2E)([^\/?#]+)$/                     | "/pony%2Ejpg"                    | ["pony", "jpg"]                      | ["pony", "jpg"]                     
 "/:file.:ext"       | /^\/([^\/?#]+)(?:\.|%2E)([^\/?#]+)$/                     | "/.jpg"                          | nil                                  | nil                                 
 "/test.bar"         | /^\/test(?:\.|%2E)bar$/                                  | "/test.bar"                      | []                                   | []                                  
 "/test.bar"         | /^\/test(?:\.|%2E)bar$/                                  | "/test0bar"                      | nil                                  | nil                                 
 "/test$/"           | /^\/test(?:\$|%24)\/$/                                   | "/test$/"                        | []                                   | []                                  
 "/te+st/"           | /^\/te(?:\+|%2B)st\/$/                                   | "/te+st/"                        | []                                   | []                                  
 "/te+st/"           | /^\/te(?:\+|%2B)st\/$/                                   | "/test/"                         | nil                                  | nil                                 
 "/te+st/"           | /^\/te(?:\+|%2B)st\/$/                                   | "/teeest/"                       | nil                                  | nil                                 
 "/test(bar)/"       | /^\/test(?:\(|%28)bar(?:\)|%29)\/$/                      | "/test(bar)/"                    | []                                   | []                                  
 "/path with spaces" | /^\/path(?:%20|(?:\+|%2B))with(?:%20|(?:\+|%2B))spaces$/ | "/path%20with%20spaces"          | []                                   | []                                  
 "/path with spaces" | /^\/path(?:%20|(?:\+|%2B))with(?:%20|(?:\+|%2B))spaces$/ | "/path%2Bwith%2Bspaces"          | []                                   | []                                  
 "/path with spaces" | /^\/path(?:%20|(?:\+|%2B))with(?:%20|(?:\+|%2B))spaces$/ | "/path+with+spaces"              | []                                   | []                                  
 "/foo&bar"          | /^\/foo(?:&|%26)bar$/                                    | "/foo&bar"                       | []                                   | []                                  
 "/:foo/*"           | /^\/([^\/?#]+)\/(.*?)$/                                  | "/hello%20world/how%20are%20you" | ["hello%20world", "how%20are%20you"] | ["hello%20world", "how%20are%20you"]
 "/*/foo/*/*"        | /^\/(.*?)\/foo\/(.*?)\/(.*?)$/                           | "/bar/foo/bling/baz/boom"        | ["bar", "bling", "baz/boom"]         | ["bar", "bling", "baz/boom"]        
 "/*/foo/*/*"        | /^\/(.*?)\/foo\/(.*?)\/(.*?)$/                           | "/bar/foo/baz"                   | nil                                  | nil                                 
 "/:name.?:format?"  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/                   | "/foo"                           | ["foo", nil]                         | ["foo", nil]                        
 "/:name.?:format?"  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/                   | "/.bar"                          | [".bar", nil]                        | [".bar", nil]                       

Broken
======

 Pattern             | Current Regexp                                           | Example                          | Should Be                            | Is Currently                        
---------------------|----------------------------------------------------------|----------------------------------|--------------------------------------|--------------------------------------
 "/:name.?:format?"  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/                   | "/foo.bar"                       | ["foo", "bar"]                       | ["foo.bar", nil]                    
 "/:name.?:format?"  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/                   | "/foo%2Ebar"                     | ["foo", "bar"]                       | ["foo%2Ebar", nil]                  
 "/:user@?:host?"    | /^\/([^\/?#]+)(?:@|%40)?([^\/?#]+)?$/                    | "/foo@bar"                       | ["foo", "bar"]                       | ["foo@bar", nil]                    
 "/:user@?:host?"    | /^\/([^\/?#]+)(?:@|%40)?([^\/?#]+)?$/                    | "/foo.foo@bar"                   | ["foo.foo", "bar"]                   | ["foo.foo@bar", nil]                
 "/:user@?:host?"    | /^\/([^\/?#]+)(?:@|%40)?([^\/?#]+)?$/                    | "/foo@bar.bar"                   | ["foo", "bar.bar"]                   | ["foo@bar.bar", nil]  

@rkh
Copy link

rkh commented Mar 22, 2012

@floere
Copy link
Author

floere commented Mar 22, 2012

Great! Does the result need to be an Array? Is #match used? Why not #split?

@floere
Copy link
Author

floere commented Mar 22, 2012

I'm asking this because:

r = /[\/\.%2E]+/
p "/foo.bar"    .split(r) # => ["", "foo", "bar"]
p "/foo%2Ebar"  .split(r) # => ["", "foo", "bar"]

r = /[\/@]+/
p "/foo@bar"    .split(r) # => ["", "foo", "bar"]
p "/foo.foo@bar".split(r) # => ["", "foo.foo", "bar"]
p "/foo@bar.bar".split(r) # => ["", "foo", "bar.bar"]

Which is very close to the expected behavior, while being relatively simple (to generate as well). Cheers!

@rkh
Copy link

rkh commented Mar 22, 2012

match is used. So you mean we should detect that scenario and "fix" it after matching?

@floere
Copy link
Author

floere commented Mar 22, 2012

I wonder about the usage of match since using split seems a much better fit (in terms of generating the regexps and mental mapping), still assuming that an array of parts is needed as a result.
No, I am suggesting to use split if that is a possibility.

@rkh
Copy link

rkh commented Mar 22, 2012

I don't get yet what the program flow would be like with split.

@floere
Copy link
Author

floere commented Mar 22, 2012

Can you point me to the match in the Sinatra code, please? :) (Then I am able to help much better)

Note: From your table I assumed you actually need an array of the parts as output, which is why I offered the idea of using split.

@floere
Copy link
Author

floere commented Mar 22, 2012

Thanks! I see.

Ok then, match it is:

r = /^\/([^\/?#]+)(?:\.|%2E)+([^\/?#]+)?$/
p "/foo.bar".match(r)
p "/foo%2Ebar".match(r)

r = /^\/([^\/?#]+)(?:@|%40)+([^\/?#]+)?$/
p "/foo@bar".match(r)
p "/foo.foo@bar".match(r)
p "/foo@bar.bar".match(r)

(Both differ from the original in 1 position: ? -> +)

Cheers!

@rkh
Copy link

rkh commented Mar 22, 2012

That doesn't match "/foo".

@floere
Copy link
Author

floere commented Mar 22, 2012

I'm beginning to understand. Also: How about being more constructive? :)

@rkh
Copy link

rkh commented Mar 22, 2012

Haha, I'm sorry, I super thankful for your investigation so far. I just don't know if this is even possible.

@floere
Copy link
Author

floere commented Mar 22, 2012

Yeah, my problem was that I didn't understand what Sinatra was actually doing – but now I do. Source code reading helps :)

@floere
Copy link
Author

floere commented Mar 22, 2012

I believe I got one. On to the next.

r = /^\/([^\.%2E\/?#]+)(?:\.|%2E)?([^\.%2E\/?#]+)?$/
p "/foo".match(r)
p "/foo.bar".match(r)
p "/foo%2Ebar".match(r)

@floere
Copy link
Author

floere commented Mar 22, 2012

The second one is probably

r = /^\/([^@%40\/?#]+)(?:@|%40)?([^@%40\/?#]+)?$/
p "/foo".match(r)
p "/foo@bar".match(r)
p "/foo.foo@bar".match(r)
p "/foo@bar.bar".match(r)

Can you check, @rkh?

@rkh
Copy link

rkh commented Mar 22, 2012

OK, that just means we'll have to do special parsing (i.e. recognize that it's of to have :format? not match the .). Might need to do proper parsing of the pattern then instead of just some replacements + Regexp.new.

@floere
Copy link
Author

floere commented Mar 22, 2012

Unsure about that. The patterns I posted actually mimic a pattern that I've seen above, and which could be summarized as:
Match all in group except the separator character(s) up to the separator character(s), then continue matching all in group except the separator character(s).
E.g. /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/

@floere
Copy link
Author

floere commented Mar 22, 2012

Are these the routing tests? https://github.com/sinatra/sinatra/blob/v1.3.2/test/routing_test.rb Maybe we can add the nice table above to it? :) (If it's not in there already and I've overlooked it)

@rkh
Copy link

rkh commented Mar 22, 2012

Yes, except that '.' is not a separator character. The issue is, at the moment we do a simple gsub for :format, so we don't know that :format is actually part of :name.?:format? and I got the feeling that in order to solve this properly we'd need a proper parser, as this is not describable with a regexp.

@rkh
Copy link

rkh commented Mar 22, 2012

Are these the routing tests? https://github.com/sinatra/sinatra/blob/v1.3.2/test/routing_test.rb Maybe we can add the nice table above to it? :) (If it's not in there already and I've overlooked it)

All but the failing ones are actually in there already (that's were I got em from).

@floere
Copy link
Author

floere commented Mar 22, 2012

Good point with the "." not being a separator character. A parser feels like the way to go here, but I had a feeling it might be doable using regexps. I'll have a look at Base#compile.

Thanks, I believe the tests would benefit a lot from being in a tabular form, but that's just me :)

@rkh
Copy link

rkh commented Mar 22, 2012

Yeah, I meant a parser for the pattern -> regexp step, not the request path -> route parsing, I would still use a regexp there. the real question would be how to generate it. I would also like :name(.:format)? to be possible.

@floere
Copy link
Author

floere commented Mar 22, 2012

Yes, I got that! :)

@floere
Copy link
Author

floere commented Mar 22, 2012

Ok, rewriting the challenge as: Find a way to elegantly map the given patterns into their corresponding regexps, such that they work for all given examples.

Full Pattern - Regexp mapping:

/                 | /^\/$/
/foo              | /^\/foo$/
/f\u00F6\u00F6    | /^\/f%C3%B6%C3%B6$/
/:foo             | /^\/([^\/?#]+)$/
/:foo/:bar        | /^\/([^\/?#]+)\/([^\/?#]+)$/
/hello/:person    | /^\/hello\/([^\/?#]+)$/
/?:foo?/?:bar?    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/
/*                | /^\/(.*?)$/
/:foo/*           | /^\/([^\/?#]+)\/(.*?)$/
/test.bar         | /^\/test(?:\.|%2E)bar$/
/test$/           | /^\/test(?:\$|%24)\/$/
/te+st/           | /^\/te(?:\+|%2B)st\/$/
/test(bar)/       | /^\/test(?:\(|%28)bar(?:\)|%29)\/$/
/path with spaces | /^\/path(?:%20|(?:\+|%2B))with(?:%20|(?:\+|%2B))spaces$/
/foo&bar          | /^\/foo(?:&|%26)bar$/
/*/foo/*/*        | /^\/(.*?)\/foo\/(.*?)\/(.*?)$/
/:file.:ext       | /^\/([^\/?#]+)(?:\.|%2E)([^\/?#]+)$/
/:name.?:format?  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/
/:user@?:host?    | /^\/([^@%40\/?#]+)(?:@|%40)?([^@%40\/?#]+)?$/
/:name.?:format?  | /^\/([^\.%2E\/?#]+)(?:\.|%2E)?([^\.%2E\/?#]+)?$/

@floere
Copy link
Author

floere commented Mar 22, 2012

I wouldn't mind working on this, if you don't mind :)

@floere
Copy link
Author

floere commented Mar 23, 2012

Ok, almost got it. Cleaning up the code and preparing for a pull request :)

@floere
Copy link
Author

floere commented Mar 23, 2012

Note: I'm assuming this example should actually be nil. At least it looks like that to me. If I am wrong, please tell me.

"/:name.?:format?  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/ | /.bar | [.bar, nil]"

@floere
Copy link
Author

floere commented Mar 23, 2012

Let's move this to: sinatra/sinatra#492 Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment