Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
A regexp to match named groups in a URL
# I've looked at the case at hand which is:
# foo.bar -> foo: foo.bar
# foo.bar -> foo: foo, format: bar
# This roughly translates to:
# For each part (foo, format), match as many possible subexpressions consisting of multiple word characters or one non-word characters
# (we might say \. explicitly, in this specific case). Do this lazily, except for the last part, since that one needs to gobble up the rest.
# And: If we have two or more parts, join them by a lazy match of a dot (\.?), which is not included in the named group.
# Examples
# 1 part:
#
p "foo".match(/(?<foo>(\w+|\W?)+)/) # => #<MatchData "foo" foo:"foo">
p "foo.bar".match(/(?<foo>(\w+|\W?)+)/) # => #<MatchData "foo.bar" foo:"foo.bar">
# 2 parts:
#
p "foo.bar".match(/(?<foo>(\w+|\W?)+?)\.?(?<bar>(\w+|\W?)+)/) # => #<MatchData "foo.bar" foo:"foo" bar:"bar">
p "foo.bar.bur".match(/(?<foo>(\w+|\W?)+?)\.?(?<bar>(\w+|\W?)+)/) # => #<MatchData "foo.bar.bur" foo:"foo" bar:"bar.bur">
# 3 parts:
#
p "foo.bar.bur".match(/(?<foo>(\w+|\W?)+?)\.?(?<bar>(\w+|\W?)+?)\.?(?<bur>(\w+|\W?)+)/) # => #<MatchData "foo.bar.bur" foo:"foo" bar:"bar" bur:"bur">
# (Note that the last expression is greedy – in the last example: (?<bur>(\w+|\W?)+) <- greedy – while the others are not, to gobble up the rest :) If you don't want this, don't make it greedy)
# So generally, what this means: If you have "pots", like foo, format etc. or let's say a, b, c, d, e, f… these regexps will distribute anything looking like:
# 1.2.3.4.5.6.7
# in these pots. If there's not enough to go around, e.g. with 1,2, then it will only be distributed to the first two pots.
# If there's too much to go around, it depends whether the last expression is greedy or not:
#
p "foo.bar".match(/(?<foo>(\w+|\W?)+?)/) # => foo matched
p "foo.bar".match(/(?<foo>(\w+|\W?)+)/) # => foo.bar gobbled up
Owner
floere commented Mar 22, 2012

I hope this is understandable. We could be more specific than \W if it's only \. that need to be matched, of course.

rkh commented Mar 22, 2012

No, \W is fine.

rkh commented Mar 22, 2012

Ok, so that causes issues, actually, as splats are non-greedy atm and making (parts of) the normal named patterns non-greedy makes splats eat up everything

rkh commented Mar 22, 2012

ah, wrong explenation, but it still doesn't work

Owner
floere commented Mar 22, 2012

Failing example?

rkh commented Mar 22, 2012

OK, so how do I apply this? The current regexp is /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/.

rkh commented Mar 22, 2012
regexp = /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/
def assert(cond) cond or fail("didn't match") end

assert "/foo" =~ regexp
assert $1 == "foo"

assert "/foo.bar" =~ regexp
assert $1 == "foo"
assert $2 == "bar"

assert "/foo." =~ regexp
assert $1 == "foo"

assert "/.bar" !~ regexp
assert "/foo/bar" !~ regexp
rkh commented Mar 22, 2012

In Sinatra:

diff --git a/test/routing_test.rb b/test/routing_test.rb
index 80b0a00..09983f0 100644
--- a/test/routing_test.rb
+++ b/test/routing_test.rb
@@ -950,6 +950,13 @@ class RoutingTest < Test::Unit::TestCase
     assert_equal 'looks good', body
   end

+  it "matches formats even if the format is potional" do
+    mock_app { p get('/:name.?:format?') { params[:format] } }
+    get '/foo.bar'
+    assert ok?
+    assert_equal "bar", body
+  end
+
   it 'raises an ArgumentError with block arity > 1 and too many values' do
     mock_app do
       get '/:foo/:bar/:baz' do |foo, bar|
Owner
floere commented Mar 22, 2012

I'm trying to understand what the problem is. (I can assume things, but I'd rather ask)

So I am running a few examples:

r = /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/

p '/:foo'.match(r)            # => #<MatchData "/:foo" 1:":foo" 2:nil>
p '/:foo.:format'.match(r)    # => #<MatchData "/:foo.:format" 1:":foo.:format" 2:nil>
p '/:foo/:bar/:baz'.match(r)  # => nil
p '/:name.?:format?'.match(r) # => nil

Is the expectation that all of these are matched in groups? What should be the results in these cases?

Owner
floere commented Mar 22, 2012

Or better… what do you need extracted, actually?

rkh commented Mar 22, 2012

Your second example should behave differently (i.e. result in #<MatchData "/:foo.:format" 1:":foo" 2:":format">).

Owner
floere commented Mar 22, 2012

Ok.

Can we maybe get a huge ass list of various examples going? :) This would help tremendously in the search for a good regexp! What should happen if "/.bar" comes along? Fail?

rkh commented Mar 22, 2012

Yes, :foo should match at least on character. I'll do a list.

Owner
floere commented Mar 22, 2012

Thanks!

rkh commented Mar 22, 2012
Working
=======

 Pattern             | Current Regexp                                           | Example                          | Should Be                            | Is Currently                        
---------------------|----------------------------------------------------------|----------------------------------|--------------------------------------|--------------------------------------
 "/"                 | /^\/$/                                                   | "/"                              | []                                   | []                                  
 "/foo"              | /^\/foo$/                                                | "/foo"                           | []                                   | []                                  
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/foo"                           | ["foo"]                              | ["foo"]                             
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/foo?"                          | nil                                  | nil                                 
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/foo/bar"                       | nil                                  | nil                                 
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/foo%2Fbar"                     | ["foo%2Fbar"]                        | ["foo%2Fbar"]                       
 "/:foo/:bar"        | /^\/([^\/?#]+)\/([^\/?#]+)$/                             | "/foo/bar"                       | ["foo", "bar"]                       | ["foo", "bar"]                      
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/"                              | nil                                  | nil                                 
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/foo/"                          | nil                                  | nil                                 
 "/f\u00F6\u00F6"    | /^\/f%C3%B6%C3%B6$/                                      | "/f%C3%B6%C3%B6"                 | []                                   | []                                  
 "/hello/:person"    | /^\/hello\/([^\/?#]+)$/                                  | "/hello/Frank"                   | ["Frank"]                            | ["Frank"]                           
 "/?:foo?/?:bar?"    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/                         | "/hello/world"                   | ["hello", "world"]                   | ["hello", "world"]                  
 "/?:foo?/?:bar?"    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/                         | "/hello"                         | ["hello", nil]                       | ["hello", nil]                      
 "/?:foo?/?:bar?"    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/                         | "/"                              | [nil, nil]                           | [nil, nil]                          
 "/?:foo?/?:bar?"    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/                         | ""                               | [nil, nil]                           | [nil, nil]                          
 "/*"                | /^\/(.*?)$/                                              | "/"                              | [""]                                 | [""]                                
 "/*"                | /^\/(.*?)$/                                              | "/foo"                           | ["foo"]                              | ["foo"]                             
 "/*"                | /^\/(.*?)$/                                              | "/"                              | [""]                                 | [""]                                
 "/*"                | /^\/(.*?)$/                                              | "/foo/bar"                       | ["foo/bar"]                          | ["foo/bar"]                         
 "/:foo/*"           | /^\/([^\/?#]+)\/(.*?)$/                                  | "/foo/bar/baz"                   | ["foo", "bar/baz"]                   | ["foo", "bar/baz"]                  
 "/:foo/:bar"        | /^\/([^\/?#]+)\/([^\/?#]+)$/                             | "/user@example.com/name"         | ["user@example.com", "name"]         | ["user@example.com", "name"]        
 "/:file.:ext"       | /^\/([^\/?#]+)(?:\.|%2E)([^\/?#]+)$/                     | "/pony.jpg"                      | ["pony", "jpg"]                      | ["pony", "jpg"]                     
 "/:file.:ext"       | /^\/([^\/?#]+)(?:\.|%2E)([^\/?#]+)$/                     | "/pony%2Ejpg"                    | ["pony", "jpg"]                      | ["pony", "jpg"]                     
 "/:file.:ext"       | /^\/([^\/?#]+)(?:\.|%2E)([^\/?#]+)$/                     | "/.jpg"                          | nil                                  | nil                                 
 "/test.bar"         | /^\/test(?:\.|%2E)bar$/                                  | "/test.bar"                      | []                                   | []                                  
 "/test.bar"         | /^\/test(?:\.|%2E)bar$/                                  | "/test0bar"                      | nil                                  | nil                                 
 "/test$/"           | /^\/test(?:\$|%24)\/$/                                   | "/test$/"                        | []                                   | []                                  
 "/te+st/"           | /^\/te(?:\+|%2B)st\/$/                                   | "/te+st/"                        | []                                   | []                                  
 "/te+st/"           | /^\/te(?:\+|%2B)st\/$/                                   | "/test/"                         | nil                                  | nil                                 
 "/te+st/"           | /^\/te(?:\+|%2B)st\/$/                                   | "/teeest/"                       | nil                                  | nil                                 
 "/test(bar)/"       | /^\/test(?:\(|%28)bar(?:\)|%29)\/$/                      | "/test(bar)/"                    | []                                   | []                                  
 "/path with spaces" | /^\/path(?:%20|(?:\+|%2B))with(?:%20|(?:\+|%2B))spaces$/ | "/path%20with%20spaces"          | []                                   | []                                  
 "/path with spaces" | /^\/path(?:%20|(?:\+|%2B))with(?:%20|(?:\+|%2B))spaces$/ | "/path%2Bwith%2Bspaces"          | []                                   | []                                  
 "/path with spaces" | /^\/path(?:%20|(?:\+|%2B))with(?:%20|(?:\+|%2B))spaces$/ | "/path+with+spaces"              | []                                   | []                                  
 "/foo&bar"          | /^\/foo(?:&|%26)bar$/                                    | "/foo&bar"                       | []                                   | []                                  
 "/:foo/*"           | /^\/([^\/?#]+)\/(.*?)$/                                  | "/hello%20world/how%20are%20you" | ["hello%20world", "how%20are%20you"] | ["hello%20world", "how%20are%20you"]
 "/*/foo/*/*"        | /^\/(.*?)\/foo\/(.*?)\/(.*?)$/                           | "/bar/foo/bling/baz/boom"        | ["bar", "bling", "baz/boom"]         | ["bar", "bling", "baz/boom"]        
 "/*/foo/*/*"        | /^\/(.*?)\/foo\/(.*?)\/(.*?)$/                           | "/bar/foo/baz"                   | nil                                  | nil                                 
 "/:name.?:format?"  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/                   | "/foo"                           | ["foo", nil]                         | ["foo", nil]                        
 "/:name.?:format?"  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/                   | "/.bar"                          | [".bar", nil]                        | [".bar", nil]                       

Broken
======

 Pattern             | Current Regexp                                           | Example                          | Should Be                            | Is Currently                        
---------------------|----------------------------------------------------------|----------------------------------|--------------------------------------|--------------------------------------
 "/:name.?:format?"  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/                   | "/foo.bar"                       | ["foo", "bar"]                       | ["foo.bar", nil]                    
 "/:name.?:format?"  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/                   | "/foo%2Ebar"                     | ["foo", "bar"]                       | ["foo%2Ebar", nil]                  
 "/:user@?:host?"    | /^\/([^\/?#]+)(?:@|%40)?([^\/?#]+)?$/                    | "/foo@bar"                       | ["foo", "bar"]                       | ["foo@bar", nil]                    
 "/:user@?:host?"    | /^\/([^\/?#]+)(?:@|%40)?([^\/?#]+)?$/                    | "/foo.foo@bar"                   | ["foo.foo", "bar"]                   | ["foo.foo@bar", nil]                
 "/:user@?:host?"    | /^\/([^\/?#]+)(?:@|%40)?([^\/?#]+)?$/                    | "/foo@bar.bar"                   | ["foo", "bar.bar"]                   | ["foo@bar.bar", nil]  
Owner
floere commented Mar 22, 2012

Great! Does the result need to be an Array? Is #match used? Why not #split?

Owner
floere commented Mar 22, 2012

I'm asking this because:

r = /[\/\.%2E]+/
p "/foo.bar"    .split(r) # => ["", "foo", "bar"]
p "/foo%2Ebar"  .split(r) # => ["", "foo", "bar"]

r = /[\/@]+/
p "/foo@bar"    .split(r) # => ["", "foo", "bar"]
p "/foo.foo@bar".split(r) # => ["", "foo.foo", "bar"]
p "/foo@bar.bar".split(r) # => ["", "foo", "bar.bar"]

Which is very close to the expected behavior, while being relatively simple (to generate as well). Cheers!

rkh commented Mar 22, 2012

match is used. So you mean we should detect that scenario and "fix" it after matching?

Owner
floere commented Mar 22, 2012

I wonder about the usage of match since using split seems a much better fit (in terms of generating the regexps and mental mapping), still assuming that an array of parts is needed as a result.
No, I am suggesting to use split if that is a possibility.

rkh commented Mar 22, 2012

I don't get yet what the program flow would be like with split.

Owner
floere commented Mar 22, 2012

Can you point me to the match in the Sinatra code, please? :) (Then I am able to help much better)

Note: From your table I assumed you actually need an array of the parts as output, which is why I offered the idea of using split.

Owner
floere commented Mar 22, 2012

Thanks! I see.

Ok then, match it is:

r = /^\/([^\/?#]+)(?:\.|%2E)+([^\/?#]+)?$/
p "/foo.bar".match(r)
p "/foo%2Ebar".match(r)

r = /^\/([^\/?#]+)(?:@|%40)+([^\/?#]+)?$/
p "/foo@bar".match(r)
p "/foo.foo@bar".match(r)
p "/foo@bar.bar".match(r)

(Both differ from the original in 1 position: ? -> +)

Cheers!

rkh commented Mar 22, 2012

That doesn't match "/foo".

Owner
floere commented Mar 22, 2012

I'm beginning to understand. Also: How about being more constructive? :)

rkh commented Mar 22, 2012

Haha, I'm sorry, I super thankful for your investigation so far. I just don't know if this is even possible.

Owner
floere commented Mar 22, 2012

Yeah, my problem was that I didn't understand what Sinatra was actually doing – but now I do. Source code reading helps :)

Owner
floere commented Mar 22, 2012

I believe I got one. On to the next.

r = /^\/([^\.%2E\/?#]+)(?:\.|%2E)?([^\.%2E\/?#]+)?$/
p "/foo".match(r)
p "/foo.bar".match(r)
p "/foo%2Ebar".match(r)
Owner
floere commented Mar 22, 2012

The second one is probably

r = /^\/([^@%40\/?#]+)(?:@|%40)?([^@%40\/?#]+)?$/
p "/foo".match(r)
p "/foo@bar".match(r)
p "/foo.foo@bar".match(r)
p "/foo@bar.bar".match(r)

Can you check, @rkh?

rkh commented Mar 22, 2012

OK, that just means we'll have to do special parsing (i.e. recognize that it's of to have :format? not match the .). Might need to do proper parsing of the pattern then instead of just some replacements + Regexp.new.

Owner
floere commented Mar 22, 2012

Unsure about that. The patterns I posted actually mimic a pattern that I've seen above, and which could be summarized as:
Match all in group except the separator character(s) up to the separator character(s), then continue matching all in group except the separator character(s).
E.g. /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/

Owner
floere commented Mar 22, 2012

Are these the routing tests? https://github.com/sinatra/sinatra/blob/v1.3.2/test/routing_test.rb Maybe we can add the nice table above to it? :) (If it's not in there already and I've overlooked it)

rkh commented Mar 22, 2012

Yes, except that '.' is not a separator character. The issue is, at the moment we do a simple gsub for :format, so we don't know that :format is actually part of :name.?:format? and I got the feeling that in order to solve this properly we'd need a proper parser, as this is not describable with a regexp.

rkh commented Mar 22, 2012

Are these the routing tests? https://github.com/sinatra/sinatra/blob/v1.3.2/test/routing_test.rb Maybe we can add the nice table above to it? :) (If it's not in there already and I've overlooked it)

All but the failing ones are actually in there already (that's were I got em from).

Owner
floere commented Mar 22, 2012

Good point with the "." not being a separator character. A parser feels like the way to go here, but I had a feeling it might be doable using regexps. I'll have a look at Base#compile.

Thanks, I believe the tests would benefit a lot from being in a tabular form, but that's just me :)

rkh commented Mar 22, 2012

Yeah, I meant a parser for the pattern -> regexp step, not the request path -> route parsing, I would still use a regexp there. the real question would be how to generate it. I would also like :name(.:format)? to be possible.

Owner
floere commented Mar 22, 2012

Yes, I got that! :)

Owner
floere commented Mar 22, 2012

Ok, rewriting the challenge as: Find a way to elegantly map the given patterns into their corresponding regexps, such that they work for all given examples.

Full Pattern - Regexp mapping:

/                 | /^\/$/
/foo              | /^\/foo$/
/f\u00F6\u00F6    | /^\/f%C3%B6%C3%B6$/
/:foo             | /^\/([^\/?#]+)$/
/:foo/:bar        | /^\/([^\/?#]+)\/([^\/?#]+)$/
/hello/:person    | /^\/hello\/([^\/?#]+)$/
/?:foo?/?:bar?    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/
/*                | /^\/(.*?)$/
/:foo/*           | /^\/([^\/?#]+)\/(.*?)$/
/test.bar         | /^\/test(?:\.|%2E)bar$/
/test$/           | /^\/test(?:\$|%24)\/$/
/te+st/           | /^\/te(?:\+|%2B)st\/$/
/test(bar)/       | /^\/test(?:\(|%28)bar(?:\)|%29)\/$/
/path with spaces | /^\/path(?:%20|(?:\+|%2B))with(?:%20|(?:\+|%2B))spaces$/
/foo&bar          | /^\/foo(?:&|%26)bar$/
/*/foo/*/*        | /^\/(.*?)\/foo\/(.*?)\/(.*?)$/
/:file.:ext       | /^\/([^\/?#]+)(?:\.|%2E)([^\/?#]+)$/
/:name.?:format?  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/
/:user@?:host?    | /^\/([^@%40\/?#]+)(?:@|%40)?([^@%40\/?#]+)?$/
/:name.?:format?  | /^\/([^\.%2E\/?#]+)(?:\.|%2E)?([^\.%2E\/?#]+)?$/
Owner
floere commented Mar 22, 2012

I wouldn't mind working on this, if you don't mind :)

Owner
floere commented Mar 23, 2012

Ok, almost got it. Cleaning up the code and preparing for a pull request :)

Owner
floere commented Mar 23, 2012

Note: I'm assuming this example should actually be nil. At least it looks like that to me. If I am wrong, please tell me.

"/:name.?:format?  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/ | /.bar | [.bar, nil]"
Owner
floere commented Mar 23, 2012

Let's move this to: sinatra/sinatra#492 Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment