public
Created

A regexp to match named groups in a URL

  • Download Gist
sinatra_regexp_suggestion.rb
Ruby
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
# I've looked at the case at hand which is:
 
# foo.bar -> foo: foo.bar
# foo.bar -> foo: foo, format: bar
 
# This roughly translates to:
# For each part (foo, format), match as many possible subexpressions consisting of multiple word characters or one non-word characters
# (we might say \. explicitly, in this specific case). Do this lazily, except for the last part, since that one needs to gobble up the rest.
# And: If we have two or more parts, join them by a lazy match of a dot (\.?), which is not included in the named group.
 
# Examples
 
# 1 part:
#
p "foo".match(/(?<foo>(\w+|\W?)+)/) # => #<MatchData "foo" foo:"foo">
p "foo.bar".match(/(?<foo>(\w+|\W?)+)/) # => #<MatchData "foo.bar" foo:"foo.bar">
 
# 2 parts:
#
p "foo.bar".match(/(?<foo>(\w+|\W?)+?)\.?(?<bar>(\w+|\W?)+)/) # => #<MatchData "foo.bar" foo:"foo" bar:"bar">
p "foo.bar.bur".match(/(?<foo>(\w+|\W?)+?)\.?(?<bar>(\w+|\W?)+)/) # => #<MatchData "foo.bar.bur" foo:"foo" bar:"bar.bur">
 
# 3 parts:
#
p "foo.bar.bur".match(/(?<foo>(\w+|\W?)+?)\.?(?<bar>(\w+|\W?)+?)\.?(?<bur>(\w+|\W?)+)/) # => #<MatchData "foo.bar.bur" foo:"foo" bar:"bar" bur:"bur">
 
# (Note that the last expression is greedy – in the last example: (?<bur>(\w+|\W?)+) <- greedy – while the others are not, to gobble up the rest :) If you don't want this, don't make it greedy)
 
# So generally, what this means: If you have "pots", like foo, format etc. or let's say a, b, c, d, e, f… these regexps will distribute anything looking like:
# 1.2.3.4.5.6.7
# in these pots. If there's not enough to go around, e.g. with 1,2, then it will only be distributed to the first two pots.
 
# If there's too much to go around, it depends whether the last expression is greedy or not:
#
p "foo.bar".match(/(?<foo>(\w+|\W?)+?)/) # => foo matched
p "foo.bar".match(/(?<foo>(\w+|\W?)+)/) # => foo.bar gobbled up

I hope this is understandable. We could be more specific than \W if it's only \. that need to be matched, of course.

No, \W is fine.

Ok, so that causes issues, actually, as splats are non-greedy atm and making (parts of) the normal named patterns non-greedy makes splats eat up everything

ah, wrong explenation, but it still doesn't work

Failing example?

OK, so how do I apply this? The current regexp is /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/.

regexp = /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/
def assert(cond) cond or fail("didn't match") end

assert "/foo" =~ regexp
assert $1 == "foo"

assert "/foo.bar" =~ regexp
assert $1 == "foo"
assert $2 == "bar"

assert "/foo." =~ regexp
assert $1 == "foo"

assert "/.bar" !~ regexp
assert "/foo/bar" !~ regexp

In Sinatra:

diff --git a/test/routing_test.rb b/test/routing_test.rb
index 80b0a00..09983f0 100644
--- a/test/routing_test.rb
+++ b/test/routing_test.rb
@@ -950,6 +950,13 @@ class RoutingTest < Test::Unit::TestCase
     assert_equal 'looks good', body
   end

+  it "matches formats even if the format is potional" do
+    mock_app { p get('/:name.?:format?') { params[:format] } }
+    get '/foo.bar'
+    assert ok?
+    assert_equal "bar", body
+  end
+
   it 'raises an ArgumentError with block arity > 1 and too many values' do
     mock_app do
       get '/:foo/:bar/:baz' do |foo, bar|

I'm trying to understand what the problem is. (I can assume things, but I'd rather ask)

So I am running a few examples:

r = /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/

p '/:foo'.match(r)            # => #<MatchData "/:foo" 1:":foo" 2:nil>
p '/:foo.:format'.match(r)    # => #<MatchData "/:foo.:format" 1:":foo.:format" 2:nil>
p '/:foo/:bar/:baz'.match(r)  # => nil
p '/:name.?:format?'.match(r) # => nil

Is the expectation that all of these are matched in groups? What should be the results in these cases?

Or better… what do you need extracted, actually?

Your second example should behave differently (i.e. result in #<MatchData "/:foo.:format" 1:":foo" 2:":format">).

Ok.

Can we maybe get a huge ass list of various examples going? :) This would help tremendously in the search for a good regexp! What should happen if "/.bar" comes along? Fail?

Yes, :foo should match at least on character. I'll do a list.

Working
=======

 Pattern             | Current Regexp                                           | Example                          | Should Be                            | Is Currently                        
---------------------|----------------------------------------------------------|----------------------------------|--------------------------------------|--------------------------------------
 "/"                 | /^\/$/                                                   | "/"                              | []                                   | []                                  
 "/foo"              | /^\/foo$/                                                | "/foo"                           | []                                   | []                                  
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/foo"                           | ["foo"]                              | ["foo"]                             
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/foo?"                          | nil                                  | nil                                 
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/foo/bar"                       | nil                                  | nil                                 
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/foo%2Fbar"                     | ["foo%2Fbar"]                        | ["foo%2Fbar"]                       
 "/:foo/:bar"        | /^\/([^\/?#]+)\/([^\/?#]+)$/                             | "/foo/bar"                       | ["foo", "bar"]                       | ["foo", "bar"]                      
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/"                              | nil                                  | nil                                 
 "/:foo"             | /^\/([^\/?#]+)$/                                         | "/foo/"                          | nil                                  | nil                                 
 "/f\u00F6\u00F6"    | /^\/f%C3%B6%C3%B6$/                                      | "/f%C3%B6%C3%B6"                 | []                                   | []                                  
 "/hello/:person"    | /^\/hello\/([^\/?#]+)$/                                  | "/hello/Frank"                   | ["Frank"]                            | ["Frank"]                           
 "/?:foo?/?:bar?"    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/                         | "/hello/world"                   | ["hello", "world"]                   | ["hello", "world"]                  
 "/?:foo?/?:bar?"    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/                         | "/hello"                         | ["hello", nil]                       | ["hello", nil]                      
 "/?:foo?/?:bar?"    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/                         | "/"                              | [nil, nil]                           | [nil, nil]                          
 "/?:foo?/?:bar?"    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/                         | ""                               | [nil, nil]                           | [nil, nil]                          
 "/*"                | /^\/(.*?)$/                                              | "/"                              | [""]                                 | [""]                                
 "/*"                | /^\/(.*?)$/                                              | "/foo"                           | ["foo"]                              | ["foo"]                             
 "/*"                | /^\/(.*?)$/                                              | "/"                              | [""]                                 | [""]                                
 "/*"                | /^\/(.*?)$/                                              | "/foo/bar"                       | ["foo/bar"]                          | ["foo/bar"]                         
 "/:foo/*"           | /^\/([^\/?#]+)\/(.*?)$/                                  | "/foo/bar/baz"                   | ["foo", "bar/baz"]                   | ["foo", "bar/baz"]                  
 "/:foo/:bar"        | /^\/([^\/?#]+)\/([^\/?#]+)$/                             | "/user@example.com/name"         | ["user@example.com", "name"]         | ["user@example.com", "name"]        
 "/:file.:ext"       | /^\/([^\/?#]+)(?:\.|%2E)([^\/?#]+)$/                     | "/pony.jpg"                      | ["pony", "jpg"]                      | ["pony", "jpg"]                     
 "/:file.:ext"       | /^\/([^\/?#]+)(?:\.|%2E)([^\/?#]+)$/                     | "/pony%2Ejpg"                    | ["pony", "jpg"]                      | ["pony", "jpg"]                     
 "/:file.:ext"       | /^\/([^\/?#]+)(?:\.|%2E)([^\/?#]+)$/                     | "/.jpg"                          | nil                                  | nil                                 
 "/test.bar"         | /^\/test(?:\.|%2E)bar$/                                  | "/test.bar"                      | []                                   | []                                  
 "/test.bar"         | /^\/test(?:\.|%2E)bar$/                                  | "/test0bar"                      | nil                                  | nil                                 
 "/test$/"           | /^\/test(?:\$|%24)\/$/                                   | "/test$/"                        | []                                   | []                                  
 "/te+st/"           | /^\/te(?:\+|%2B)st\/$/                                   | "/te+st/"                        | []                                   | []                                  
 "/te+st/"           | /^\/te(?:\+|%2B)st\/$/                                   | "/test/"                         | nil                                  | nil                                 
 "/te+st/"           | /^\/te(?:\+|%2B)st\/$/                                   | "/teeest/"                       | nil                                  | nil                                 
 "/test(bar)/"       | /^\/test(?:\(|%28)bar(?:\)|%29)\/$/                      | "/test(bar)/"                    | []                                   | []                                  
 "/path with spaces" | /^\/path(?:%20|(?:\+|%2B))with(?:%20|(?:\+|%2B))spaces$/ | "/path%20with%20spaces"          | []                                   | []                                  
 "/path with spaces" | /^\/path(?:%20|(?:\+|%2B))with(?:%20|(?:\+|%2B))spaces$/ | "/path%2Bwith%2Bspaces"          | []                                   | []                                  
 "/path with spaces" | /^\/path(?:%20|(?:\+|%2B))with(?:%20|(?:\+|%2B))spaces$/ | "/path+with+spaces"              | []                                   | []                                  
 "/foo&bar"          | /^\/foo(?:&|%26)bar$/                                    | "/foo&bar"                       | []                                   | []                                  
 "/:foo/*"           | /^\/([^\/?#]+)\/(.*?)$/                                  | "/hello%20world/how%20are%20you" | ["hello%20world", "how%20are%20you"] | ["hello%20world", "how%20are%20you"]
 "/*/foo/*/*"        | /^\/(.*?)\/foo\/(.*?)\/(.*?)$/                           | "/bar/foo/bling/baz/boom"        | ["bar", "bling", "baz/boom"]         | ["bar", "bling", "baz/boom"]        
 "/*/foo/*/*"        | /^\/(.*?)\/foo\/(.*?)\/(.*?)$/                           | "/bar/foo/baz"                   | nil                                  | nil                                 
 "/:name.?:format?"  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/                   | "/foo"                           | ["foo", nil]                         | ["foo", nil]                        
 "/:name.?:format?"  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/                   | "/.bar"                          | [".bar", nil]                        | [".bar", nil]                       

Broken
======

 Pattern             | Current Regexp                                           | Example                          | Should Be                            | Is Currently                        
---------------------|----------------------------------------------------------|----------------------------------|--------------------------------------|--------------------------------------
 "/:name.?:format?"  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/                   | "/foo.bar"                       | ["foo", "bar"]                       | ["foo.bar", nil]                    
 "/:name.?:format?"  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/                   | "/foo%2Ebar"                     | ["foo", "bar"]                       | ["foo%2Ebar", nil]                  
 "/:user@?:host?"    | /^\/([^\/?#]+)(?:@|%40)?([^\/?#]+)?$/                    | "/foo@bar"                       | ["foo", "bar"]                       | ["foo@bar", nil]                    
 "/:user@?:host?"    | /^\/([^\/?#]+)(?:@|%40)?([^\/?#]+)?$/                    | "/foo.foo@bar"                   | ["foo.foo", "bar"]                   | ["foo.foo@bar", nil]                
 "/:user@?:host?"    | /^\/([^\/?#]+)(?:@|%40)?([^\/?#]+)?$/                    | "/foo@bar.bar"                   | ["foo", "bar.bar"]                   | ["foo@bar.bar", nil]  

Great! Does the result need to be an Array? Is #match used? Why not #split?

I'm asking this because:

r = /[\/\.%2E]+/
p "/foo.bar"    .split(r) # => ["", "foo", "bar"]
p "/foo%2Ebar"  .split(r) # => ["", "foo", "bar"]

r = /[\/@]+/
p "/foo@bar"    .split(r) # => ["", "foo", "bar"]
p "/foo.foo@bar".split(r) # => ["", "foo.foo", "bar"]
p "/foo@bar.bar".split(r) # => ["", "foo", "bar.bar"]

Which is very close to the expected behavior, while being relatively simple (to generate as well). Cheers!

match is used. So you mean we should detect that scenario and "fix" it after matching?

I wonder about the usage of match since using split seems a much better fit (in terms of generating the regexps and mental mapping), still assuming that an array of parts is needed as a result.
No, I am suggesting to use split if that is a possibility.

I don't get yet what the program flow would be like with split.

Can you point me to the match in the Sinatra code, please? :) (Then I am able to help much better)

Note: From your table I assumed you actually need an array of the parts as output, which is why I offered the idea of using split.

Thanks! I see.

Ok then, match it is:

r = /^\/([^\/?#]+)(?:\.|%2E)+([^\/?#]+)?$/
p "/foo.bar".match(r)
p "/foo%2Ebar".match(r)

r = /^\/([^\/?#]+)(?:@|%40)+([^\/?#]+)?$/
p "/foo@bar".match(r)
p "/foo.foo@bar".match(r)
p "/foo@bar.bar".match(r)

(Both differ from the original in 1 position: ? -> +)

Cheers!

That doesn't match "/foo".

I'm beginning to understand. Also: How about being more constructive? :)

Haha, I'm sorry, I super thankful for your investigation so far. I just don't know if this is even possible.

Yeah, my problem was that I didn't understand what Sinatra was actually doing – but now I do. Source code reading helps :)

I believe I got one. On to the next.

r = /^\/([^\.%2E\/?#]+)(?:\.|%2E)?([^\.%2E\/?#]+)?$/
p "/foo".match(r)
p "/foo.bar".match(r)
p "/foo%2Ebar".match(r)

The second one is probably

r = /^\/([^@%40\/?#]+)(?:@|%40)?([^@%40\/?#]+)?$/
p "/foo".match(r)
p "/foo@bar".match(r)
p "/foo.foo@bar".match(r)
p "/foo@bar.bar".match(r)

Can you check, @rkh?

OK, that just means we'll have to do special parsing (i.e. recognize that it's of to have :format? not match the .). Might need to do proper parsing of the pattern then instead of just some replacements + Regexp.new.

Unsure about that. The patterns I posted actually mimic a pattern that I've seen above, and which could be summarized as:
Match all in group except the separator character(s) up to the separator character(s), then continue matching all in group except the separator character(s).
E.g. /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/

Are these the routing tests? https://github.com/sinatra/sinatra/blob/v1.3.2/test/routing_test.rb Maybe we can add the nice table above to it? :) (If it's not in there already and I've overlooked it)

Yes, except that '.' is not a separator character. The issue is, at the moment we do a simple gsub for :format, so we don't know that :format is actually part of :name.?:format? and I got the feeling that in order to solve this properly we'd need a proper parser, as this is not describable with a regexp.

Are these the routing tests? https://github.com/sinatra/sinatra/blob/v1.3.2/test/routing_test.rb Maybe we can add the nice table above to it? :) (If it's not in there already and I've overlooked it)

All but the failing ones are actually in there already (that's were I got em from).

Good point with the "." not being a separator character. A parser feels like the way to go here, but I had a feeling it might be doable using regexps. I'll have a look at Base#compile.

Thanks, I believe the tests would benefit a lot from being in a tabular form, but that's just me :)

Yeah, I meant a parser for the pattern -> regexp step, not the request path -> route parsing, I would still use a regexp there. the real question would be how to generate it. I would also like :name(.:format)? to be possible.

Yes, I got that! :)

Ok, rewriting the challenge as: Find a way to elegantly map the given patterns into their corresponding regexps, such that they work for all given examples.

Full Pattern - Regexp mapping:

/                 | /^\/$/
/foo              | /^\/foo$/
/f\u00F6\u00F6    | /^\/f%C3%B6%C3%B6$/
/:foo             | /^\/([^\/?#]+)$/
/:foo/:bar        | /^\/([^\/?#]+)\/([^\/?#]+)$/
/hello/:person    | /^\/hello\/([^\/?#]+)$/
/?:foo?/?:bar?    | /^\/?([^\/?#]+)?\/?([^\/?#]+)?$/
/*                | /^\/(.*?)$/
/:foo/*           | /^\/([^\/?#]+)\/(.*?)$/
/test.bar         | /^\/test(?:\.|%2E)bar$/
/test$/           | /^\/test(?:\$|%24)\/$/
/te+st/           | /^\/te(?:\+|%2B)st\/$/
/test(bar)/       | /^\/test(?:\(|%28)bar(?:\)|%29)\/$/
/path with spaces | /^\/path(?:%20|(?:\+|%2B))with(?:%20|(?:\+|%2B))spaces$/
/foo&bar          | /^\/foo(?:&|%26)bar$/
/*/foo/*/*        | /^\/(.*?)\/foo\/(.*?)\/(.*?)$/
/:file.:ext       | /^\/([^\/?#]+)(?:\.|%2E)([^\/?#]+)$/
/:name.?:format?  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/
/:user@?:host?    | /^\/([^@%40\/?#]+)(?:@|%40)?([^@%40\/?#]+)?$/
/:name.?:format?  | /^\/([^\.%2E\/?#]+)(?:\.|%2E)?([^\.%2E\/?#]+)?$/

I wouldn't mind working on this, if you don't mind :)

Ok, almost got it. Cleaning up the code and preparing for a pull request :)

Note: I'm assuming this example should actually be nil. At least it looks like that to me. If I am wrong, please tell me.

"/:name.?:format?  | /^\/([^\/?#]+)(?:\.|%2E)?([^\/?#]+)?$/ | /.bar | [.bar, nil]"

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.