Skip to content

Instantly share code, notes, and snippets.

@lengarvey
Last active August 29, 2015 14:03
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save lengarvey/31983eac6664351ed16d to your computer and use it in GitHub Desktop.
demonstrating a naive first cut implementation at reliable uri parsing for RFC3986 for Ruby 2.2.0dev

The problem

On Ruby 2.2.0dev URI.parse has been changed so that it uses RFC3986. This changes the semantics of URIs in some subtle ways. Probably most importantly it means that square brackets "[", "]" and a few other characters should be percent-encoded, primarily in the query string.

Unfortunately the implementation on ruby-trunk doesn't provide an encoding functionality. I raised a bug explaining this: https://bugs.ruby-lang.org/issues/9990 but here's a quick script to demonstrate the issue:

url = "https://bugs.ruby-lang.org/projects/ruby-trunk/issues?set_filter=1&f[]=status_id&op[status_id]=o"
puts URI.encode(url)
URI.parse(URI.encode(url))

See https://gist.github.com/lengarvey/c1d17913f9ea95fd999c for the output of this code.

Currently, URI.escape still points to the "DEFAULT_PARSER" (which is no longer default for most operations) which doesn't encode uris with square brackets, and URI.parse won't accept those uris because they aren't escaped properly.

Solution?

The fixed parser above provides a naive and first cut implementation at properly splitting, escaping and parsing uris. First it attempts to use the existing RFC3986 parsing implementation, if that fails to work then it performs a non-validating split of the uri, percent encodes the query string and repeats. You can see a demonstration of this in irb_output

What else?

I'm really bad at reading RFCs. They seem to be designed to be incomprehensible so I'm certain I've not included some stuff which should be percent encoded. This code is just my first cut at fixing this issue in a way I think would work for most. I'm not sure if URI.escape could be pointed towards #naive_escape mostly because I don't understand the RFC in enough depth to be able to tell and haven't written enough tests to give myself more confidence that it would work.

require 'uri'
module URI
class RFC3986_Parser # :nodoc:
# Non validating splitting regular expression for RFC3986
RFC3986_URI_SPLIT = Regexp.new '\A(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?\z'
QUERY_RESERVED = /[\[\] \/!'()\*]/
def non_validating_split(uri) #:nodoc:
uri =~ RFC3986_URI_SPLIT
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
[scheme, authority, path, query, fragment]
end
def percent_encode(str) # :nodoc:
tmp = ''
str.each_byte do |uc|
tmp << sprintf('%%%02X', uc)
end
tmp
end
def parse(uri, retry_parse = true) # :nodoc:
begin
scheme, userinfo, host, port,
registry, path, opaque, query, fragment = self.split(uri)
if scheme && URI.scheme_list.include?(scheme.upcase)
URI.scheme_list[scheme.upcase].new(scheme, userinfo, host, port,
registry, path, opaque, query,
fragment, self)
else
Generic.new(scheme, userinfo, host, port,
registry, path, opaque, query,
fragment, self)
end
rescue URI::InvalidURIError => e
if retry_parse
parse(naive_escape(uri), false)
else
raise
end
end
end
private
# will only attempt to escape the query string
def naive_escape(uri) # :nodoc: #
scheme, authority, path, query, fragment = non_validating_split(uri)
query.gsub!(QUERY_RESERVED) { percent_encode($&) }
"#{scheme}://#{authority}#{path}?#{query}##{fragment}"
end
end # class Parser
end # module URI
# This is irb showing the problem
irb(main):001:0> Object::RUBY_VERSION
=> "2.2.0"
irb(main):002:0> URI.parse "http://user:1234@example.com/go/to/widgets?a[b]=1&test=hello world&x=/#hello"
URI::InvalidURIError: bad URI(is not URI?): http://user:1234@example.com/go/to/widgets?a[b]=1&test=hello world&x=/#hello
from /Users/artega/.rubies/ruby-trunk/lib/ruby/2.2.0/uri/rfc3986_parser.rb:47:in `split'
from /Users/artega/.rubies/ruby-trunk/lib/ruby/2.2.0/uri/rfc3986_parser.rb:53:in `parse'
from /Users/artega/.rubies/ruby-trunk/lib/ruby/2.2.0/uri/common.rb:223:in `parse'
from (irb):2
from /Users/artega/.rubies/ruby-trunk/bin/irb:11:in `<main>'
irb(main):003:0> require 'fixed_rfc3986_parser' # my monkey patch to URI::RFC3986_Parser
=> true
irb(main):004:0> URI.parse "http://user:1234@example.com/go/to/widgets?a[b]=1&test=hello world&x=/#hello"
=> #<URI::HTTP:0x007fb9a93bb860 URL:http://user:1234@example.com/go/to/widgets?a%5Bb%5D=1&test=hello%20world&x=%2F#hello>
irb(main):005:0>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment