Skip to content

Instantly share code, notes, and snippets.

@codematix
Last active August 29, 2015 14:22
Show Gist options
  • Save codematix/594dc7ac786e374ce2e0 to your computer and use it in GitHub Desktop.
Save codematix/594dc7ac786e374ce2e0 to your computer and use it in GitHub Desktop.
Regular Expressions to match URL components

Documentation

  1. Scheme - The scheme name consists of a sequence of characters beginning with a letter and followed by any combination of letters, digits, +, ., or -. Although schemes are case-insensitive, the canonical form is lowercase and documents that specify schemes must do so with lowercase letters. The scheme name is followed by a colon :.
  2. Host Name - Hostname labels may contain only the ASCII letters a through z (in a case-insensitive manner), the digits 0 through 9, and the -. While a hostname may not contain other characters, such as the underscore character _, other DNS names may contain the underscore.
  3. Port Number - A port number is a 16-bit unsigned integer, thus ranging from 0 to 65535.
  4. Path - If present, may optionally begin with a single forward slash /. It may not begin with two slash characters //. The path is a sequence of segments (conceptually similar to directories, though not necessarily representing them) separated by a forward slash /.
  5. Query String -

Individual components

  1. protocol - ([\w\.\-\+]+:)
  2. username - ([\w\d\.]+)
  3. password - ([\w\d\.]+)
  4. userinfo - ($username:$password)@
  5. hostname - ([a-zA-Z0-9\.\-_]+)
  6. port - (\d{1,5})
  7. host - ($hostname:$port)
  8. authority - ($userinfo@$host)
  9. origin - ($protocol\/{2}$authority)
  10. pathname - (\/(?:[a-zA-Z0-9\.\-\/\+\%]+)?)
  11. search - \?([a-zA-Z0-9=%\-_\.\*&]+)
  12. hash - #([a-zA-Z0-9\-=,&%;\/\\"'\?]+)?

Iterations

Version 1

(([\w\.\-\+]+:)\/{2}(([\w\d\.]+):([\w\d\.]+))@(([a-zA-Z0-9\.\-_]+):(\d{1,5})))?(\/(?:[a-zA-Z0-9\.\-\/\+\%]+)?)(?:\?([a-zA-Z0-9=%\-_\.\*&;]+))?(?:#([a-zA-Z0-9\-=,&%;\/\\"'\?]+)?)?

Version 2

(([\w\.\-\+]+:)\/{2}(([\w\d\.]+):([\w\d\.]+))@(([a-zA-Z0-9\.\-_]+)(?::(\d{1,5}))?))?(\/(?:[a-zA-Z0-9\.\-\/\+\%]+)?)(?:\?([a-zA-Z0-9=%\-_\.\*&;]+))?(?:#([a-zA-Z0-9\-=,&%;\/\\"'\?]+)?)?

Version 3 - FINAL

(([\w\.\-\+]+:)\/{2}(([\w\d\.]+):([\w\d\.]+))?@?(([a-zA-Z0-9\.\-_]+)(?::(\d{1,5}))?))?(\/(?:[a-zA-Z0-9\.\-\/\+\%]+)?)(?:\?([a-zA-Z0-9=%\-_\.\*&;]+))?(?:#([a-zA-Z0-9\-=,&%;\/\\"'\?]+)?)?

Test Cases

Full URL

http://user:pass@www.google.com:8000/foo/bar?st=1&lt=10;#/koo9

Without Port

http://user:pass@www.google.com/foo/bar?st=1&lt=10;#/koo9

Without Port and User Info

http://www.google.com/foo/bar?st=1&lt=10;#/koo9

Without Port, User Info and hash

http://www.google.com/foo/bar?st=1&lt=10;

Without Port, User Info, search and hash

http://www.google.com/foo/bar

Without Port, User Info, path, search and hash

http://www.google.com/

Path + Search + Hash

/foo/bar?st=1&lt=10#foo

Path + Search

/foo/bar?st=1&lt=10

Path Only

/foo/bar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment