Skip to content

Instantly share code, notes, and snippets.

@lanebpemberton
Last active June 18, 2021 00:36
Show Gist options
  • Save lanebpemberton/46569e2e49f9a124a30d29ef2b5bd626 to your computer and use it in GitHub Desktop.
Save lanebpemberton/46569e2e49f9a124a30d29ef2b5bd626 to your computer and use it in GitHub Desktop.
Describing the structure of a regex designed to match a URL

Matching URLs With Regex

Summary

It can be important to match a string as a URL in form or database validation. For example, a job finding platform that asks the user to enter their portfolio website could validate the user's input to make sure their input isn't gibberish. The following regex will tell the user if a string is a valid URL that will be recognized by most popular browsers: /^(https?://)?([\da-z.-]+).([a-z.]{2,6})(:\d{1,5})?([/\w .-])/?$/

Table of Contents

Regex Components

I used the mdn reference https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL as a reference to break down what the regex is matching. According to the reference, this regex will match the scheme, domain name, port, and path to resource

Anchors

I include a begining '^' and ending '$' anchor in this regex to match the entire string as a URL. This was done to be consisitent with the desired applications of form and database validations stated earlier.

Quantifiers

Quantifiers are explained one at a time going through the regex from left to right.

  1. '?' checks for either one or zero 's' characters in the 'https?' section of the regex. This basically allows for secure and non secure URLs.
  2. '?' checks for either one or zero '(https?://)' tokens. This token is optional as browsers don't require that part of the URL to be typed in.
  3. '+' in '[\da-z.-]+' checks for one or more characters that make up the subdomain
  4. '{2,6}' in '[a-z.]{2,6}' matches a token 2 to 6 characters long that is the top level domain
  5. '{1,5}' in ':\d{1,5}' will look for a valid port number that matches the constraints setup in the IANA assignments
  6. '?' in '(:\d{1,5})?' allows the inclusion of a port but doesn't require it
  7. '*' in '[/\w .-]*' checks for either zero or any number of segments in a path to resource
  8. '?' in '/?' includes a trailing forward slash, if it exists

Grouping Constructs

There are five main grouping constructs in the regular expression.

  1. (https?://) matches a scheme section in a URL
  2. ([\da-z.-]+) matches a domain section in a URL
  3. ([a-z.]{2,6}) matches a subdomain section in a URL
  4. (:\d{1,5}) matches a port section in a URL
  5. ([/\w .-]*) matches a path to resource section in a URL

Bracket Expressions

There are three bracket expressions in this regular expression that help match unique parts of a URL

  1. [\da-z.-] looks for a digit first, then any letter a through z, then a period character (the metacharacter '.' that matches any single character is escaped here), then a dash character
  2. [a-z.] looks for any letter a through z and then a period character
  3. [/\w .-] looks for a forward slash, then a word character, then a period character, and then a dash character

Character Classes

The most used character classes in this regex are the digit (\d), alphabet (a-z), and word (\w) metacharacters. They're used here to match a subdomain with digits and alphabet characters, match a top level domain that could include combinations of letters and periods, and match a path to resource with any word characters and forward slashes throughout.

Character Escapes

Character escapes are used to match forward slashes in the scheme section, period characters through the URL structure, and a potential trailing forward slash in the URL

Author

Regex run-through by Lane Pemberton. Let me know if you were able to use my work or improve upon it!
🔗 LinkedIn  :octocat: Github  📧 Email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment