Skip to content

Instantly share code, notes, and snippets.

@nitrotap
Last active April 23, 2022 22:42
Show Gist options
  • Save nitrotap/0e413aa7719fcfd6edb67dece32efe3a to your computer and use it in GitHub Desktop.
Save nitrotap/0e413aa7719fcfd6edb67dece32efe3a to your computer and use it in GitHub Desktop.
Regular Expression Tutorial - URL Validation

Regex for URL validation

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/i

The purpose of this tutorial gist is to explain the regex for URL validation. This regex value demonstrates protocol optional URL validation.

The expression begins with a capture grouping expression. It searches for http in order, with an s as optional. It checks for a : and two //'s to make up https://. That entire part is wrapped in a capture group and is made optional by the ?.

The second capture grouping is checking for any name for the website address. It looks for any series of numbers, letters, periods, and hyphens. It looks for at least 1 with the + quantifier. It then checks for a period to signify the end of the website name.

The next capture grouping signifies checking for letters a-z and a period. It looks for between 2 and 6 characters.

It then checks for any series of a forward slash, a word, a space, a period, and a hyphen. It looks for any number of sets of the capture group, and finally checking for an optional /. The $ signifies the end of the regex. The flag at the end, /i, signifies case-insensitive, which means it will treat all letters as lowercase.

Summary

The regular expression I will be describing validates a url with or without its I will be explaining all the properties of the regular expression, including anchors, quantifiers, etc.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/i

Table of Contents

Regex Components

Anchors

Anchors are are a type of character that is used to match a position in the regular expression. Anchors can match a position before, between characters, and after characters. Anchors allow the regular expression to bind to a certain position within the search string, like the beginning or end of the string. Anchors are ^ for start of text and $ for end of text. These can be used in conjunction with the m flag to search for the beginning and end of lines as well.

In this URL validation regex, anchors include the starting character, ^, and the ending character, $.

Example: Finding the first word with a capital letter within search text: /^([A-Z])\w+/g In this example, we are including the caret before the capture group to designate that we want to match a capital letter at the start of given string. If the given string was "Watches are great!", then the regular expression would match "Watches". If the string was "watches are great!", however, the regular expression would not match.

Quantifiers

Quantifiers are meta characters that modify previous characters in regex and say how many we want in a row by using curly brackets:

Basically, quantifiers help the expression determine how many times to repeat a given bracket expression or capture group. They are:
* meaning 0 or more times
+ meaning 1 or more times
? meaning either 0 or 1 time
{n} meaning 'n' number of times
{min, max} meaning between min and max number of times

Quantifiers used: {2,6} which should be the match for any domain. + to indicate 1 or more of the previous bracket expression. * to indicate 0 or more of the last bracket expression. ? to indicate optional parameter preceding the ?.

Example: finding 4 letter words /\b\w{4}\b/g. The quantifier is {4} since we are searching for four letter words. If we set the quantifier to 3 within the curly brackets, then it would match all three letter words.

OR Operator

This regex does not use an OR operator: |

Character Classes

Character classes like [:alpha:] [:digit:] [:punct:] are not included.

Flags

Flags are used at the end of the regex syntax, following the last /. This regex uses /i to mean case-insensitive.

Flags affect the search based on their usage. There are only six flags in JavaScript - i, g, m, s, u, and y

i stands for searching for any case (case-insensitive)
g looks for all matches instead of just the first match
m looks for multi-line matches
s looks for a period and treats it like a new line
u looks for any unicode characters
y looks for the exact position within text

Example: Any word starting with a capital letter /([A-Z])\w+/g. Without any flags, the regular expression is correctly finding capital letters. If we add a flag 'i' at the end of the regular expression /([A-Z])\w+/gi, the regular expression will match any word instead of only starting with a capital letter. The g denotes finding all matches instead of just the first.

Grouping and Capturing

Grouping and Capturing is used to group multiple characters within parenthesis and to treat them as a single unit. It treats everything within the parenthesis as a single part of a pattern instead of multiple sets.

The first capture grouping (https?:\/\/)? checks for the http/https protocol by checking for each character: h, t, t, p, s (optional), :, / (designating /), /, with the s being optional. The final ? means the entire grouping is also optional.

The second capture grouping, ([\da-z\.-]+), checks for website name.

The third capture grouping ([a-z\.]{2,6}), checks for the domain name which cannot be less than 2 or greater than 6 based on ICANN standards.

The last capture grouping ([\/\w \.-]*), checks for anything coming after the domain name, like a / indicating a directory, or a word, period, or hyphen.

Example: Finding all capital letter:s ([A-Z]). The capture grouping in this example involves using the character set A-Z and matches the substring of all capital letters to match any capital letter in any string.

Bracket Expressions

Bracket Expressions are used to match any characters or ranges of characters. Bracket Expressions are the heart of regex; they contain the pattern that specifies what the regex is matching. They can contain anything from characters to classes or expressions.

The first bracket expression [\da-z\.-] checks for any digits, with \d, any character a-z with a-z, any "." character with \., and any hyphen with -.

The second bracket expression, [a-z\.], checks for any letters or . since domain names can't be numbers or special characters.

The third bracket expression, [\/\w \.-], looks for "/", any word character, a space, a period, or a hyphen.

Example: ([\d])/g This example matches any digit throughout the search content. The bracket expression [\d] looks for any set of digits within the search text.

Greedy and Lazy Match

Greedy matches refers to matching all expressions while lazy matches refers to finding a single match for an expression. Most matches is the URL validation regex use greedy matches since they are finding all patterns. The lazy matches in the URL validation include the https capture group (https?:\/\/)? and final back-slash \/? at the end of the regex. These are lazy matches because the expression is searching for only one of each match.

Example: Lazy Match /(www\.)?/ This matches for a www. in a domain name if it exists only once, instead of a greedy match which would find www. at any point within the search text.

Example: Greedy Match ([A-Z])\w+/g This expression looks for all words starting with capital letters within text. This will look for all words regardless of how many there are, and is, therefore, a greedy match.

Back-references

Back-references match the same text as previously matched by a capturing group. /1 Back-references are not used in this regular expression.

Look-ahead and Look-behind

?= look-ahead asserts that what is immediately follows the current position comes after the "="
?<= look-behind asserts that what is immediately before the current position comes after the "="
?! negative look-ahead - what immediately follows current position is not what comes after the "!" sign.
?<! negative look-behind - what comes right before the current position is not what comes after the "!".

This regular expression does not contain any sort of look-ahead or look-behind.

Author

Hi! I'm @nitrotap. I'm a developer living outside of Denver, CO. I love my cat since she is better than the rest of all the other cats in existence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment