Skip to content

Instantly share code, notes, and snippets.

@miss-mad
Last active November 2, 2022 21:47
Show Gist options
  • Save miss-mad/7785f9a453248395391b55cc3642b7b5 to your computer and use it in GitHub Desktop.
Save miss-mad/7785f9a453248395391b55cc3642b7b5 to your computer and use it in GitHub Desktop.

Regex Tutorial: Matching a URL

What is a Regex?

Regex stands for "regular expression" and is a way to search for existing patterns within text. It is not language-specific, meaning that it can be used to find patterns within JavaScript, Python, Ruby, C++, etc. These expressions actually don't have to do with programming at all; they can be useful for general find-and-replace searches in a file. For programming, regex is often used for form or input validation.

There are literal and meta characters:

Literal characters are what we see - what is literally written, like "github.com."

Meta characters are ones that do not depict a single, literal character, but a generalized pattern.
Different categories of meta characters are outlined below.

Summary

This tutorial explains the different parts of one type of regex. This one validates that the user input is a valid URL:

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

This regex in particular is interesting to me because I received TA help for validating my URLs in challenge 9, and would like to understand more about how it operates.

Table of Contents

Regex Components

Literal Notation

Our regex is created with literal notation. This is indicated by the forward slash / / characters that bookend the regex. These are called delimiters.

In JavaScript, regex objects can also be created with constructor functions that take in a string as an argument ("" instead of / /).

Our example rewritten as a constructor function would be:

const urlMatchRegex = new RegExp('^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$');

Anchors

Both the caret symbol ^ and dollar sign $ are anchors. Notice these also wrap the regex and are located just inside the forward slashes /.

^ This anchor tells the search pattern to find a string that begins with the following characters.
Basically, it marks the beginning of the regex.

$ Similarly, this anchor means that it's looking for a string that ends with the preceding characters.
It essentially marks the end of the regex.

An example of another regex using the same anchors in the same positions (this regex matches an HTML tag):

/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/

Subexpressions/Grouping Constructs

Subexpressions are denoted with parentheses () and are used to break up sections of the regex. Subexpressions are separated by a colon : .

Subexpressions look for an exact match unless noted otherwise, unlike bracket expressions (explained later).

In our regex, we have four subexpressions:

  1. (https?:\/\/) (Note that these are two subexpressions because a colon : separates them)
  2. ([\da-z\.-]+)
  3. ([a-z\.]{2,6})
  4. ([\/\w \.-]*)

Capturing

Subexpressions can be capturing or non-capturing. Capturing means that the pattern the subexpression finds is remembered for possible reuse or reference later on. If it's non-capturing, it does not do this. A subexpression can be specified to be non-capturing with a question mark + colon ?: at the beginning of the subexpression just inside the first parentheses.

An example involving a section of our regex but without capturing/remembering the "http" or "https": (?:https?)

Escaping

The backslash character \ "escapes" a character that would have otherwise been interpreted literally.

Our regex copied again for the following examples: /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

The forward slashes are used as delimiters to bookend the regex in literal notation (as mentioned above). If we use another forward slash / somewhere within the regex, we must escape it so that the computer doesn't think we are ending our regex early Escaping the forward slash looks like this \/. This happens four times in our regex.

The dot . must also be escaped because otherwise, it matches any character except for the newline character \n. Escaping the dot . looks like this \.. This also happens four times in our regex.

Quantifiers

Quantifiers quantify, or tell how many, of certain characters must be present for the pattern to find matches. There are six quantifiers:

* Matches the pattern 0+ times

+ Matches the pattern 1+ times

? Matches the pattern 0 or 1 time

The last three quantifiers are the three ways to set limits:

{ n } Matches the pattern exactly n number of times

{ n, } Matches the pattern n+ number of times (n or more)

{ n, x } Matches the pattern a minimum of n number of times, up to a maximum of x number of times

Our regex copied again for the following examples: /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

In our regex, the question mark ? is used three times; the first two are in the first subexpression. In this subexpression's case, it's telling us that what the question mark ? points to is optional.

(https?:\/\/)?

When the question mark ? follows the s, that means that the URL may have "http" or it might have "https."

When the question mark ? follows the entire subexpression (https?:\/\/), that means that the URL might not have the protocol section at all, and to still accept the input as a match if that section is omitted.

Later, at the end of the regex, the question mark ? tells us that the forward slash / is optional as well.

Also in our regex, the star * quantifier is used. Inside the bracket expression [\/\w \.-], we want to have zero or more matches of anything inside it.

Then, the second star * is to say that we want to match this entire group as many times as we want (zero or more times). Both of these star * quantifiers allow optional file directories to be inputted in the URL. There can be an unlimited number of directories. If we used a question mark ? instead of the second star *, the pattern would only allow for one directory to be added after the domain name (".com," etc.).

Next, the plus sign + quantifier is also used. This means that we can have one or more matches of this bracket expression [\da-z\.-].

Lastly, the limiting { n, x } quantifier is used in our regex. {2,6} signifies that we want to match this bracket expression [a-z\.] a minimum of 2 times and a maximum of 6 times. This means that we can have ".com" or a country-specific domain like ".co.uk," and others like this.

Greedy and Lazy Match

Greedy quantifiers find as many matches as possible. This is the default behavior for all quantifiers, so no additional symbols are needed to make it greedy.

Lazy quantifiers find as few matches as possible (sometimes called reluctant). To make any quantifier lazy, add a question mark ? after it. Remember that the question mark ? by itself is its own quantifier, so lazy mode only happens when the question mark ? follows another quantifier.

*?
+?
??
{ n }?
{ n, }?
{ n, x }?

A lazy quantifier example: /.+?/ means that this regex searches for any character except the newline character . one or more times + but in a lazy way +? so that it only looks for the least number of matches as possible.

Bracket Expressions

Bracket expressions are expressions encompassed in square brackets []. Anything inside [] signifies the range of characters we want to match. These ranges are marked with a hyphen - between the letters and numbers to show the limits of the range we're searching.

Bracket expressions are synonymous with a "positive character group," meaning that these are the characters we want.

The inverse of a positive character group is a "negative character group" and that is one that shows characters we don't want. To make a bracket expression negative, all that needs to be added is a ^ at the beginning of the expression just inside the first square bracket.

Note that within bracket expressions, we want to match any of the characters or character ranges we define, in any order. The search pattern doesn't require the string to match every requirement, just any of them (at least one).

A positive character group example: [a-z0-9_-] which searches for any string that includes any combination of lowercase letters between a and z, any number between 0 and 9, and the underscore and hyphen special characters.

A negative character group example: [^aeiou] seaches for a string that doesn't include any lowercase vowels.

In our regex, we have three bracket expressions:

  1. [\da-z\.-]
  2. [a-z\.]
  3. [\/\w \.-]

Within these bracket expressions, we have two examples of ranges, and both are for the same range. [a-z] This is the same as writing out all letters from a to z between the square brackets []. Similarly, [a-c] is the same as writing [abc].

Character Classes

Character classes define character sets. In other words, bracket expressions, including both positive and negative character groups, are character classes.

Four of the most common character classes are:

. Matches any character whatsoever - except the newline character

\d Matches any numeral digit

\w Matches any alphanumeric character including underscore _ (w stands for "word")

\s Matches a single whitespace character, including tabs and line breaks

Three of these character classes also have an easy inverse:

\D Is the inverse of \d and finds a non-digit character

\W Is the inverse of \w and finds non-alphanumeric, non-underscore characters

\S Is the inverse of \s and finds non-whitespace characters

In our regex, there are two character classes shown in these two code snippets (excluding the bracket expressions as a whole that we just discussed):

  1. [\da-z\.-] (focusing on \d)
  2. [\/\w \.-] (focusing on the \w)

Both bracket expressions house a positive character group. In the first, we are searching for any matches involving any digit, any letter from a to z, a dot (made literal by escaping it), and a hyphen. In the second, we are searching for any matches involving a forward slash (made literal by escaping it), any alphanumeric character including underscores, a dot (also made literal by escaping it), and a hyphen.

Matching a URL - Regex Summary

The above categories explain the regex I've selected for matching a URL. Our regex is copied here again for reference and summarized from left to right:

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

The delimiter / shows that this is a regex. The caret ^ marks the beginning of the regex. The first subexpression () is a capturing group that is entirely optional. The letters "http" can be followed by one or zero ? s letters, and the entire "http://" or "https://" can have one or zero ? instances, meaning that it can be optional, too. The second subexpression () is a bracket expression [] for the domain name of the website. This expression looks for one or more + numbers \d, letters a-z, dots \. or hyphens -. Between the second and third subexpressions is a dot .. The third subexpression () is another bracket expression [] that looks for a minimum of 2 and a maximum of 6 {2,6} letters or dots a-z\.. The fourth subexpression () is a final bracket expression [] that says we can have zero or more * matches of forward slashes \/; letters, numbers, underscores \w; spaces; dots \.; or hyphens -. The second star * quantifier outside of the fourth subexpression indicates that we can have zero or more of the entire subexpression, which means that we can have multiple file directory paths after the domain name. Then, we can have zero or one forward slashes / at the end of the URL (the trailing slash) which means that is optional as well. Finally, the dollar sign $marks the end of the regex and the final delimiter / closes.

Regex Components - Unused

The below additional categories do not pertain to our regex, but are explained generally in terms of other regexes.

Boundaries

The most common boundary type is a word boundary, backslash b \b. In short, it marks where a word starts or ends. If backslash b \b is placed at the beginning of the word, the character to the left of it is not a word character (in other words, not a letter, digit, or underscore) and is instead a string or space character. The opposite is true if the backslash b \b is placed at the end of the word. If placed on both sides of the words, that means the characters on either side of the word are non-word characters.

For example: \bthe\b would match "the" and "in the ocean." \bthe would match "theme." the\b would match "absinthe."

Back-references

Backreferences are used to match the same text again by using backslash + n, where n is the group number \n. The group number is a reference to the order of the capturing groups. Remember that capturing is just a type of subexpression or grouping construct, and these subexpressions are marked by parentheses ().

Across programming languages, most can support up to 99 capturing groups to backreference. So \99 is valid if there is a 99th subexpression.

In our regex, there are four subexpressions (as outlined in the subexpression section). If we wanted to backreference any of these four, we would put a backslash + the number of that group based on its order in the regex.

For example, to backreference the first capturing group, we would write \1 just before the dollar sign $ and forward slash / delimiter that marks the end of the regex. /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?\1$/

Look-ahead and Look-behind

Lookahead and lookbehind are collectively called "lookaround." Here, we're just looking for matches and not actually moving forward or backward in the string. We just want to find a match if our character or character group is preceded by or followed by another pattern we look for first. Lookaheads and lookbehinds can also be chained together to look for multiple patterns preceding or following the characters.

Lookarounds can be made negative with an exclamation point !.

X(?=Y) Positive lookahead, meaning look for X if it's followed by Y

X(?!Y) Negative lookahead, meaning look for X if it's not followed by Y

(?<=Y)X Positive lookbehind, meaning look for X if it's after Y

(?<!Y)X Negative lookbehind, meaning look for X if it's not after Y

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

An example of a positive lookahead: \d+(?=\s) which looks for any digit \d only if it's followed by a space \s. "5 cents" would match.

An example of a positive lookbehind: (?<=\$)\d+ which looks for any digit preceded by a dollar sign $. "$8" or "$2984613" would match.

OR Operator

The OR Operator looks like this |. This operator is used inside grouping constructs to look for any matches, not an exact match (in other words, make it like a bracket expression).

Our regex does not have an OR operator, but this is an example that does (this regex matches a HEX value):

/^#?([a-f0-9]{6}|[a-f0-9]{3})$/

Flags

This regex does not contain any flags. For regexes that do, flags are located at the very end after the ending /. The most common are the g, i, or m flags:

g This flag is for a global search, where the regex should be tested against all possible matches in a string

i This flag is for a case-insensitive search

m This flag is for a multi-line search

Our regex does not have any flags, but this is an example that does: /\w+\/(\d+)/gmi (flags can be applied right after one another; in this case, we are telling the search pattern to look globally, ignore capital/lowercase, and look across multiple lines).

Author

Thanks for reading and hopefully you have found this helpful! I am an emerging full stack web developer with Georgia Tech. Find more solo projects and collaborations on my Github

Credits

https://www.youtube.com/watch?v=7DG3kCDx53c&list=PLRqwX-V7Uu6YEypLuls7iidwHMdCM6o2w&index=1&ab_channel=TheCodingTrain

https://coding-boot-camp.github.io/full-stack/computer-science/regex-tutorial

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp#examples

https://www.markdownguide.org/basic-syntax/

https://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149

https://javascript.info/regexp-introduction

https://learn.microsoft.com/en-us/dotnet/standard/base-types/quantifiers-in-regular-expressions

https://www.rexegg.com/regex-quantifiers.html

https://javascript.info/regexp-greedy-and-lazy

https://blog.hubspot.com/marketing/parts-url#:~:text=What%20are%20the%20parts%20of,%2Dlevel%20domain%2C%20and%20subdirectory.

https://www.rexegg.com/regex-boundaries.html

https://javascript.info/regexp-backreferences

https://www.regular-expressions.info/backref.html

https://www.rexegg.com/regex-lookarounds.html

https://javascript.info/regexp-lookahead-lookbehind

https://regex101.com/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment