Skip to content

Instantly share code, notes, and snippets.

@dark40
Last active September 17, 2022 09:20
Show Gist options
  • Save dark40/fd966cac3c86e4e931a535ce09504b9e to your computer and use it in GitHub Desktop.
Save dark40/fd966cac3c86e4e931a535ce09504b9e to your computer and use it in GitHub Desktop.
This is a Regex (Regular Expression) tutorial demonstrating how to match an URL.

Regex Tutorial for URL Marching

This is a Regex (Regular Expression) tutorial demonstrating how to match an URL.

Summary

URL(Uniform Resource Locator), or link is widely used in daily life. People use it for web surfing while developer use it for routers. This tutorial will give detailed explanation on how to use Regex to match a URL.

Here is a peak of what is covered in this tutorial.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

This shows a search pattern meant for a URL validation. That is, it checks to see if a string fulfills the requirements for an URL.

Table of Contents

Regex Components

A regex is regarded as a literal, so it must be wrapped in slash character /.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

Anchors

The characters ^ and $ are both regarded as anchors.

The example of ^(https?:\/\/) means a range of possible matches, displayed in the bracket. It means search anything matches http:// or https://. The ? mark will be explained later.

The $ anchor signifies a string that ends with the characters that precede it.

So in our "Matching URL" regex, the string must start and end with a pattern of https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?

Quantifiers

Quantifiers set the limits of the string that your regex matches (or an individual section of the string). They frequently include the minimum and maximum number of characters that your regex is looking for.

  • {2, 6} matches the the string to be between 2-6 characters long.

  • More specifically, curly brackets can provide three different ways to set limits for a match:

    • { n } — matches the pattern exactly n number of times
    • { n, } - matches the pattern at least n number of times
    • { n, x } — matches the pattern from a minimum of n number of times to a maximum of x number of times
  • * - matches the pattern zero or more times

  • + — matches the pattern one or more times

  • ? - matches the pattern zero or one time

In our case, (https?:\/\/)? means "http" or "https" may appear once or not appear at all.

[\da-z\.-]+ could be something like "boot-camp.github."

[a-z\.]{2,6} means the "com", "io", "club"

[\/\w \.-]* means there could be "/api/home/dashboard"

\/? means there may or may not have / at the end.

Grouping Constructs

The primary way you group a section of a regex is by using parentheses (). Each section within parentheses is known as a subexpression.

In our case, we have many groups (https?:\/\/), ([\da-z\.-]+), ([a-z\.]{2,6}),([\/\w \.-]*).

Bracket Expressions

Anything inside a set of square brackets [] represents a range of characters that we want to match.

In our example,

  • [a-z\.] will look for a string that matches any lowercase letter character plus . symbol like "yahoo." or "abc.".

  • [\da-z\.-] matches any Arabic number digit and lowercase letter including symbol -. like "coding-boot-camp."

  • [\/\w \.-] matches with any alphanumeric character from basic Latin alphabet, including - and _ like /regex-tutorial.

You may noticed that a lot \. inside the expression, where will be covered by Character Escapes.

Character Classes

A character class in regex defines a set of characters. In our case, you can find the followings are used.

  • \d - It matches any Arabic numeral digit. This class is equivalent to the bracket expression [0-9].

  • \w - It matches any alphanumeric character from the basic Latin alphabet, including the underscore _. This class is equivalent to the bracket expression [A-Za-z0-9_].

  • It is worth mentioning that the difference between . and \., where . matches any character except the newline character \n but \. matches the symbol . itself.

The OR Operator

OR operator(|) is not used in our case, but it is quite handy for adding alternative when orders are not important. The expression [apple] could be written as (a|p|l|e).

Therefore, "apple", " aple", "ape" will match.

Flags

Flags are palaced at the end of regex, after the second slash, and they define additional functionality or limits for the regex. In our case, flags are not required but there are three common types.

  • g - Global search: the regex should be tested against all possible matches in a string.

  • i - Case-insensitive search: case should be ignored while attempting a match in a string.

  • m - Multi-line search: a multi-line input string should be treated as multiple lines.

Character Escapes

The backslash \ in a regex escapes a character that otherwise would be interpreted literally.

For example, \. will search . and \/ will search /.

In a nutshell, all special characters, including the backslash \, lose their special significance inside bracket expressions.

Author

Author here is Freddie. You can find my Github profile here. https://github.com/dark40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment