Skip to content

Instantly share code, notes, and snippets.

@kwaters3
Last active October 26, 2023 01:55
Show Gist options
  • Save kwaters3/845518383281068ce71db4ccfcf00bf7 to your computer and use it in GitHub Desktop.
Save kwaters3/845518383281068ce71db4ccfcf00bf7 to your computer and use it in GitHub Desktop.
URL Matching Regex

Understanding the URL Matching Regex

As a web developer, understanding Regex (regular expressions) is crucial for many tasks, including input validation and text processing. In this tutorial, we will explore a specific regex pattern used to match URLs, which can vary in structure.

Regex is a sequence of characters that defines a specific search pattern, it's useful in finding and manipulating text data.

Summary

We will examine the following regex:

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

Here's how this string breaks down (we'll explore it in more detail later):

  • The first part of the regex, ^(https?:\/\/)? matches the optional HTTP or HTTPS protocol.

    • The ? denotes that the protocol is optional.
  • The next part, the domain name, ([\da-z\.-]+) displays the alphanumeric characters, dots, and hyphens.

  • Then, the Top-Level Domain, [a-z\.]{2,6} allows for lowercase letters and dots, which can be between 2-6 characters long.

  • Then, the Path, ([\/\w \.-]*)* contains slashes, alphanumeric characters, dots, and hypens.

    • The * allows for multiple path segments.
  • Finally, the optional trailing slash, \/? allows URLs to either end with a slash or without.

Table of Contents

Regex Components

A regex is considered a literal, so the pattern must be wrapped in slash characters / (as noted below):

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

Now let's take a look at the components of our URL Matching regex.

Anchors

The characters ^ and $ are both considered to be anchors.

The ^ anchor signifies a string that begins with the characters that follow it.

  • ^ at the beginning of the regex specifies the start of a line. It ensures that the regex pattern must begin matching at the very start of the input string. It's a position anchor.

The $ anchor signifies a string that ends with the characters that precede it. Just like the ^ character, it can be preceded by an exact string or a range of possible matches.

  • $ at the end of the regex specifies the end of a line. It ensures that the regex pattern must end matching at the very end of the input string. It's another position anchor.

Quantifiers

Quantifiers set the limits of the string that your regex matches (or an individual section of the string). They frequently include the minimum and maximum number of characters that your regex is looking for.

Quantifiers match as many occurrences of particular patterns as possible. They include the following:

  • * Matches the pattern zero or more times

  • + Matches the pattern one or more times

  • ? Matches the pattern zero or one time

  • {} Curly brackets can provide three different ways to set limits for a match:

    • { n } Matches the pattern exactly n number of times

    • { n, } Matches the pattern at least n number of times

    • { n, x } Matches the pattern from a minimum of n number of times to a maximum of x number of times

Each of these quantifiers can have the ? symbol after it, to match as few occurrences as possible.

Let's Look at how quantifiers are used in the Matching a URL regex:

  • ? after (https?://) makes the protocol (HTTP or HTTPS) optional. It means that the protocol can either be "http://" or "https://" or none at all.

  • * after ([\/\w \.-]*) allows for multiple path segments. It means that the path part can contain zero or more occurrences of characters like slashes, alphanumeric characters, spaces, dots, and hyphens.

  • ? after \/ makes the trailing slash optional. The trailing slash may or may not be present at the end of the URL.

    • For example, in the regex (https?://)? the ? makes "http://" or "https://" optional.

    • In ([\/\w \.-]*)* the * allows for zero or more path segments.

    • In \/? the ? makes the trailing slash optional.

Grouping Constructs

As regular expressions grow more complicated, you may check multiple parts of a string to determine that different sections fulfill different requirements. To break these sections up, you'll need to use grouping constructs.

The primary way you group a section of a regex is by using parentheses (). Each section within parentheses is known as a subexpression.

  • ... are used to group specific parts of the regex for capturing. They have several important purposes, including:

    • Capturing a specific part of the input for later reference.

    • Applying quantifiers (e.g., *, +, ?) to a group of characters.

    • Creating sub-patterns within a larger pattern.

  • In the given regex, there are multiple groups, such as (https?://), ([\da-z\.-]+), ([a-z\.]{2,6}), and ([\/\w \.-]*).

    • These groups capture and separate different parts of the URL, like the protocol, domain name, top-level domain, and path.

Bracket Expressions

Anything inside a set of square brackets [] represents a range of characters that we want to match. These patterns are known as bracket expressions, but they are also known as a positive character group, because they outline the characters we want to include.

  • [\da-z\.-] is a bracket expression used to match alphanumeric characters, dots, and hyphens.

  • In this context, it's part of the domain name component. It means that the domain name can contain characters like letters, digits, dots, and hyphens.

  • For example, it will match domain names like "google.com" or "sub-domain.google.com".

Character Classes

A character class in a regex defines a set of characters, any one of which can occur in an input string to fulfill a match. The bracket expressions outlined previously, including positive and negative character groups, are considered character classes.

Here are some of the other common character classes:

  • . Matches any character except the newline character \n.

  • \d used to match digits within ([\da-z\.-]+). It's a shorthand character class for matching any digit from 0 to 9.

  • \w Matches any alphanumeric character from the basic Latin alphabet, including the underscore _.

The OR Operator

Bracket expression does not require the string to meet all of the requirements in the pattern. It can search for the alphanumeric characters or the two special characters included in the pattern.

Using the OR operator | in regex, allows you to specify alternative patterns.

  • https? uses the | symbol to match either "http" or "https".

    • For example, in this context, https? matches URLs that start with either "http://" or "https://".

Flags

Flags are additional options that can be added to the end of a regex to modify its behavior.

Some common flags include:

  • i for case-insensitive matching that should be ignored while attempting a match in a string.

  • g for global-search matching (matches all occurrences, not just the first) and should be tested against all possible matches in a string.

  • m for a multi-line input string that should be treated as multiple lines.

Flags would come after the closing / and they define additional functionality or limits for the regex.

No flags are specified in this regex.

Character Escapes

The backslash \ in a regex escapes a character that otherwise would be interpreted literally.

Character escapes are used to match specific characters with special meanings in regex.

No character escapes are used in this regex.

Author

In this tutorial, we've explored the components of the URL matching regex, by understanding each parts role in capturing and validating URLs. Regexs are powerful tools, and mastering them is a valuable skill for any web developer. Continue to learn more complex regex patterns and enhance your journey in web development!

Author: Katie Waters
If you have any questions, please email me at: knickler3@gmail.com
My GitHub page is: kwaters3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment