Skip to content

Instantly share code, notes, and snippets.

@MaryVPie
Last active September 14, 2021 05:38
Show Gist options
  • Save MaryVPie/80d872f73f868e931e997af507eeaf4a to your computer and use it in GitHub Desktop.
Save MaryVPie/80d872f73f868e931e997af507eeaf4a to your computer and use it in GitHub Desktop.

Regex Tutorial. Matching an URL.

The phrase regular expressions, or regexps, is often used to mean the specific, standard textual syntax for representing patterns for matching text. One of the cases when the regexps are used is matching and searching urls. Let say, you have a task to check if an article contains url references.

Summary

This guide will specifically breakdown the components of an example regex used to match an URL:

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

Using this regex we can identify if a string is an URL:

https://www.bbc.co.uk/bitesize/guides/zp73wmn/revision/1
https://www.newsweek.com/california-recall-risks-becoming-another-disputed-election-larry-elder-talks-voter-fraud-1628290
https://www.wta.org/go-outside

Here is a step-by-step of how this regex identifies an URL

  1. Looks for the start of a series of characters ^
  2. Looks for 1st Capturing Group (https?:\/\/)? it must appear 0 or 1 times (because of ? quantifier) to match.
  3. Looks for 2nd Capturing Group ([\da-z\.-]+) where + says that we are matching 1 or more times a character from following character which can be a digit (\d token) or from a-z character diapason (lowercase characters only or a dot \. or a -).
  4. Then it searches for a dot \.
  5. Then it searches for the first level domain which is the 3rd Capturing Group ([a-z\.]{2,6})
  6. Then it searches for the 4th Capturing Group ([\/\w \.-]*)* which is essentially the path of URL. For example in the below samples path is in bold:
  https://www.bbc.co.uk/bitesize/guides/zp73wmn/revision/1
  https://www.newsweek.com/california-recall-risks-becoming-another-disputed-election-larry-elder-talks-voter-fraud-1628290
  https://www.wta.org/go-outside
  1. Searches for the line terminator with $ anchor.

Table of Contents

Regex Components

Anchors

Anchors specify a position in the string where a match must occur. The Regex engine looks for a match in the specified position only. It is necessary to use programming language syntax to test the text. For example, in JavaScript Regex are wrapped in forward slashes /.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
Anchor Description
^ By default, the match must occur at the beginning of the string.
$ By default, the match must occur at the end of the string.

Quantifiers

Indicate numbers of characters or expressions to match. Quantifiers set the limits of the string that regex matches (or an individual section of the string). They include the minimum and maximum number of characters that regex is looking for.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
Quantifier Description
* Matches zero or more times.
+ Matches one or more times.
? Matches zero or one time.
{ 2 , 6 } Matches from 2 to 6 times.

Grouping Constructs

Indicate groups and ranges of expression characters using (). Each section within parentheses is known as a subexpression.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

For example, following group constructs represent different part of an URL.

Grouping Construct Description
(https?:\/\/) Optional protocol http or https.
([\da-z\.-]+) Everything between the protocol and a first level domain.
([a-z\.]{2,6}) A first level domain.
([\/\w \.-]*) The rest of an URL path.

Bracket Expressions

It is a list of characters and/or character classes enclosed in brackets []. It's been used to match single characters in a list, or a range of characters in a list.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
Bracket Expression [\da-z\.-] Description
\d Matches a digit (equivalent to [0-9]).
a-z Matches a single character in the range between a and z (lowercase).
\. Matches the character ..
- Matches the character -.
Bracket Expression [a-z\.] Description
a-z Matches a single character in the range between a and z (lowercase).
\. Matches the character ..
Bracket Expression [\/\w \.-] Description
\/ Matches the character /.
\w Matches any word character (equivalent to [a-zA-Z0-9_]).
\. Matches the character ..
- Matches the character -.

Character Classes

It is a special notation that matches any symbol from a certain set. There are several character classes:

  1. Digit class. It’s written as \d and corresponds to any single digit.
Digit class Description
\d "d" is from "digit". A character from 0 to 9.
\s "s" is from "space". Includes spaces, tabs \t, newlines \n and few other rare characters, such as \v, \f and \r.
\w "w" is from "word". Either a letter of Latin alphabet or a digit or an underscore _. Non-Latin letters (like cyrillic or hindi) do not belong to \w.
  1. Inverse class. Denoted with the same letter, but uppercased.
Inverse class Description
\D Non-digit: any character except \d, for instance a letter.
\S Non-space: any character except \s, for instance a letter.
\W Non-wordly character: anything but \w, e.g a non-latin letter or a space.
  1. Special class. Matches any character except a newline.
Special class Description
. Is "any character" but not the "absence of a character". There must be a character to match it.
space itself The strings 1-5 and 1 - 5 are nearly identical. But if a regexp doesn’t take spaces into account, it may fail to work.

In our example we have characters from two classes - digit and special.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w\.-]*)*\/?$/

Character Escapes

To escape a special character you need to use a backslash \. For example, the open curly brace { is used to begin a quantifier, but adding a backslash before the open curly brace \{ means that the regex should look for the open curly brace character instead of beginning to define a quantifier. This is common when looking for strings with special characters that are the same as a particular component of a regex.

It's important to note that all special characters, including the backslash \, lose their special significance inside bracket expressions.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w\.-]*)*\/?$/

Author

Hello everyone! My name is Mariia Pirogova and I am a student enrolled in the University of Washington full-stack coding bootcamp growing my skills and looking forward to building a career in web development. To check out my works/progresses on Github click here. Thanks for stopping by.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment