This is a Regex (Regular Expression) tutorial demonstrating how to match an URL.
URL(Uniform Resource Locator), or link is widely used in daily life. People use it for web surfing while developer use it for routers. This tutorial will give detailed explanation on how to use Regex to match a URL.
Here is a peak of what is covered in this tutorial.
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
This shows a search pattern meant for a URL validation. That is, it checks to see if a string fulfills the requirements for an URL.
- Anchors
- Quantifiers
- Grouping Constructs
- Bracket Expressions
- Character Classes
- The OR Operator
- Flags
- Character Escapes
A regex is regarded as a literal, so it must be wrapped in slash character /
.
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
The characters ^
and $
are both regarded as anchors.
The example of ^(https?:\/\/)
means a range of possible matches, displayed in the bracket. It means search anything matches http://
or https://
. The ?
mark will be explained later.
The $
anchor signifies a string that ends with the characters that precede it.
So in our "Matching URL" regex, the string must start and end with a pattern of https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?
Quantifiers set the limits of the string that your regex matches (or an individual section of the string). They frequently include the minimum and maximum number of characters that your regex is looking for.
-
{2, 6}
matches the the string to be between 2-6 characters long. -
More specifically, curly brackets can provide three different ways to set limits for a match:
{ n }
— matches the pattern exactly n number of times{ n, }
- matches the pattern at least n number of times{ n, x }
— matches the pattern from a minimum of n number of times to a maximum of x number of times
-
*
- matches the pattern zero or more times -
+
— matches the pattern one or more times -
?
- matches the pattern zero or one time
In our case, (https?:\/\/)?
means "http" or "https" may appear once or not appear at all.
[\da-z\.-]+
could be something like "boot-camp.github."
[a-z\.]{2,6}
means the "com", "io", "club"
[\/\w \.-]*
means there could be "/api/home/dashboard"
\/?
means there may or may not have / at the end.
The primary way you group a section of a regex is by using parentheses ()
. Each section within parentheses is known as a subexpression.
In our case, we have many groups (https?:\/\/)
, ([\da-z\.-]+)
, ([a-z\.]{2,6})
,([\/\w \.-]*)
.
Anything inside a set of square brackets []
represents a range of characters that we want to match.
In our example,
-
[a-z\.]
will look for a string that matches any lowercase letter character plus.
symbol like "yahoo." or "abc.". -
[\da-z\.-]
matches any Arabic number digit and lowercase letter including symbol-
. like "coding-boot-camp." -
[\/\w \.-]
matches with any alphanumeric character from basic Latin alphabet, including-
and_
like/regex-tutorial.
You may noticed that a lot \.
inside the expression, where will be covered by Character Escapes.
A character class in regex defines a set of characters. In our case, you can find the followings are used.
-
\d
- It matches any Arabic numeral digit. This class is equivalent to the bracket expression[0-9]
. -
\w
- It matches any alphanumeric character from the basic Latin alphabet, including the underscore_
. This class is equivalent to the bracket expression[A-Za-z0-9_]
. -
It is worth mentioning that the difference between
.
and\.
, where.
matches any character except the newline character\n
but\.
matches the symbol.
itself.
OR operator(|
) is not used in our case, but it is quite handy for adding alternative when orders are not important. The expression [apple]
could be written as (a|p|l|e)
.
Therefore, "apple", " aple", "ape" will match.
Flags are palaced at the end of regex, after the second slash, and they define additional functionality or limits for the regex. In our case, flags are not required but there are three common types.
-
g
- Global search: the regex should be tested against all possible matches in a string. -
i
- Case-insensitive search: case should be ignored while attempting a match in a string. -
m
- Multi-line search: a multi-line input string should be treated as multiple lines.
The backslash \
in a regex escapes a character that otherwise would be interpreted literally.
For example, \.
will search .
and \/
will search /
.
In a nutshell, all special characters, including the backslash \
, lose their special significance inside bracket expressions.
Author here is Freddie. You can find my Github profile here. https://github.com/dark40