Regular expressions, or regex, can be a daunting challenge for new coders. Instead of words that we are familar with outside of programming, here we see an assortment of seemingly random characters. But these characters do have purpose, and, even better, a specific pattern. Once this pattern is understood, regular expressions are a useful tool in any programmer's arsenal.
In order to demonstrate how to understand a regex, we'll be using one that is designed to match an email:
/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/
The basic function of this regex is that you are matching a string divided into three sections with two mandatory character dividers:
- The first section can contain any lowercase letter between a-z; any number between 0-9; or an underscore, hyphen, or period--matching at least one
- Between the first and second section, there must be an @
- The second section can contain any digit, any lowercase letter between a-z, or a hyphen--matching at least one
- Between the second and third group, there must be a period
- The third group can contain any lowercase letter between a-z or a period, but must contain a combination of between 2 and 6 characters
In the following tutorial, we'll go over each component of the regex to better understand the construction.
Before we dive into the other components, a note on how this regex is created. In JavaScript, you can create a regex in two ways: the
literal notation or using the built in RegExp constructor. In our example, we are using the literal notation, which encloses its
parameters between slashes. Thus, the first and last /
are the start and stop of our regular expression.
Both ^
and $
are anchor components. The ^
indicates the begining of a string to match, and then $
indicates the end. The nature
of how the string is matched depends on if the ^
is followed by a bracket expression (enclosed in []
) or not. If there is no bracket
expression, then this will be an exact string match. For example, if our regex was ^good
then the string "good"
or "good job"
would match, but not "Good"
because regex is case-sensitive. Using bracket notation, we denote a range of possible matches. This is
explained further in the section below.
As mentioned above, bracket notation ([]
) will denote a range of characters to be matched. Using our previous example, if we had
^[good]
, our regex will match "good"
, but also "dog"
, "ood"
, or even "antidisestablishmentarianism"
because it matches one
element of what is contained in our bracket notation. We have three different sets of bracket notations in our regex. What they mean
exactly will be discussed in the character classes section further on.
The parentheses (()
) we see in the expression are working as grouping constructs. What a grouping construct does it that it enables us
to break up the string into different parts to match. For our regex ( /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/
), we see
three sets of grouping constructs denoting subexpressions to match. In plainer English, we're saying take a string given to us by the
user and make sure it matches the form: matches something in the first group, has an @ in between group one and two, matches something
in the second group, has a period in between group two and three, and finally matches something in the third group.
The {2,6}
portion at the end of our regex is what is called a quantifier, as are the +
at the end of groups one and two. What
quantifiers are telling us are the minimum and/or maximum matches we want for our string or subexpression. In the case of the +
,
it is saying that we want our string to match the pattern at least one time. Going back to the example, if we wrote [good]+
, we would
want our string to match at least once, so "antidisestablishmentarianism"
would work, but "bunny"
would not.
The curly bracket quantifier ({}
), also sets limits. If we have {x}
, where x is an integer, we are asking for the string to match
exactly x times. So, using 2 for x and going back to [good]{2}
, "zoo"
would work, but "dog"
would not. If we add a comma after our
number ([good]{2,}
), we now want to match at least two times, so both "zoo"
and "dog"
now match. Adding a second number after the
comma now gives us an inclusive range for matches. In our email regex, {2,6}
means we want at least two matches, but no more than 6
matches from our final subexpression.
Character classes, or character sets, tells us exactly what we are matching in our regex. In our first subexpression (([a-z0-9_\.-]+)
),
we already know that we will accept a range of possible matches from our bracket notation and that we need at least one match from our
+
quantifier. Breaking down the subexpression further: a-z
denotes that we will accept any lowercase letter from a to z (inclusive),
0-9
matches any digit between 0 and 9 (inclusive), and we will accept three special characters _
,.
, and -
. Note that the period
has a \
in front of it. This is because it is an escaped character, which will be discussed further down.
Our second character class is similar to our first, but we have a new character class \d
. This means we can match any digit character,
so it is essentially equivalent to [0-9]
as seen in the first subexpression. Unlike our first subexpression, _
is not a viable match.
The third character class is again accepting a range of matches, but here it is only the range of lowercase letters a to z and a period. And, as discussed previously, here we need at least two matches, but no more than six.
Finally, the last piece of our puzzle is the \.
, which is a character escape. A period can have functional meaning in a regex, so, to
denote that we are actually looking for the period character, we use the \
before the period.
Sarah Garrison is a student at Rice University's Full Stack Web Development Bootcamp. You can find more of her work here.