Skip to content

Instantly share code, notes, and snippets.

@segarrison
Last active November 17, 2021 08:43
Show Gist options
  • Save segarrison/3d2a045914d9cca5893ff81af01dc7bb to your computer and use it in GitHub Desktop.
Save segarrison/3d2a045914d9cca5893ff81af01dc7bb to your computer and use it in GitHub Desktop.

Decoding The Email Matching Regex

Regular expressions, or regex, can be a daunting challenge for new coders. Instead of words that we are familar with outside of programming, here we see an assortment of seemingly random characters. But these characters do have purpose, and, even better, a specific pattern. Once this pattern is understood, regular expressions are a useful tool in any programmer's arsenal.

Summary

In order to demonstrate how to understand a regex, we'll be using one that is designed to match an email:

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

The basic function of this regex is that you are matching a string divided into three sections with two mandatory character dividers:

  • The first section can contain any lowercase letter between a-z; any number between 0-9; or an underscore, hyphen, or period--matching at least one
  • Between the first and second section, there must be an @
  • The second section can contain any digit, any lowercase letter between a-z, or a hyphen--matching at least one
  • Between the second and third group, there must be a period
  • The third group can contain any lowercase letter between a-z or a period, but must contain a combination of between 2 and 6 characters

In the following tutorial, we'll go over each component of the regex to better understand the construction.

Table of Contents

Regex Components

Before we dive into the other components, a note on how this regex is created. In JavaScript, you can create a regex in two ways: the literal notation or using the built in RegExp constructor. In our example, we are using the literal notation, which encloses its parameters between slashes. Thus, the first and last / are the start and stop of our regular expression.

Anchors

Both ^ and $ are anchor components. The ^ indicates the begining of a string to match, and then $ indicates the end. The nature of how the string is matched depends on if the ^ is followed by a bracket expression (enclosed in []) or not. If there is no bracket expression, then this will be an exact string match. For example, if our regex was ^good then the string "good" or "good job" would match, but not "Good" because regex is case-sensitive. Using bracket notation, we denote a range of possible matches. This is explained further in the section below.

Bracket Expressions

As mentioned above, bracket notation ([]) will denote a range of characters to be matched. Using our previous example, if we had ^[good], our regex will match "good", but also "dog", "ood", or even "antidisestablishmentarianism" because it matches one element of what is contained in our bracket notation. We have three different sets of bracket notations in our regex. What they mean exactly will be discussed in the character classes section further on.

Grouping Constructs

The parentheses (()) we see in the expression are working as grouping constructs. What a grouping construct does it that it enables us to break up the string into different parts to match. For our regex ( /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/ ), we see three sets of grouping constructs denoting subexpressions to match. In plainer English, we're saying take a string given to us by the user and make sure it matches the form: matches something in the first group, has an @ in between group one and two, matches something in the second group, has a period in between group two and three, and finally matches something in the third group.

Quantifiers

The {2,6} portion at the end of our regex is what is called a quantifier, as are the + at the end of groups one and two. What quantifiers are telling us are the minimum and/or maximum matches we want for our string or subexpression. In the case of the +, it is saying that we want our string to match the pattern at least one time. Going back to the example, if we wrote [good]+, we would want our string to match at least once, so "antidisestablishmentarianism" would work, but "bunny" would not.

The curly bracket quantifier ({}), also sets limits. If we have {x}, where x is an integer, we are asking for the string to match exactly x times. So, using 2 for x and going back to [good]{2}, "zoo" would work, but "dog" would not. If we add a comma after our number ([good]{2,}), we now want to match at least two times, so both "zoo" and "dog" now match. Adding a second number after the comma now gives us an inclusive range for matches. In our email regex, {2,6} means we want at least two matches, but no more than 6 matches from our final subexpression.

Character Classes

Character classes, or character sets, tells us exactly what we are matching in our regex. In our first subexpression (([a-z0-9_\.-]+)), we already know that we will accept a range of possible matches from our bracket notation and that we need at least one match from our + quantifier. Breaking down the subexpression further: a-z denotes that we will accept any lowercase letter from a to z (inclusive), 0-9 matches any digit between 0 and 9 (inclusive), and we will accept three special characters _,., and -. Note that the period has a \ in front of it. This is because it is an escaped character, which will be discussed further down.

Our second character class is similar to our first, but we have a new character class \d. This means we can match any digit character, so it is essentially equivalent to [0-9] as seen in the first subexpression. Unlike our first subexpression, _ is not a viable match.

The third character class is again accepting a range of matches, but here it is only the range of lowercase letters a to z and a period. And, as discussed previously, here we need at least two matches, but no more than six.

Character Escapes

Finally, the last piece of our puzzle is the \., which is a character escape. A period can have functional meaning in a regex, so, to denote that we are actually looking for the period character, we use the \ before the period.

Author

Sarah Garrison is a student at Rice University's Full Stack Web Development Bootcamp. You can find more of her work here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment