Skip to content

Instantly share code, notes, and snippets.

@maplesyrupman
Last active January 24, 2022 07:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save maplesyrupman/097b4ed91edc103d99a9d3f692321d6a to your computer and use it in GitHub Desktop.
Save maplesyrupman/097b4ed91edc103d99a9d3f692321d6a to your computer and use it in GitHub Desktop.
Email Regex Breakdown

Email Regex Breakdown

Below, we'll be breaking down a regular expression used to match emails in order to develop our understanding of regular expressions.

Summary

The following regular expression can be used to match an email in a string: /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

Emails are always made up of three components, which makes them fairly simple to write a search regex for.

  1. Account name
  2. Domain name
  3. Top Level Domain (TLD)

The account name and domain name are seperated by the '@' symbole, and the domain name and TLD are seperated by a period ('.').

In the email address megganuts@alpharomero.dit, the components are as follows:

  • Account name: megganuts
  • Domain name: alpharomero
  • TLD: dit

We will breakdown the regular expression proposed at the beginning of this gist component by component and explain how they work in the context of the entire expression.

Table of Contents

Regex Components

Anchors

Let's start with the beginning... and the end. Anchors do exactly that. The ^ symbole matches any string that starts with the character(s) following it, where as the $ symbole matches any string that ends with the character(s) preceding it.

For example, ^Dic matches any strings starting with 'Dic', such as 'Dice', 'Diction', or 'Dichotomy', whereas all$ matches any strings ending with 'all', such as 'balls', 'stalls', or 'crawls'. Just kidding, it wouldn't match 'crawls'... I wonder how regex on audio would work?

In our email regex, the ^ character can be found at the beginning of the part of the expression that sets the rules for what the account name can look like: ^([a-z0-9_\.-]+)

And the $ character is found at the end of the part of the expression that sets the rules for what the TLD can look like: ([a-z\.]{2,6})$

Notice how the anchors seem to be working in tandem with the ( and ) characters... seems like a suspicious coincidence. More on that soon.

Quantifiers

Quantifiers do exactly as their name suggests... they quantify stuff. What kind of stuff? Character stuff, of course!

The two quantifiers we'll be focusing on today are the + symbole and the { and } symboles.

The + will match one or more of the character preceding it. For example, cats+ will match 'cats', 'catssss', and 'catsssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss'.

Next, we have the { character, and its partner in crime the }. Now these guys actually require at least one more character to become effective: a number. The number or range found between { and } determines the quantity of repeated characters that matches the character before the { required to satisfy the expression. Now if you're thinking that was confusing to read, just imagine how confusing it was to write. Let's use some examples to help us better understand.

cats{3} matches a string starts with 'cat' followed by 3 's's. Basically, it would match 'catsss' but not 'cats' or 'catssssss'.

Now if you use a range instead, any number of characters within that range will satisfy the expression. So cats{1-3000} would match 'cats', but also 'catss', 'catsss', and any number of additional 's's up to and including 3000. But 3001 's's? No, that's too many. Much too many.

You can use a trailing comma to denote an open ended range. For example, cats{3,} will match any string starting with 'cat' followed by at least 3 's's.

In the context of our email regex, we see the + used in the account name and domain name poriton, and the {} used in the TLD portion.

The + in [a-z0-9_\.-]+ and [\da-z\.-]+ indicates that there must be at least one of the characters within the square brackets, but no upper limit on how many. So it would match a single character, 10 characters, or 10,000 characters. By the way, the square brackets represent a range, and \d represents a single digit. More on those later.

The {2-6} in [a-z\.]{2-6} indicates that there must be at least 2, but no more than 6 of the characters within the square brackets.

Grouping and Capturing

Capturing groups have a number of applicaitons, but in the context of our email regex they allow us to apply a rule to a group of characters. For example, (go ){3} would match 'go go go ', and (la){2} would match 'lala'.

In our email regex, we use capture groups to divide our regex into the account name, domain name, and TLD.

^([a-z0-9_\.-]+) matches a string that begins with at least one of the characters inside the square brackets which makes up the account name.

@([\da-z\.-]+)\. matches a string that has at least one of the characters found inbetween the square brackets sandwhiched inbetween the '@' and '.' symbols.

([a-z\.]{2-6})$ matches a string that ends with 2-6 characters found in the set a to z.

Combined in order, these groups represent the format of an email!

Bracket Expressions

Square brackets are used to represent an inclusive range of characters. Any number of characters can be included to make up a range.

[0-9] matches a single number between 0 and 9. [a-z] matches a single letter between a and z, case sensitive. [a-m4-8] matches a single character between a and m or 4 and 8. [#\$%&] matches a single '#', '$', '%', or '&' character. Notice the backslash before the '$' character, which acts as an escape character as '$' has a special meaning in regex.

Greedy and Lazy Match

In the context of regular expressions, greedy means as much as possible while lazy means as little as possible. The + character is an example of a greedy rule, meaning that it will match as much as fits the rules outlined before it.

For example, in the string 'test string (test brackets) blah blah (second bracket pair)' the regex (.+) will match '(test brackets) blah blah (second bracket pair)', instead of '(test brackets)' and '(second bracket pair)'.

Author

Hey there, my name's William Weiland. I'm a web developer living in Toronto, Ontario. Check me out on github.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment