Below, we'll be breaking down a regular expression used to match emails in order to develop our understanding of regular expressions.
The following regular expression can be used to match an email in a string:
/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/
Emails are always made up of three components, which makes them fairly simple to write a search regex for.
- Account name
- Domain name
- Top Level Domain (TLD)
The account name and domain name are seperated by the '@' symbole, and the domain name and TLD are seperated by a period ('.').
In the email address megganuts@alpharomero.dit
, the components are as follows:
- Account name: megganuts
- Domain name: alpharomero
- TLD: dit
We will breakdown the regular expression proposed at the beginning of this gist component by component and explain how they work in the context of the entire expression.
- Anchors
- Quantifiers
- OR Operator
- Character Classes
- Flags
- Grouping and Capturing
- Bracket Expressions
- Greedy and Lazy Match
- Boundaries
- Back-references
- Look-ahead and Look-behind
Let's start with the beginning... and the end. Anchors do exactly that. The ^
symbole matches any string that starts with the character(s) following it, where as the $
symbole matches any string that ends with the character(s) preceding it.
For example, ^Dic
matches any strings starting with 'Dic', such as 'Dice', 'Diction', or 'Dichotomy', whereas all$
matches any strings ending with 'all', such as 'balls', 'stalls', or 'crawls'. Just kidding, it wouldn't match 'crawls'... I wonder how regex on audio would work?
In our email regex, the ^
character can be found at the beginning of the part of the expression that sets the rules for what the account name can look like:
^([a-z0-9_\.-]+)
And the $
character is found at the end of the part of the expression that sets the rules for what the TLD can look like:
([a-z\.]{2,6})$
Notice how the anchors seem to be working in tandem with the (
and )
characters... seems like a suspicious coincidence. More on that soon.
Quantifiers do exactly as their name suggests... they quantify stuff. What kind of stuff? Character stuff, of course!
The two quantifiers we'll be focusing on today are the +
symbole and the {
and }
symboles.
The +
will match one or more of the character preceding it. For example, cats+
will match 'cats', 'catssss', and 'catsssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss'.
Next, we have the {
character, and its partner in crime the }
. Now these guys actually require at least one more character to become effective: a number. The number or range found between {
and }
determines the quantity of repeated characters that matches the character before the {
required to satisfy the expression. Now if you're thinking that was confusing to read, just imagine how confusing it was to write. Let's use some examples to help us better understand.
cats{3}
matches a string starts with 'cat' followed by 3 's's. Basically, it would match 'catsss' but not 'cats' or 'catssssss'.
Now if you use a range instead, any number of characters within that range will satisfy the expression. So cats{1-3000}
would match 'cats', but also 'catss', 'catsss', and any number of additional 's's up to and including 3000. But 3001 's's? No, that's too many. Much too many.
You can use a trailing comma to denote an open ended range. For example, cats{3,}
will match any string starting with 'cat' followed by at least 3 's's.
In the context of our email regex, we see the +
used in the account name and domain name poriton, and the {}
used in the TLD portion.
The +
in [a-z0-9_\.-]+
and [\da-z\.-]+
indicates that there must be at least one of the characters within the square brackets, but no upper limit on how many. So it would match a single character, 10 characters, or 10,000 characters. By the way, the square brackets represent a range, and \d
represents a single digit. More on those later.
The {2-6}
in [a-z\.]{2-6}
indicates that there must be at least 2, but no more than 6 of the characters within the square brackets.
Capturing groups have a number of applicaitons, but in the context of our email regex they allow us to apply a rule to a group of characters. For example, (go ){3}
would match 'go go go ', and (la){2}
would match 'lala'.
In our email regex, we use capture groups to divide our regex into the account name, domain name, and TLD.
^([a-z0-9_\.-]+)
matches a string that begins with at least one of the characters inside the square brackets which makes up the account name.
@([\da-z\.-]+)\.
matches a string that has at least one of the characters found inbetween the square brackets sandwhiched inbetween the '@' and '.' symbols.
([a-z\.]{2-6})$
matches a string that ends with 2-6 characters found in the set a to z.
Combined in order, these groups represent the format of an email!
Square brackets are used to represent an inclusive range of characters. Any number of characters can be included to make up a range.
[0-9]
matches a single number between 0 and 9.
[a-z]
matches a single letter between a and z, case sensitive.
[a-m4-8]
matches a single character between a and m or 4 and 8.
[#\$%&]
matches a single '#', '$', '%', or '&' character. Notice the backslash before the '$' character, which acts as an escape character as '$' has a special meaning in regex.
In the context of regular expressions, greedy means as much as possible while lazy means as little as possible. The +
character is an example of a greedy rule, meaning that it will match as much as fits the rules outlined before it.
For example, in the string 'test string (test brackets) blah blah (second bracket pair)' the regex (.+)
will match '(test brackets) blah blah (second bracket pair)', instead of '(test brackets)' and '(second bracket pair)'.
Hey there, my name's William Weiland. I'm a web developer living in Toronto, Ontario. Check me out on github.