grayad/regexTutorial.md

## regexTutorial.md

      
    Raw
  

              regexTutorial.md
            
          
    Matching an Email Using Regex

In this tutorial, you will learn how a regular expression is used to match an email.
Regular expressions (regex or regexp for short) are special text strings that are used to define patterns in text. It is like using the find box on your computer (ctrl + f) to search for characters, text, and phrases within documents. The difference is, with regular expressions, you can broaden your search to more than a literal character or word. You can search for different variations of a pattern.
For example, in a paragraph, you may want to find every instance of the word "the," but only when it begins a sentence. Or maybe you want to find numbers, but only when they are 3 digits. You can complete these searches using regular expressions.
Regular expressions are also used for more than just searching text. They can be used for validating, replacing, and manipulating text, etc.
Summary

The regex that I will be describing is used to find/validate emails. See below.
/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/
I will explain each part of the regex and how these parts work together to match any email presented in text.
Table of Contents


Anchors
Quantifiers
OR Operator
Character Classes
Flags
Grouping and Capturing
Bracket Expressions
Greedy and Lazy Match
Boundaries
Back-references
Look-ahead and Look-behind

Regex Components

Anchors

Anchors do not match any character, but instead match a position before or after characters.
In this example, the anchors are ^ and $, marking the beginning and end of the string, respectively.
Quantifiers

Quantifiers define how many times the preceeding character, group, or character class must be present in the input. Some common quantifiers are:

* to match the element zero or more times
+ to match the element 1 or more times
? to match the element 0 or 1 time
{n} to match the element exactly n times

In this example, the end of the regex, ([a-z\.]{2,6})$, uses the quantifier {2,6}. This means that, given the character set [a-z\.], the string should contain at least 2 of those characters and no more than 6. This is the end of the email string (note the $ anchor) and would most commonly match the common '.com' closing of emails.
Another quantifier, +, is used for the character set following the email @ symbol. See again below.
/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/
This character set prior to the + will most likely be the email service, like gmail or outlook. So the quantifier + means that if there is at least 1 or more elements from that character set in the input email, then the email will match. (e.g. both email@gmail.com and email@g.com will match)
OR Operator

In regular expressions, the OR Operator common to JavaScript is considered alternation and is still representing using the | symbol. Alternation matches 1 of 2 expressions on either side of the |. For example, given the regex c(a|o|u)t, the three possible matches are cat, cot, OR cut. Alternation is not used in our email regex.
Character Classes

A character class defines a set of characters using brackets []. If any character within the brackets is within the input text, they may match, considering the regex's other conditions.
For our email example, the first character set presented is [a-z0-9_\.-]. This means that for an email to match, the initial text prior to the @ symbol can contain any lowercase letter a through z, any number between 0 and 9, underscores, backslashes, periods, and hyphens.
The second character set used is [\da-z\.-], which as I previously mentioned, will most likely be the email service, like gmail. This character set uses the class \d, which is another way to match any digit between 0 and 9. The character set also matches any character a-z again, as well as backslashes, periods, and hyphens.
The final character set in the email example is [a-z\.] following the mandatory literal 'dot'. This set will most commonly represent the '.com' closing of an email address, and again allows any lowercase letter, backslash, or period.
Flags

Expression flags modify the search. Flags are located after the closing forward slash of the expression.
Some useful flags for the email regex are g (global search) and i (ignore case).

Global search will allow subsequent searches, meaning the search will result in all emails matched from the input text, not just one.
Ignore case will make the regex case-insensitive, so everywhere a-z is defined, A-Z will also match (e.g. both abc@gmail.com and aBC@gmail.com will match)

Grouping and Capturing

Grouping is used to break up the expression into sections and is usually accomplished using parentheses (). We've come to see that our email example is grouped into 3 main sections: the part before the @ sign (like a username), the email service (like gmail), and the closing of the email address (.com, .gov, .net, etc.). The character set definitions for these groups are contained within the parentheses ().
When a regex is grouped, the text matched within those groups is also captured and assigned to a numbered group that can be later reused within a numbered backreference.
Bracket Expressions

In regular expressions, different types of brackets are used for different reasons. Curly braces {} are simply used for quantifiers, but parentheses () and square brackets [] are often confused.
Parentheses are for grouping and capturing, while square brackets are for character set definitions, as seen throughout this tutorial.
Greedy and Lazy Match

By default, quanitifiers are 'greedy', meaning they will match as many characters as possible. Adding a ? after a quantifier makes it 'lazy,' causing it to match as few characters as possible. Lazy matching is not used in the email example.
Boundaries

Boundaries, like anchors, do not match any character, but instead positions. Boundaries define what can be matched to the left and right of the current position in the string.
The most popular boundary is the word boundary \b, which matches positions where one side is a word and the other side is not a word.
Here's an example: \bto\b would match 'to' in 'going to the movies,' but it would not match it in 'tonight' or 'photo.' When removing one of the boundaries, \bto would match 'to' in 'tonight,' and to\b would match 'to' in photo.
Notice there are no boundaries in this email regex example.
Back-references

To identify a repeated character or a substring within a string, back-references can be used. Backreferences are most commonly numbered, i.e. \3, where the number is the position of the capturing group. So, \3 would match the text of the third capturing group; in this case, ([a-z\.]{2,6}).
Look-ahead and Look-behind

Althought they are not used in the email regex, look-aheads and look-behinds search for groups after or before the main expression, and will match if positive and will not match if negative. Postive look-aheads and behinds are defined using =, while negatives are defined using !.

positive look-ahead (?=ABC)
negative look-ahead (?!ABC)
positive look-behind (?<=ABC)
negative look-behind (?<!ABC)

Author

Alexus Gray is a full stack developer, alumni of the University of North Carolina at Chapel Hill's Coding Bootcamp. Her github is https://github.com/grayad and showcases many of her development projects built using a range of technologies like HTML, CSS, JavaScript, Node, Express, and more.