lexslo/T-regex.md

## T-regex.md

      
    Raw
  

              T-regex.md
            
          
    T-Regex 🦖

This tutorial will introduce you to regular expressions, or regex for short. A regex is a tool that can extract information from text. Amazingly, regular expressions are available for use in most programming languages.
Summary

In this tutorial we will look at how a regular expression checks text input to determine whether the input is a valid email address. We will walk through this entire expression step by step, breaking down each of the characters used and what they mean.
This is the regex for validating an email address:
/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

Table of Contents


Anchors
Grouping and Capturing
Character Classes
Quantifiers
Greedy and Lazy Match
Putting It All Together

Regex Components

Anchors

The anchors used in this regex:
^ This symbol signifies the START of a string
$ This symbol signifies the END of a string
Note the use of ^ and $:

This is important because we are specifically and clearly telling the engine where the entire string in the search criteria should start and where it should end.
Grouping and Capturing

( ) Parentheses are used to group multiple criteria together into one, creating a "capture group."
Notice how the groups are created in this regex to capture all the characters before the "@", all the characters between the "@" and the ".", and then all the characters after the "." :

This is important for our email validation since we want to be sure that the input follows the correct format:
_______ @ ______ . _____
Where each blank space contains a combination of characters that we validate using character classes.
Character Classes

Character classes define the characters we want to match in our search. Since this regex is validating an email address, we want to allow users to type letters, numbers, periods, underscores, and hyphens.
We will place the @ character after our first capture group, and then a . after our second capture group since emails generally follow the format myEmail@place.com
Notice the backslash before the period between our 2nd and 3rd group. That is what's known as an escaped character, where the backslash is saying "Hey, this is the actual period character, not the dot!" A . by itself has specific meaning in a regex.
Notice how, instead of typing the entire alphabet, we can use square brackets to declare a range of characters. Take a look at the range of characters in capture group #1:
[a-z0-9_\.-]

Here, we are saying find all characters a through z, all characters 0 through 9, underscores, periods, and hyphens. Note the escaped character for a period!
So, what's going on with the range of characters in our second capture group?
[\da-z\.-]

The \d is a character class that means digits from 0 - 9, but this allows for all Unicode digits, whereas [0 - 9] only accounts for the 10 ASCII codes that correspond to 0 - 9 on a computer keyboard.
Quantifiers

The quantifiers used in this regex:
+ This means one or more of the preceding criteria must match
{2,6} This basically says there should be at least 2 but no more than 6 of the preceding characters
Notice how these quantifiers appear after each group capture:

Greedy and Lazy Match

Greedy and lazy both sound bad but, in the case of regular expressions, they are just terms used to describe the behavior of quantifiers.
In our regex, we aren't using any lazy quantifiers. A quantifier is considered lazy if it matches as few characters as possible. The default state of quantifiers is greedy, meaning they work to match as many characters as possible.
Let's look again at the quantifiers we are using, + and {2,6}.
+ is saying "match one or more" of the preceding characters - this is "greedy" in regex terminology.
{2,6} is also greedy because it is looking for 2 to 6 characters, as many as possible in this context.
Putting it All Together

Now that we have broken it up piece by piece, let's look at the bigger picture.
/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/


We start with /, which basically says "this is about to be a regular expression"
We place an anchor ^ signifying where the string should start
Next is an open paren ( to start a group capture
Between square brackets are characters to match in the search [a-z0-9_\.-]
This is followed by a + quantifier, meaning one or more of the previous characters should match
Then we have a closin paren ) which means our first group has been captured
Next, the @ character appears, clearly stating this should appear after all previous criteria is met
Another open paren ( signaling the second capture group
Then comes the range of characters [\da-z\.-] to say all digits 0-9 unicode, a to z, periods and/or hyphens
Next another + quanitfier
Closing paren ) marks the end of capture group #2
Next, the escaped character \. is placed after capture group 2 and before capture group 3 (think "gmail.com")
Another open paren ( means we are starting capture group #3
Then, the range [a-z\.] which allows for characters a to z and periods
Followed by the quantifier {2,6 which basically says anything after the period should be between 2-6 characters
Next, closing ) concludes capture group #3
Then, we have the anchor $ which signifies the end of the string to be searched
Lastly, / tells the engine "that's it, my regular expression is finished"

Author

My name is Lex
I have a B.S. degree in Music Technology and a certificate in Full Stack Web Development from UC Berkeley. I love learning new things and integrating creativity with tech anywhere possible.
Check out my GitHub