Skip to content

Instantly share code, notes, and snippets.

@lexslo
Last active November 8, 2022 18:23
Show Gist options
  • Save lexslo/e9d485fd46214aa39207a96d752c325a to your computer and use it in GitHub Desktop.
Save lexslo/e9d485fd46214aa39207a96d752c325a to your computer and use it in GitHub Desktop.
T-regex: A tutorial on regular expressions

T-Regex 🦖

This tutorial will introduce you to regular expressions, or regex for short. A regex is a tool that can extract information from text. Amazingly, regular expressions are available for use in most programming languages.

Summary

In this tutorial we will look at how a regular expression checks text input to determine whether the input is a valid email address. We will walk through this entire expression step by step, breaking down each of the characters used and what they mean.

This is the regex for validating an email address:

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

Table of Contents

Regex Components

Anchors

The anchors used in this regex:

^ This symbol signifies the START of a string

$ This symbol signifies the END of a string

Note the use of ^ and $:

anchors

This is important because we are specifically and clearly telling the engine where the entire string in the search criteria should start and where it should end.

Grouping and Capturing

( ) Parentheses are used to group multiple criteria together into one, creating a "capture group."

Notice how the groups are created in this regex to capture all the characters before the "@", all the characters between the "@" and the ".", and then all the characters after the "." :

groups

This is important for our email validation since we want to be sure that the input follows the correct format:

_______ @ ______ . _____

Where each blank space contains a combination of characters that we validate using character classes.

Character Classes

Character classes define the characters we want to match in our search. Since this regex is validating an email address, we want to allow users to type letters, numbers, periods, underscores, and hyphens.

We will place the @ character after our first capture group, and then a . after our second capture group since emails generally follow the format myEmail@place.com

Notice the backslash before the period between our 2nd and 3rd group. That is what's known as an escaped character, where the backslash is saying "Hey, this is the actual period character, not the dot!" A . by itself has specific meaning in a regex.

Notice how, instead of typing the entire alphabet, we can use square brackets to declare a range of characters. Take a look at the range of characters in capture group #1:

[a-z0-9_\.-]

Here, we are saying find all characters a through z, all characters 0 through 9, underscores, periods, and hyphens. Note the escaped character for a period!

So, what's going on with the range of characters in our second capture group?

[\da-z\.-]

The \d is a character class that means digits from 0 - 9, but this allows for all Unicode digits, whereas [0 - 9] only accounts for the 10 ASCII codes that correspond to 0 - 9 on a computer keyboard.

Quantifiers

The quantifiers used in this regex:

+ This means one or more of the preceding criteria must match

{2,6} This basically says there should be at least 2 but no more than 6 of the preceding characters

Notice how these quantifiers appear after each group capture:

quantifiers

Greedy and Lazy Match

Greedy and lazy both sound bad but, in the case of regular expressions, they are just terms used to describe the behavior of quantifiers.

In our regex, we aren't using any lazy quantifiers. A quantifier is considered lazy if it matches as few characters as possible. The default state of quantifiers is greedy, meaning they work to match as many characters as possible.

Let's look again at the quantifiers we are using, + and {2,6}.

+ is saying "match one or more" of the preceding characters - this is "greedy" in regex terminology. {2,6} is also greedy because it is looking for 2 to 6 characters, as many as possible in this context.

Putting it All Together

Now that we have broken it up piece by piece, let's look at the bigger picture.

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/
  • We start with /, which basically says "this is about to be a regular expression"
  • We place an anchor ^ signifying where the string should start
  • Next is an open paren ( to start a group capture
  • Between square brackets are characters to match in the search [a-z0-9_\.-]
  • This is followed by a + quantifier, meaning one or more of the previous characters should match
  • Then we have a closin paren ) which means our first group has been captured
  • Next, the @ character appears, clearly stating this should appear after all previous criteria is met
  • Another open paren ( signaling the second capture group
  • Then comes the range of characters [\da-z\.-] to say all digits 0-9 unicode, a to z, periods and/or hyphens
  • Next another + quanitfier
  • Closing paren ) marks the end of capture group #2
  • Next, the escaped character \. is placed after capture group 2 and before capture group 3 (think "gmail.com")
  • Another open paren ( means we are starting capture group #3
  • Then, the range [a-z\.] which allows for characters a to z and periods
  • Followed by the quantifier {2,6 which basically says anything after the period should be between 2-6 characters
  • Next, closing ) concludes capture group #3
  • Then, we have the anchor $ which signifies the end of the string to be searched
  • Lastly, / tells the engine "that's it, my regular expression is finished"

Author

My name is Lex

I have a B.S. degree in Music Technology and a certificate in Full Stack Web Development from UC Berkeley. I love learning new things and integrating creativity with tech anywhere possible.

Check out my GitHub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment