Skip to content

Instantly share code, notes, and snippets.

@Wombattree
Last active December 22, 2022 06:56
Show Gist options
  • Save Wombattree/5c22daa657b3af85a49d32b748be4c88 to your computer and use it in GitHub Desktop.
Save Wombattree/5c22daa657b3af85a49d32b748be4c88 to your computer and use it in GitHub Desktop.
A regex tutorial

Regex Tutorial

Determining whether or not a string matches specific criteria is a very common problem in programming, particularly in web development where it's important to ensure that, for instance, user login details fit specific rules. This tutorial will cover a method for ensuring that an input string matches the format for an email.

Summary

This is a quick guide to how regular expressions (regex) work and how they can be used in Javascript. For this tutorial I'll be explaining the regex below:

/^([a-z0-9_.-]+)@([\da-z-]+)\.([a-z]{2,6})$/

While this might look like an incomprehensible mess of symbols, everything in that line of code has a specific purpose that I'll be explaining.

Table of Contents

Regex Components

A regex is made up of a variety of components, each requiring that the input string has, or does not have, certain things.

Basics

A regex needs to start and end with forard slashes like so:

//

Inside those slashes will be the things that the regex is trying to match this regex for instance will check if an input string contains the letter "a".

/a/

Anchors

Anchors can be used to require that the string doesn't just contain certain elements, but contains those elements in certain positions.

The regex below for instance uses the anchor "^" to check if a string contains an "a" as the first character, so the string "apple" will match, while "bad" will not even though bad does contain an "a".

/^a/

This regex uses the anchor "$" to check if a string contains an "a" as the last character, so the string "gamma" will match, while "apple" will not even though apple does contain an "a".

/a$/

Quantifiers

Quantifiers are used to require parts of the string to be of a certain length or to show up a certain number of times.

This regex expects that the string will contain an "h", followed by at least one "i" because of the "+". So the string "Hiii" will match, the string "Hiiiiiiii" will match, but the string "Ha" won't.

/hi+/

This regex expects that the string will contain an "h", followed by at least three "i". So the string "Hiii" will match, the string "Hiiiiiiii" will match, but the string "Hii" won't. The curly brackets are written as {minimum required, maximum required}.

/hi{3,}/

Capturing Groups

Capturing groups are parentheses that enclose sections of regex that can be used to check if sections of the string fit criteria, they also allow qualifiers to affect everything inside of the group instead of just the previous token.

This regex expects that the string will contain the substring "bunny", it can match the world "bunny" any number of times but won't match to the individual letters of "bunny" only the whole word. "bunny" will match, "bun" won't.

/(bunny)+/

This regex uses the \w token to match any word character, then the . token is used to find a fullstop after that word. This regex will match "bunny.bunny.com" in its entirety as it contains two instances of a word followed by a fullstop and finished with another word. If given the string "bunny.bunny-com" then only the "bunny.bunny" part would match, as the hyphen is not part of this regex. This is useful for ensuring that something is a domain name such as "google.com".

/(\w+\.)+\w+/

Bracket Expressions

Square brackets can be used to indicate a set of characters, any of which can be considered a valid match.

This will match any parts of string that are the letters "a", "b", "c" or "d". The g at the end of the regex marks this regex as global, allowing it to match all instances instead of just the first. So with global the string "dogcat" will have matches at "d", "c", and "a", without it there would only be a match at "a".

/[abcd]/g

A hyphen can be used to indicate a range, in this case matching all the letters from "a" to "z". This regex is case sensitive so the string "4Dogs" would only match "ogs".

/[a-z]/g

Multiple ranges can be declared next to each other, in this case the capital letters are also declared as valid so the string "4Dogs" would now match "Dogs", removing only the "4".

/[a-zA-Z]/g

By adding a "^" at the start of the bracket, we've made this bracket the reverse of what it previous was. Now instead of matching only letters, it'll match everything except for letters. So the string "4Dogs" would now only match "4".

/[^a-zA-Z]/g

Character Classes

Character classes are shorthand ways of matching common things. These are just a couple of examples.

The "/d" character class will match any digit, while "/D" does the reverse.

/[/d]/g
Equivalent to
/[0-9]/g

/[/D]/g
Equivalent to
/[^0-9]/g

The "/w" character class will match any alphanumeric characters (as well as underscores) from the basic latin alphabet (so it will not match "Ñ" as that has an accent and isn't part of the latin alphabet), while "/W" does the reverse.

/[/w]/g
Equivalent to
/[A-Za-z0-9_]/g

/[/W]/g
Equivalent to
/[^A-Za-z0-9_]/g

Alternation

Alternation allows for a simple OR operator and is writen as a "|".

Given the string "I like cats, dogs, and birds." this will match "cats" and "dogs".

/cats|dogs|turtles/gi

This will match either the American spelling "gray" or the everybody-else-in-the-entire-world spelling of "grey".

/gr(e|a)y/gi

Flags

Flags are ways of modifying your entire regex with a single letter, we already know one global flag "g", but here are a couple of others.

The "i" flag makes a regex case insensitive.

/[a-z]/ig
Equivalent to
/[a-zA-Z]/g

The "m" flag allows the "^" and "$" anchors to work on the first/last character of each line instead of the first/last character in the string as a whole.

Matches only the first "a" in the multiline string "Apples\nAre\nAwesome".

/^[a]/ig

While this matches the "a" at the start of each line.

/^[a]/igm

Character Escapes

An escape is a backslash "" and it can be used to search for things that are also used as special tokens.

This regex will only return everything other than an a, but what if we wanted to actually search for the "^" character instead of using it as an anchor.

/[^a]/gi

This will now return anything that is either a "^" or an "a".

/[\^a]/gi

Putting It All To Use

Having now seen what many of the different regex components do lets actually see how this regex functions using the example email.

/^([a-z0-9_.-]+)@([\da-z-]+)\.([a-z]{2,6})$/
Email example: frank@gmail.com

The first token is an anchor, indicating that the regex should start checking from the beginning of the string.

^

Then we have the first capture group as denoted by the parentheses, this will capture the first part of the email ("frank"). The brackets show that we want this first part of the email to include only lower case letters and numbers, as well as underscores, periods and hypens. The "+" tells us that we want to capture one substring that matches this sequence.

([a-z0-9_.-]+)
Returns "frank"

Following the first capture group is an "@" symbol, simply requiring that the first part of the email be followed by an "@".

@

The second capture group is trying to get the "gmail" part of the example address. The "\d" will match any digit, while the "a-z" will match any lowercase letter, and finally hyphens are also allowed.

([\da-z-]+)

After the name of the website provider should be period, and that's what the next part of the regex is checking for.

\.

The final substring checks for the "com" part of the email. This part is expected to be within 2 to 6 characters long.

([a-z]{2,6})

The "$" anchor then ensures that nothing comes after ".com" part of the email.

$

So to rewrite this regex into plain english we would get something like the following:

Starting at the beginning of the string, check and see if there is a substring consisting of only letters, numbers, underscores, hypens and fullstops. Following that should be an "@", then another substring using only numbers, letters and hyphens. Then that substring is followed by a fullstop, and finally a third substring made of only letters and only between 2-6 characters long.

Here are some examples of strings that won't match the regex:

frank#gmail.com
The regex requires an @, not a hash

Frank@Gmail.Com
All the letters must be lowercase

frank@gmail.com4
The final substring must only contain letters

frank@gmail.hamsandwich
The final substring must only be three to six characters long

How To Use The Regex

Here are a couple examples of how to actually use this regex in JavaScript.

The string.match() function will return null on a failed match and the position, and number of matches if the string matched.

const email = 'frank@gmail.com';
const regex = /^([a-z0-9_.-]+)@([\da-z-]+)\.([a-z]{2,6})$/;
console.log(email.match(regex));

Will return:

[
  'frank@gmail.com',
  'frank',
  'gmail',
  'com',
  index: 0,
  input: 'frank@gmail.com',
  groups: undefined
]
const email2 = 'frank#gmail.Com5';
console.log(email2.match(regex));

Will return:

null

The regex.test() method can be used if you simply want a boolean.

const email = 'frank@gmail.com';
const regex = /^([a-z0-9_.-]+)@([\da-z-]+)\.([a-z]{2,6})$/;
console.log(regex.test(regex));

Will return:

true
const email2 = 'frank#gmail.Com5';
console.log(regex.test(regex));

Will return:

false

Author

This guide was writen by Alex Scrivener, you can find my GitHub here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment