Skip to content

Instantly share code, notes, and snippets.

@Mark33Mark
Last active December 19, 2021 11:27
Embed
What would you like to do?
Regular Expression Tutorial

RegEx_banner_pattern

Mark Watson's Notes on Regular Expressions

Table of Contents

Background RegEx Components References & Author
Summary Anchors References
What is Regex? Quantifiers Author
The History Grouping Constructs
Bracket Expressions
Character Classes
The OR Operator
Flags or Modifiers
Character Escapes
Example RegEx

Summary

A Regular Expression to a first time user, on instinct, does not appear 'regular' at all.

In this tutorial I take a dive into the world of Regular Expressions (RegEx) including the history, major components of RegEx followed by an example to help explain the application of RegEx in JavaScript.

I've chosen a useful regular expression that can be used when searching for an email address in text or validating a user's input:

Example RegEx

// find email addresses in text:

/\b([\da-z\._%+-]+)@([\da-z\.-]+)\.([a-z]{2,10})\b/gi

// validate an email addresses:

/\b^([\da-z\._%+-]+)@([\da-z\.-]+)\.([a-z]{2,10})\b/i

Do note that the focus of this summary is exclusively the use of Regular Expression's in JavaScript. You will need to refer to other tutorials for other programming languages to understand the syntax for the regular expression pattern you are wanting to match in your selected programming language.

There are different 'flavours' of Regular Expression processors / engines. CMCDragonkai has a great Gist that has a very detailed table that communicates what each 'flavour' has for each programming language:

CMCDragonkai's Gist

🔼 index


What is RegEx?

A Regular Expression (RegEx) is a pattern of characters created to either pattern-match or search and replace a given string of characters.

Regular Expressions provide a powerful, flexible, and efficient method for processing text. The extensive pattern-matching notation of regular expressions enables us to quickly parse large amounts of text to:

  • Find specific character patterns.
  • Validate text to ensure that it matches a predefined pattern, such as an email address.
  • Extract, edit, replace, or delete text substrings.
  • Add extracted strings to a collection in order to generate a report.

Regular Expression is essential to efficiently programmatically work with strings or when parsing large blocks of text.

🔼 index


Why? A History of RegEx

Sometimes it helps to have an appreciation of the origins of tools we frequently use to deepen our understanding of why they exist. Following is a history I've compiled from my review of Wikipedia's Regular Expression page

1951

Mathematician Stephen Cole Kleene described regular languages using his mathematical notation called regular events.

1965 - 1966

Among the first appearances of regular expressions in program form was when Ken Thompson built Kleene's notation as a means to match patterns in text files in the editor QED ( i.e Quick EDitor: a line-oriented computer text editor developed by Butler Lampson and L. Peter Deutsch for the Berkeley Timesharing System running on the SDS 940).

~1966

Ken Thompson implemented regular expression matching by just-in-time compilation (JIT) to IBM 7094 code on the Compatible Time-Sharing System, an important early example of JIT compilation.

Regular Expression matching was then added to the Unix editor ed, which eventually led to the popular search tool grep's use of regular expressions.

"grep" is a word derived from the command for regular expression searching in the ed editor: g/re/p meaning "Global search for Regular Expression and Print matching lines".

Around the same time when Thompson developed QED, a group of researchers including Douglas T. Ross implemented a tool based on regular expressions that is used for lexical analysis in compiler design.

1968

Regular expressions entered popular use in two areas:

  1. pattern matching in a text editor; and
  2. lexical analysis in a compiler.

1970's

Many variations of the original forms of Regular Expressions were used in Unix programs at Bell Labs, including vi, lex, sed, AWK, and expr, and in other programs such as Emacs.

1980's

More complicated Regular Expressions arose in Perl, which originally derived from a regex library written by Henry Spencer (1986), who later wrote an implementation of Advanced Regular Expressions for Tcl. The Tcl library is a hybrid NFA/DFA implementation with improved performance characteristics.

PostgreSQL adopts Spencer's Tcl regular expression implementation. Perl later expands on Spencer's original library to add many new features.

Part of the effort in the design of Raku (formerly named Perl 6) is to improve Perl's regex integration, and to increase the scope and capabilities to allow the definition of parsing expression grammars. The result is a mini-language called Raku rules, which are used to define Raku grammar as well as provide a tool to programmers in the language. These rules maintain existing features of Perl 5.x regexes, but also allow BNF-style definition of a recursive descent parser via sub-rules.

1992

Regular Expressions were subsequently adopted by a wide range of programs, with these early forms standardized in the POSIX.2 standard.

1997

Philip Hazel developed PCRE (Perl Compatible Regular Expressions), which attempts to closely mimic Perl's regex functionality and is used by many modern tools including PHP and Apache HTTP Server.

2010's

Implementations of regex functionality is often called a regex engine, and a number of libraries are available for reuse. Several companies started to offer hardware compatible regex engines, faster than CPU implementations. Some examples were FPGA (a Field Programmable Gate Array is an integrated circuit designed to be configured by a customer after manufacturing) and GPU implementations of PCRE (Perl Compatible Regular Expressions).

Today

Regular Expressions are widely supported in programming languages, text processing programs (particularly lexers), advanced text editors, and some other programs.

Regular Expression support is part of the standard library of many programming languages, including Java and Python, and is built into the syntax of others, including Perl and ECMAScript (JavaScript).

🔼 index


Regular Expressions - JavaScript

The Syntax

In the JavaScript universe, a RegExp object is a pattern with properties and methods and you call the constructor function as follows:

let re = new RegExp('ab+c');

The other way, probably the more frequently used method, is to use what is called a regular expression literal that encloses the search pattern between back-slashes:

let re = / ab+c /;


/ REGULAR EXPRESSION PATTERN GOES HERE BETWEEEN THE 2 BACK SLASHES / REGULAR EXPRESSION MODIFIER (FLAG) GOES HERE AFTER THE 2nd BACKSLASH ; 

🔼 index


Anchors

Anchors are unique as they match a position within a string, not a character. They match a position before, after, or between characters. They can be used to “anchor” the regex match at a certain position. A regex that consists solely of an anchor can only find zero-length matches.

Examples:

The caret ^ matches the position before the first character in the string. Applying ^a to abc matches a. ^b does not match abc at all, because the b cannot be matched right after the start of the string, matched by ^.

The $ matches right after the last character in the string. c$ matches c in abc, while a$ does not match at all.

Anchor Description
^ Finds the beginning of a string or the beginning of a line if the multi-line (m) flag is enabled. Matches a position, NOT a character.
$ Finds the end of the string or the end of a line if the multi-line (m) flag is enabled. Matches a position, NOT a character.
\b Matches a word boundary position of a word character either at the beginning of the word pattern ( \bHI ) or end of word pattern ( HI\b ).
\B Opposite of \b - matches any position of character pattern but NOT at the beginning / end of the character pattern.

🔼 index


Quantifiers

Quantifiers indicate to the processor that the preceding token must be matched a certain number of times. By default, quantifiers will try to match as many characters as possible (greedy).

Quantifier Description
+ Matches 1 or more of the preceding pattern.
* Matches 0 or more of the preceding pattern.
{ l,h } Matches at least "l" but not more than "h" repetitions of the preceding character, for example { 2,3 } will match 2 to 3; { 3 } will match exactly 3; and { 3, } will match 3 or more.
? Finds 0 to 1 of the character pattern preceding it, effectively making the character pattern preceding it optional.
\(quantifier)+? Makes the preceding quantifier lazy, causing it to match as few characters as possible. By default quantifiers are greedy, and match as many characters as possible.

🔼 index


Grouping Constructs

Groups allow you to combine a sequence of tokens to operate on them together. Capture groups can be referenced by a backreference and accessed separately in the results.

Constructs Description
( ABC ) Groups multiple tokens together and creates a capture group for extracting a substring or using a backreference.
(? <name> ABC ) Creates a capturing group that can be referenced via a specified name.
\1 Matches the results of a capture group. For example \1 matches the results of the first capture group and \3 matches the third.
( ?:ABC ) Groups multiple tokens together without creating a capture group.

🔼 index


Bracket Expressions

Bracket expressions [] are used to show character classes. They are essentially grouping a character class together and including the OR operator. For example [abc] tells the search to match "a" or "b" or "c".

Expressions Description
[ ] Character class. Matches any character contained between the square brackets.
[^ ] Negated character class. Matches any character that is not contained between the square brackets

🔼 index


Character Classes

A character class is a special notation that matches any symbol from a certain set.

Character Class Description
\[ABC] Matches any character in the set.
[^ABC] Negated set.
[A-Z] Matches a character having a character code between the two specified characters inclusive.
. Period or Full Stop - finds any single character - DOES NOT find newline or line terminator.
\w Matches word characters - alphanumeric (both letters and numbers) or underscored alphanumeric. Matches low-ascii characters, NOT accented or non-roman characters.
\W Opposite of \w - matches any character that is not a word character - including accented or non-roman characters.
\d Matches any digit character: [0-9]
\D Opposite of \d - matches ANY character that is not a digit.
\s Matches any whitespace character - spaces, tabs, line breaks.
\S Opposite of \s - matches any character that is not a whitespace character.

🔼 index


The OR Operator

The OR operator (alternation) is similar. You can use alternation to match a single regular expression out of several possible regular expressions.

The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the alternation, you need to use parentheses for grouping.

Example:

To match whole words only, use \b(cat|dog)\b. This tells the regex engine to find a word boundary, then either cat or dog, and then another word boundary.

Operator Description
| Alternation - acts like the boolean OR, matches the expression before or after the |. It can operate in a group or on a whole expression, the patterns will be tested in order.

🔼 index


Flags or Modifiers

Modifiers instruct the processor with how to proceed with searching for the Regular Expression Pattern.

Flags / Modifiers Description
g Global match - finds all matches - DOES NOT stop search after first match.
i Matches without case sensitivity.
m Matches multi-lines - DOES NOT stop search after first line match.

🔼 index


Character Escapes

Escape sequences can be used to insert reserved, special, and unicode characters. All escaped characters begin with the \ character.

Character escapes are used to switch from a literal search to a special search, and vice versa.

Character Escapes Description
\+ + * ? ^ $ \ . [ ] { } ( ) | / are all regular expression operators (special characters) and must be preceded by a \ (backslash) to represent the literal version of the character. In a character set, only , -, and ] needs to be escaped.
\000 Octal escaped character in the form \000, for example © would be searched with \251. The value must be less than 255.
\xFF Hexadecimal escaped character, example \xA9 is for the character ©
\uFFFF Unicode escaped character, example \u00A9 is for the character ©
\u{FFFF} Extended unicode escaped character. Supports a full range of unicode point escapes with any number of hex digits. Requires the unicode flag / modifier u to be set.
\cI Escaped control character. This can range from \cA (SOH, char code 1) to \cZ (SUB, char code 26).
\t Matches a TAB character (char code 9).
\n Matches a line feed character with a char code = 10.
\v Matches a VERTICAL TAB character (char code 11).
\f Matches a FORM FEED character (char code 12).
\r Matches a CARRIAGE RETURN character (char code 13).
\0 This is a zero not letter O, it matches a NULL character (char code 0) - that is, a character with no memory allocation, occupies no memory as it is null.

🔼 index


Regular Expression Example

I've selected an example that has a lot of every day programming application. The following Regular Expression will match or validate email addresses in a document or an input box:

/\b([\da-z\._%+-]+)@([\da-z\.-]+)\.([a-z]{2,10})\b/gi

What does it mean?

Anchors

\b = matches a word boundary position between a word character and non-word character or position (start / end of string).

For validation, add the "^" anchor after the first "\b" anchor. This will confirm if a valid email has been entered with no whitespace or new lines at the start of the email. You also don't need the global "g" flag.


Grouping Constructs

This example has 3 groups:

Group 1

([\da-z\._%+-]+)

Group 2

([\da-z\.-]+)

Group 3

([a-z]{2,10})

Characters

@ = located between Group 1 and Group 2 character sets, instructs Regular Expression engine to look for "@" character in between groups 1 and 2.

. = located between Group 2 and Group 3 character sets, instructs Regular Expression engine to look for "." character in between groups 2 and 3. The backslash is needed before to escape the character as "." has a special purpose, refer Character Classes


Flag or Modifier

g = global, search continues after 1st match. i = shortens the Regular Expression as you can state a-z characters instead of A-Za-z for the search criteria.


Bracket Expression, Character Class and Quantifier

[\da-z_.%+-]+

Searches for any of the literal characters "a-z", "0-9", "_", ".", "%", "-", "+", "-".

The Character Class "\d" instructs RegEx to search for any digits 0-9.

The quantifier "+" outside the square brackets instructs RegEx that one of these must exist at least once (but could have more than one of these components in the string).

[\da-z.-]+

Searches for any of the literal characters "0-9", "a-z", ".", "-".

The quantifier "+" instruct RegEx that one of these must exist at least once (but could have more than one of these components in the string).

[a-z.]{2,10}

Searches for any of the literal characters "a-z", ".".

The Quantifier "{2,10}" instructs RegEx that one of these must exist at least two times and no more than 10 times.

Note: Whilst this quantifier will work for most domains, there are now so many domain alternatives that extend past 10 characters. This is a good example of choosing the right Regular Expression setting for your search or validation. A better Quantifier to use would be "{ 2, }". Have a look at this picture showing some of the alternatives to .com :

alternatives_to_com

🔼 index


References

There is so much more to Regular Expressions than what I've covered here. Following are the various sources I referred to for this Gist.

I highly recommend RegExr. It is a great website to test your understanding of Regular Expressions and test if your Regular Expression will actually find or validate what you want using the 'sandbox' environment in the website.

🔼 index


Author

Mark Watson is a programmer currently focused on learning JavaScript. If you have questions or would like to connect with me, please use one of the following:

🔼 index


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment