Mark Watson's Notes on Regular Expressions
Table of Contents
Background | RegEx Components | References & Author |
---|---|---|
Summary | Anchors | References |
What is Regex? | Quantifiers | Author |
The History | Grouping Constructs | |
Bracket Expressions | ||
Character Classes | ||
The OR Operator | ||
Flags or Modifiers | ||
Character Escapes | ||
Example RegEx | ||
Summary
A Regular Expression to a first time user, on instinct, does not appear 'regular' at all.
In this tutorial I take a dive into the world of Regular Expressions (RegEx) including the history, major components of RegEx followed by an example to help explain the application of RegEx in JavaScript.
I've chosen a useful regular expression that can be used when searching for an email address in text or validating a user's input:
// find email addresses in text:
/\b([\da-z\._%+-]+)@([\da-z\.-]+)\.([a-z]{2,10})\b/gi
// validate an email addresses:
/\b^([\da-z\._%+-]+)@([\da-z\.-]+)\.([a-z]{2,10})\b/i
Do note that the focus of this summary is exclusively the use of Regular Expression's in JavaScript. You will need to refer to other tutorials for other programming languages to understand the syntax for the regular expression pattern you are wanting to match in your selected programming language.
There are different 'flavours' of Regular Expression processors / engines. CMCDragonkai has a great Gist that has a very detailed table that communicates what each 'flavour' has for each programming language:
What is RegEx?
A Regular Expression (RegEx) is a pattern of characters created to either pattern-match or search and replace a given string of characters.
Regular Expressions provide a powerful, flexible, and efficient method for processing text. The extensive pattern-matching notation of regular expressions enables us to quickly parse large amounts of text to:
- Find specific character patterns.
- Validate text to ensure that it matches a predefined pattern, such as an email address.
- Extract, edit, replace, or delete text substrings.
- Add extracted strings to a collection in order to generate a report.
Regular Expression is essential to efficiently programmatically work with strings or when parsing large blocks of text.
Why? A History of RegEx
Sometimes it helps to have an appreciation of the origins of tools we frequently use to deepen our understanding of why they exist. Following is a history I've compiled from my review of Wikipedia's Regular Expression page
1951
Mathematician Stephen Cole Kleene described regular languages using his mathematical notation called regular events.
1965 - 1966
Among the first appearances of regular expressions in program form was when Ken Thompson built Kleene's notation as a means to match patterns in text files in the editor QED ( i.e Quick EDitor: a line-oriented computer text editor developed by Butler Lampson and L. Peter Deutsch for the Berkeley Timesharing System running on the SDS 940).
~1966
Ken Thompson implemented regular expression matching by just-in-time compilation (JIT) to IBM 7094 code on the Compatible Time-Sharing System, an important early example of JIT compilation.
Regular Expression matching was then added to the Unix editor ed, which eventually led to the popular search tool grep's use of regular expressions.
"grep" is a word derived from the command for regular expression searching in the ed editor: g/re/p meaning "Global search for Regular Expression and Print matching lines".
Around the same time when Thompson developed QED, a group of researchers including Douglas T. Ross implemented a tool based on regular expressions that is used for lexical analysis in compiler design.
1968
Regular expressions entered popular use in two areas:
- pattern matching in a text editor; and
- lexical analysis in a compiler.
1970's
Many variations of the original forms of Regular Expressions were used in Unix programs at Bell Labs, including vi, lex, sed, AWK, and expr, and in other programs such as Emacs.
1980's
More complicated Regular Expressions arose in Perl, which originally derived from a regex library written by Henry Spencer (1986), who later wrote an implementation of Advanced Regular Expressions for Tcl. The Tcl library is a hybrid NFA/DFA implementation with improved performance characteristics.
PostgreSQL adopts Spencer's Tcl regular expression implementation. Perl later expands on Spencer's original library to add many new features.
Part of the effort in the design of Raku (formerly named Perl 6) is to improve Perl's regex integration, and to increase the scope and capabilities to allow the definition of parsing expression grammars. The result is a mini-language called Raku rules, which are used to define Raku grammar as well as provide a tool to programmers in the language. These rules maintain existing features of Perl 5.x regexes, but also allow BNF-style definition of a recursive descent parser via sub-rules.
1992
Regular Expressions were subsequently adopted by a wide range of programs, with these early forms standardized in the POSIX.2 standard.
1997
Philip Hazel developed PCRE (Perl Compatible Regular Expressions), which attempts to closely mimic Perl's regex functionality and is used by many modern tools including PHP and Apache HTTP Server.
2010's
Implementations of regex functionality is often called a regex engine, and a number of libraries are available for reuse. Several companies started to offer hardware compatible regex engines, faster than CPU implementations. Some examples were FPGA (a Field Programmable Gate Array is an integrated circuit designed to be configured by a customer after manufacturing) and GPU implementations of PCRE (Perl Compatible Regular Expressions).
Today
Regular Expressions are widely supported in programming languages, text processing programs (particularly lexers), advanced text editors, and some other programs.
Regular Expression support is part of the standard library of many programming languages, including Java and Python, and is built into the syntax of others, including Perl and ECMAScript (JavaScript).
Regular Expressions - JavaScript
The Syntax
In the JavaScript universe, a RegExp object is a pattern with properties and methods and you call the constructor function as follows:
let re = new RegExp('ab+c');
The other way, probably the more frequently used method, is to use what is called a regular expression literal that encloses the search pattern between back-slashes:
let re = / ab+c /;
/ REGULAR EXPRESSION PATTERN GOES HERE BETWEEEN THE 2 BACK SLASHES / REGULAR EXPRESSION MODIFIER (FLAG) GOES HERE AFTER THE 2nd BACKSLASH ;
Anchors
Anchors are unique as they match a position within a string, not a character. They match a position before, after, or between characters. They can be used to “anchor” the regex match at a certain position. A regex that consists solely of an anchor can only find zero-length matches.
Examples:
The caret ^ matches the position before the first character in the string. Applying ^a to abc matches a. ^b does not match abc at all, because the b cannot be matched right after the start of the string, matched by ^.
The $ matches right after the last character in the string. c$ matches c in abc, while a$ does not match at all.
Anchor | Description |
---|---|
^ | Finds the beginning of a string or the beginning of a line if the multi-line (m) flag is enabled. Matches a position, NOT a character. |
$ | Finds the end of the string or the end of a line if the multi-line (m) flag is enabled. Matches a position, NOT a character. |
\b | Matches a word boundary position of a word character either at the beginning of the word pattern ( \bHI ) or end of word pattern ( HI\b ). |
\B | Opposite of \b - matches any position of character pattern but NOT at the beginning / end of the character pattern. |
Quantifiers
Quantifiers indicate to the processor that the preceding token must be matched a certain number of times. By default, quantifiers will try to match as many characters as possible (greedy).
Quantifier | Description |
---|---|
+ | Matches 1 or more of the preceding pattern. |
* | Matches 0 or more of the preceding pattern. |
{ l,h } | Matches at least "l" but not more than "h" repetitions of the preceding character, for example { 2,3 } will match 2 to 3; { 3 } will match exactly 3; and { 3, } will match 3 or more. |
? | Finds 0 to 1 of the character pattern preceding it, effectively making the character pattern preceding it optional. |
\(quantifier)+? | Makes the preceding quantifier lazy, causing it to match as few characters as possible. By default quantifiers are greedy, and match as many characters as possible. |
Grouping Constructs
Groups allow you to combine a sequence of tokens to operate on them together. Capture groups can be referenced by a backreference and accessed separately in the results.
Constructs | Description |
---|---|
( ABC ) | Groups multiple tokens together and creates a capture group for extracting a substring or using a backreference. |
(? <name> ABC ) | Creates a capturing group that can be referenced via a specified name. |
\1 | Matches the results of a capture group. For example \1 matches the results of the first capture group and \3 matches the third. |
( ?:ABC ) | Groups multiple tokens together without creating a capture group. |
Bracket Expressions
Bracket expressions [] are used to show character classes. They are essentially grouping a character class together and including the OR operator. For example [abc] tells the search to match "a" or "b" or "c".
Expressions | Description |
---|---|
[ ] | Character class. Matches any character contained between the square brackets. |
[^ ] | Negated character class. Matches any character that is not contained between the square brackets |
Character Classes
A character class is a special notation that matches any symbol from a certain set.
Character Class | Description |
---|---|
\[ABC] | Matches any character in the set. |
[^ABC] | Negated set. |
[A-Z] | Matches a character having a character code between the two specified characters inclusive. |
. | Period or Full Stop - finds any single character - DOES NOT find newline or line terminator. |
\w | Matches word characters - alphanumeric (both letters and numbers) or underscored alphanumeric. Matches low-ascii characters, NOT accented or non-roman characters. |
\W | Opposite of \w - matches any character that is not a word character - including accented or non-roman characters. |
\d | Matches any digit character: [0-9] |
\D | Opposite of \d - matches ANY character that is not a digit. |
\s | Matches any whitespace character - spaces, tabs, line breaks. |
\S | Opposite of \s - matches any character that is not a whitespace character. |
The OR Operator
The OR operator (alternation) is similar. You can use alternation to match a single regular expression out of several possible regular expressions.
The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the alternation, you need to use parentheses for grouping.
Example:
To match whole words only, use \b(cat|dog)\b. This tells the regex engine to find a word boundary, then either cat or dog, and then another word boundary.
Operator | Description |
---|---|
| | Alternation - acts like the boolean OR, matches the expression before or after the |. It can operate in a group or on a whole expression, the patterns will be tested in order. |
Flags or Modifiers
Modifiers instruct the processor with how to proceed with searching for the Regular Expression Pattern.
Flags / Modifiers | Description |
---|---|
g | Global match - finds all matches - DOES NOT stop search after first match. |
i | Matches without case sensitivity. |
m | Matches multi-lines - DOES NOT stop search after first line match. |
Character Escapes
Escape sequences can be used to insert reserved, special, and unicode characters. All escaped characters begin with the \ character.
Character escapes are used to switch from a literal search to a special search, and vice versa.
Character Escapes | Description |
---|---|
\+ | + * ? ^ $ \ . [ ] { } ( ) | / are all regular expression operators (special characters) and must be preceded by a \ (backslash) to represent the literal version of the character. In a character set, only , -, and ] needs to be escaped. |
\000 | Octal escaped character in the form \000, for example © would be searched with \251. The value must be less than 255. |
\xFF | Hexadecimal escaped character, example \xA9 is for the character © |
\uFFFF | Unicode escaped character, example \u00A9 is for the character © |
\u{FFFF} | Extended unicode escaped character. Supports a full range of unicode point escapes with any number of hex digits. Requires the unicode flag / modifier u to be set. |
\cI | Escaped control character. This can range from \cA (SOH, char code 1) to \cZ (SUB, char code 26). |
\t | Matches a TAB character (char code 9). |
\n | Matches a line feed character with a char code = 10. |
\v | Matches a VERTICAL TAB character (char code 11). |
\f | Matches a FORM FEED character (char code 12). |
\r | Matches a CARRIAGE RETURN character (char code 13). |
\0 | This is a zero not letter O, it matches a NULL character (char code 0) - that is, a character with no memory allocation, occupies no memory as it is null. |
Regular Expression Example
I've selected an example that has a lot of every day programming application. The following Regular Expression will match or validate email addresses in a document or an input box:
/\b([\da-z\._%+-]+)@([\da-z\.-]+)\.([a-z]{2,10})\b/gi
What does it mean?
\b = matches a word boundary position between a word character and non-word character or position (start / end of string).
For validation, add the "^" anchor after the first "\b" anchor. This will confirm if a valid email has been entered with no whitespace or new lines at the start of the email. You also don't need the global "g" flag.
This example has 3 groups:
Group 1
([\da-z\._%+-]+)
Group 2
([\da-z\.-]+)
Group 3
([a-z]{2,10})
Characters
@ = located between Group 1 and Group 2 character sets, instructs Regular Expression engine to look for "@" character in between groups 1 and 2.
. = located between Group 2 and Group 3 character sets, instructs Regular Expression engine to look for "." character in between groups 2 and 3. The backslash is needed before to escape the character as "." has a special purpose, refer Character Classes
g = global, search continues after 1st match. i = shortens the Regular Expression as you can state a-z characters instead of A-Za-z for the search criteria.
Bracket Expression, Character Class and Quantifier
[\da-z_.%+-]+
Searches for any of the literal characters "a-z", "0-9", "_", ".", "%", "-", "+", "-".
The Character Class "\d" instructs RegEx to search for any digits 0-9.
The quantifier "+" outside the square brackets instructs RegEx that one of these must exist at least once (but could have more than one of these components in the string).
[\da-z.-]+
Searches for any of the literal characters "0-9", "a-z", ".", "-".
The quantifier "+" instruct RegEx that one of these must exist at least once (but could have more than one of these components in the string).
[a-z.]{2,10}
Searches for any of the literal characters "a-z", ".".
The Quantifier "{2,10}" instructs RegEx that one of these must exist at least two times and no more than 10 times.
Note: Whilst this quantifier will work for most domains, there are now so many domain alternatives that extend past 10 characters. This is a good example of choosing the right Regular Expression setting for your search or validation. A better Quantifier to use would be "{ 2, }". Have a look at this picture showing some of the alternatives to .com :
References
There is so much more to Regular Expressions than what I've covered here. Following are the various sources I referred to for this Gist.
I highly recommend RegExr. It is a great website to test your understanding of Regular Expressions and test if your Regular Expression will actually find or validate what you want using the 'sandbox' environment in the website.
- https://regexr.com/
- https://www.regular-expressions.info/tutorial.html
- https://en.wikipedia.org/wiki/Regular_expression
- https://www.w3schools.com/jsref/jsref_obj_regexp.asp
- https://www.programiz.com/javascript/regex
- https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
- https://www.shellhacks.com/regex-find-email-addresses-file-grep/
- https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expressions
- https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference
- https://smallbiztrends.com/2016/03/alternatives-to-com.html
Author
Mark Watson is a programmer currently focused on learning JavaScript. If you have questions or would like to connect with me, please use one of the following:
-
Find me on GitHub: Mark Watson GitHub
-
Email me Mark Watson email