bonniss/js-regexp.md

## js-regexp.md

      
    Raw
  

              js-regexp.md
            
          
    Regular Expression in Javascript

Patterns and Flags

Regexp

A regular expression consists of a pattern and option flags.
There are two syntaxes that can be used to create a regexp object.

The long syntax.

regexp = new RegExp('pattern', 'flags');

The short one.

regexp = /pattern/; // no flags
regexp = /pattern/gim; // with flags g,m
Slashes /.../ tell Javascript that we are creating a regular expression. They play the same role as quotes for strings.
In both cases regexp becomes an instance of the built-in Regexp class.
The main difference between these two syntaxes is that pattern using slashes /.../ does not allow for expressions to be inserted (like string template literals with ${...}). They are fully static.
Slashes are used when we know the regular expression at the writing time - and that's the most common situation. While new RegExp, is more often used when we need to create a regexp "on the fly" from a dynamically generated string.
Flags

Regular expression may have flags that affect the search.
There are only 6 flags of them in Javasript.
i

With this flag the search is case-insensitive: A and a is the same.
g

With this flag the search looks for all matches, without it - only the first match is returned.
m

Multiline mode.
s

Enable "dotall" mode, that allows a dot . to match newline character \n.
Without this flag, . match all characters except \n.
u

Enables full unicode support. The flag enables correct processing of surrogate pairs.
y

"Sticky" mode: searching at the exact position in the text.
Search: str.match

The method str.match(regexp) finds all matches of regexp in the string str.
It has 3 working modes:

If the regular expression has flag g, it returns an array of all matches.

let str = 'We will, we will rock you';
alert(str.match(/we/gi)); // We,we (an array of 2 substrings that match)
Both We and we are found, because flag i makes the regular expression case-insensitive.

If there's no such flag it returns only the first match in the form of an array, with the fill match at index 0 and some additional details in properties.

let str = 'We will, we will rock you';

let result = str.match(/we/i); // without flag g

alert(result[0]); // We (1st match)
alert(result.length); // 1

// Details:
alert(result.index); // 0 (position of the match)
alert(result.input); // We will, we will rock you (source string)

And, finally, if there is no match, null is returned.

This is a very important nuance. If there is no match, we don't receive an empty array, but instead receive null.
let matches = 'JavaScript'.match(/HTML/); // = null

if (!matches.length) {
  // Error: Cannot read property 'length' of null
  alert('Error in the line above');
}
If we’d like the result to always be an array, we can write it this way:
let matches = 'JavaScript'.match(/HTML/) || [];

if (!matches.length) {
  alert('No matches'); // now it works
}
Replacing: str.place

The method str.replace(regexp, replacement) replaces matches found using regexp in string str with replacement.
The second argument is the replacement string. We can use special character combinations in it to insert fragments of the match.
Testing: regexp.test

The method regexp.test(str) looks for at least one match. If found, returns true, otherwise false.
let str = 'I love Javascript';
let regexp = /LOVE/i;

alert(regexp.test(str));
Character classes

Consider a practical task - we have a phone number like +7(903)-123-45-67, and we need to return it into pure numbers: 79035419441.
To do so, we can find an remove anything that's not a number. Character classes can help with that.
A character class is a special notation that matches any symbol from a certain set.
For the start, let's explore the "digit" class. It's written as \d and correspons to "any single digit".
let str = '+7(903)-123-45-67';

let regexp = /\d/g;

alert(str.match(regexp)); // array of matches: 7,9,0,3,1,2,3,4,5,6,7

// let's make the digits-only phone number of them:
alert(str.match(regexp).join('')); // 79035419441
Most-used classes are:


\d
A digit


\s("s" is from "space")
A space symbol: \t, \n, \v, \f, \r.


\w ("w" is from "word")
A "wordly" character: either a letter of Latin alphabet or a digit or an underscore _. Non-Latin letters do not belong to \w.


For instance, \d\s\w means a "digit" followed by a "space character" followed by a "wordly character", such as 1 a.
A regexp may contain both regular symbols and character classes.
For instance, CSS\d matches a string `CSS with a digit after it.
let str = 'Is there CSS4?';
let regexp = /CSS\d/;

alert(str.match(regexp)); // CSS4

alert('I love HTML5!'.match(/\s\w\w\w\w\d/)); // ' HTML5'
Inverse classes

For every character class, there exists an "inverse class", denoted with the same letter, but uppercased.
The "inverse" means that it matches all other character, for instance:


\D
Non-digit: any character except \d, for instance a letter.


\S
Non-space: any character except \s, for instance a letter.


\W
Non-wordly character: anything but \w, e.g a non-latin letter or a space.


A dot is "any character"

A dot . is a special character class that matches "any character except a newline".
let regexp = /CS.4/;

alert('CSS4'.match(regexp)); // CSS4
alert('CS-4'.match(regexp)); // CS-4
alert('CS 4'.match(regexp)); // CS 4 (space is also a character)
Please note that a dot means "any character", but not the "absence of a a character". There must be a character to match it:
alert('CS4'.match(/CS.4/)); // null, no match because there's no character for the dot
Dot is literally any character with "s" flag

alert('A\nB'.match(/A.B/s)); // A\nB (match!)
// Another trick to match "any character"
// \s\S
// \d\D
// \w\W
Pay attention to spaces

Usually we pay little attention to spaces. For us using 1-5 and 1 - 5 are nearly identical.
A space is a character. Equal in importance with any other character.
In other words, in a regexp all character matter, spaces too.
alert('1 - 5'.match(/\d-\d/)); // null, no match!
alert('1 - 5'.match(/\d - \d/)); // 1 - 5, now it works
// or we can use \s class:
alert('1 - 5'.match(/\d\s-\s\d/)); // 1 - 5, also works
Unicode encoding, used by Javascript for strings, provides many properties for characters, like: which language the letter belongs to, punctuation sign, etc.
Unicode: flag 'u" and class \p{...}

Javascript uses Unicode encoding for strings. Most characters are encoding with 2 bytes, but that allows to represent at most 65536 characters.
That range is not big enough to encode all possible characters, that's why some rare characters are encoded with 4 bytes.
Long time ago, when JS lang was created, Unicode encoding was simpler: there were no 4-byte character. So, some language features still handle them incorrectly.
alert('😄'.length); // 2
alert('𝒳'.length); // 2
The point is that length treats 4-byte as 2-byte characters("surrogate pair"). By default, regular expressions also treat 4-byte “long characters” as a pair of 2-byte ones. And, as it happens with strings, that may lead to odd results.
Unicode properties \p{...}

Every character in Unicode has a lot of properties. They describe what “category” the character belongs to, contain miscellaneous information about it.
For instance, if a character has Letter property, it means that the character belongs to an alphabet (of any language). And Number property means that it’s a digit: maybe Arabic or Chinese, and so on.
Example: Chinese hieroglyphs

Let’s look for Chinese hieroglyphs.
There’s a unicode property Script (a writing system), that may have a value: Cyrillic, Greek, Arabic, Han (Chinese) and so on, here’s the full list.
To look for characters in a given writing system we should use Script=<value>, e.g. for Cyrillic letters: \p{sc=Cyrillic}, for Chinese hieroglyphs: \p{sc=Han}, and so on:
let regexp = /\p{sc=Han}/gu; // returns Chinese hieroglyphs

let str = `Hello Привет 你好 123_456`;

alert(str.match(regexp)); // 你,好
Anchors

The caret ^ and dollar $ characters have special meaning in a regexp. They are called "anchors".
The caret ^ matches at the beginning of the text, and the dollar $ - at the end.
For instance, let's test if the text starts with Mary:
let str1 = 'Mary had a little lamb';
alert(/^Mary/.test(str1));
Multiline mode of anchors ^$, flag "m"

The multiline mode is enabled by the flag m.
Searching at line start ^

let str = `1st place: Winnie
2nd place: Piglet
3rd place: Eeyore`;

alert(str.match(/^\d/gm)); // 1, 2, 3

let str = `Winnie: 1
Piglet: 2
Eeyore: 3`;

alert(str.match(/\d$/gm)); // 1,2,3

"Start of a line" formally means "immediately after a line break": the test ^ multiline mode matches at all positions preceeded by a newline character \n.

Word boundary: \b

A word boundary \b is a test, just like ^ and $.
When the regexp engine (program module that implements searching for regexp) comes across \b, it checks that the position in the string is a word boundary.
There are three different positions that qualify as word boundaries:

At string start, if the first string character is a word character \w.
Between two characters in the string, where one is a word character \w and the other is not.
At string end, if the last string character is a word character \w.


Word boundary \b doesn't work for non-latin alphabets

Escaping, special characters

A slash

alert('/'.match(/\//)); // '/'
alert('/'.match(new RegExp('/'))); // finds /, no need to escape
new RegExp

If we are creating a regular expression with new RegExp, then we don’t have to escape /, but need to do some other escaping.
let regexp = new RegExp('d.d');

alert('Chapter 5.1'.match(regexp)); // null
The similar search worked with /\d\.\d/, but new RegExp("\d\.\d") doesn't work, why?
The reason is that backslashes are "consumed" by a string.

\n - become a newline character
\u123 - becomes the Unicode character with such code
...And when there is no special meaning: like \d, then the backslash is simply removed.

let regStr = '\\d\\.\\d';
alert(regStr); // \d\.\d (correct now)

let regexp = new RegExp(regStr);

alert('Chapter 5.1'.match(regexp)); // 5.1
Sets and Ranges [...]

Several characters or character classes inside square bracket [...] mean to "search for any character among given".
Sets

For instance, [eao] means any of the 3 characters a, e or o.
That's called a set
// find [t or m], and then "op"
alert('Mop top'.match(/[tm]op/gi)); // "Mop",  "top"
Ranges

Square brackets may also contain character ranges.
For instance, [a-z] is a character in range from a to z, and [0-5] is a digit from 0 to 5.
Excluding ranges

Besides normal ranges, there are "excluding" ranges that look like [^...].
They are denoted by a caret character ^ at the start and match any character except the given ones.
Escaping in [...]

In square brackets, we can use the vast majority of special characters without escaping:

. + ( ) never need escaping
A hyphen - is not escaped in the beginning or the end (where is does not define a range).
A caret ^ is only escaped in the beginning (where it means exclusion).
The closing square bracket ] is always escaped.

/[-+.()]/.test('1+2.4'); // true
Ranges and flags "u"

If there are surrogate pairs in the set, flag u is required for them to work correctly.
Quantifiers +, *, ? and {n}

Quantity {n}

The simplest quantifier: {n}.

The exact count: {5}
The range: {3,5}, match 3-5 times

You can omit the upper limit: \d{3,} loks for sequences of digits of length 3 or more.


Shorthands


+ means "one or more", the same as {1, }
? means "zero or one", in other words, it makes the symbol optional
* means "zero or more", the same as {0,}


To make a regexp more precise, we often need to make it more complex

Greedy and lazy quantifiers

Quantifiers are very simple from the first sight, but in fact they can be tricky.
Greedy search

To find a match, the regexp engine use the following algorithm:

For every position in the string

Try to match the pattern at that position
If there's no match, go to the next position


In the greedy mode (by default) a quantifier is repeated as many time as possible.
Lazy mode

The lazy mode of quantifier is an opposite to the greedy mode. It means: "repeat minimal of times".
We can enable it by putting a question mark '?' after the quantifier, so that it becomes *? or +? or even '??' for '?'.
Capturing groups

A part of a pattern can be enclosed in parentheses (...). This is called a "capturing group". This has 2 effects:

It allows to get a part of the match as a separate item in the result array.
If we put a quantifier after the parentheses, it applies to the parentheses as a whole.

Parentheses contents in the match

Parentheses are numbered from left to right. The search engine memorizes the content matched by each of them and allows to get it in the result.
let str = '<h1>Hello, world!</h1>';

let tag = str.match(/<(.*?)>/);

alert(tag[0]); // <h1>
alert(tag[1]); // h1
Nested group

Parentheses can be nested.
Backreferences in pattern: \N and \k

We can use the contents of capturing groups (...) not only in the result or in the replacement string, but also in the pattern itself.
Backreference by number: \N

A group can be referenced in the pattern using \N, where N is the group number.
To make clear why that's helpful, let's consider a task.
let str = `He said: "She's the one!".`;

let regexp = /['"](.*?)['"]/g;

// The result is not what we'd like to have
alert(str.match(regexp)); // "She'
As we can see, the pattern found an opening quote ", then the text is consumed till the other quote ', that closes the match.
To make sure that the pattern looks for the closing quote exactly the same as the opening one, we can wrap it into a capturing group and backreference it: (['"])(.*?)\1. Futher in the pattern \1 means "find the same text as in the first group", exactly the same quote in our case.

If we use ?: in the group, then we can't reference it. Groups that are excluded from capturing (?:...) are not memorized by the engine.


Don't mess up: in the pattern \1, in the replacement: $1

Backreference by name: \k<name>

let str = `He said: "She's the one!".`;

let regexp = /(?<quote>['"])(.*?)\k<quote>/g;

alert(str.match(regexp)); // "She's the one!"
Alternation (OR) |

Alternation is the term in regexp that is actually a simple "OR".
Lookahead and lookbehind

Sometimes we need to find only those matches for a pattern that are followed or preceeded by another pattern.
There's a special syntax for that, called "lookahead" and "lookbehind", together referred to as "lookaround".
For the start, let's find the price from the string like 1 turkey costs 30$. That is: a number, followed by $ sign.
Lookahead

'2 turkeys cost 60€'.match(/\d+(?=€)/); // 60
Negative lookahead

'2 turkeys cost 60€'.match(/\d+(?!€)/); // 2
Lookbehind

Lookahead allows to add a condition for "what follows".
Lookbehind is similar, but it looks behind. That is, it allows to match a pattern only if there's something before it.
/(?<=Y)X/   // matches X, but only if there's Y before it
/(?<!Y)X/   // matches X, but only if there’s no Y before it.
Capturing groups

Generally, the contents inside lookaround parentheses does not become a part of the result.
But in the situations we might want to capture the lookaround expression as well, or a part of it.
let str = '1 turkey costs 30€';
let regexp = /\d+(?=(€|kr))/; // extra parentheses around €|kr

alert(str.match(regexp)); // 30, €
Catastrophic backtracking

Some regexp are looking simple, but can execute veeeeery long time, and even "hang" the JS engine.
The typical symptom - a regexp works fine sometimes, but for certain string it "hangs", consuming 100% of CPU.
How to fix

There are two main approaches to fixing the problem.
Lower the number of possible combinations

Let's rewrite the regular expression as ^(\w+\s)*\w* - we'll look for any number of words followed by a space (\w+\s)*, and then (optionally) a word \w*.
This regexp is equivalent to the previous one and works well.
Preventing backtracking

It's not always convenient to rewrite a regexp. And it's not always obvious how to do it.
The alternative approach is to forbid backtracking for the quantifier.
The regexp engine tries many combinations that are obviously wrong for a human.
Modern regexp engine support possesive quantifiers for that. They are like greedy ones, but don't backtrack (so they are actually simpler than regular quantifiers).
Lookahead to the rescue

We can prevent backtracking using lookahead.
The pattern to take as much repetitions of \w as possible without backtracking is (?=(\w+))\1.
Sticky flag "y", searching at position

The flag y allows to perform the search at the given position in the source string.
To grasp the use case of y flag, and see how great it is, let's explore a practical use case.
One of the common task for regexp is "lexical analysis": we get a text, e.g in a programming language, and analyze it for structural elements.
For instance, HTML has tags and attributes, Javascript code has functions, variables and so on.
Writing lexical analyzers is a special area, with its own tools and algorithms, so we don't go deep in there, but there's a common task: to read something at the given position.
We'll look for variable name using regexp \w+. Actually, Javascript variable names need a bit more complex regexp for accurate matching, but here it doesn't matter.
let str = 'let varName = "value"';

let regexp = /\w+/y;

regexp.lastIndex = 3;
alert(regexp.exec(str)); // null (there's a space at position 3, not a word)

regexp.lastIndex = 4;
alert(regexp.exec(str)); // varName (word at position 4)
As we can see, regexp /\w+/y doesn't match at position 3 (unlike flag g), but matches at position 4.
Imagine, we have a long text, and there are no matches in it, at all. Then searching with flag g will go till the end of the text, and this will take significantly more time than the search with flag y.
In such tasks like lexical analysis, there are usually many searches at an exact position. Using flag y is the key for a good performance.
Methods of RegExp and String

str.match(regexp)

The method str.match(regexp) finds matches for regexp in the string str.
It has 3 modes:
M1. If the regexp doesn't have flag g, then it returns the first match as an array with capturing groups and properties index (position of the match), input (input string, equals str).
let str = "I love JavaScript";

let result = str.match(/Java(Script)/);

alert( result[0] );     // JavaScript (full match)
alert( result[1] );     // Script (first capturing group)
alert( result.length ); // 2

// Additional information:
alert( result.index );  // 0 (match position)
alert( result.input );  // I love JavaScript (source string)
M2. If the regexp has flag g, then it returns an array of all matches as strings, without capturing groups and other details.
let str = "I love JavaScript";

let result = str.match(/Java(Script)/g);

alert( result[0] ); // JavaScript
alert( result.length ); // 1
M3. If there are no matches, no matter if there's flag g or not, null is returned.
str.split(regexp|substr, limit)

alert('12, 34, 56'.split(/,\s*/)) // array of [12, 34, 56]
str.search(regexp)

let str = "A drop of ink may make a million think";

alert( str.search( /ink/i ) ); // 10 (first match position)

search only find the first match

If we need positions of further matches, we should use other means, such as finding them all with str.matchAll(regexp).
str.replace(str|regexp, str|func)

When the first argument of replace of a string, it only replaces the first match.
// replace a dash by a colon
alert('12-34-56'.replace("-", ":")) // 12:34-56

For situations that require "smart" replacements, the second argument can be a function

It will be called for each match, and the returned value will be inserted as a replacement.
The function is called with argument func(match, p1, p2,...,pn, offset, input, groups).
regexp.exec(str)

regexp.test(str)

If the regexp has flag g, then regexp.test looks from regexp.lastIndex property and updates this property, just like regexp.exec.
So we can use it to search from a given position.
let regexp = /love/gi;

let str = "I love JS";

regexp.lastIndex = 10;

alert(regexp.test(str)); // false (no match)