Breaking down and understanding how to match an HTML tag using Regex.
Before we begin, What is a regex? A regex aka "REGular EXpression," is a chain of characters used to define a specfic search pattern.
In this tutorial, we will break down the following regex used to find and match an HTML tag:
Regex
/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/
Matching String
<p id="regexTag">Hello World</p>
- Anchors
- Quantifiers
- OR Operator
- Character Classes
- Flags
- Grouping and Capturing
- Bracket Expressions
- Greedy and Lazy Match
In our regex, we have two anchors: ^ and $ . These two anchors are representing the start of a string ^ and the end of a string $.
In our regex, we have two "Greedy" quanitifiers, * and + being used.
The OR Operator is also known as an 'Alternate' and uses the OR syntax |.
Example:
match either a|b
Case sensitive*
In our regex, we have two Bracket Expressions (See Bracket Expressions) AKA Character Classes, or simply Classes. Classes are declared with [ brackets ].
Class 1
[a-z]
Class 2
[^<]
In our regex we have no flags declared, however - a flag, also known as a modifier, can be put on the end of a regex to give a regex an even more specific search pattern. Some of the most basic flags are:
> Global: g
> Multiline: m
> Case insensitive: i
> Sticky - searches in strings only from the index of the last match: y
> Enable unicode support: U
An example in our regex would be as followed:
/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/g
Notice the flag g is placed AFTER the last /.
In our regex, we have 4 groupings:
CapGroup 1
([a-z]+)
CapGroup 2
([^<]+)*
Non-CapGroup 1
(?:>(.*)<\/\1>|\s+\/>)
CapGroup 3
(.*) - this group is inside of Non-CapGroup 1
Nested inside CapGroup 1 and 2 are "Bracket Expressions" (See Bracket Expressions). Nested inside Non-CapGroup 1, we have two(2) alternative arguments seperated by the OR (Alternate) syntax |. See OR Operator
In our CapGroup 1 and 2, we are Capturing everything enclosed inside the ( paraentheses ).
Undstanding CapGroup 1: ([a-z]+)
... We are capturing everything inside the ( ). If we did not have the + quantifier, we would only capture one(1) character. Since we did declare the (Greedy) + quantifier, we will capture all characters in a chain, and stopping at whitespace or symbols. This Group finds and matches the following:
<p>
Understanding CapGroup 2: ([^<]+)*
... We are capturing everything inside ( ). Inside our bracket expression we are trying to find and match any character, whitespace and special characters in a chain. using <kbd^ will match a non listed character once. But, because we declared a + quantifier, we now will capture and match all characters including whitespace and special characters. This group finds and matches the following:
id="regexTag" , including whitespace. (Notice: There is whitespace behind the "i". It is found and matched by this group.
⚡ We also declared the * character. This will allow and match the previous token(group) - ([^<])
, from zero(0) to unlimited time, as many times as needed. See Quantifiers
In our Non-CapGroup - (?:>(.*)</\1>|\s+/>)
we have two(2) alternatives declared by the OR | syntax. We are Matching everthing enclosed inside the ( ) by using the ?: at the beginning of the syntax. We then have our 3rd group expression (See Below). Then, we are declaring a literal for the / character using \ . See Text and Oddies.
First Alternative:
\1 matches the same text as the most recent matched by the 1st group.
OR |
Second Alternative:
\s matches any whitespace.
This allows our closing HTML tag to match the opening HTML tag...
In our CapGroup 3, we are matching any character, including whitespace - except for line terminators
⚡ Line Terminators are: \n, \r, \u2028, \u2029
Bracket Expressions refer to a matching or non-matching list characters. These list are case sensitive - a is not the same as A, etc... For example, our regex has 2 bracket expressions:
Expression 1
[a-z]+
Expression 1 will match a list of characters from a through z until hits whitespace or any symbols.
For example, regex is strange!
will match regex
Expression 2
[^<]+
Using the ^ syntax inside a bracket expression will declare a non-matching list. This regex expression also uses the < symnbol which will match a text character presented to it. Expression 2 will match all charatcers including whitespace and symbols.
The + quantifier will match a token from one(1) to unlimited times. In our regex, the tokens are the bracket expression [a-z]
and [^<]
), giving back as needed. This is known as a "Greedy" match.
Meanwhile, the * quantifer with match a token from zero(0) to unlimited times, giving back as needed. This is known as a "Greedy" match.
Not shown, however a "Lazy" match will match as few characters as possible. You can do this by providing a ? as a quantifier.
<, >, / are just text characters to match.
When a \ is declared, we are making any character a literal.
I am a student, and passioniate about devolopment! I love sports, hunting, fishing and programming!
You can check out my github account here!