Skip to content

Instantly share code, notes, and snippets.

@aidev13
Last active October 1, 2023 20:51
Show Gist options
  • Save aidev13/7dd114145e0cda5bc1fda67ccb42978d to your computer and use it in GitHub Desktop.
Save aidev13/7dd114145e0cda5bc1fda67ccb42978d to your computer and use it in GitHub Desktop.
Regex Tutorial

Regex Tutorial

Breaking down and understanding how to match an HTML tag using Regex.

Summary

Before we begin, What is a regex? A regex aka "REGular EXpression," is a chain of characters used to define a specfic search pattern.

In this tutorial, we will break down the following regex used to find and match an HTML tag:

Regex
/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/

Matching String
<p id="regexTag">Hello World</p>

Table of Contents

Regex Components

Anchors

In our regex, we have two anchors: ^ and $ . These two anchors are representing the start of a string ^ and the end of a string $.

Quantifiers

In our regex, we have two "Greedy" quanitifiers, * and + being used.

(See Greedy and Lazy Match)

OR Operator

The OR Operator is also known as an 'Alternate' and uses the OR syntax |.

Example:

match either a|b

Case sensitive*

Character Classes

In our regex, we have two Bracket Expressions (See Bracket Expressions) AKA Character Classes, or simply Classes. Classes are declared with [ brackets ].

Class 1

[a-z]

Class 2

[^<]

Flags

In our regex we have no flags declared, however - a flag, also known as a modifier, can be put on the end of a regex to give a regex an even more specific search pattern. Some of the most basic flags are:

> Global: g
> Multiline: m
> Case insensitive: i
> Sticky - searches in strings only from the index of the last match: y
> Enable unicode support: U

An example in our regex would be as followed:

/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/g

Notice the flag g is placed AFTER the last /.

Grouping and Capturing

In our regex, we have 4 groupings:

CapGroup 1

([a-z]+)

CapGroup 2

([^<]+)*

Non-CapGroup 1

(?:>(.*)<\/\1>|\s+\/>)

CapGroup 3

(.*) - this group is inside of Non-CapGroup 1

Nested inside CapGroup 1 and 2 are "Bracket Expressions" (See Bracket Expressions). Nested inside Non-CapGroup 1, we have two(2) alternative arguments seperated by the OR (Alternate) syntax |. See OR Operator

In our CapGroup 1 and 2, we are Capturing everything enclosed inside the ( paraentheses ).

Undstanding CapGroup 1: ([a-z]+)... We are capturing everything inside the ( ). If we did not have the + quantifier, we would only capture one(1) character. Since we did declare the (Greedy) + quantifier, we will capture all characters in a chain, and stopping at whitespace or symbols. This Group finds and matches the following:

<p>

Understanding CapGroup 2: ([^<]+)*... We are capturing everything inside ( ). Inside our bracket expression we are trying to find and match any character, whitespace and special characters in a chain. using <kbd^ will match a non listed character once. But, because we declared a + quantifier, we now will capture and match all characters including whitespace and special characters. This group finds and matches the following:

 id="regexTag" , including whitespace. (Notice: There is whitespace behind the "i". It is found and matched by this group.

⚡ We also declared the * character. This will allow and match the previous token(group) - ([^<]), from zero(0) to unlimited time, as many times as needed. See Quantifiers

In our Non-CapGroup - (?:>(.*)</\1>|\s+/>) we have two(2) alternatives declared by the OR | syntax. We are Matching everthing enclosed inside the ( ) by using the ?: at the beginning of the syntax. We then have our 3rd group expression (See Below). Then, we are declaring a literal for the / character using \ . See Text and Oddies.

First Alternative:

\1 matches the same text as the most recent matched by the 1st group.

OR |

Second Alternative:

\s matches any whitespace.

This allows our closing HTML tag to match the opening HTML tag...

In our CapGroup 3, we are matching any character, including whitespace - except for line terminators

⚡ Line Terminators are: \n, \r, \u2028, \u2029

Bracket Expressions

Bracket Expressions refer to a matching or non-matching list characters. These list are case sensitive - a is not the same as A, etc... For example, our regex has 2 bracket expressions:

Expression 1

[a-z]+

Expression 1 will match a list of characters from a through z until hits whitespace or any symbols. For example, regex is strange! will match regex

Expression 2

[^<]+

Using the ^ syntax inside a bracket expression will declare a non-matching list. This regex expression also uses the < symnbol which will match a text character presented to it. Expression 2 will match all charatcers including whitespace and symbols.

Greedy and Lazy Match

The + quantifier will match a token from one(1) to unlimited times. In our regex, the tokens are the bracket expression [a-z] and [^<]), giving back as needed. This is known as a "Greedy" match.

Meanwhile, the * quantifer with match a token from zero(0) to unlimited times, giving back as needed. This is known as a "Greedy" match.

Not shown, however a "Lazy" match will match as few characters as possible. You can do this by providing a ? as a quantifier.

Texts and Oddies

<, >, / are just text characters to match.

When a \ is declared, we are making any character a literal.

Author

I am a student, and passioniate about devolopment! I love sports, hunting, fishing and programming!

You can check out my github account here!

https://github.com/aidev13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment