Skip to content

Instantly share code, notes, and snippets.

@huirayj
Last active July 3, 2021 23:08
Show Gist options
  • Save huirayj/0bc15bf1ff928ded8e21bf09996823e7 to your computer and use it in GitHub Desktop.
Save huirayj/0bc15bf1ff928ded8e21bf09996823e7 to your computer and use it in GitHub Desktop.
A regex tutorial about searching for an HTML tag

Regex Tutorial: Searching for an HTML tag

What is a Regex?

Regex or Regexp, short for regular expression, is a search pattern, consisting of a specific sequence of characters. Practical application of regex include: checking if newly created usernames, passwords, and/or emails meet a certain criteria, phone number validation, URL validation, searching and replacing text in text editors or word processors, and etc...

Table of Contents

Overview

In this document, the following regex:

^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$

will be examined. The pattern of this expression searches for an HTML tag, including self-closing ones. Specifically, it looks for the pattern of < and > and < and > or />, as well as matching tagnames. In the table of contents, there is an outline of topics that may or may not be discussed with the example regex.

Regex Components

Anchors

Rather than matching character(s), an anchor matches a position before or after character(s).

Anchor Description
^ Matches the pattern at the beginning of the text
$ Matches the pattern at the end of the text
\b Matches on a word boundary

^<([a-z]+)([^<]+)(?:>(.)</\1>|\s+/>)$

  • A caret anchor is used to search the beginning of the selected text, because typically in a HTML document, HTML tags are written first in an HTML element.
  • At the end, there is validation if the ending matches the pattern of a closing or self-closing tag.

Quantifiers

Quantifiers are used to match how many instances a pattern of your regex is repeated. A regex is, by default greedy, which means matches will be as long as possible. The opposite would be a lazy match.

Greedy Quantifier Lazy Quantifier Description
* *? Matches 0 or more instances
+ +? Matches 1 or more instances
? ?? Matches 0 or one instance
{num} {num}? Matches num instances
{num,} {num,}? Matches at least num instances
{num, num1} {num, num1}? Matches from num to num1 instances

^<([a-z]+)([^<]+)*(?:>(.*)</\1>|\s+/>)$

  • [a-z]+ checks for 1 or more lowercase letters
  • [^<]+ checks for 1 or more characters that are not <
  • ([^<]+)* checks for 0 or more instances of that capture group
  • \s+\ checks for 1 or more whitespace

Grouping Constructs

Grouping is capturing substring(s) of the text, meaning the selected text is treated as a unit. Capture groups can be given a name.

Group Description
(expr) Captures the pattern within the parantheses
(?:expr) Ignores the pattern within the parantheses
(?=expr) Captures the pattern within the paranthese if it also true
(?<name>) Named capture group
\k<name> Named back reference, matches previous capture group

^<([a-z]+) ([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$

  • ([a-z]+) is the first capture group, which becomes useful in comparing closing tags
  • ([^<]+) is the second capture group, which captures HTML attributes, such as class or src
  • \1 references the first capture group and checks for a similar value
  • (?:>(.*)<\/\1>|\s+\/>) is a non-capture group, which captures the content of the HTML element

Bracket Expressions

A bracket expression matches from a list of pattern(s) enclosed by square brackets, [expr].

Expression Description
^expr (At the start of the list) Ignores all patterns in the list
expr-expr Shorthand for inclusive range

^<([a-z]+)([^<]+)(?:>(.)</\1>|\s+/>)$

  • [^<] creates a list that contains characters that are not <
  • [a-z] creates a list that contains lowercase alphabet letters

Character Classes

Characters classes categorizes different types of characters.

Class Description
\d A digit from 0 to 9
\D Non-digit (inverse of \d)
\s Any white space
\S Non white space (inverse of \s)
\w Any Latin alphabet, including underscores
\W Any non-latin letter or space (inverse of \w)
. Any character except a newline

^<([a-z]+)([^<]+)(?:>(.)</\1>|\s+/>)$

  • \s+\ denotes white space to match the formatting of closing tags

The OR Operator

The OR operator, denoted as |, is a logical operator that provides an alternative expression in the search pattern.

^<([a-z]+)([^<]+)(?:>(.)</\1>|\s+/>)$

  • (?:>(.*)<\/\1>|\s+\/>)$ represents the possibilities of having a closing or self-closing tag

Flags

Flags are optional parameters that can further modify a regex expression.

Flag Description
i Ignores case sensitivity
g Matches for all occurences
m Matches each the beginning and end of every new line
u Allows matching outside the UTF-16 character set
y Allows matching from a different starting position
s Allows matching of everything including new lines

Flags are not used in this example.

Character Escapes

Character escapes, denoted by prepending a backslash, \, is used to escape a character's original purpose and to allow searching of special characters. Here's a list of special characters, [ \ ^ $ . | ? * + ( ), which would need a character escape to be searched for.

Character escapes are not use in this example.

Author

Justin Hui is an aspiring Front-end web developer, who is partially self-taught and currently enrolled in a coding bootcamp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment