huirayj/regex-tutorial.md

## regex-tutorial.md

      
    Raw
  

              regex-tutorial.md
            
          
    Regex Tutorial: Searching for an HTML tag

What is a Regex?

Regex or Regexp, short for regular expression, is a search pattern, consisting of a specific sequence of characters. Practical application of regex include: checking if newly created usernames, passwords, and/or emails meet a certain criteria, phone number validation, URL validation, searching and replacing text in text editors or word processors, and etc...
Table of Contents


Regex Tutorial: Searching for an HTML tag

What is a Regex?
Table of Contents
Overview
Regex Components

Anchors
Quantifiers
Grouping Constructs
Bracket Expressions
Character Classes
The OR Operator
Flags
Character Escapes


Author


Overview

In this document, the following regex:
^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$

will be examined. The pattern of this expression searches for an HTML tag, including self-closing ones. Specifically, it looks for the pattern of < and > and < and > or />, as well as matching tagnames. In the table of contents, there is an outline of topics that may or may not be discussed with the example regex.
Regex Components

Anchors

Rather than matching character(s), an anchor matches a position before or after character(s).


Anchor
Description


^
Matches the pattern at the beginning of the text


$
Matches the pattern at the end of the text


\b
Matches on a word boundary


^<([a-z]+)([^<]+)(?:>(.)</\1>|\s+/>)$

A caret anchor is used to search the beginning of the selected text, because typically in a HTML document, HTML tags are written first in an HTML element.
At the end, there is validation if the ending matches the pattern of a closing or self-closing tag.

Quantifiers

Quantifiers are used to match how many instances a pattern of your regex is repeated. A regex is, by default greedy, which means matches will be as long as possible. The opposite would be a lazy match.


Greedy Quantifier
Lazy Quantifier
Description


*
*?
Matches 0 or more instances


+
+?
Matches 1 or more instances


?
??
Matches 0 or one instance


{num}
{num}?
Matches num instances


{num,}
{num,}?
Matches at least num instances


{num, num1}
{num, num1}?
Matches from num to num1 instances


^<([a-z]+)([^<]+)*(?:>(.*)</\1>|\s+/>)$

[a-z]+ checks for 1 or more lowercase letters
[^<]+ checks for 1 or more characters that are not <
([^<]+)* checks for 0 or more instances of that capture group
\s+\ checks for 1 or more whitespace

Grouping Constructs

Grouping is capturing substring(s) of the text, meaning the selected text is treated as a unit. Capture groups can be given a name.


Group
Description


(expr)
Captures the pattern within the parantheses


(?:expr)
Ignores the pattern within the parantheses


(?=expr)
Captures the pattern within the paranthese if it also true


(?<name>)
Named capture group


\k<name>
Named back reference, matches previous capture group


^<([a-z]+) ([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$

([a-z]+) is the first capture group, which becomes useful in comparing closing tags
([^<]+) is the second capture group, which captures HTML attributes, such as class or src
\1 references the first capture group and checks for a similar value
(?:>(.*)<\/\1>|\s+\/>) is a non-capture group, which captures the content of the HTML element

Bracket Expressions

A bracket expression matches from a list of pattern(s) enclosed by square brackets, [expr].


Expression
Description


^expr
(At the start of the list) Ignores all patterns in the list


expr-expr
Shorthand for inclusive range


^<([a-z]+)([^<]+)(?:>(.)</\1>|\s+/>)$

[^<] creates a list that contains characters that are not <
[a-z] creates a list that contains lowercase alphabet letters

Character Classes

Characters classes categorizes different types of characters.


Class
Description


\d
A digit from 0 to 9


\D
Non-digit (inverse of \d)


\s
Any white space


\S 
Non white space (inverse of \s)


\w
Any Latin alphabet, including underscores


\W
Any non-latin letter or space (inverse of \w)


.
Any character except a newline


^<([a-z]+)([^<]+)(?:>(.)</\1>|\s+/>)$

\s+\ denotes white space to match the formatting of closing tags

The OR Operator

The OR operator, denoted as |, is a logical operator that provides an alternative expression in the search pattern.
^<([a-z]+)([^<]+)(?:>(.)</\1>|\s+/>)$

(?:>(.*)<\/\1>|\s+\/>)$ represents the possibilities of having a closing or self-closing tag

Flags

Flags are optional parameters that can further modify a regex expression.


Flag
Description


i
Ignores case sensitivity


g
Matches for all occurences


m
Matches each the beginning and end of every new line


u 
Allows matching outside the UTF-16 character set


y
Allows matching from a different starting position


s
Allows matching of everything including new lines


Flags are not used in this example.
Character Escapes

Character escapes, denoted by prepending a backslash, \, is used to escape a character's original purpose and to allow searching of special characters. Here's a list of special characters, [ \ ^ $ . | ? * + ( ), which would need a character escape to be searched for.
Character escapes are not use in this example.
Author

Justin Hui is an aspiring Front-end web developer, who is partially self-taught and currently enrolled in a coding bootcamp.
Anchor	Description
`^`	Matches the pattern at the beginning of the text
`$`	Matches the pattern at the end of the text
`\b`	Matches on a word boundary
Greedy Quantifier	Lazy Quantifier	Description
`*`	`*?`	Matches 0 or more instances
`+`	`+?`	Matches 1 or more instances
`?`	`??`	Matches 0 or one instance
`{num}`	`{num}?`	Matches num instances
`{num,}`	`{num,}?`	Matches at least num instances
`{num, num1}`	`{num, num1}?`	Matches from num to num1 instances
Group	Description
`(expr)`	Captures the pattern within the parantheses
`(?:expr)`	Ignores the pattern within the parantheses
`(?=expr)`	Captures the pattern within the paranthese if it also true
`(?<name>)`	Named capture group
`\k<name>`	Named back reference, matches previous capture group
Expression	Description
`^expr`	(At the start of the list) Ignores all patterns in the list
`expr-expr`	Shorthand for inclusive range
Class	Description
`\d`	A digit from 0 to 9
`\D`	Non-digit (inverse of \d)
`\s`	Any white space
`\S`	Non white space (inverse of \s)
`\w`	Any Latin alphabet, including underscores
`\W`	Any non-latin letter or space (inverse of \w)
`.`	Any character except a newline
Flag	Description
`i`	Ignores case sensitivity
`g`	Matches for all occurences
`m`	Matches each the beginning and end of every new line
`u`	Allows matching outside the UTF-16 character set
`y`	Allows matching from a different starting position
`s`	Allows matching of everything including new lines