Regex or Regexp, short for regular expression, is a search pattern, consisting of a specific sequence of characters. Practical application of regex include: checking if newly created usernames, passwords, and/or emails meet a certain criteria, phone number validation, URL validation, searching and replacing text in text editors or word processors, and etc...
In this document, the following regex:
^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$
will be examined. The pattern of this expression searches for an HTML tag, including self-closing ones. Specifically, it looks for the pattern of <
and >
and <
and >
or />
, as well as matching tagnames. In the table of contents, there is an outline of topics that may or may not be discussed with the example regex.
Rather than matching character(s), an anchor matches a position before or after character(s).
Anchor | Description |
---|---|
^ |
Matches the pattern at the beginning of the text |
$ |
Matches the pattern at the end of the text |
\b |
Matches on a word boundary |
^
<([a-z]+)([^<]+)(?:>(.)</\1>|\s+/>)$
- A caret anchor is used to search the beginning of the selected text, because typically in a HTML document, HTML tags are written first in an HTML element.
- At the end, there is validation if the ending matches the pattern of a closing or self-closing tag.
Quantifiers are used to match how many instances a pattern of your regex is repeated. A regex is, by default greedy, which means matches will be as long as possible. The opposite would be a lazy match.
Greedy Quantifier | Lazy Quantifier | Description |
---|---|---|
* |
*? |
Matches 0 or more instances |
+ |
+? |
Matches 1 or more instances |
? |
?? |
Matches 0 or one instance |
{num} |
{num}? |
Matches num instances |
{num,} |
{num,}? |
Matches at least num instances |
{num, num1} |
{num, num1}? |
Matches from num to num1 instances |
^<([a-z]+
)([^<]+
)*
(?:>(.*)</\1>|\s+
/>)$
[a-z]+
checks for 1 or more lowercase letters[^<]+
checks for 1 or more characters that are not<
([^<]+)*
checks for 0 or more instances of that capture group\s+\
checks for 1 or more whitespace
Grouping is capturing substring(s) of the text, meaning the selected text is treated as a unit. Capture groups can be given a name.
Group | Description |
---|---|
(expr) |
Captures the pattern within the parantheses |
(?:expr) |
Ignores the pattern within the parantheses |
(?=expr) |
Captures the pattern within the paranthese if it also true |
(?<name>) |
Named capture group |
\k<name> |
Named back reference, matches previous capture group |
^<([a-z]+)
([^<]+)
*(?:>(.*)<\/\1>|\s+\/>)
$
([a-z]+)
is the first capture group, which becomes useful in comparing closing tags([^<]+)
is the second capture group, which captures HTML attributes, such as class or src\1
references the first capture group and checks for a similar value(?:>(.*)<\/\1>|\s+\/>)
is a non-capture group, which captures the content of the HTML element
A bracket expression matches from a list of pattern(s) enclosed by square brackets, [expr]
.
Expression | Description |
---|---|
^expr |
(At the start of the list) Ignores all patterns in the list |
expr-expr |
Shorthand for inclusive range |
^<([a-z]
+)([^<]
+)(?:>(.)</\1>|\s+/>)$
[^<]
creates a list that contains characters that are not<
[a-z]
creates a list that contains lowercase alphabet letters
Characters classes categorizes different types of characters.
Class | Description |
---|---|
\d |
A digit from 0 to 9 |
\D |
Non-digit (inverse of \d) |
\s |
Any white space |
\S |
Non white space (inverse of \s) |
\w |
Any Latin alphabet, including underscores |
\W |
Any non-latin letter or space (inverse of \w) |
. |
Any character except a newline |
^<([a-z]+)([^<]+)(?:>(.)</\1>|\s+
/>)$
\s+\
denotes white space to match the formatting of closing tags
The OR operator, denoted as |
, is a logical operator that provides an alternative expression in the search pattern.
^<([a-z]+)([^<]+)(?:>(.)</\1>|
\s+/>)$
(?:>(.*)<\/\1>|\s+\/>)$
represents the possibilities of having a closing or self-closing tag
Flags are optional parameters that can further modify a regex expression.
Flag | Description |
---|---|
i |
Ignores case sensitivity |
g |
Matches for all occurences |
m |
Matches each the beginning and end of every new line |
u |
Allows matching outside the UTF-16 character set |
y |
Allows matching from a different starting position |
s |
Allows matching of everything including new lines |
Flags are not used in this example.
Character escapes, denoted by prepending a backslash, \
, is used to escape a character's original purpose and to allow searching of special characters. Here's a list of special characters, [ \ ^ $ . | ? * + ( )
, which would need a character escape to be searched for.
Character escapes are not use in this example.
Justin Hui is an aspiring Front-end web developer, who is partially self-taught and currently enrolled in a coding bootcamp.