Below is a tutorial that explains how a specific regular expression, or regex, functions by breaking down each part of the expression and describing what it does. As a web development student, a tutorial that explains regex functions is important so that I can understand the search pattern the regex defines.
A Regex or regular expression is a sequence of characters that define a search pattern. Regular expressions are used to replace text within a string, validating forms, extracting a substring from a string based on a pattern match, and much more. It is a technique commonly developed in theoretical computer science.
We will look a a string of code using regex, this code looks for a match HTML tag.
Example: /^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/
The below content will explain what each section of this code does and more.
- Anchors
- Quantifiers
- Grouping Constructs
- Bracket Expressions
- Character Classes
- The OR Operator
- Flags
- Character Escapes
Anchors do not match any character, instead, they match a position before, after or between characters. They can be used to "anchor" the regex match at a certain position.
Anchors are ^ and $
and their usage is explained below.
^Hello
matches any string that starts with "Hello"world$
matches a stright that ends with "world"^Hello world$
exact string match (start and ends with "Hello world"goodbye
matches any stright that has the text "goodbye" in it
Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found.
Quantifiers are * + ? and {}
and their usage is explained below.
DEF*
matches a string that has DE followed by zero or more FDEF+
matches a string that has DE followed by one or more FDEF?
matches a string that has DE followed by zero or one FDEF{2}
matches a string that has DE followed by 2 FDEF{2,}
matches a string that has DE followed by 2 or more FDEF{2,5}
matches a string that has DE followed by 2 up to 5 FD(EF)*
matches a string that has a D followed by zero or more copies of the sequence EFD(EF){2,5}
matches a string that has D followed by 2 up to 5 copies of the sequence EF
Grouping constructs let you extract information from strings or data.
Grouping constructs are ( )
and their usage is explained below.
(hello)
captures the information that matches the expression in the parentheses - value "hello"(?:hello)
groups the contained expressions together, but does not restrict the information to be captured to only that group(?=hello)
captures information that is followed by the expression if the expression is true and the input matches the pattern that follows this expression(?<hello>)
named capture group\k<hello>
named back reference
A bracket expression is a regular expression that matches a single character, or collating element.
Bracket expressions are [ ]
and their usage is explained below.
[abc]
matches a string that has either an a or a b or a c - it is the same as a|b|c[a-c]
same as above[a-fA-f0-9]
a string that represents a single hexidecimal digit, case insensitively[0-9]%
a string that has a character from 0 to 9 before a % sign[^a-zA-Z]
a string that does not have a letter from a to z or from A to Z. In this case, the^
is used as a negation of the expression.
Character classes match a character from a specific set. There are a number of predefined character classes and you can also define your own sets.
Character Classes are . \d \D \w \W \s \S \t \r \n \v \f \[b] \0 \cX \xhh \uhhhh \u{hhhh} or \u{hhhhh} and \p{UnicodeProperty or \P{UnicodeProperty}
and their usage is explained below.
.
Matches any single character excpet line terminators\d
Matches any digit, equivalent to [0-9]\D
Matches any character that is not a digit, equivalent to [^0-9]\w
Matches any alphanumeric character, including the underscore. Equivalent to [A-Za-z0-9_]\W
Matches any character that is not a word character. Equivalent to [^A-Za-z0-9_]\s
Matches a single white space character, including space, tab, form feed, line feed, and other Unicode spaces. Equivalent to [ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]\S
Matches a single character other than white space. Equivalent to [^ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]\t
Matches a horizontal tab\r
Matches a carriage return\n
Matches a linefeed\v
Matches a vertical tab\f
Matches a form-feed[\b]
Matches a backspace\0
Matches a NUL character. Do not follow this with another digit\cX
Matches a control character using caret notation, where "X" is a letter from A-Z\xhh
Matches the character with the cod hh (two hexadecimal digits)\uhhhh
Matches a UTF-16 code-unit with the value hhhh (four hexadecimal digits)\u{hhhh} or \u{hhhhh}
(only when the u flag is set) Matches the character with the Unicode valuse U+hhhh or U+hhhhh (hexadecimal digits)\p{UnicodeProperty} or \P{UnicodeProperty}
Matches a character based on its Unicode character properties
The OR operator is a Boolean operator which would return the value TRUE or Boolean value of 1 if either or both of the operands are TRUE or have Boolean value of 1.
OR Operators are | or []
and their usage is explained below.
x(y|z)
matches a string that has x followed by y or z (and captures y or z)x[yz]
same as above, but it does not capture y or z
Flags are optional parameters that we can add to a plain expression to make it search in a different way.
Flags are i g m s u and y
and their usage is explained below.
i
With this flag, the search is case-insensitive: no difference betweenA
anda
g
With this flag, the search look for all matches, without it - only the first match is returnedm
Multiline modes
Enables "dotall" mode, that allows a dot.
to match newline character\n
u
Enables full Unicode support. The flag enable correct processing of surrogate pairsy
"Sticky" mode: searching at the exact position in the text
Most regular expression operators are unescaped single characters. The Escape Character, \
(a single backslash), signals to the regular expression parser that the character following the backslash, is not a literal character, but instead represents a regular expression symbol.
Name | GitHub | |
---|---|---|
Brendan Moore | brendandjmoore@gmail.com | Click Here |