bdjm94/regex-tutorial.md

## regex-tutorial.md

      
    Raw
  

              regex-tutorial.md
            
          
    Regex Tutorial

Below is a tutorial that explains how a specific regular expression, or regex, functions by breaking down each part of the expression and describing what it does. As a web development student, a tutorial that explains regex functions is important so that I can understand the search pattern the regex defines.
Summary

A Regex or regular expression is a sequence of characters that define a search pattern. Regular expressions are used to replace text within a string, validating forms, extracting a substring from a string based on a pattern match, and much more. It is a technique commonly developed in theoretical computer science.
We will look a a string of code using regex, this code looks for a match HTML tag.
Example: /^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/
The below content will explain what each section of this code does and more.
Table of Contents


Anchors
Quantifiers
Grouping Constructs
Bracket Expressions
Character Classes
The OR Operator
Flags
Character Escapes

Regex Components

Anchors

Anchors do not match any character, instead, they match a position before, after or between characters. They can be used to "anchor" the regex match at a certain position.
Anchors are ^ and $ and their usage is explained below.

^Hello matches any string that starts with "Hello"
world$ matches a stright that ends with "world"
^Hello world$ exact string match (start and ends with "Hello world"
goodbye matches any stright that has the text "goodbye" in it

Quantifiers

Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found.
Quantifiers are * + ? and {} and their usage is explained below.

DEF* matches a string that has DE followed by zero or more F
DEF+ matches a string that has DE followed by one or more F
DEF? matches a string that has DE followed by zero or one F
DEF{2} matches a string that has DE followed by 2 F
DEF{2,} matches a string that has DE followed by 2 or more F
DEF{2,5} matches a string that has DE followed by 2 up to 5 F
D(EF)* matches a string that has a D followed by zero or more copies of the sequence EF
D(EF){2,5} matches a string that has D followed by 2 up to 5 copies of the sequence EF

Grouping Constructs

Grouping constructs let you extract information from strings or data.
Grouping constructs are ( ) and their usage is explained below.

(hello) captures the information that matches the expression in the parentheses - value "hello"
(?:hello) groups the contained expressions together, but does not restrict the information to be captured to only that group
(?=hello) captures information that is followed by the expression if the expression is true and the input matches the pattern that follows this expression
(?<hello>) named capture group
\k<hello> named back reference

Bracket Expressions

A bracket expression is a regular expression that matches a single character, or collating element.
Bracket expressions are [ ] and their usage is explained below.

[abc] matches a string that has either an a or a b or a c - it is the same as a|b|c
[a-c] same as above
[a-fA-f0-9] a string that represents a single hexidecimal digit, case insensitively
[0-9]% a string that has a character from 0 to 9 before a % sign
[^a-zA-Z] a string that does not have a letter from a to z or from A to Z. In this case, the ^ is used as a negation of the expression.

Character Classes

Character classes match a character from a specific set. There are a number of predefined character classes and you can also define your own sets.
Character Classes are . \d \D \w \W \s \S \t \r \n \v \f \[b] \0 \cX \xhh \uhhhh \u{hhhh} or \u{hhhhh} and \p{UnicodeProperty or \P{UnicodeProperty} and their usage is explained below.

. Matches any single character excpet line terminators
\d Matches any digit, equivalent to [0-9]
\D Matches any character that is not a digit, equivalent to [^0-9]
\w Matches any alphanumeric character, including the underscore. Equivalent to [A-Za-z0-9_]
\W Matches any character that is not a word character. Equivalent to [^A-Za-z0-9_]
\s Matches a single white space character, including space, tab, form feed, line feed, and other Unicode spaces. Equivalent to [ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]
\S Matches a single character other than white space. Equivalent to [^ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]
\t Matches a horizontal tab
\r Matches a carriage return
\n Matches a linefeed
\v Matches a vertical tab
\f Matches a form-feed
[\b] Matches a backspace
\0 Matches a NUL character. Do not follow this with another digit
\cX Matches a control character using caret notation, where "X" is a letter from A-Z
\xhh Matches the character with the cod hh (two hexadecimal digits)
\uhhhh Matches a UTF-16 code-unit with the value hhhh (four hexadecimal digits)
\u{hhhh} or \u{hhhhh} (only when the u flag is set) Matches the character with the Unicode valuse U+hhhh or U+hhhhh (hexadecimal digits)
\p{UnicodeProperty} or \P{UnicodeProperty} Matches a character based on its Unicode character properties

The OR Operator

The OR operator is a Boolean operator which would return the value TRUE or Boolean value of 1 if either or both of the operands are TRUE or have Boolean value of 1.
OR Operators are | or [] and their usage is explained below.

x(y|z) matches a string that has x followed by y or z (and captures y or z)
x[yz] same as above, but it does not capture y or z

Flags

Flags are optional parameters that we can add to a plain expression to make it search in a different way.
Flags are i g m s u and y and their usage is explained below.

i With this flag, the search is case-insensitive: no difference between A and a
g With this flag, the search look for all matches, without it - only the first match is returned
m Multiline mode
s Enables "dotall" mode, that allows a dot . to match newline character \n
u Enables full Unicode support. The flag enable correct processing of surrogate pairs
y "Sticky" mode: searching at the exact position in the text

Character Escapes

Most regular expression operators are unescaped single characters. The Escape Character, \ (a single backslash), signals to the regular expression parser that the character following the backslash, is not a literal character, but instead represents a regular expression symbol.
Author


Name
Email
GitHub


Brendan Moore
brendandjmoore@gmail.com
Click Here


Sources


MDN Documents
Laserfiche User Guide
Regex Tutorial
Regex Builder