Skip to content

Instantly share code, notes, and snippets.

@bdjm94
Last active June 29, 2021 11:02
Show Gist options
  • Save bdjm94/7e55b59e084765a7d00cd27cfb031f01 to your computer and use it in GitHub Desktop.
Save bdjm94/7e55b59e084765a7d00cd27cfb031f01 to your computer and use it in GitHub Desktop.
A tutorial that explains how a specific regular expression, or regex, functions by breaking down each part of the expression and describing what it does.

Regex Tutorial

Below is a tutorial that explains how a specific regular expression, or regex, functions by breaking down each part of the expression and describing what it does. As a web development student, a tutorial that explains regex functions is important so that I can understand the search pattern the regex defines.

Summary

A Regex or regular expression is a sequence of characters that define a search pattern. Regular expressions are used to replace text within a string, validating forms, extracting a substring from a string based on a pattern match, and much more. It is a technique commonly developed in theoretical computer science.

We will look a a string of code using regex, this code looks for a match HTML tag.

Example: /^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/

The below content will explain what each section of this code does and more.

Table of Contents

Regex Components

Anchors

Anchors do not match any character, instead, they match a position before, after or between characters. They can be used to "anchor" the regex match at a certain position.

Anchors are ^ and $ and their usage is explained below.

  • ^Hello matches any string that starts with "Hello"
  • world$ matches a stright that ends with "world"
  • ^Hello world$ exact string match (start and ends with "Hello world"
  • goodbye matches any stright that has the text "goodbye" in it

Quantifiers

Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found.

Quantifiers are * + ? and {} and their usage is explained below.

  • DEF* matches a string that has DE followed by zero or more F
  • DEF+ matches a string that has DE followed by one or more F
  • DEF? matches a string that has DE followed by zero or one F
  • DEF{2} matches a string that has DE followed by 2 F
  • DEF{2,} matches a string that has DE followed by 2 or more F
  • DEF{2,5} matches a string that has DE followed by 2 up to 5 F
  • D(EF)* matches a string that has a D followed by zero or more copies of the sequence EF
  • D(EF){2,5} matches a string that has D followed by 2 up to 5 copies of the sequence EF

Grouping Constructs

Grouping constructs let you extract information from strings or data.

Grouping constructs are ( ) and their usage is explained below.

  • (hello) captures the information that matches the expression in the parentheses - value "hello"
  • (?:hello) groups the contained expressions together, but does not restrict the information to be captured to only that group
  • (?=hello) captures information that is followed by the expression if the expression is true and the input matches the pattern that follows this expression
  • (?<hello>) named capture group
  • \k<hello> named back reference

Bracket Expressions

A bracket expression is a regular expression that matches a single character, or collating element.

Bracket expressions are [ ] and their usage is explained below.

  • [abc] matches a string that has either an a or a b or a c - it is the same as a|b|c
  • [a-c] same as above
  • [a-fA-f0-9] a string that represents a single hexidecimal digit, case insensitively
  • [0-9]% a string that has a character from 0 to 9 before a % sign
  • [^a-zA-Z] a string that does not have a letter from a to z or from A to Z. In this case, the ^ is used as a negation of the expression.

Character Classes

Character classes match a character from a specific set. There are a number of predefined character classes and you can also define your own sets.

Character Classes are . \d \D \w \W \s \S \t \r \n \v \f \[b] \0 \cX \xhh \uhhhh \u{hhhh} or \u{hhhhh} and \p{UnicodeProperty or \P{UnicodeProperty} and their usage is explained below.

  • . Matches any single character excpet line terminators
  • \d Matches any digit, equivalent to [0-9]
  • \D Matches any character that is not a digit, equivalent to [^0-9]
  • \w Matches any alphanumeric character, including the underscore. Equivalent to [A-Za-z0-9_]
  • \W Matches any character that is not a word character. Equivalent to [^A-Za-z0-9_]
  • \s Matches a single white space character, including space, tab, form feed, line feed, and other Unicode spaces. Equivalent to [ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]
  • \S Matches a single character other than white space. Equivalent to [^ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]
  • \t Matches a horizontal tab
  • \r Matches a carriage return
  • \n Matches a linefeed
  • \v Matches a vertical tab
  • \f Matches a form-feed
  • [\b] Matches a backspace
  • \0 Matches a NUL character. Do not follow this with another digit
  • \cX Matches a control character using caret notation, where "X" is a letter from A-Z
  • \xhh Matches the character with the cod hh (two hexadecimal digits)
  • \uhhhh Matches a UTF-16 code-unit with the value hhhh (four hexadecimal digits)
  • \u{hhhh} or \u{hhhhh} (only when the u flag is set) Matches the character with the Unicode valuse U+hhhh or U+hhhhh (hexadecimal digits)
  • \p{UnicodeProperty} or \P{UnicodeProperty} Matches a character based on its Unicode character properties

The OR Operator

The OR operator is a Boolean operator which would return the value TRUE or Boolean value of 1 if either or both of the operands are TRUE or have Boolean value of 1.

OR Operators are | or [] and their usage is explained below.

  • x(y|z) matches a string that has x followed by y or z (and captures y or z)
  • x[yz] same as above, but it does not capture y or z

Flags

Flags are optional parameters that we can add to a plain expression to make it search in a different way.

Flags are i g m s u and y and their usage is explained below.

  • i With this flag, the search is case-insensitive: no difference between A and a
  • g With this flag, the search look for all matches, without it - only the first match is returned
  • m Multiline mode
  • s Enables "dotall" mode, that allows a dot . to match newline character \n
  • u Enables full Unicode support. The flag enable correct processing of surrogate pairs
  • y "Sticky" mode: searching at the exact position in the text

Character Escapes

Most regular expression operators are unescaped single characters. The Escape Character, \ (a single backslash), signals to the regular expression parser that the character following the backslash, is not a literal character, but instead represents a regular expression symbol.

Author

Name Email GitHub
Brendan Moore brendandjmoore@gmail.com Click Here

Sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment