Skip to content

Instantly share code, notes, and snippets.

@shelleymcq
Last active August 24, 2021 19:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shelleymcq/5574f9b656d169be9abe486b62863639 to your computer and use it in GitHub Desktop.
Save shelleymcq/5574f9b656d169be9abe486b62863639 to your computer and use it in GitHub Desktop.
A Regular Expression Tutorial

Regex Tutorial

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." -Jamie Zawinski

Summary

A regex, or regular expression, defines a sequence or pattern of characters that can be used to search a text.

There are typically two ways to approach dissecting and understanding a new function. You can backwards engineer a complete function and try to break down each part, or you can build it up from scratch, analyzing each step as you go. I took the latter approach in order to see the function in its most simplest steps to inform this tutorial.

I have created a regular expression that checks if a string is a valid postal code for the Netherlands. I chose the Dutch code because it included an interesting pattern - four numbers followed by two uppercase letters, excluding a few combinations of negative historical significance.

Table of Contents

Regex Components

Quantifiers

Quantifiers, as the name implies, represent a quantity. More specifically, quantifiers restrict the number of characters to match.

Character Pattern Match
* appears 0 or more times
+ appears 1 or more times
? appears 0 or 1 time = optional character
{n} appears exactly n times
{min, max} appears a minimum and maximum number of times, indicates a range

The numbers 4 and 2 placed in curly braces below specify the number of of the type of character required just behind it. In the example, the pattern to match is exactly 4 digits, \d{4}, and 2 uppercase letters, [A-Z]{2}. regex-quantifier-charclass

Grouping Constructs

Grouping constructs allow the matching of a specific section of a string. This section is indicated with parentheses and known as a subexpression.

The parentheses surrounding the letter combinations below apply a pattern requirement to just those letters, and in this case, an exclusion if the pattern matches, (?!SA|SD|SS).

regex-lookahead

Bracket Expressions

Bracket Expressions indicate which characters to match. This is also known as a positive character group. A string that contains any character inside the brackets will return a positive match.

For example, the strings 'a', 'b', 'c', 'ac', 'cat', 'big', 'bridges', and '00c00' will all match the pattern [abc] because they contain at least one of the characters 'c', 'b', and/or 'c'. The string 'dog' will not match because it does not contain any of the three characters. Note that regular expressions are also case-sensitive. The string 'ABC' will not match this bracket expression. Patterns can be combined inside the brackets to include any desired character. For example, [a-zA-C4-6+] will return any string that contains a lowercase letter OR an uppercase A, B, C, OR a 4, 5, 6, OR a +.

In an early version of the Dutch Postal Code Regex, brackets were used to match all numbers and letters. [0-9] will return any string containing a digit from 0 through 9 and [A-Z] will return any string containing an upercase letter A through Z.

regex-brackets

Character Classes

Bracket expressions and quantifiers are members of a broader category of regex components called Character Classes. These all match any included character that appears, but can be indicated by brackets [ ], a back slash , * or .

Assertions

A lookaround is an example of an assertion. When a pattern match is made, a positive match returns 'match' and negative match returns 'no match'. A lookahead matches a pattern following something else and a lookbehind matches a pattern preceding.

Regex Lookaround
x(?=SA) positive lookahead for x followed by SA = match
x(?!SA) negative lookahead for x followed by SA = no match
x(?<=SA) positive lookbehind for x preceded by SA = match
x(?<!SA) negative lookbehind for x preceded by SA = no match

In the Dutch Postal Code Regex, a negative lookahead is used to exclude the three disallowed 2-letter combinations from the inclusive [A-Z]. The regex (?!SA|SD|SS) excludes SA, SD, or SS.

regex-lookahead

The OR Operator

The OR operator allows the matching of any characters without using a bracket expression. The OR is indicated with a pipe character, |. Using the example for bracket expressions above, the strings 'a', 'b', 'c', 'ac', 'cat', 'big', 'bridges', and '00c00' will all match the pattern (a|b|c).

The lookahead example above used the OR operator to exclude any of the three disallowed letter combinations, (SA|SD|SS). The OR operator is also used to include either a space OR a hyphen between the numbers and letters of the postal code.

regex-OR

Character Escapes

Character escapes are used to indicate a literal character versus a key regex character. The combination of an asterisk preceded by a backslash will be interpreted as the actual * character and not the wildcard.

A boundary escape, \b, is used in the code below to ensure a 4-digit number at the beginning of the string. If it were not used, any string of 4 digits or more would be included.

regex-boundary-escape

Sources

Author

Shelley McHardy is a student in the Georgia Tech Coding Bootcamp looking forward to her Full Stack Web Developer Certificate in October.

https://github.com/shelleymcq

@shelleymcq
Copy link
Author

regex-quantifier-charclass

@shelleymcq
Copy link
Author

regex-lookahead

@shelleymcq
Copy link
Author

regex-brackets

@shelleymcq
Copy link
Author

regex-OR

@shelleymcq
Copy link
Author

regex-boundary-escape

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment