"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." -Jamie Zawinski
A regex, or regular expression, defines a sequence or pattern of characters that can be used to search a text.
There are typically two ways to approach dissecting and understanding a new function. You can backwards engineer a complete function and try to break down each part, or you can build it up from scratch, analyzing each step as you go. I took the latter approach in order to see the function in its most simplest steps to inform this tutorial.
I have created a regular expression that checks if a string is a valid postal code for the Netherlands. I chose the Dutch code because it included an interesting pattern - four numbers followed by two uppercase letters, excluding a few combinations of negative historical significance.
- Quantifiers
- Grouping Constructs
- Bracket Expressions
- Character Classes
- Assertions
- The OR Operator
- Character Escapes
- Sources
- Author
Quantifiers, as the name implies, represent a quantity. More specifically, quantifiers restrict the number of characters to match.
Character | Pattern Match |
---|---|
* | appears 0 or more times |
+ | appears 1 or more times |
? | appears 0 or 1 time = optional character |
{n} | appears exactly n times |
{min, max} | appears a minimum and maximum number of times, indicates a range |
The numbers 4 and 2 placed in curly braces below specify the number of of the type of character required just behind it. In the example, the pattern to match is exactly 4 digits, \d{4}, and 2 uppercase letters, [A-Z]{2}.
Grouping constructs allow the matching of a specific section of a string. This section is indicated with parentheses and known as a subexpression.
The parentheses surrounding the letter combinations below apply a pattern requirement to just those letters, and in this case, an exclusion if the pattern matches, (?!SA|SD|SS).
Bracket Expressions indicate which characters to match. This is also known as a positive character group. A string that contains any character inside the brackets will return a positive match.
For example, the strings 'a', 'b', 'c', 'ac', 'cat', 'big', 'bridges', and '00c00' will all match the pattern [abc] because they contain at least one of the characters 'c', 'b', and/or 'c'. The string 'dog' will not match because it does not contain any of the three characters. Note that regular expressions are also case-sensitive. The string 'ABC' will not match this bracket expression. Patterns can be combined inside the brackets to include any desired character. For example, [a-zA-C4-6+] will return any string that contains a lowercase letter OR an uppercase A, B, C, OR a 4, 5, 6, OR a +.
In an early version of the Dutch Postal Code Regex, brackets were used to match all numbers and letters. [0-9] will return any string containing a digit from 0 through 9 and [A-Z] will return any string containing an upercase letter A through Z.
Bracket expressions and quantifiers are members of a broader category of regex components called Character Classes. These all match any included character that appears, but can be indicated by brackets [ ], a back slash , * or .
A lookaround is an example of an assertion. When a pattern match is made, a positive match returns 'match' and negative match returns 'no match'. A lookahead matches a pattern following something else and a lookbehind matches a pattern preceding.
Regex | Lookaround |
---|---|
x(?=SA) | positive lookahead for x followed by SA = match |
x(?!SA) | negative lookahead for x followed by SA = no match |
x(?<=SA) | positive lookbehind for x preceded by SA = match |
x(?<!SA) | negative lookbehind for x preceded by SA = no match |
In the Dutch Postal Code Regex, a negative lookahead is used to exclude the three disallowed 2-letter combinations from the inclusive [A-Z]. The regex (?!SA|SD|SS) excludes SA, SD, or SS.
The OR operator allows the matching of any characters without using a bracket expression. The OR is indicated with a pipe character, |. Using the example for bracket expressions above, the strings 'a', 'b', 'c', 'ac', 'cat', 'big', 'bridges', and '00c00' will all match the pattern (a|b|c).
The lookahead example above used the OR operator to exclude any of the three disallowed letter combinations, (SA|SD|SS). The OR operator is also used to include either a space OR a hyphen between the numbers and letters of the postal code.
Character escapes are used to indicate a literal character versus a key regex character. The combination of an asterisk preceded by a backslash will be interpreted as the actual * character and not the wildcard.
A boundary escape, \b, is used in the code below to ensure a 4-digit number at the beginning of the string. If it were not used, any string of 4 digits or more would be included.
- https://eloquentjavascript.net/09_regexp.html
- https://www.regular-expressions.info/tutorial.html
- https://www.youtube.com/watch?v=7DG3kCDx53c
- https://coding-boot-camp.github.io/full-stack/computer-science/regex-tutorial
Shelley McHardy is a student in the Georgia Tech Coding Bootcamp looking forward to her Full Stack Web Developer Certificate in October.