brobro10000/PhoneNumberRegex.md

## PhoneNumberRegex.md

      
    Raw
  

              PhoneNumberRegex.md
            
          
    MDN's Phone Number Regex

Regular expressions or regex are special sets of developer designed code used for pattern matching purposes in strings. Its purpose is useful in constraining the input of the user to the expected characters (letters and numbers), format (such as an email which includes an @ and .), and length, (such as a zip code, social security number, or a specific regional phone number). These patterns of expected output are useful for avoiding instances of SQL injection attacks, corrupted databases, or mismatched data that may cause the program to no longer function as expected. We will be going into detail on a single regex in this gist.
Summary

The regex expression we'll be looking at is the expression that parses the information of the input and expects a phone number in return. The expressions original source is from MDN Web Docs describing regular expressions. At the bottom, the following code snippet includes the regex for a phone number and a brief explanation as follows:

The regular expression looks for:

[T]hree numeric characters \d{3} OR | a left parenthesis \(, followed by three digits \d{3},followed by a close parenthesis \), in a non-capturing group (?:)
[F]ollowed by one dash, forward slash, or decimal point in a capturing group ()
[F]ollowed by three digits \d{3}
[F]ollowed by the match remembered in the (first) captured group \1
[F]ollowed by four digits \d{4}
...
var re = /(?:\d{3}|\(\d{3}\))([-\/\.])\d{3}\1\d{4}/;
...


The explanation from MDN gives a quick overview of how the regex expression components work, without denoting specificity of each component and examples of each. Thats what this overview hopes to clarify.
The phone number regex, /(?:\d{3}|\(\d{3}\))([-\/\.])\d{3}\1\d{4}/, has multiple acceptable inputs for a phone number format. Focusing on North American Numbering Plan format (or NANP for short) for expected phone number length and lineation (excluding the country code), the expected outputs are as followed where the pound/number/hash represent a single digit:


###-###-###
(###)-###-###
###/###/####
###.###.####


The four expected outputs represent the allowed input by the user to successfully enter a phone number.
Table of Contents


Quantifiers
Piping
Character Classes
Grouping and Capturing
Bracket Expressions
Boundaries
Back-references
Author

Regex Components

Quantifiers

The only determination of length of each segment of the phone number in our regex is the {} quantifier or the fixed quantifier.
/(?:\d  {3} |\(\d {3} \))([-\/\.])\d {3} \1\d {4} /

The fixed quantifiers denotes how many sequential instances of the string will be matched. In the phone number example, it should be digits. Because of the NANP standard of phone numbers, we are expecting 3 digits \d{3}, followed by either a dash-, backslash/, or a period., another 3 digits \d{3}, followed by the same lineation as previously used, finally 4 digits \d{4}. Some examples of this are below:

\d{3} \d{3} \d{4}

(123)-456-7890 or 123-456-7890
(123).456.7890 or 123.456.7890
(123)/456/7890 or 123/456/7890


We can see we are using the same amount of digits throughout that is strictly limited by the regex function, formatted to the NANP standard. The wrapping of parenthesis around the digits do not affect the count of the digits.
Piping

Piping or the OR operation | is used to give an option as to which set of logic to use when determining the correct string set.
/(?:\d{3} | \(\d{3}\))([-\/\.])\d{3}\1\d{4}/

The 2 logical or statements are as follows \d{3} | \(\d{3}\) with the left hand logic allowing 3 consecutive digits to the fixed length of 3 \d{3} OR | an open parenthesis \(, 3 consecutive digits to the fixed length of 3 \d{3} and a closing parenthesis \). This format takes into account 2 instances of NANP format where in some instances, the area code is surrounded by parenthesis. Below is an example:

Left hand Parenthesis

123-456-7890
123.456.7890
123/456/7890


Right hand Parenthesis

(123)-456-7890
(123).456.7890
(123)/456/7890


We can see with the phone number regex, the left hand parenthesis are all examples where the parenthesis are not detected, yes still conform to the standard of a phone number. The right hand parenthesis all include the area code wrapped in the parenthesis along with the dash, period and backslash. The OR statement does not mean it omits the next statement required for the left hand parenthesis ###-###-#### to the right hand because of the additional parenthesis (###)-###-####
Character Classes

Character classes in the phone number regex are the main determination to matching a correct phone number. There are two types of character classes used in our phone number regex.
/(?: \d {3}|\( \d  {3}\))([ -\/\. ]) \d {3}\1 \d {4}/

As we can see, there are many instances of character classes in our regex function. The first and simplest is the digit \d character class. This class denotes specifically that numbers are to be  selected. There are other ways to select numbers only through the bracket [0-9] but the difference is that \d will recognize digits within the Unicode Characters in the 'Number, Decimal Digit' Category which include 610 possible values. Is it useful in our particular case? In all honestly no because of the NANP standard, but it allows for accessibility for those who may include phone numbers in other languages. An example of \d is below:

\d{3} \d{3} \d{4}

(123)-456-7890 or 123-456-7890
(123).456.7890 or 123.456.7890
(123)/456/7890 or 123/456/7890


Similar to the quantifiers example, we can denote the length of how many digits we need with the quantifier specifying exactly 3 digits for the area code, 3 digits for the middle segment, and 4 digits for the final segment.
The second character class used is the ones between the [] which contain -\/\.. The bracket in conjucton with the character classes acts as an OR which determines as long as one of the possible outcomes are present (A dash -, period . or backslash /), the regex function with function correctly. The way our regex function is written to include the bracket and selective character classes within has it enclosed as a group. Whatever character is used in the first lineation between the area code and middle digits, must be used in the following instance between the middle digits and last four digits because we are reusing the captured group in the second lineation (denoted by a \1 in the regex function) of the phone number. Lets look at some examples:

[-\/\.]
Will Pass

(123)-456-7890 or 123-456-7890
(123).456.7890 or 123.456.7890
(123)/456/7890 or 123/456/7890


[-\/\.]
Will Fail

(123)-456/7890 or 123-456/7890
(123).456-7890 or 123.456-7890
(123)/456.7890 or 123/456.7890


In our failing group, we can see that the difference is only the dissimilar lineations between the numbers despite the character class allowing it. The capturing and grouping of the bracket and character classes within requires us to continue with the same format we started with in our first lineation of the phone number.
Grouping and Capturing

Grouping is a major part of the phone number regex algorithm that prevents us from writing repeated logic. Below you can see where the grouping occurs, but not all of them are being captured.
/ (  ?:\d{3}|\(\d{3}\) )( [-\/\.] ) \d{3}\1\d{4}/

The initial grouping / (  ?:\d{3}|\(\d{3}\) ) has a piping argument between 2 possible choices. An important distinction into why the inner parenthesis is not considered a grouping is because of the escape character \ which specifies the literal () as opposed to the regex operation of grouping.
The inner grouping is also not being captured but still grouped. The ?: is written when a specific argument within a group is not to be captured. A captured argument allows you to reference it later in the regex argument or with \1 where the number represents which grouping pair it represents. The entire regex is known as  grouping 0 or \0,
An example of the grouping is in the first 3 digit area code format of a phone number. The user has the option to either enter three digits with no whitespace or three digits, surrounded by parethesis with no whitespace:


(?:\d{3} = 123
|\(\d{3}\)) = (123)


In the middle of the regex, there is another set of parenthesis that is being captured and is captured, ( [-\/\.] ). This capturing occurs because of the expected structure of the output of a phone number. Due to regional or formatting differences, three possibilies into the format of the lineation of the phone number. A typical 10 digit NANP phone number has two points of lineation ### ### ####. What we put in the lineation is what is currently being grouped in a bracket expression as either a dash -, backslash / or period .. Possible example output with this regex grouping is as follows:


123.456.7890 or (123).456.7890
123/456/7890 or (123)/456/7890
123-456-7890 or (123)-456-7890


Two important points to highlight are first, whatever format you start with for lineation - / .  must be used in the second instance of lineation. You cannot mismatch the format. Secondly, despite not explicitly written into the regex, the same format is expected in the second lineation between \d{3} and \d{4}. the \1 between them back references the grouping which represents the lineation between number groupings. This is known as back-referencing.
Bracket Expressions

In continuation with character classes, the bracket allows us to isolate specific string values as literal characters to be selected. The selection is determined similar to the | or OR statement where each literal character is checked to the value and if any of the value match what is in the bracket, it is selected and the regex function continues.
/(?:\d{3}|\(\d{3}\))( [ -\/\. ] )\d{3}\1\d{4}/

We can see the bracket encloses 3 possible values, a dash -, a backslash \/ and a period \.. Each character literal after the dash is preceded with the exit value of forward slash \ as to not trigger any functions that may use those values when running the regex. In conjuction with bracket, we can see it is enclised with parenthesis, capturing and grouping the bracket object together. That in  conjuction with back referencing \1 we are repeating the logic further down the phone number regex to check if the value picked in the initial lineation of numbers between the area code and middle numbers is the same as the second lineation between the middle number and last four digits of the phone number. Below are examples of the bracket used in our regex, identical to the character classes example above:

[-\/\.]
Will Pass

(123)-456-7890 or 123-456-7890
(123).456.7890 or 123.456.7890
(123)/456/7890 or 123/456/7890


[-\/\.]
Will Fail

(123)-456/7890 or 123-456/7890
(123).456-7890 or 123.456-7890
(123)/456.7890 or 123/456.7890


We can see that despite the value being correct in the bracket, the will fail will always fail if the second lineation differs from the first, despite the logical correctness. This is due to the back referencing after the grouping was formed. The logic was locked to the value and carried forward onward.
Boundaries

The first component to any regex is the forward slash, /, this denotes the begin and end of the Regex expression. This is known as a delimiter, which in javascript is the only delimiter allowed to represent a regex.
Our regex expression contains this beginning and end forward slash:
/ (?:\d{3}|\(\d{3}\))([-\/\.])\d{3}\1\d{4} /

Some examples that replicate this are simply other regular expressions. A list of regex below alll use the forward slash to denote the beginning and end of a regex:


Matching a Hex Value – /^#?([a-f0-9]{6}|[a-f0-9]{3})$/
Matching an Email – /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/
Matching a URL – /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
Matching an HTML Tag – /^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/


The / is a boundry for any regex expression created in javascript.
Back-references

Back referencing is a unique feature that allows you to take previously grouped logical statements, and reuse them within the same regex function. In our regex functions, there are 3 instances of grouping which will be pointed out, but only a single instance of back referencing.
/(?:\d{3}|\(\d{3}\))([-\/\.])\d{3} \1 \d{4}/

There are three groupings in our regex function, the first grouping is the entire function itself denoted as \0. The second grouping occurs here: (?:\d{3}|\(\d{3}\)), but because of the ?: logic, we are specifically ignoring this parenthesized logic to captured for grouping, therefore, although the grouping exist, the grouping is not captured. Finally, the third grouping is ([-\/\.]) which specifies the lineation between each one of the number grouping in the phone number. This set is being grouped and used in the back referencing reference in the function, which allows us to avoid dry logic, and remain consistent in what to expect from the user output. Lets look at some examples, which will be identical to the bracket and character classes:

([-\/\.])...\1
Will Pass

(123)-456-7890 or 123-456-7890
(123).456.7890 or 123.456.7890
(123)/456/7890 or 123/456/7890


([-\/\.])...\1
Will Fail

(123)-456/7890 or 123-456/7890
(123).456-7890 or 123.456-7890
(123)/456.7890 or 123/456.7890


We can see that the back referencing is taking into account what the user had put in the first instance of the lineation between the area code and and middle number must match the second instance of the lineation. Back referencing also helps to reduce repeated logic in the code
Author

As of August 2021, I am currently attending the University of Central Florida completing my Bachelor's in Computer Engineering and the Web Development Bootcamp through the university. To learn more about my projects, vist My Github Profile