isayani/how-to-regex.md

## how-to-regex.md

      
    Raw
  

              how-to-regex.md
            
          
    Reg-What? A Guide to Regular Expressions

What is a Regex?

A Regular Expression or Regex is a sequence of characters specifying a search pattern in text. In other words, it is a shorthand for returning specific criterion in any programming language.
Regex can be useful when:


looking for a patterned-string in your code (like a phone or card number)
replacing an existing pattern with another
referencing string-based algortithms
input validation

So what does it look like?

/^[a-z0-9_-]{3,16}$/

At times, Regex may seem like a different language, but it is almost like writing plain text in encrypted code. 
 Each area of the regex means something else and is routed in theoretical computer science and formal language theory. 
 Below, we will cover an example of a regex as well as some key concepts that are needed to understand how they work.
Summary

The pattern we will be looking at today matches HTML tags:
/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/

We will breakdown the above expression while covering some of its basic regex components.
Table of Contents


Example Summary
Regex Components

Anchors
Grouping Constructs
Bracket Expressions
Character Escapes
Character Classes
Backreferences
The OR Operator


Conclusion
Author

Regex Components

Regular Expressions are written in literal notation, so we wrap our desired pattern in slash characters like so: /regex/ 
 For this reason, all regular expressions are case sensitive.

Tip: Certain programming languages also have built in regex constructors. For instance, JavaScript provides the RegExp constructor in which the notation defers slightly to what we are covering here. To learn more about RegExp, click here.


Anchors

Anchors have special a meaning in regular expressions because they do not match any character. Instead, they match a position before or after characters:

^ – The caret anchor matches the beginning of the text.
$ – The dollar anchor matches the end of the text.

So in our example regex, the string must start and end with something that matches the pattern 
 <([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>). Now we can start breaking the pattern down further to see what it means.
After the initial anchor, the next character we see in our HTML Tag regex is <. Although this may seem like a regex character, it is actually referring to the the literal character <. As we know, all HTML tags are encased in <> i.e. <div><div/> and all of them start with the same < character.

Grouping Constructs

As regular expressions grow in complexity, you may want to inspect multiple parts of a string rather than the whole. In order to do so, we use parentheses notation () better known as grouping constructs. 

In the context of our pattern, we have 3 different groups or subexpressions, of which one is nested (one group inside another).

([a-z]+)
([^<]+)
(?:>(.*)<\/\1>|\s+\/>)

Grouping constructs have two primary categories: capturing and non-capturing. Capturing groups find the matched character sequence and remembers them whereas non-capturing groups do not. The way to know if a grouping construct is non-capturing is if it starts with (?:). In our third subexpression, we are using a non-capturing construct.
You will also notice that inside these groups, we have brackets or bracket expressions. Taking a look at these will help us get one step closer to decoding our pattern.

Bracket Expressions

Positive character groups or bracket expressions hold the characters we want to include in our match. We denote these as they sound with []. So in our case, two out of the three subexpressions we outlined previously are searching for different positive character groups.

[a-z]  This bracket is matching all lowercase letters from a-z
[^<]  This bracket has a similar notation to what we covered initially ^< so we are looking for anything starting with <.


Tip: All special characters, including character escapes, do not work inside bracket expressions.


Character Escapes

The backslash \ in a regex makes it a character literal that otherwise would be interpreted as regex notation. For example, {} in regex encases a quantifier. But adding the escape \{ beforehand tells the regex to look for the open curly brace character rather than starting a quanitifier.
We can see character escape being used in the last subexpression of our example (?:>(.*)<\/\1>|\s+\/>) 

Here we see \/ and \/> which are character escapes, looking for self-enclosing or end tags.
For instance the </ in <div></div> or the /> in <br/>.

Quantifiers

Quantifiers set the limits for an indidivual section or entire string that your regex matches. They frequently include the minimum and maximum number of characters that your regex is looking for. We can set our own quantifiers or use prebuilt ones.

*   Matches the pattern zero or more times.
+   Matches the pattern one or more times.
?   Matches the pattern zero or one time.
{} Curly brackets can provide three different ways to set limits for a match:

{ n } Matches the pattern exactly n number of times
{ n, } Matches the pattern at least n number of times
{ n, x } Matches the pattern from a minimum of n number of times to a maximum of x number of times


In our example, we see quantifiers a few times, mostly + and *. We can identify the quantifiers in our HTML tag pattern:

([a-z]+) We are using the + quantifier to "a-z" one or more times
([^<]+) We are using the + quantifier to match "<" one or more times
* We are using the * quantifier to match the first two groups zero or more times
(?:>(.*)<\/\1>|\s+\/>) We are using the * quantifier to match any character zero or more times


Character Classes

Now we get to the good part. A character class in regex does exactly what it sounds like. It defines a set of predisposed characters used to match. Normally with bracket expressions, we can denote what characters to find literally like [a-z].
But with character classes, we can shorthand some of the more common bracket expressions:

 .   Matches any character except the newline character(\n).
\d Matches any numeric digit or bracket expression [0-9].
\w Matches any alphanumeric character, including the underscore(_) or bracket expression [A-Za-z0-9_].
\s Matches a single whitespace character, including tabs(\t) and line breaks.


Tip: Some character classes can be used to perform inverse matches. \D, \W, and \S look for non-digits, non-words and non-spaces respectively.

You will notice in our third subexpression (?:>(.*)<\/\1>|\s+\/>) we have a character class \s+ which is matching one or more white spaces.

Backreferences

In grouping constructs, we covered capturing and non-capturing groups. Well, what do we do with the information in captured groups? We reference them later. Backreferences also use () notation, and can be used in a multitude of ways. The one thing to remember is that they can only reference previously captured groups, so we usually see them towards the end of our regex.
Following suite to this, our example has backreferences in the last subexpression (?:>(.*)<\/\1>|\s+\/>).
Now we understand why this last subexpression is non-capturing. It itself is referencing the previous groups in our example, so there is nothing to remember here. (.*) is a sub-group matching to any character and repeated as many times as needed.

(\1>) Using numbers after opening a paranthesized backreference is used to retrieve captured data in sequential order.

Knowing this information now tells us this last subexpression is handling our HTML closing tags. It uses these backreferences to take the word in the captured group and uses it to match the closing tag.

The OR Operator

Similar to programming logic, sometimes we need conditional statements to achieve our goal. The OR operator | makes it possible to include multiple criteria in one bracket expression. So, the expression [cat] would return these letters in this sequence, but [c|a|t] would return characters c or a or t.
Using our knowledge of OR operations, we can now fully decode the subexpression from earlier (?:>(.*)<\/\1>|\s+\/>):

This is a non capturing group
Looking for any characters from the first captured group followed by < phrase from subexpression 1 />

OR
Looking for any one or more white space characters ending in />


Conclusion

After covering all the core components of our example regex, we can now break it down in its entirety.
/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/

From left to right: 


We are looking for any text beginning with "<"  /^<
It could have any number of a-z letters after it ([a-z]+)
We are also allowing duplicate characters after the "<" opening tag ([^<]+)
Then, we are finding 0 or more occurences of the existing regex to be matched  *
Now, we open a non-capturing group for any text with a closing tag ">" (?:>
Inside this group, we write pattern for text that may live between HTML tags (.*)
Lastly, we denote our closing tag with a "</" and reference to our first group followed by ">" <\/\1>

 OR | 

We denote a self closing tag by matching to any white space followed by "/>" \s+\/>)$


Here is a step-by-step example of how the regex finds the <footer> and <img> tags:

classic: < self-closing: <
classic: <fo self-closing: <img
classic: <footer self-closing: <img
 matching preceding pattern 0 or more times 
classic: <footer> self-closing: <image
classic: <footer> text self-closing: <image src='image.png'
classic: <footer> text </footer> self-closing: <image src='image.png' />


Author

This gist was created by Insha Sayani. 
To see more gists by Insha click here!

© 2022 Insha Sayani of ISayani Creative Services, Confidential and Proprietary. All Rights Reserved.