bluesatyr/regex-tutorial.md

## regex-tutorial.md

      
    Raw
  

              regex-tutorial.md
            
          
    HTML Tag Regular Expression Explained

The Regular Expression (regex) we will be looking at in this tutorial is designed to match any html tags in a document and return the match. It also includes subgroups for the type of tag (h1, footer, a etc.), any attributes the tag may have, as well as the content contained between the opening and closing tags.
This regex makes possible a number of useful actions, such as manipulating an html file to strip all attributes from tags, or perhaps removing all content within the tags. These actions may also be used to extract bare html to create templates or for removing hard-coded style tags within text content.
Summary

The original expression suggested to me for this tutorial is expressed here:
/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/
However I soon found an important case that this expression does not cover: heading tags! As heading tags contain both an alphabetic character and a numberal, the expression above would not return a match.
With a simple alteration the new expression solves this omission:
/^<([a-z]+\d?)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/
To break the expression down a little, let's compare it with an HTML Tag and its various parts:
<h1 class="title">My Website</h1>
^<([a-z]+\d?) corresponds to the first part of our tag: <h1.
([^<]+)*(?:> corresponds to the second part of the opening tag: class="title">
(.*) corresponds to the inner text/content of the tag: My Website
<\/\1>|\s+\/>)$ corresponds to the closing tag: </h1>
Table of Contents


Anchors
Quantifiers
OR Operators
Flags
Grouping and Capturing
Bracket Expressions
Greedy
Back-references

Regex Components

Anchors

Our expression contains two anchors ^ and $. These anchors are used in Regular Expressions to denote the start of the phrase or string being matched (the carrot symbol  ^ ) and the end of the phrase (the dollar sign $). This lets the environment know where a matching phrase should begin and end in relation to the content contained within.
Quantifiers

We also make use of a few different quantifiers in this expression. The first one tells us that following the first < symbol, inside the first capture-group (more on this later), we should have one or more alphabetic characters which represent the tag name. To declare that we should have one or more letters, we use the plus sign:  [a-z]+.
As originally written, our expression would not then expect a numeric character immediately following the letters we just included. This meant that heading tags such as <h1>My Website</h1> would not be returned as a match. To fix this, I've added another quantifier, the question mark ?. The question mark tells us that the item just before it is optional. It may or may not exist in the phrase, just as a heading tag will a number in it, but an anchor tag will not. The optional character in this case is a digit so in the expression we add \d? and suddenly all our heading tags work!
We also use * and .* quantifiers as well. * tells us that we can expect zero or more of the character type that precedes the symbol this is used specifically in this expression to match any attributes that maybe contained in a tag. As an attribute may contain letters, numbers or symbols, the * is used here.
Finally the .* component tells us to expect anything (any character, zero or more in quantity). In this case it is used to match the inner content or text content of the html tag. In this case it may be empty (as in a script tag) or have a lot of content, such as a paragraph tag. .* allows us to accept either case.
OR Operators

Our expression uses one 'OR Operator' towards the end of the expression to match the closing tag of our 'phrase'. Using the symbol | it allows us to match either a tag like this; </h2> or with  /> to match non-container tags such as <br />. The part of the expression containing the OR Operator looks like this:  <\/\1>|\s+\/>. The first part matches the group we use initially to capture the tag name, and the second inserts a space and a forward slash befor closing the tag.
Flags

Though our regex does not include any flags, we could include one or more to make our expression more useful. If for instance we added i to the end of the expression (which makes the expression case in-sensitive) we would match html tags that use capital letters, such as <H1> as was more common in the past. Another useful tag would be g which will match all instances instead of just one as our original expression uses.
If we include both, our expression will look like this: /^<([a-z]+\d?)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/gi
Grouping and Capturing

Using parenthesis to indicate capture groupings, our expression is able to capture and reuse or return 3 disting parts of an html tag. The first, ([a-z]+\d?), captures the tag name, which is referenced later in the closing tag. The second, ([^<]+)*, captures any attributes that the tag may have, such as class, id or inline styles. And finally, (.*) which returns a third group containing the inner content/text of the html tag.
We also can see a 'non-capturing group' here: (?:>(.*)<\/\1>|\s+\/>). Though it has a capturing group within it, the (.*), it does not capture and return anything contained within the closing tag.
Bracket Expressions

Two bracket expressions are used in our expression. Enclosed within square brackets, a bracket expression is a list of characters that will match any single chracter in the list. The first of these components is used to match the tag name by matching any sequence of one or more alphabetic characters (the type or class of character) and is written: [a-z]+.
Interestingly the bracket expression uses the ^ symbol after the first square bracket. This indicates that it should match anything NOT between the brackets. In this case written as [^<] it should match anything that is NOT a < symbol, which in this case woud indicate the start of the closing tag.
Greedy Match

As mentioned in the section on Quantifiers, we do have one component which qualifies as what we call a greedy match: (.*). It is referred to as "greedy" because it will match any and every character and whitespace. If not well constructed an expression with a greedy match continue to match items beyond what was intended.
Back-references

In the earlier section on Grouping and Capturing when discussing the non-capturing group at the end of the tag, we can see something called a back-reference. A back-reference is a reference to a captured group earlier in the phrase: \1. Here we reference the first captured group, which indicates that the closing tag should contain the same info as we found in the opening tag.
Author

Shawn Evans is a full-stack web developer based in the Greater Toronto Area. In addition to developing for the web, he is a musician, dementia specialist, world traveller, event planner and above all, a life-long learner.
You can view his web development portfolio at https://bluesatyr.github.io/