gavin-asay/regex_html_tag.md

## regex_html_tag.md

      
    Raw
  

              regex_html_tag.md
            
          
    Regex and You: Matching an HTML Tag

Regular expressions, ever versatile, will help up locate HTML tags in a string today.
Summary

Pattern matching HTML strings serves at least one crucial function in web dev: sanitizing user input. Allowing user-submitted strings opens one's application to significant vulnerability. Supposing, for example, some ne'er-do-well on the internet submitted a comment that includes <script src="[path]/stealYourData.js"></script>. Regular expressions allow us to match HTML tags in a string, because HTML tags conform to a certain pattern:

begin and end with brackets (<>)
contain a string name consisting of one or more lowercase letters, like p, a, div, strong, script
contain zero or more attributes, such as class="btn", src="/steal_your_data.js", or href="https://github.com/gavin-asay"
be accompanied by a closing tag in brackets with a slash and its tag name, e.g., </p>, </div> or
be a self-closing tag, which has one or more whitespace characters, then a slash before the closing bracket (>).

So, to pick out an HTML tag, we write a regex that can account for these various possibilities. Consider this regex:
/^<([a-z]+)([^>]+)*(?:>(.*)<\/\1>|\s+\/>)$/
If that looks like gibberish, that's because a regex often does at first glance. It takes some time to break down a lengthy regex and make sense of its pattern. Let's break this regex down piece by piece. Look in the table of contents for an explanation for each part of this lengthy regex.
Table of Contents


/
^<([a-z]+)
^
<
[a-z]
+
( ... )
([^>]+)*
[^>]
+
( ... )*
(?: ... )
>(.*)
</\1>
|
\s+/>
$

/ {#slash}

Every regex is enclosed in forward slashes. Programming languages recognize this syntax to denote a regular expression.
^<([a-z]+) {#capture1}

^ {#carat}

When you see a carat ^ at the beginning of the regex, it means the beginning of the string we're comparing. Thus, only an HTML tag found immediately at the start of our string will fit the pattern. (Note that we also have a character that matches the end of the string, which we'll discuss later.)
< {#openbracket}

This single character < stands alone, not enclosed in any parentheses or brackets. This means that the pattern will match one and only one single open bracket, as we would expect from an HTML tag.
[a-z] {#class}

Square brackets [] mark a character class. Any character within the brackets will match the pattern. In this case, we match any lowercase letter from a to z. Note that for letters, regex is case sensitive. If we wanted to match capital letters as well, our character class would be [A-Za-z]. If we only wanted to match a handful of characters, we could use [abc123] to match only lowercase a, b, c, or the digits, 1, 2, and 3.
+ {#plus}

The plus sign + is a quantifier. It describes how many times the previous character class can be repeated. Plus means one more times. That means we must have at least one character that matches [a-z], but two or any quantity beyond that will also match. Other quantifiers include the asterisk *, meaning zero more times (essentially making the character class optional), while a question mark ? means zero or one times.
( ... ) {#capturing}

Finally, you'll notice that this segment is enclosed in parentheses ( ). Parentheses mark a capturing group. This means that the regex will remember the segment of the pattern matching everything inside those parentheses. We can refer back to this capturing group later. JavaScript will also keep track of the contents of this capturing group.
Still with me? Have you figured out what this first part matches? An opening HTML bracker <, followed by one or more lowercase letters. That's the start of an HTML tag—segments like <a, <div, or <p all match the pattern so far.
And what about the first capturing group? That's all of the letters, so a, div, or p would be the capturing group. That's our tag name, which we're keeping track of now.

([^>]+)* {#capture2}

You'll notice that we're isolating a second capturing group.
[^>] {#class2}

Last time we saw a carat ^, it denoted the start of the string. Within a character class, however, ^ has a different meaning: to exclude a character from the class. We're excluding > here, but that's the only definition of this class. If a character class only describes exclusions, then any character EXCEPT the exluded characters will match. Any character that isn't >, including letters, digits, symbols, and whitespace match this character class.
+ {#plus2}

As before, + matches one or more non-> characters.
( ... )* {#asterisk}

Like we mentioned above, the asterisk * matches zero or more times. Thus, our second capturing group ([^<]+)* is optional and will include any collection of one or more non-> characters. What is this very flexible pattern looking for? Anything that comes after the tag name and before the closing bracket >. That includes the tags attributes. That includes anything like classes or ids, href, src, or flags like selected or disabled.
Let's look at an example:
<option value="United States" id="US" selected>
The first capturing group ([a-z]+) grabs the tag name (option) and remembers it for later. The second capturing group ([^>]+)* matches all of the attributes and flags (value="United States" id="US" selected). That's stored as well.

(?: ... ) {#noncapture}

Here we have another group that begins with ?:. These characters ?: denote a non-capturing group. A string must match everything inside a non-capturing group, but this group will not be remembered later. You'll notice that there are capturing groups within this non-capturing group. It's those sub-group that we'll be more concerned with.
>(.*) {#period}

The first character matched in this segment is >, signifying the end of the HTML tag. But why does the end of the tag appear in the middle of the regex?
Next is the third capturing group (.*). The period . matches any character. So, following the complete HTML tag, the third capturing group matches any string, or no string at all.
</\1> {#escape}

What is /\ supposed to be? Programmers will recognize the backslash \ to escape the following character. To match a forward slash /, we need to escape it. This is because / is a functional character in regex, marking the beginning and end of the pattern.
What about \1? We don't need to escape digits, do we? An escaped character is a reference to the contents of a capturing group. Capturing group 1 matched the tag name. This doesn't simply repeat the pattern of capturing group 1, it matches the exact same text that capturing group 1 found. Thus, if the tag name was div, \1 must also match div; it can't match span or any other tag name.
Putting this segment together, we match <, then /, then capturing group 1, then >. You've likely caught on that this segment finds the closing tag that pairs with the opening tag we found previously. (.*) allows for any text that comes in between them. That means it can match any text or enclosed tags!

| {#pipe}

The pipe | separates alternate patterns. </\1> is a valid pattern, but what follows | can match instead of </\1>.
\s+/> {#short}

An escaped letter is a shorthand for a commonly used character class. Here, \s matches any whitespace character: space, tab, or a newline character. Other useful classes include \w (any word character [a-zA-Z0-9_]) and \d (any digit [0-9]).
Altogether, this alternate pattern matches one or more whitespace characters, then /, then >. The alternate to a separate closing tag is, naturally, the /> found in self-closing tags like 
 or .
$/ {#dollar}

Finally, the dollar sign $ matches the end of the string. Then / closes out the regex pattern.
Author

Gavin is a full-stack web developer. See his work at https://github.com/gavin-asay.