Skip to content

Instantly share code, notes, and snippets.

@gavin-asay
Last active April 14, 2024 16:38
Show Gist options
  • Star 10 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save gavin-asay/6cd089ca72b9810957254ec6a0cfced7 to your computer and use it in GitHub Desktop.
Save gavin-asay/6cd089ca72b9810957254ec6a0cfced7 to your computer and use it in GitHub Desktop.
Regex and You: Matching an HTML Tag

Regex and You: Matching an HTML Tag

Regular expressions, ever versatile, will help up locate HTML tags in a string today.

Summary

Pattern matching HTML strings serves at least one crucial function in web dev: sanitizing user input. Allowing user-submitted strings opens one's application to significant vulnerability. Supposing, for example, some ne'er-do-well on the internet submitted a comment that includes <script src="[path]/stealYourData.js"></script>. Regular expressions allow us to match HTML tags in a string, because HTML tags conform to a certain pattern:

  • begin and end with brackets (<>)
  • contain a string name consisting of one or more lowercase letters, like p, a, div, strong, script
  • contain zero or more attributes, such as class="btn", src="/steal_your_data.js", or href="https://github.com/gavin-asay"
  • be accompanied by a closing tag in brackets with a slash and its tag name, e.g., </p>, </div> or
  • be a self-closing tag, which has one or more whitespace characters, then a slash before the closing bracket (>).

So, to pick out an HTML tag, we write a regex that can account for these various possibilities. Consider this regex:

/^<([a-z]+)([^>]+)*(?:>(.*)<\/\1>|\s+\/>)$/

If that looks like gibberish, that's because a regex often does at first glance. It takes some time to break down a lengthy regex and make sense of its pattern. Let's break this regex down piece by piece. Look in the table of contents for an explanation for each part of this lengthy regex.

Table of Contents

/ {#slash}

Every regex is enclosed in forward slashes. Programming languages recognize this syntax to denote a regular expression.

^<([a-z]+) {#capture1}

^ {#carat}

When you see a carat ^ at the beginning of the regex, it means the beginning of the string we're comparing. Thus, only an HTML tag found immediately at the start of our string will fit the pattern. (Note that we also have a character that matches the end of the string, which we'll discuss later.)

< {#openbracket}

This single character < stands alone, not enclosed in any parentheses or brackets. This means that the pattern will match one and only one single open bracket, as we would expect from an HTML tag.

[a-z] {#class}

Square brackets [] mark a character class. Any character within the brackets will match the pattern. In this case, we match any lowercase letter from a to z. Note that for letters, regex is case sensitive. If we wanted to match capital letters as well, our character class would be [A-Za-z]. If we only wanted to match a handful of characters, we could use [abc123] to match only lowercase a, b, c, or the digits, 1, 2, and 3.

+ {#plus}

The plus sign + is a quantifier. It describes how many times the previous character class can be repeated. Plus means one more times. That means we must have at least one character that matches [a-z], but two or any quantity beyond that will also match. Other quantifiers include the asterisk *, meaning zero more times (essentially making the character class optional), while a question mark ? means zero or one times.

( ... ) {#capturing}

Finally, you'll notice that this segment is enclosed in parentheses ( ). Parentheses mark a capturing group. This means that the regex will remember the segment of the pattern matching everything inside those parentheses. We can refer back to this capturing group later. JavaScript will also keep track of the contents of this capturing group.

Still with me? Have you figured out what this first part matches? An opening HTML bracker <, followed by one or more lowercase letters. That's the start of an HTML tag—segments like <a, <div, or <p all match the pattern so far.

And what about the first capturing group? That's all of the letters, so a, div, or p would be the capturing group. That's our tag name, which we're keeping track of now.


([^>]+)* {#capture2}

You'll notice that we're isolating a second capturing group.

[^>] {#class2}

Last time we saw a carat ^, it denoted the start of the string. Within a character class, however, ^ has a different meaning: to exclude a character from the class. We're excluding > here, but that's the only definition of this class. If a character class only describes exclusions, then any character EXCEPT the exluded characters will match. Any character that isn't >, including letters, digits, symbols, and whitespace match this character class.

+ {#plus2}

As before, + matches one or more non-> characters.

( ... )* {#asterisk}

Like we mentioned above, the asterisk * matches zero or more times. Thus, our second capturing group ([^<]+)* is optional and will include any collection of one or more non-> characters. What is this very flexible pattern looking for? Anything that comes after the tag name and before the closing bracket >. That includes the tags attributes. That includes anything like classes or ids, href, src, or flags like selected or disabled.

Let's look at an example:

<option value="United States" id="US" selected>

The first capturing group ([a-z]+) grabs the tag name (option) and remembers it for later. The second capturing group ([^>]+)* matches all of the attributes and flags (value="United States" id="US" selected). That's stored as well.


(?: ... ) {#noncapture}

Here we have another group that begins with ?:. These characters ?: denote a non-capturing group. A string must match everything inside a non-capturing group, but this group will not be remembered later. You'll notice that there are capturing groups within this non-capturing group. It's those sub-group that we'll be more concerned with.

>(.*) {#period}

The first character matched in this segment is >, signifying the end of the HTML tag. But why does the end of the tag appear in the middle of the regex?

Next is the third capturing group (.*). The period . matches any character. So, following the complete HTML tag, the third capturing group matches any string, or no string at all.

</\1> {#escape}

What is /\ supposed to be? Programmers will recognize the backslash \ to escape the following character. To match a forward slash /, we need to escape it. This is because / is a functional character in regex, marking the beginning and end of the pattern.

What about \1? We don't need to escape digits, do we? An escaped character is a reference to the contents of a capturing group. Capturing group 1 matched the tag name. This doesn't simply repeat the pattern of capturing group 1, it matches the exact same text that capturing group 1 found. Thus, if the tag name was div, \1 must also match div; it can't match span or any other tag name.

Putting this segment together, we match <, then /, then capturing group 1, then >. You've likely caught on that this segment finds the closing tag that pairs with the opening tag we found previously. (.*) allows for any text that comes in between them. That means it can match any text or enclosed tags!


| {#pipe}

The pipe | separates alternate patterns. </\1> is a valid pattern, but what follows | can match instead of </\1>.

\s+/> {#short}

An escaped letter is a shorthand for a commonly used character class. Here, \s matches any whitespace character: space, tab, or a newline character. Other useful classes include \w (any word character [a-zA-Z0-9_]) and \d (any digit [0-9]).

Altogether, this alternate pattern matches one or more whitespace characters, then /, then >. The alternate to a separate closing tag is, naturally, the /> found in self-closing tags like
or .

$/ {#dollar}

Finally, the dollar sign $ matches the end of the string. Then / closes out the regex pattern.

Author

Gavin is a full-stack web developer. See his work at https://github.com/gavin-asay.

@rickychongjl
Copy link

very nice explanation, good for beginners to understand basic regex syntaxes

@shoaiyb
Copy link

shoaiyb commented Jun 2, 2023

Also, here's another way I do it:

Tag Regex

/\<(?<tag>[a-z][a-z0-9\-]*)(\s+([\s\S]*?))?\/?\>(([\s\S]*?)\<\/(?P=tag)\>)?/

The above RegExp can be broken down as:

  1. /: The regex opening delimiter.
  2. \<: Matches the character < of an opening tag.
  3. (?<tag>[a-z][a-z0-9\-]*): Matches HTML valid tag name, which should start with a character between a and z, could contain another characters between a to z, and numbers between 0 and 9, and could also contain the character - in it.
  4. (\s+([\s\S]*?))?: Matches the entire attributes of the tag including spaces between them, but only if they were present.
  5. \/?: Matches the character / of self closing tags.
  6. \>: Matches the character >, which is supposed to be the closing character of the opening tag.
  7. (([\s\S]*?)\<\/(?P=tag)\>)?: Matches the content or HTML inside the tag and the closing tag, but only if the tag is not self closing tag.
  8. /: The regex closing delimiter.

Tag Attributes Regex

/([\w\-]+)(\s*\=\s*(?|(?<quot>[\'"])([\s\S]*?)(?P=quot)|(?<quot>)([\w\-]+)))?/

The above RegExp can be broken down as:

  1. /: The regex opening delimiter.
  2. ([\w\-]+): Matches the attributes key/name.
  3. (\s*\=\s*(?|(?<quot>[\'"])([\s\S]*?)(?P=quot)|(?<quot>)([\w\-]+)))?: Matches the value of the attribute, which could be anything wrapped in a single-quote (') or in a double-quote ("). Also, could be naked (not wrapped in a quote). If not wrapped, the value must only contain characters in the range a to z or the capitals A to Z, and numbers in the range 0 to 9, and _ (underscore), and - (hyphen). This could also match nothing for boolean attributes.
  4. /: The regex closing delimiter.

Example Usage in PHP

<?php

// HTML elements
$content = <<<EOL

<p>Text paragraph.</p>
<img src="http://example.com/image-200x320.png" width="200" height="320">

EOL;

// Tags matching RegExp
$tags_regexp = '/\<(?<tag>[a-z][a-z0-9\-]*)(\s+([\s\S]*?))?\/?\>(([\s\S]*?)\<\/(?P=tag)\>)?/';

// Attributes matching RegExp
$atts_regexp = '/([\w\-]+)(\s*\=\s*(?|(?<quot>[\'"])([\s\S]*?)(?P=quot)|(?<quot>)([\w\-]+)))?/';

// Match all the valid elements in the HTML
preg_match_all( $tags_regexp, $content, $matches, PREG_SET_ORDER );

// Loop through and make the necessary changes
foreach ( $matches as $match ) {

  // We are going to modify only image tags
  if ( 'img' !== $match[ 'tag' ] ) continue;

  // Match all the attributes
  preg_match_all( $atts_regexp, $match[2], $atts_match );

  // Combine the keys and the values
  $atts_match = array_combine( $atts_match[1], $atts_match[4] );

  // Build back a HTML valid attributes
  $atts = '';
  foreach ( $atts_match as $name => $value ) {
    $atts .= sprintf( ' %s="%s"', $name, $value );
  }

  // Replacement for the tag
  $amp = sprintf( '<amp-img%s></amp-img>', $atts );

  // Replace the complete tag match with the new replacement
  $content = str_replace( $match[0], $amp, $content );

}


// The AMPifyed HTML
/**
 * <p>Text paragraph.</p>
 * <amp-img src="http://example.com/image-200x320.png" width="200" height="320"></amp-img>
 */
echo $content;

From: https://dev.to/shoaiyb/regexp-based-html-modification-in-php-5e85

@b-jsshapiro
Copy link

The tags regex given by @shoaiyb is essential. The original regex skips over anything that is not a closing >, but if the > appears inside a quoted attribute value it is perfectly valid and does not terminate the attribute.

It had been a while since i had built this particular regex, and the reminder was really helpful. Thanks, all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment