Skip to content

Instantly share code, notes, and snippets.

@AMiller42
Last active July 28, 2021 03:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save AMiller42/97601175247d231988feb1d66ad33344 to your computer and use it in GitHub Desktop.
Save AMiller42/97601175247d231988feb1d66ad33344 to your computer and use it in GitHub Desktop.

(Markdown is my own, for formatting and stuff)

How do you parse HTML with regex?

Let's say we want to pull out all the <title></title> tags from a piece of HTML. The HTML might look like this:

<html> <head> <title>My Web Page</title> </head> <body bgcolor="white"> <h1>Hello There</h1> </body> </html>

To parse the HTML with a regular expression, we'll first have to write the regular expression that matches <title> tags. Here's one way to do that:

var regex = /<title>([^<]+)</title>/g;

Next, we use the RegExp.prototype.exec() method to match the regular expression against our sample HTML string, and extract the matches from the results:

var matches = regex.exec(sample);

Finally, we iterate over the matches array, pulling out the values:

for (var i = 0; i < matches.length; i += 1) { var titleTag = matches[i]; var title = titleTag.replace(/<\/title>/g, ''); // do something with title }

Here's the full source code:

var sample = '<html> <head> <title>My Web Page</title> </head> <body bgcolor="white"> <h1>Hello There</h1> </body> </html>'; var regex = /<title>([^<]+)</title>/g; var matches = regex.exec(sample); for (var i = 0; i < matches.length; i += 1) { var titleTag = matches[i]; var title = titleTag.replace(/<\/title>/g, ''); // do something with title }

Discussion

The RegExp.prototype.exec() method is a powerful way to extract value from an HTML string. For example, you can use it to extract the values of all the <td> tags from an HTML table row, or all the <img> tags from an HTML page.

You can also use regular expressions to modify HTML. For example, you can remove all the <table> tags from an HTML string with the following regex:

var regex = /<table([^>]*)>/g; var sample = '<table> <tr> <td>A</td> <td>B</td> </tr> <tr> <td>C</td> <td>D</td> </tr> <tr> <td>E</td> <td>F</td> </tr> </table>'; var output = regex.exec(sample);

The output variable will now contain:

A B C D E F

This regex matches all <table> tags, and then dumps everything after the first opening </table> tag.

See Also

Recipe 1.6 for information about writing regular expressions; http://www.regular-expressions.info/ for more information on regular expressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment