(Markdown is my own, for formatting and stuff)
How do you parse HTML with regex?
Let's say we want to pull out all the <title></title>
tags from a piece of HTML. The HTML might look like this:
<html> <head> <title>My Web Page</title> </head> <body bgcolor="white"> <h1>Hello There</h1> </body> </html>
To parse the HTML with a regular expression, we'll first have to write the regular expression that matches <title>
tags. Here's one way to do that:
var regex = /<title>([^<]+)</title>/g;
Next, we use the RegExp.prototype.exec()
method to match the regular expression against our sample HTML string, and extract the matches from the results:
var matches = regex.exec(sample);
Finally, we iterate over the matches array, pulling out the values:
for (var i = 0; i < matches.length; i += 1) { var titleTag = matches[i]; var title = titleTag.replace(/<\/title>/g, ''); // do something with title }
Here's the full source code:
var sample = '<html> <head> <title>My Web Page</title> </head> <body bgcolor="white"> <h1>Hello There</h1> </body> </html>'; var regex = /<title>([^<]+)</title>/g; var matches = regex.exec(sample); for (var i = 0; i < matches.length; i += 1) { var titleTag = matches[i]; var title = titleTag.replace(/<\/title>/g, ''); // do something with title }
Discussion
The RegExp.prototype.exec()
method is a powerful way to extract value from an HTML string. For example, you can use it to extract the values of all the <td>
tags from an HTML table row, or all the <img>
tags from an HTML page.
You can also use regular expressions to modify HTML. For example, you can remove all the <table>
tags from an HTML string with the following regex:
var regex = /<table([^>]*)>/g; var sample = '<table> <tr> <td>A</td> <td>B</td> </tr> <tr> <td>C</td> <td>D</td> </tr> <tr> <td>E</td> <td>F</td> </tr> </table>'; var output = regex.exec(sample);
The output variable will now contain:
A B C D E F
This regex matches all <table>
tags, and then dumps everything after the first opening </table>
tag.
See Also
Recipe 1.6 for information about writing regular expressions; http://www.regular-expressions.info/ for more information on regular expressions.