Skip to content

Instantly share code, notes, and snippets.

@adinan-cenci
Last active October 12, 2021 23:24
Show Gist options
  • Save adinan-cenci/417dee090d79466add05f69d3d4fb81d to your computer and use it in GitHub Desktop.
Save adinan-cenci/417dee090d79466add05f69d3d4fb81d to your computer and use it in GitHub Desktop.
How to parse HTML with javascript

Parsing HTML with JavaScript

Javascript provide us with the DOMParser class to parse XML/HTML into DOM.

But sometimes we will end up dealing with invalid html, the DOMParser do not provide a way to deal with syntax errors, it will simply stop and leave us hanging.

Another way would be to use document.implementation.createHTMLDocument.

function getDocumentWithDomParser(html) 
{
    var parser  = new DOMParser();
    var doc     = parser.parseFromString(html, 'text/xml');
    
    return doc;
}

//  VS

function getDocumentWithCreateDocument(html) 
{
    var doc = document.implementation.createHTMLDocument('');
    doc.documentElement.innerHTML = html;

    return doc;
}
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>Example</title>
</head>
<body>
<script>
function getDocumentWithDomParser(html)
{
var parser = new DOMParser();
var doc = parser.parseFromString(html, 'text/xml');
return doc;
}
function getDocumentWithCreateDocument(html)
{
var doc = document.implementation.createHTMLDocument('');
doc.documentElement.innerHTML = html;
return doc;
}
function fetchFile(url)
{
return fetch(url).then(async (res) =>
{
return res.text();
});
}
fetchFile('2-flawed-document.html').then( (html) =>
{
var doc1 = getDocumentWithDomParser(html);
var doc2 = getDocumentWithCreateDocument(html);
//------------------
console.log('failure', doc1);
console.log('success', doc2);
//------------------
var title1 = doc1.querySelector('title');
var title2 = doc2.querySelector('title');
console.log(title1 ? title1.innerHTML : '???');
console.log(title2 ? title2.innerHTML : '???');
});
</script>
</body>
</html>
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<base href="http://mywebsite.com/" />
<!-- ↓ all it takes for the parser to fail -->
<meta name="description" content="Lorem ipsum dolor sit amet ao a&ccedil;&atilde;o" />
<title>My website's title</title>
</head>
<body>
<div id="website">
<nav id="nav">
<ul>
<li><a href="index.php">Home</a></li>
<li><a href="about/">About</a></li>
<li>
<a href="products/" target="_blank">Products</a>
<ul>
<li><a href="product/?category=1&page=1" target="_blank">CDs</a></li>
<li><a href="product/?category=2&page=1" target="_blank">Books</a></li>
</ul>
</li>
<li><a href="contact/">Contact</a></li>
</ul>
</nav>
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
</div>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment