Skip to content

Instantly share code, notes, and snippets.

@mdawaffe
Last active September 1, 2023 06:14
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mdawaffe/9163774 to your computer and use it in GitHub Desktop.
Save mdawaffe/9163774 to your computer and use it in GitHub Desktop.
Strip HTML markup from a string.

Strip Tags

Strip HTML tags from a string. Does not use (much) regex.

Never loads any resources (<img>, <script>, etc.) referenced in the input.

The treatment of whitespace is probably consistent across browsers but is not guaranteed.

Technique

Rather than using fragile regexes, the DOM is used, and the resulting text nodes are pulled out.

This can be made safe by adding to the DOM with node.innerHTML, which does not run scripts.

node.innerHTML does load images and other resources, though. To prevent that, the input is first munged to replace src and href with srco and hrefo.

To exclude the contents of <script> and <style>, the DOM is recursively looped to ignore those tags and find textNodes.

Todo

  • Look into DOM Mutation events.
  • Any other elements/attributes to worry about?

Compatibility

Safari, Chrome, Firefox, IE6-11.

Other Techniques

  • Regex: ugly
  • <noscript> tag's .innerHTML: inconsistent across browsers. Firefox treats .innerHTML as .innerText.
  • <div> tag's .innerHTML inside an <iframe security="restricted" sandbox="allow-same-origin">: clever, but .innerHTML already doesn't run scripts.
  • <noscript> tag's .innerHTML inside an <iframe security="restricted" sandbox="allow-same-origin">: inconsistent across browsers. Firefox still treats .innerHTML as .innerText.
function strip_tags( string ) {
var div = document.createElement( 'div' );
// mung src and href attributes to stop <img> and <link> elements from loading anything
div.innerHTML = string.replace( /(src|href)/g, '$1o' );
return get_text_from_node( div, true );
}
function get_text_from_node( node, demung ) {
var tag = ( node.tagName || '' ).toLowerCase();
var out;
var i;
if ( 'script' === tag || 'style' === tag ) {
return '';
}
switch ( node.nodeType ) {
case node.ELEMENT_NODE || 1 :
case node.DOCUMENT_NODE || 9 :
case node.DOCUMENT_FRAGMENT_NODE || 11 :
out = '';
for ( i = 0; i < node.childNodes.length; i++ ) {
out += get_text_from_node( node.childNodes[i] );
}
return out;
case node.TEXT_NODE || 3 :
case node.CDATA_SECTION_NODE || 4 :
if ( demung ) {
// demung here instead of above in strip_tags() to preserve text from inputs like <span>sr</span>co.
return node.nodeValue.replace( /(src|href)o/g, '$1' );
}
return node.nodeValue;
}
return '';
}
@mdawaffe
Copy link
Author

mdawaffe commented Sep 1, 2023

This is old and ugly :) Look at DOMParser instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment