mdawaffe/Readme.md

## Readme.md

      
    Raw
  

              Readme.md
            
          
    Strip Tags

Strip HTML tags from a string. Does not use (much) regex.
Never loads any resources (<img>, <script>, etc.) referenced in the input.
The treatment of whitespace is probably consistent across browsers but is not guaranteed.
Technique

Rather than using fragile regexes, the DOM is used, and the resulting text nodes are pulled out.
This can be made safe by adding to the DOM with node.innerHTML, which does not run scripts.
node.innerHTML does load images and other resources, though.  To prevent that, the input is first munged to replace src and href with srco and hrefo.
To exclude the contents of <script> and <style>, the DOM is recursively looped to ignore those tags and find textNodes.
Todo


Look into DOM Mutation events.
Any other elements/attributes to worry about?

Compatibility

Safari, Chrome, Firefox, IE6-11.
Other Techniques


Regex: ugly
<noscript> tag's .innerHTML: inconsistent across browsers.  Firefox treats .innerHTML as .innerText.
<div> tag's .innerHTML inside an <iframe security="restricted" sandbox="allow-same-origin">: clever, but .innerHTML already doesn't run scripts.
<noscript> tag's .innerHTML inside an <iframe security="restricted" sandbox="allow-same-origin">: inconsistent across browsers.  Firefox still treats .innerHTML as .innerText.


## strip-tags.js
function strip_tags( string ) {
	var div = document.createElement( 'div' );
	// mung src and href attributes to stop <img> and <link> elements from loading anything
	div.innerHTML = string.replace( /(src|href)/g, '$1o' );
	return get_text_from_node( div, true );
}

function get_text_from_node( node, demung ) {
	var tag = ( node.tagName || '' ).toLowerCase();
	var out;
	var i;

	if ( 'script' === tag || 'style' === tag ) {
		return '';
	}

	switch ( node.nodeType ) {
	case node.ELEMENT_NODE || 1 :
	case node.DOCUMENT_NODE || 9 :
	case node.DOCUMENT_FRAGMENT_NODE || 11 :
		out = '';
		for ( i = 0; i < node.childNodes.length; i++ ) {
			out += get_text_from_node( node.childNodes[i] );
		}
		return out;
	case node.TEXT_NODE || 3 :
	case node.CDATA_SECTION_NODE || 4 :
		if ( demung ) {
			// demung here instead of above in strip_tags() to preserve text from inputs like <span>sr</span>co.
			return node.nodeValue.replace( /(src|href)o/g, '$1' );
		}

		return node.nodeValue;
	}

	return '';
}
	function strip_tags( string ) {
	var div = document.createElement( 'div' );
	// mung src and href attributes to stop <img> and <link> elements from loading anything
	div.innerHTML = string.replace( /(src\|href)/g, '$1o' );
	return get_text_from_node( div, true );
	}

	function get_text_from_node( node, demung ) {
	var tag = ( node.tagName \|\| '' ).toLowerCase();
	var out;
	var i;

	if ( 'script' === tag \|\| 'style' === tag ) {
	return '';
	}

	switch ( node.nodeType ) {
	case node.ELEMENT_NODE \|\| 1 :
	case node.DOCUMENT_NODE \|\| 9 :
	case node.DOCUMENT_FRAGMENT_NODE \|\| 11 :
	out = '';
	for ( i = 0; i < node.childNodes.length; i++ ) {
	out += get_text_from_node( node.childNodes[i] );
	}
	return out;
	case node.TEXT_NODE \|\| 3 :
	case node.CDATA_SECTION_NODE \|\| 4 :
	if ( demung ) {
	// demung here instead of above in strip_tags() to preserve text from inputs like <span>sr</span>co.
	return node.nodeValue.replace( /(src\|href)o/g, '$1' );
	}

	return node.nodeValue;
	}

	return '';
	}