Strip HTML tags from a string. Does not use (much) regex.
Never loads any resources (<img>
, <script>
, etc.) referenced in the input.
The treatment of whitespace is probably consistent across browsers but is not guaranteed.
Rather than using fragile regexes, the DOM is used, and the resulting text nodes are pulled out.
This can be made safe by adding to the DOM with node.innerHTML
, which does not run scripts.
node.innerHTML
does load images and other resources, though. To prevent that, the input is first munged to replace src
and href
with srco
and hrefo
.
To exclude the contents of <script>
and <style>
, the DOM is recursively looped to ignore those tags and find textNodes.
- Look into DOM Mutation events.
- Any other elements/attributes to worry about?
Safari, Chrome, Firefox, IE6-11.
- Regex: ugly
<noscript>
tag's.innerHTML
: inconsistent across browsers. Firefox treats.innerHTML
as.innerText
.<div>
tag's.innerHTML
inside an<iframe security="restricted" sandbox="allow-same-origin">
: clever, but.innerHTML
already doesn't run scripts.<noscript>
tag's.innerHTML
inside an<iframe security="restricted" sandbox="allow-same-origin">
: inconsistent across browsers. Firefox still treats.innerHTML
as.innerText
.
This is old and ugly :) Look at
DOMParser
instead.