Overview of the Benchmark
The following benchmark output was generated from the codes at http://github.com/flavorjones/loofah/tree/master/benchmark
These results show the performance of Loofah scrubbing methods against comparable methods from other common open-source libraries:
- ActionView sanitize() and strip_tags()
- Sanitize sanitize()
- HTML5lib sanitize()
- HtmlFilter filter()
HTML of various sizes is tested:
- a large document (~98 KB)
- a sizable fragment (~3 KB)
- a small snippet (58 bytes)
Head to Head against ActionView sanitize()
Loofah wins by about 20% on large documents and fragments, but loses on small snippets.
Loofah's comparative slowness for small snippets is because Nokogiri uses libxml2, which has a constant "startup overhead" that is incurred before parsing HTML regardless of size. ActionPack's regular expressions have no such startup overhead.
The win for ActionView on small snippets comes at a cost, though. From the ActionView comments:
Please note that sanitizing user-provided text [with ActionView]
does not guarantee that the resulting markup is valid (conforming
to a document type) or even well-formed. The output may still
contain e.g. unescaped '<', '>', '&' characters and confuse
browsers.
Loofah will always generate well-formed and valid HTML with proper encoding and escaping. Something to keep in mind when choosing a sanitizing library. Just sayin'.
Head to Head against ActionView strip_tags()
Loofah wins by between 60% and 100% on large documents and fragments, but loses again on small snippets.
See previous section for explanation and commentary.
Head to Head against Sanitize sanitize()
Loofah wins on HTML of all sizes, between 13% and 280%.
Head to Head against HTML5lib sanitize()
Loofah wins on HTML of all sizes, between 300% and 1450%.
Yes. Not a typo. REXML is that slow.
Head to Head against HtmlFilter filter()
Loofah wins by a factor of two on large and medium documents, but loses on small snippets.
HtmlFilter also uses regular expressions and hence cannot guarantee that the output markup is well-formed or valid.
Here's a more up to date comparison between Loofah, Sanitize, and HTMLFilter: https://github.com/rgrove/sanitize/blob/master/COMPARISON.md#performance-comparison