Skip to content

Instantly share code, notes, and snippets.

@ender672
Created December 8, 2011 23:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ender672/1449283 to your computer and use it in GitHub Desktop.
Save ender672/1449283 to your computer and use it in GitHub Desktop.
require 'nokogiri'
html = '<html><body><br/></body></html>'
# Nokogiri's new HTML encoding detection uses a custom SAX document handler to
# "peek" at an IO before parsing it.
#
# It interrupts the SAX parser by throwing from the context of a SAX document
# handler callback:
# https://github.com/tenderlove/nokogiri/blob/master/lib/nokogiri/html/document.rb#L144
#
# This causes a memory leak since the libxml2 parser does not expect its
# callbacks to longjump. Nokogiri leaks a little bit of memory every time we
# open an HTML document from an IO.
loop do
doc = Nokogiri::HTML::Document::EncodingReader::SAXHandler.new(:foo)
prs = Nokogiri::HTML::SAX::Parser.new(doc)
ctx = Nokogiri::HTML::SAX::ParserContext.memory(html, 'UTF-8')
catch(:foo) do
ctx.parse_with(prs)
end
end
# The above shows what is going on behind the scenes. Here is a much easier way
# to trigger this memory leak:
loop{ Nokogiri::HTML(StringIO.new(html)) }
# The proper fix for this issue is intrusive. I am unsure if we want to
# incorporate it into a stable release. It involves wrapping every rb_funcall in
# xml_sax_parser.c so that it:
# * intercepts exceptions and throws
# * stashes the exception in a new C struct associated with the handler or the
# parser.
# * tells libxml2 to stop parsing.
# * re-throws the exception after libxml2 finishes its cleanup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment