Skip to content

Instantly share code, notes, and snippets.

@valo
Created August 27, 2009 14:43
Show Gist options
  • Save valo/176339 to your computer and use it in GitHub Desktop.
Save valo/176339 to your computer and use it in GitHub Desktop.
require 'rubygems'
require 'mechanize'
require 'tidied_html_page.rb'
a = WWW::Mechanize.new do |agent|
agent.user_agent_alias = 'Mac Safari'
agent.log = Logger.new(File.open('parser.log', 'w+'))
agent.pluggable_parser.html = TidiedHTMLPage
agent.pluggable_parser.xhtml = TidiedHTMLPage
end
require 'rubygems'
require 'mechanize'
require 'tidy'
# A HTML parser, which extends the reguler Mechanize parser with the Tidy lib
# for fixing invalid HTML pages and making them parsable with the Nogoriki
# parser
class TidiedHTMLPage < WWW::Mechanize::Page
def initialize(uri=nil, response=nil, body=nil, code=nil)
super(uri, response, body, code)
Tidy.open do |tidy|
tidy.options.force_output = true
tidy.options.char_encoding = "utf8"
tidy.options.indent = true
tidy.options.xhtml_output = true
tidy.options.wrap = 0
@body = tidy.clean(@body)
end
end
def parser
return @parser if @parser
if body && response
html_body = body.length > 0 ? body : '<html></html>'
@parser = WWW::Mechanize.html_parser.parse(html_body)
end
@parser
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment