Skip to content

Instantly share code, notes, and snippets.

@labocho
Created February 28, 2012 13:14
Show Gist options
  • Save labocho/1932509 to your computer and use it in GitHub Desktop.
Save labocho/1932509 to your computer and use it in GitHub Desktop.
Cleaning HTML by using Chrome via Selenium
#!/usr/bin/env ruby
require "selenium-webdriver"
require "open-uri"
require "uri"
require "tempfile"
url = nil
if STDIN.tty?
url = ARGV.shift
unless url
usage = <<-EOS
Usage:
./clean_html http://example.com/
./clean_html index.html
cat index.html | ./clean_html
EOS
STDERR.puts usage.gsub(/^ +/, "")
exit 1
end
if File.exists?(url)
url = "file://" + File.expand_path(url)
end
else
file = Tempfile.new("clean_html")
file.write STDIN.read
file.close
url = "file://#{file.path}"
end
HTML_REBUILDER_URL = "https://raw.github.com/labocho/html_rebuilder/master/js/html_rebuilder.min.js"
html_rebuilder_src = open(HTML_REBUILDER_URL){|f| f.read }
driver = Selenium::WebDriver.for :chrome
begin
driver.navigate.to url
driver.script(html_rebuilder_src)
puts driver.script("return (new HtmlRebuilder(document)).html();")
ensure
driver.close
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment