Created
February 28, 2012 13:14
-
-
Save labocho/1932509 to your computer and use it in GitHub Desktop.
Cleaning HTML by using Chrome via Selenium
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env ruby | |
require "selenium-webdriver" | |
require "open-uri" | |
require "uri" | |
require "tempfile" | |
url = nil | |
if STDIN.tty? | |
url = ARGV.shift | |
unless url | |
usage = <<-EOS | |
Usage: | |
./clean_html http://example.com/ | |
./clean_html index.html | |
cat index.html | ./clean_html | |
EOS | |
STDERR.puts usage.gsub(/^ +/, "") | |
exit 1 | |
end | |
if File.exists?(url) | |
url = "file://" + File.expand_path(url) | |
end | |
else | |
file = Tempfile.new("clean_html") | |
file.write STDIN.read | |
file.close | |
url = "file://#{file.path}" | |
end | |
HTML_REBUILDER_URL = "https://raw.github.com/labocho/html_rebuilder/master/js/html_rebuilder.min.js" | |
html_rebuilder_src = open(HTML_REBUILDER_URL){|f| f.read } | |
driver = Selenium::WebDriver.for :chrome | |
begin | |
driver.navigate.to url | |
driver.script(html_rebuilder_src) | |
puts driver.script("return (new HtmlRebuilder(document)).html();") | |
ensure | |
driver.close | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment