Skip to content

Instantly share code, notes, and snippets.

@kryzhovnik
Created February 2, 2011 01:42
Show Gist options
  • Save kryzhovnik/807100 to your computer and use it in GitHub Desktop.
Save kryzhovnik/807100 to your computer and use it in GitHub Desktop.
требует установленного гема loofah (gem install loofah)
desc <<END
Find all html files from the specified directory and clean them: removes comments, whitespaces, and carriage return.
Before using install loofah gem:
gem install loofah
Usage:
rake clean_html DIR=my_dir
Warning: by default, DIR variable point current directory - #{Dir.pwd}
END
task 'clean_html' do
require 'loofah'
require 'active_support/core_ext/string'
clean = Loofah::Scrubber.new do |node|
if node.type == Nokogiri::XML::Node::COMMENT_NODE
node.remove
else
if node.name == 'pre'
Loofah::Scrubber::STOP # don't bother with the rest of the subtree
elsif node.type == Nokogiri::XML::Node::TEXT_NODE
if node.content.blank?
node.remove
else
node.content = node.content.strip
end
end
end
end
ENV['DIR'] ||= Dir.pwd
files = Dir["#{ENV['DIR']}/**/*.html"]
files.each do |file_path|
puts "parse file: #{file_path}"
html = File.open(file_path, 'r').read
clear_tree = Loofah.document(html).scrub!(clean)
clear_html = clear_tree.to_html(:indent_text => '', :indent => 0, :save_with => 0)
html = File.open(file_path, 'w').write clear_html
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment