Skip to content

Instantly share code, notes, and snippets.

@mperham
Created June 2, 2010 19:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save mperham/422886 to your computer and use it in GitHub Desktop.
Save mperham/422886 to your computer and use it in GitHub Desktop.
require 'sanitize'
STRIPPERS = %w(br p).freeze
def cleaner
lambda do |env|
return nil if !STRIPPERS.include?(env[:node_name])
n = env[:node]
txt = Nokogiri::XML::Text.new(' ', n.document)
n.children.each do |c|
txt.add_child(c)
end
txt.add_child(Nokogiri::XML::Text.new(' ', n.document))
{ :node => txt }
end
end
f = '<img src="http://jobs.43folders.com/files/company_logos/823290/dd98f0e1.png" align="right" />AT&#38;T/Milpitas, CA<br/><br/>Sr Technical Architect<br />'
f.each_line do |line|
result = Sanitize.clean(line, :remove_contents => %w(script style), :transformers => cleaner)
# Space character class is necessary as \s is not Unicode-aware in Ruby 1.9.1 anymore.
puts result.gsub(/\p{Space}+/u, ' ')
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment