Skip to content

Instantly share code, notes, and snippets.

@zbage
Created September 30, 2009 15:20
Show Gist options
  • Save zbage/198173 to your computer and use it in GitHub Desktop.
Save zbage/198173 to your computer and use it in GitHub Desktop.
def plain_text(url)
doc = Nokogiri::HTML(open(url))
# convert the xml doc to a string
buffer = doc.xpath("//body").to_xml.to_s
b1 = buffer.gsub(/(\n)|\s+/,' ') # strip out carriage returns and extra space
# parse the string as a REXML document
doc4 = Document.new(b1.gsub(/>\s+</,'><'))
# select all text elements
c = XPath.match(doc4, "//[name() != 'script']/text()").join(' ')
c2 = c.gsub(/&#\w+;/,'') # strip out html codes
# strip out words with mid symbols
d = c2.gsub(/(\w+[\w&)\.\\][\/\.&#;@][\w;\.\=:]\w+)/,'')
# strip out numbers
d2 = d.gsub(/(\b\d[\d\w]*\b)/,'')
# split words which have a hyphen
d2b = d2.gsub(/\w(-)\w/) {|x| x.sub(/-/,' ')}
# strip out isolated symbols or characters
d3 = d2b.gsub(/([-\=:()#'\/\?\|"\\,\.;&\!\[\]])/,'')
d4 = d3.gsub(/\w{15,}/,'') # strip away any long erroneous words
d4b = d4.gsub(/\b\w\b/,'') # strip out single characters
d5 = d4b.gsub(/\s+/,' ') # strip out extra spaces
d6 = d5.downcase # switch to lowercase
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment