Skip to content

Instantly share code, notes, and snippets.

@maxjacobson
Last active December 19, 2015 07:19
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save maxjacobson/5917798 to your computer and use it in GitHub Desktop.
Save maxjacobson/5917798 to your computer and use it in GitHub Desktop.
hast thou seen the whale?
require 'nokogiri'
require 'open-uri'
url = "http://www.gutenberg.org/files/2701/2701-h/2701-h.htm"
doc = Nokogiri::HTML(open(url))
words = doc.text.downcase
.gsub(/[^\w\s]/,'') # remove non word or space characters
.gsub(/\n/,' ') # remove new lines
.gsub(/\s+/,' ') # normalize spaces (only one space ever)
.split("chapter one").first # remove preamble
.split("end of project gutenberg").first # remove postamble (does this work?)
.split(" ") # to array
histogram = {}
words.each do |word|
histogram[word] ||= 0
histogram[word] += 1
end
top_100 = histogram.sort_by{|word,count| count}.reverse[0..99]
top_100.each do |word, count|
puts "#{word} #{count}"
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment