Skip to content

Instantly share code, notes, and snippets.

@doublemarket
Last active October 31, 2015 13:53
Show Gist options
  • Save doublemarket/45b0e236036fd08bddba to your computer and use it in GitHub Desktop.
Save doublemarket/45b0e236036fd08bddba to your computer and use it in GitHub Desktop.
Counting words in a web page and sorting them in order of frequency
require 'nokogiri'
require 'open-uri'
url = ARGV[0]
doc = Nokogiri::HTML(open(url))
all_words = doc.inner_text.split(/\s*[\W-[[:cntrl:]]]\s*/)
.reject{|w| /^(\d*|\w)$/ =~ w}
.map{|w| w.downcase}
.sort
words = Hash.new(0)
all_words.each{|w| words[w] += 1}
words = words.sort{|a,b| b[1] <=> a[1]}
words.each do |w|
puts "#{w[1]} #{w[0]}"
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment