Created
March 2, 2015 02:05
-
-
Save humbroll/913f82646172cecb8523 to your computer and use it in GitHub Desktop.
top_twenty_frequent_words
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Download this file - The Adventures of Sherlock Holmes | |
# http://www.gutenberg.org/cache/epub/1661/pg1661.txt | |
# | |
# Write a program to print the 20 most frequent words in the document, in | |
# descending order, | |
# with counts. Output format looks like: | |
# | |
# 9213 the | |
# 3223 i | |
def top_twenty_frequent_words(text) | |
words_count = {} | |
words = text.split(/\W+/) | |
words_count = words.inject({}) do |count, word| | |
normalized_word = word.downcase | |
count[normalized_word] = 0 if count[normalized_word].nil? | |
count[normalized_word] += 1 | |
count | |
end | |
top_twenty = words_count.sort_by(&:last).reverse[0..19] | |
top_twenty.each_with_index do |(word, count), i| | |
puts "#{count}\t#{word}" | |
end | |
end | |
sherlock_holmes = File.readlines("./pg1661.txt").join | |
top_twenty_frequent_words(sherlock_holmes) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment