public
Created

Simple example to find possible tag/topic of a text.

  • Download Gist
Gemfile
Ruby
1 2 3
source "http://rubygems.org"
gem 'uea-stemmer'
gem 'stopwords'
simple_usage_frequency.rb
Ruby
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
require "rubygems"
require "bundler/setup"
 
# UEA stemmer algorithm lib.
# Really conservative stemmer, seems like it kinda seems like it does lemmatization.
require "uea-stemmer"
 
# Lib for stopwords.
require "stopwords"
 
# Input string
string = <<END
MARLEY was dead: to begin with. There is no doubt
whatever about that. The register of his burial was
signed by the clergyman, the clerk, the undertaker,
and the chief mourner. Scrooge signed it: and
Scrooge's name was good upon 'Change, for anything he
chose to put his hand to. Old Marley was as dead as a
door-nail.
 
Mind! I don't mean to say that I know, of my
own knowledge, what there is particularly dead about
a door-nail. I might have been inclined, myself, to
regard a coffin-nail as the deadest piece of ironmongery
in the trade. But the wisdom of our ancestors
is in the simile; and my unhallowed hands
shall not disturb it, or the Country's done for. You
will therefore permit me to repeat, emphatically, that
Marley was as dead as a door-nail.
 
Scrooge knew he was dead? Of course he did.
How could it be otherwise? Scrooge and he were
partners for I don't know how many years. Scrooge
was his sole executor, his sole administrator, his sole
assign, his sole residuary legatee, his sole friend, and
sole mourner. And even Scrooge was not so dreadfully
cut up by the sad event, but that he was an excellent
man of business on the very day of the funeral,
and solemnised it with an undoubted bargain
END
 
# Tokenize string. Convert all words to lowercase. Remove punctuation.
tokens = string.split(/[\s,]+/).map{|x| x.downcase.gsub(/[(,?!\";:.)]/, '')}
 
# Initialize stemmer.
stemmer = UEAStemmer.new
 
# Stem each word.
tokens = tokens.map{|w| stemmer.stem(w)}
 
# Remove all stopwords.
tokens = tokens.find_all{|w| !Stopwords.is?(w)}
 
# Create hash with word count.
word_count = Hash[tokens.group_by{|w| w }.map{|w, words| [w, words.length] }]
 
 
# Remove all words with count less than 3.
puts word_count.sort_by{|k, v| v}.find_all{|w| w[1] >= 3}

Example output would be:

marley
3
door-nail
3
dead
5
scrooge
6
sole
6

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.