Skip to content

Instantly share code, notes, and snippets.

@seabre
Created December 18, 2012 04:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save seabre/4324904 to your computer and use it in GitHub Desktop.
Save seabre/4324904 to your computer and use it in GitHub Desktop.
Simple example to find possible tag/topic of a text.
source "http://rubygems.org"
gem 'uea-stemmer'
gem 'stopwords'
GEM
remote: http://rubygems.org/
specs:
stopwords (0.2)
uea-stemmer (0.10.1)
PLATFORMS
ruby
DEPENDENCIES
stopwords
uea-stemmer
require "rubygems"
require "bundler/setup"
# UEA stemmer algorithm lib.
# Really conservative stemmer, seems like it kinda seems like it does lemmatization.
require "uea-stemmer"
# Lib for stopwords.
require "stopwords"
# Input string
string = <<END
MARLEY was dead: to begin with. There is no doubt
whatever about that. The register of his burial was
signed by the clergyman, the clerk, the undertaker,
and the chief mourner. Scrooge signed it: and
Scrooge's name was good upon 'Change, for anything he
chose to put his hand to. Old Marley was as dead as a
door-nail.
Mind! I don't mean to say that I know, of my
own knowledge, what there is particularly dead about
a door-nail. I might have been inclined, myself, to
regard a coffin-nail as the deadest piece of ironmongery
in the trade. But the wisdom of our ancestors
is in the simile; and my unhallowed hands
shall not disturb it, or the Country's done for. You
will therefore permit me to repeat, emphatically, that
Marley was as dead as a door-nail.
Scrooge knew he was dead? Of course he did.
How could it be otherwise? Scrooge and he were
partners for I don't know how many years. Scrooge
was his sole executor, his sole administrator, his sole
assign, his sole residuary legatee, his sole friend, and
sole mourner. And even Scrooge was not so dreadfully
cut up by the sad event, but that he was an excellent
man of business on the very day of the funeral,
and solemnised it with an undoubted bargain
END
# Tokenize string. Convert all words to lowercase. Remove punctuation.
tokens = string.split(/[\s,]+/).map{|x| x.downcase.gsub(/[(,?!\";:.)]/, '')}
# Initialize stemmer.
stemmer = UEAStemmer.new
# Stem each word.
tokens = tokens.map{|w| stemmer.stem(w)}
# Remove all stopwords.
tokens = tokens.find_all{|w| !Stopwords.is?(w)}
# Create hash with word count.
word_count = Hash[tokens.group_by{|w| w }.map{|w, words| [w, words.length] }]
# Remove all words with count less than 3.
puts word_count.sort_by{|k, v| v}.find_all{|w| w[1] >= 3}
@seabre
Copy link
Author

seabre commented Dec 18, 2012

Example output would be:

marley
3
door-nail
3
dead
5
scrooge
6
sole
6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment