Created
December 18, 2012 04:03
-
-
Save seabre/4324904 to your computer and use it in GitHub Desktop.
Simple example to find possible tag/topic of a text.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
source "http://rubygems.org" | |
gem 'uea-stemmer' | |
gem 'stopwords' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
GEM | |
remote: http://rubygems.org/ | |
specs: | |
stopwords (0.2) | |
uea-stemmer (0.10.1) | |
PLATFORMS | |
ruby | |
DEPENDENCIES | |
stopwords | |
uea-stemmer |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require "rubygems" | |
require "bundler/setup" | |
# UEA stemmer algorithm lib. | |
# Really conservative stemmer, seems like it kinda seems like it does lemmatization. | |
require "uea-stemmer" | |
# Lib for stopwords. | |
require "stopwords" | |
# Input string | |
string = <<END | |
MARLEY was dead: to begin with. There is no doubt | |
whatever about that. The register of his burial was | |
signed by the clergyman, the clerk, the undertaker, | |
and the chief mourner. Scrooge signed it: and | |
Scrooge's name was good upon 'Change, for anything he | |
chose to put his hand to. Old Marley was as dead as a | |
door-nail. | |
Mind! I don't mean to say that I know, of my | |
own knowledge, what there is particularly dead about | |
a door-nail. I might have been inclined, myself, to | |
regard a coffin-nail as the deadest piece of ironmongery | |
in the trade. But the wisdom of our ancestors | |
is in the simile; and my unhallowed hands | |
shall not disturb it, or the Country's done for. You | |
will therefore permit me to repeat, emphatically, that | |
Marley was as dead as a door-nail. | |
Scrooge knew he was dead? Of course he did. | |
How could it be otherwise? Scrooge and he were | |
partners for I don't know how many years. Scrooge | |
was his sole executor, his sole administrator, his sole | |
assign, his sole residuary legatee, his sole friend, and | |
sole mourner. And even Scrooge was not so dreadfully | |
cut up by the sad event, but that he was an excellent | |
man of business on the very day of the funeral, | |
and solemnised it with an undoubted bargain | |
END | |
# Tokenize string. Convert all words to lowercase. Remove punctuation. | |
tokens = string.split(/[\s,]+/).map{|x| x.downcase.gsub(/[(,?!\";:.)]/, '')} | |
# Initialize stemmer. | |
stemmer = UEAStemmer.new | |
# Stem each word. | |
tokens = tokens.map{|w| stemmer.stem(w)} | |
# Remove all stopwords. | |
tokens = tokens.find_all{|w| !Stopwords.is?(w)} | |
# Create hash with word count. | |
word_count = Hash[tokens.group_by{|w| w }.map{|w, words| [w, words.length] }] | |
# Remove all words with count less than 3. | |
puts word_count.sort_by{|k, v| v}.find_all{|w| w[1] >= 3} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Example output would be: