Skip to content

Instantly share code, notes, and snippets.

@michaelminter
Last active December 22, 2015 06:18
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save michaelminter/6429695 to your computer and use it in GitHub Desktop.
Save michaelminter/6429695 to your computer and use it in GitHub Desktop.
Create Keywords from content
# gem install Sanitize
require 'Sanitize'
def generate_keywords(content)
# strip HTML tags
content = Sanitize.clean content
# dump content into array and remove short words
words = content.scan /[A-Za-z0-9]{3,}/
# count occurrences of each word
words_by_count = {}
words.each do |word|
word.downcase!
if (words_by_count[word].nil?)
words_by_count[word] = 1
else
words_by_count[word] = words_by_count[word] + 1
end
end
# remove common words
common_words = ["about", "accessibility", "add", "ads", "after", "again", "all", "along", "also", "although", "amazon", "and", "another", "any", "application", "are", "area", "around", "association", "aswell", "available", "back", "based", "basically", "because", "become", "becoming", "been", "before", "began", "begin", "begun", "being", "belong", "both", "broader", "business", "but", "came", "can", "com", "come", "coming", "company", "contact", "contents", "copyright", "copyrighted", "copyrights", "could", "day", "does", "down", "during", "each", "else", "elsewhere", "email", "enough", "etc", "even", "ever", "every", "everyday", "except", "far", "farther", "fascinating", "features", "field", "find", "first", "for", "form", "format", "freeimages", "from", "further", "get", "getting", "gone", "got", "had", "happen", "happened", "happening", "happens", "has", "have", "her", "here", "hers", "high", "him", "his", "home", "how", "however", "href", "http", "including", "information", "into", "its", "just", "last", "left", "let", "like", "likely", "likes", "long", "made", "mail", "make", "many", "mass", "may", "million", "mine", "more", "most", "must", "neither", "net", "new", "news", "next", "non", "none", "not", "now", "nowhere", "off", "one", "online", "only", "other", "otherwise", "our", "ours", "out", "over", "own", "owner", "people", "policy", "post", "president", "press", "privacy", "put", "report", "reserved", "right", "rights", "said", "since", "some", "something", "soon", "states", "still", "such", "technology", "than", "that", "thats", "the", "their", "theirs", "them", "there", "therefore", "these", "they", "this", "those", "though", "three", "through", "throughout", "time", "too", "tried", "try", "trying", "two", "uncommon", "under", "unsubscribe", "updates", "use", "used", "user", "users", "using", "varied", "various", "want", "was", "web", "webdesigns", "well", "went", "were", "what", "whatever", "when", "where", "whether", "which", "while", "who", "whom", "whose", "why", "will", "with", "within", "without", "work", "world", "would", "www", "yeah", "year", "years", "yep", "yes", "you", "your", "yours", "very", "much", "inc", "sunday", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "she"]
common_words.each do |word|
unless words_by_count[word].nil?
words_by_count.delete word
end
end
# only return keywords with a count of more than 3 and less than 15
popular_words = []
words_by_count.each do |word, count|
if count >= 3 && count < 15
popular_words.push word
end
end
return popular_words
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment