Last active
December 22, 2015 06:39
-
-
Save chrisvfritz/6432816 to your computer and use it in GitHub Desktop.
This is an example implementation of a really simple algorithm in Ruby to identify trending words and phrases in a collection of posts. This was written in a single sitting, late at night, without research, so it's definitely not the most efficient way to tackle this problem. I haven't even tested it with real data to make sure that it works as …
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# constant array with the 100 most common words in English | |
COMMON_WORDS = ["the","be","to","of","and","a","in","that","have","i","it","for","not","on","with","he","as","you","do","at","this","but","his","by","from","they","we","say","hew","she","or","an","will","my","one","all","would","there","their","what","so","up","out","if","about","who","get","which","go","me","when","make","can","like","time","no","just","him","know","take","people","into","year","your","good","some","could","them","see","other","than","then","now","look","only","come","its","over","think","also","back","after","use","two","how","our","work","first","well","way","even","new","want","because","any","these","give","day","most","us"] | |
# these numbers are totally made up, so you'll probably want to tweak them | |
MINIMUM_FREQUENCY_FOR_THREE_WORD_PHRASES = 10 | |
MINIMUM_FREQUENCY_FOR_TWO_WORD_PHRASES = 20 | |
MINIMUM_FREQUENCY_FOR_SINGLE_WORDS = 30 | |
three_word_phrases = Array.new | |
two_word_phrases = Array.new | |
single_words = Array.new | |
# gets the 100 most popular post using a popular scope, presumably sorting by views/reposts/etc in the last 24 hours or something | |
most_popular_posts = Post.popular(100) | |
# for each of the 100 most popular posts... | |
most_popular_posts.each do |post| | |
# for all of the three word phrases in each post... | |
post.scan(/\b\w+\s\w+\s\w+\b/).each do |phrase| | |
# normalize the phrase into lower case | |
phrase = phrase.downcase | |
# put each word of the phrase into an array | |
words_in_phrase = phrase.scan(/\b\w+\b/) | |
# unless at least two of the words in the phrase are common... | |
# this allows phrases like "the big apple" but not "in the big" | |
first_word_is_common = COMMON_WORDS.include? words_in_phrase[0] | |
second_word_is_common = COMMON_WORDS.include? words_in_phrase[1] | |
third_word_is_common = COMMON_WORDS.include? words_in_phrase[2] | |
unless first_word_is_common and second_word_is_common or \ | |
first_word_is_common and third_word_is_common or \ | |
second_word_is_common and third_word_is_common then | |
# if the phrase already exists in the list... | |
index_of_phrase_if_found = three_word_phrases.index {|hash| hash[:string] == phrase} | |
if index_of_phrase_if_found then | |
# increment the frequency of the phrase by 1 | |
three_word_phrases[index_of_phrase_if_found][:frequency] += 1 | |
# otherwise the phrase isn't already in the list... | |
else | |
# add the phrase to the list with a frequency of 1 | |
three_word_phrases.push({ string: phrase, frequency: 1 }) | |
end | |
end | |
end | |
end | |
# create a new array excluding the three word phrases that aren't frequent enough to be trending | |
trending_three_word_phrases = three_word_phrases.select { |hash| hash[:frequency] >= MINIMUM_FREQUENCY_FOR_THREE_WORD_PHRASES } | |
# then the same thing for two word phrases, but with an expensive twist | |
most_popular_posts.each do |post| | |
post.scan(/\b\w+\s\w+\b/).each do |phrase| | |
# normalize the phrase into lower case | |
phrase = phrase.downcase | |
# put each word of the phrase into an array | |
words_in_phrase = phrase.scan(/\b\w+\b/) | |
# unless at least one of the words is common | |
unless COMMON_WORDS.include? words_in_phrase[0] or COMMON_WORDS.include? words_in_phrase[1] then | |
# for each of the three word phrases... | |
trending_three_word_phrases.each do |larger_phrase| | |
# if the two word phrase is part of an existing three word phrase, skip to the next phrase | |
next if larger_phrase[:string].include? phrase | |
end | |
# if we've made it this far, the phrase has met our criteria, so let's count it | |
index_of_phrase_if_found = two_word_phrases.index {|hash| hash[:string] == phrase} | |
if index_of_phrase_if_found then | |
# increment the frequency of the phrase by 1 | |
two_word_phrases[index_of_phrase_if_found][:frequency] += 1 | |
# otherwise the phrase isn't already in the list... | |
else | |
# add the phrase to the list with a frequency of 1 | |
two_word_phrases.push({ string: phrase, frequency: 1 }) | |
end | |
end | |
end | |
end | |
# create a new array excluding the two word phrases that aren't frequent enough to be trending | |
trending_two_word_phrases = two_word_phrases.select { |hash| hash[:frequency] >= MINIMUM_FREQUENCY_FOR_TWO_WORD_PHRASES } | |
# and again for single words | |
most_popular_posts.each do |post| | |
post.scan(/\b\w+\b/).each do |word| | |
# normalize the word into lower case | |
word = word.downcase | |
# unless the word is common | |
unless COMMON_WORDS.include? word then | |
# this part is tricky. i *think* it's a good idea, but i'm not sure. | |
# the idea is that if "michael jackson" is trending, "michael" shouldn't | |
# also show as trending, as it's out of its meaningful context | |
trending_three_word_phrases.each do |larger_phrase| | |
# if the word is part of an existing three word phrase, skip to the next word | |
next if larger_phrase[:string].include? word | |
end | |
trending_two_word_phrases.each do |larger_phrase| | |
# if the word is part of an existing two word phrase, skip to the next word | |
next if larger_phrase[:string].include? word | |
end | |
# if we've made it this far, the word has met our criteria, so let's count it | |
index_of_word_if_found = single_words.index {|hash| hash[:string] == word} | |
if index_of_word_if_found then | |
# increment the frequency of the word by 1 | |
single_words[index_of_word_if_found][:frequency] += 1 | |
# otherwise the word isn't already in the list... | |
else | |
# add the word to the list with a frequency of 1 | |
single_words.push({ string: word, frequency: 1 }) | |
end | |
end | |
end | |
end | |
# create a new array excluding the words that aren't frequent enough to be trending | |
trending_single_words = single_words.select { |hash| hash[:frequency] >= MINIMUM_FREQUENCY_FOR_SINGLE_WORDS } | |
# now you have three collections of trending words and phrases. from here you | |
# can play with them however you like - keeping them separate or combining | |
# them. you can choose to sort alphabetically for a word cloud or by frequency | |
# for other visualizations. have fun. :-) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment