Skip to content

Instantly share code, notes, and snippets.

@hrickards
Created November 25, 2013 13:20
Show Gist options
  • Save hrickards/7641082 to your computer and use it in GitHub Desktop.
Save hrickards/7641082 to your computer and use it in GitHub Desktop.
Code for TFIDF-based algorithm for obtaining relevant research topics for Rewired State GTR hackathong
filter = Stopwords::Snowball::Filter.new "en"
unique_words = filter.filter(unique_words).uniq
unique_words.reject! { |word, tfidf| query.split.map(&:sanitize).include? word }
word_corpus = File.read("words.txt").split("\n").map(&:downcase)
unique_words.reject! { |word, tfidf| word_corpus.include? word }
String.class_eval do
# Remove all non alphanumeric characters
def alphanumeric_only
return self.gsub(/[^0-9a-z ]/i, '')
end
# Sanitize a string by removing all non alphanumeric characters,
# downcasing it and removing any excess spaces
def sanitize
return self.alphanumeric_only.downcase.squeeze " "
end
end
words = titles.map { |title| title.split(" ") }.flatten(1).map(&:sanitize)
unique_words = words.uniq
unique_words = unique_words
.group_by { |w| w }
.map { |w, ws| [w, ws.count] }
.select { |w, count| count > 10 }
max_freq = unique_words.map { |word, count| count }.max
unique_words.map! { |word, count| [word, tf(count, maxf) * idf(word, documents)] }
unique_words.sort_by! { |word, tfidf| tfidf }
unique_words.reverse!
def tf freq, max_freq
0.5 + 0.5*freq/max_freq
end
def idf(word, documents)
cardinality = documents.count
num_docs = documents.select { |doc| doc.include? word }.count
Math.log10(cardinality * 1.0 / (1 + num_docs))
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment