Skip to content

Instantly share code, notes, and snippets.

@codyrioux
Created September 10, 2013 21:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save codyrioux/6515675 to your computer and use it in GitHub Desktop.
Save codyrioux/6515675 to your computer and use it in GitHub Desktop.
An idiomatic implementation of term-frequency inverse-document-frequency in 6 lines of Clojure.
(ns eacl2014.tools.tfidf
"A term-frequency inverse-document-frequency implementation in idiomatic clojure.
A term is a single token, doc is a seq of terms and docs is a seq of docs.")
(defn- log [x] (Math/log x))
(defn- t-in-d? [t, d] (some #{t} d))
(defn- f [t doc] (count (filter #(= t %) doc)))
(defn tf [t doc] (+ 0.5 (/ (* 0.5 (f t doc)) (apply max (map #(f % doc) doc)))))
(defn idf [t docs] (log (/ (count docs) (count (filter (partial t-in-d? t) docs)))))
(defn tfidf [t d docs] (* (tf t d) (idf t docs)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment