Skip to content

Instantly share code, notes, and snippets.

@jmorton
Last active August 29, 2015 14:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jmorton/786ae0909b3caf0e4bb0 to your computer and use it in GitHub Desktop.
Save jmorton/786ae0909b3caf0e4bb0 to your computer and use it in GitHub Desktop.
My Clojure Katas

Counting Words

Get:

  • Ten most frequently occurring non-stop words in a text file (including how often they occur).
  • Alphabetized list of words occurring exactly three times.

Considerations:

  • Stop words
  • Stemming
  • Laziness

Units Conversion

Needs elaboration. Convert between units. Display human readable output of measures.

(ns shokunin.words)
(defn term-frequency
"Count occurrences of terms in a file."
[file stop-set]
(with-open [rdr (clojure.java.io/reader file)]
(let [lines (line-seq rdr)
token (partial re-seq #"[\w\-\']+")
-stop (partial remove stop-set)
parse (comp frequencies -stop token clojure.string/lower-case)
freqs (map parse lines)]
(reduce (partial merge-with +) freqs))))
;; working files
(def file-url "resources/01/declaration.txt")
(def stop-set (-> "resources/01/stop.txt" slurp tokenize set))
(def freq-map (term-frequency file stop-set))
;; statistics
(take 10 (sort-by last > freq-map))
(take 10 (sort-by first compare freq-map))
(defn sentence-tokens [s]
(re-seq #"(?:[^\.\?\!]+[\.\?\!]\s*)" s))
(defn term-tokens [s]
(concat [nil] (re-seq #"(?:\w[\w\'\-]+|[\,\.\?\!\:\;])" s) [nil]))
(def sentences (sentence-tokens (slurp "resources/01/declaration.txt")))
(term-tokens "Now, is the time.")
(term-tokens (nth sentences 10))
;; define a stats db
(defn naive-bayes [& ps]
(let [prob (apply * ps)
not-p (apply * (map #(- 1 %) ps))]
(/ prob (+ prob not-p))))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment