Skip to content

Instantly share code, notes, and snippets.

@aamar
Created October 8, 2012 09:00
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aamar/3851520 to your computer and use it in GitHub Desktop.
Save aamar/3851520 to your computer and use it in GitHub Desktop.
Clojure & ML Talk

Clojure + Machine Learning

Why we use Clojure at Unfold
My experience with Machine Learning
And why it's the ideal language for web-apps

Anshul Amar | aamar@unfold.com

To view as a slideshow, please use the cool GistDeck.

Summary

  • Improving CRUD
  • A case study: Document Classification
  • Why Clojure helps

CRUD

CRUD is a (somewhat pejorative) description for a "standard" or vanilla web-app.

Create
Read
Update
Delete

I spent years doing a lot of standard CRUD functionality on the web.

How can we do more with less?

Specifically, how can we allow humans to get the same benefits, but perform less CRUD?

  • Create/Update:

    • Guess data for you
    • Validate data -> Probabilistic Validation
  • Read:

    • Complex query interface -> Context- aware search
    • Automated Suggestions & filtering
  • Delete:

    • Reject counterproductive data (e.g. Spam, Gaming)

Example: Document Classification

Classification Problem

  • Get each one into the system (upload it, or it arrives via fax)
  • Categorize it
  • Add a lot of metadata
  • System will route it to correct recipient and provide them with correct options

The envisioned solution

  1. Reroute client's fax machines to us
  2. User dumps faxes to us everything arrives via mail/handoff
  3. Classify as many of them as possible automatically
  4. Route the "tricky ones" to a human
  • N.B: This only happened after:
    • A minimal side project
    • A lot of conversation wrapping our heads around it
    • It became clear that if it worked, it would save the company $$

A few facts about these documents

We want to classify them into a ~40 different categories: e.g. Lab Result, Rx Renewal, Insurance Form.

Docs are form-based, lots of shorthand, very little prose => "bag of words" approach fits.

Our Attempt 1. Stick with Perl

  • Native Perl libraries
  • These turned out to be slow/incomplete
  • Reimplement some components by hand
  • Still very slow
  • Lots more numerical optimizations possible, but this would take a lot of coding

Attempt 2. Dedicated ML System

  • Perl layer queries DB, business logic, dumps to file-system
  • Feed into Rainbow/Mallet (open-source tools, optimized version of TF-IDF (discussed later))
  • Back to DB

But... there were all sorts of edge cases and complex needs

  • Bad scans (upside-down, etc.)
  • Page-splits
  • Pages out of order
  • Special rule-based classifiers
  • Conditional sub-categorization (refill vs. renewal)

E.g. Page-Splits

  • Single fax = multiple "documents"
  • As before, train based on what humans have done before
  • What is page-to-page similarity? Similarity over a split?
  • Classifier assigns to one of these cases:
    • 1/1: singleton page
    • 1/n: first page (e.g. "cover sheet")
    • n/n: last page
    • m/n: interim page
  • Combining above yielded decent results
  • Go through all of this, then do the classification

Instead of a system that looked like this...

We ended up with something like this...

The desire: a single system for "exploratory programming"

  • Represent the core business logic
  • Deal with web and database
  • Be able to run basic, tweakable algorithms
  • Be able to implement your own

Helpful principles

  • Avoid disk, avoid streams
  • Run as much as possible in memory

Caveats:

  • Some ML problems don't require a lot of exploration, but require scale
  • Ultimately a mixed-stack may be useful
  • Separate technologies can sometimes help isolate concerns

But: I've persistently found great value in being able to conveniently access ML constructs from within my primary "business-logic" language.

Clojure's strengths

  • Can handle the web/db stuff perfectly well
  • Can implement domain logic perfectly well
  • Be able to run basic, tweakable algorithms

Clojure's subtle strengths

  • Philosophy of "small composable bits"
  • Focus on processing lists
  • Almost everything's a seq
    • (entities & protocols can interfere a bit)
    • e.g. Incanter uses a different internal format
    • but it's trivial to port data back and forth
  • Easy to implement algorithms

TF-IDF in Clojure

TF-IDF has come in handy both in doc classification and Unfold document clustering.

  • TF-IDF is often associated with search, but it is a generic weighting scheme that can be useful in classification, clustering, etc.
  • Basic formulas:

The Clojure Code

  • Thanks to Ignacio Thayer

  • Implement these formulas

  • Add stopwords

  • My changes:

    • Disentangle file-access from algorithm
    • Allow domain-specific tokenization
    • Add Classification API

The Clojure Code -- starting up

(ns tf-idf
  (:use clojure.test)
  (:require [clojure.string :as string]))

(def default-stopwords (set (string/split #"\n" (slurp "./stopwords.txt"))))

(defn default-tokenize
  "Default tokenizer. Can be overridden by"
  [text]
  (remove default-stopwords (string/split (.toLowerCase text) #"[^a-zA-Z0-9]+")))

(def tokenize default-tokenize)

(defn sq [x] (Math/pow x 2))

The Clojure Code -- tf-idf core

(defn idf [total-doc-count match]
  (sq (Math/log (/ total-doc-count (+ 0 (count (keys match)))))))

(defn tf-map [doc-id raw & {:keys [tokenize]}]
  (let [word-counts (frequencies ((or tokenize default-tokenize) raw))]
    (zipmap (keys word-counts) (map (fn [c] {doc-id c}) (vals word-counts)))))

(defn accum-tfidf [total-doc-count match]
  (map (fn [doc w-count] {doc (* w-count (sq (idf total-doc-count match)))})
    (keys match) (vals match)))

(defn score-all [index total-doc-count search-terms]
  (let [results (keep index search-terms)]
    (apply merge-with +
           (mapcat (partial accum-tfidf total-doc-count) results))))

The Clojure Code -- learning

(defn take-top-by-val [n m]
  (take n (reverse (sort-by second m))))

(defn map-hash [f m]
  (reduce merge (map (fn [[k v]] {k (f [k v])}) m)))

;; pass klass for classification rather than search
(defn learn [db doc-id raw-text & [tokenize klass]]
  (let [db (or db {:count 0 :index {} :norms {}})
        tf (tf-map (or klass doc-id) raw-text :tokenize tokenize)
        new-count (inc (:count db))
        new-index (merge-with merge (:index db) tf)
        ;; NOTE: this part is recomputed in full each time; this can be optimized further
        norms-raw (apply merge-with + (mapcat (partial accum-tfidf new-count) (vals new-index)))]
    {:count new-count
     :index new-index
     :norms (zipmap (keys norms-raw) (map #(Math/sqrt %) (vals norms-raw)))}))

(defn learn-all [docs]
  (reduce #(apply learn % %2) nil docs))

The Clojure Code -- searching

(defn search [db raw-text & {:keys [tokenize]}]
  (let [results (score-all (:index db) (:count db) ((or tokenize default-tokenize) raw-text))]
    (take-top-by-val 3 (map-hash (fn [[k v]] (/ (results k) ((:norms db) k))) results))))

(def classify search)

(deftest search-test 
  (let [db (learn-all [[:a "all's well that ends well"]
                       [:b "some other thing that ends well"]
                       [:c "another thing"]])]
    (is (= (map first (search db "well")) [:a :b]))
    (is (= (map first (search db "thing")) [:c :b])))

  (let [db-klass (learn-all [[:a "all's well that ends well" nil "play"]
                             [:b "all quiet on the western front" nil "movie"]
                             [:c "all about eve" nil "movie"]
                             [:d "measure for measure" nil "play"]])]
    (is (= (ffirst (search db-klass "all")) "movie"))
    (is (= (ffirst (search db-klass "measure")) "play"))))

Closing thoughts

  • ML is often taught using great tools like Matlab and R
  • But it's often very valuable in the early days to call ML directly from your main language
  • It's no fun to do this if your main language (due to power or idioms) make ML algorithms slow or awkward
  • Clojure offers a great mix of practicality for business logic and practicality for ML

Thanks

Anshul Amar aamar@unfold.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment