Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bradlucas/70f7b2e5a7b3d24571e541f7fd5e3563 to your computer and use it in GitHub Desktop.
Save bradlucas/70f7b2e5a7b3d24571e541f7fd5e3563 to your computer and use it in GitHub Desktop.
Hadoop map-reduce explained with clojure map, reduce and mapcat using the word count example.
; Interested in a short introduction to hadoop mapreduce?
(declare mapreduce)
; Let's look at the "hello world job" ie word count.
(def input [
[1 "hadoop map-reduce explained"]
[2 "with clojure map, reduce and mapcat"]
[3 "using the world count example"]])
(def output [
["and" 1]["clojure" 1]["count" 1]
["example" 1]["explained" 1]["hadoop" 1]
["map" 2]["mapcat" 1]["reduce" 2]["the" 1]["using" 1]
["with" 1]["world" 1]])
; The two main customisation points are :
(declare my-mapper my-reducer)
; And at the end, we should have :
(= (mapreduce input my-mapper my-reducer) output)
; First, the mapper behavior for word count :
(= (my-mapper [1 "hadoop map-reduce explained"])
[["hadoop" 1] ["map" 1] ["reduce" 1] ["explained" 1]])
(defn my-mapper [[k v]]
(map #(vector % 1)(re-seq #"\w+" v)))
; Second, the reducer behavior for word count
(= (my-reducer ["map" [1 1]])
[["map" 2]])
(defn my-reducer [[word list-occurences]]
[[word (reduce + list-occurences)]])
; And finaly, the mapreduce implementation for single node
(defn shuffle-sort [kvs]
(->> kvs
(sort-by first)
(partition-by first)
(map #(vector (first (first %)) (map second %)))))
(defn mapreduce [kvs mapper reducer]
(->> kvs
(mapcat mapper)
shuffle-sort
(mapcat reducer)))
; Indeed, quite trivial.
; Now if you want to use clojure on a real case with Hadoop,
; you should use higher abstractions and look at Cascalog.
; https://github.com/nathanmarz/cascalog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment