Skip to content

Instantly share code, notes, and snippets.

@zerokarmaleft
Created November 28, 2012 18:25
Show Gist options
  • Save zerokarmaleft/4163053 to your computer and use it in GitHub Desktop.
Save zerokarmaleft/4163053 to your computer and use it in GitHub Desktop.
generating minhash signatures
(defmapcatop [extract-shingles [k]] [line] (shingles k line))
(defmapop [multihash [n]] [shingle]
[(map (fn [seed]
(.asInt (.hashString (Hashing/murmur3_32 seed) shingle)))
(range n))])
(defn merge-vectors
[v1 v2]
(map #(map min %1 %2) v1 v2))
(defbufferop minhash-sig
[hash-sigs]
[(reduce merge-vectors hash-sigs)])
(defn minhash-sigs
[docs k n]
(<- [?doc-id ?minhash-sig]
(docs :> ?doc-id ?line)
(extract-shingles k ?line :> ?shingle)
(multihash n ?shingle :> ?hash-sig)
(minhash-sig ?hash-sig :> ?minhash-sig)))
(pprint (??- (minhash-sigs D 1 2)
(minhash-sigs D 1 8)))
;; ((["S1" (1867108634 -1900302480)]
;; ["S2" (-107865855 -959211595)]
;; ["S3" (-284916816 -1900302480)]
;; ["S4" (-107865855 -1900302480)])
;; (["S1"
;; (1867108634
;; -1900302480
;; 488346356
;; 655955059
;; -304251353
;; -1271286501
;; -2025252291
;; -1911403394)]
;; ["S2"
;; (-107865855
;; -959211595
;; 1664025675
;; -516762017
;; -1735866048
;; 187387202
;; -2080907966
;; -1843992735)]
;; ["S3"
;; (-284916816
;; -1900302480
;; -1367910071
;; -1780580861
;; -1900867701
;; -1665535712
;; -2025252291
;; -252536730)]
;; ["S4"
;; (-107865855
;; -1900302480
;; 488346356
;; -516762017
;; -1735866048
;; -1271286501
;; -2080907966
;; -1911403394)]))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment