Last active
November 20, 2016 22:16
-
-
Save erdos/4fbe8fe9a8ad17abf3f2a9ccde498cb0 to your computer and use it in GitHub Desktop.
split-by-key returning groups may not cover whole input
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(do | |
;; number of groups | |
(def n (.availableProcessors (Runtime/getRuntime))) | |
;; memoized round-robin style group number generator for items | |
(let [a (atom (cycle (range n)))] | |
(def get-id-for | |
(memoize (fn [_] (first (swap! a next)))))) | |
(defn split-by-key | |
"Distributes all items in n sequences according to f(x)" | |
[f items] | |
(for [id (range n)] | |
(filter (comp #{id} get-id-for f) items))) | |
;; example: (split-by-key int (range 16)) => ((7 15) (0 8) (1 9) (2 10) (3 11) (4 12) (5 13) (6 14)) | |
;; i am using (future) to realize the lazy groups returned by #'split-by-key | |
(let [total 10000 | |
xs (range total) | |
gs (split-by-key int xs)] | |
(doseq [g gs] | |
;; do work here. | |
(future (reduce + g))) | |
;; i expect the sum to be = total, but it is not constant. | |
;; that is because an item in xs may go to multiple groups. | |
(reduce + (map count gs))) | |
) |
Replacing split-by-key
with the following seems to fix the problem:
(defn split-by-key [f items]
(vals (group-by (comp get-id-for f) items)))
Of course it does not create n
groups.
My question is: why is the previous behaviour possible?
Solution: The bug is in line 8 of the original code. The point is that memoize
should never have side effects. Memoization internally is implemented using atom
and swap
(a compare-and-swap operation).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Task
I would like to partition the input into
n
lists with thesplit-by-key
function. That function is likeclojure.core/partition-by
but only returnsn
groups.I use the memoized
get-id-for
function that returns a group id (between 0 and n) for every input in a round-robin manner.Problem
The last expression shows that the total number of items in the groups do not always add up to 10000. That is strange because it shows that an item may be found in either more than one group or in none. I found this strange behaviour on a multicore machine.