Skip to content

Instantly share code, notes, and snippets.

@erdos
Last active November 20, 2016 22:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save erdos/4fbe8fe9a8ad17abf3f2a9ccde498cb0 to your computer and use it in GitHub Desktop.
Save erdos/4fbe8fe9a8ad17abf3f2a9ccde498cb0 to your computer and use it in GitHub Desktop.
split-by-key returning groups may not cover whole input
(do
;; number of groups
(def n (.availableProcessors (Runtime/getRuntime)))
;; memoized round-robin style group number generator for items
(let [a (atom (cycle (range n)))]
(def get-id-for
(memoize (fn [_] (first (swap! a next))))))
(defn split-by-key
"Distributes all items in n sequences according to f(x)"
[f items]
(for [id (range n)]
(filter (comp #{id} get-id-for f) items)))
;; example: (split-by-key int (range 16)) => ((7 15) (0 8) (1 9) (2 10) (3 11) (4 12) (5 13) (6 14))
;; i am using (future) to realize the lazy groups returned by #'split-by-key
(let [total 10000
xs (range total)
gs (split-by-key int xs)]
(doseq [g gs]
;; do work here.
(future (reduce + g)))
;; i expect the sum to be = total, but it is not constant.
;; that is because an item in xs may go to multiple groups.
(reduce + (map count gs)))
)
@erdos
Copy link
Author

erdos commented Apr 21, 2016

Task

I would like to partition the input into n lists with the split-by-key function. That function is like clojure.core/partition-by but only returns n groups.

I use the memoized get-id-for function that returns a group id (between 0 and n) for every input in a round-robin manner.

Problem

The last expression shows that the total number of items in the groups do not always add up to 10000. That is strange because it shows that an item may be found in either more than one group or in none. I found this strange behaviour on a multicore machine.

@erdos
Copy link
Author

erdos commented Apr 21, 2016

Replacing split-by-key with the following seems to fix the problem:

(defn split-by-key [f items]
    (vals (group-by (comp get-id-for f) items)))

Of course it does not create n groups.

My question is: why is the previous behaviour possible?

@erdos
Copy link
Author

erdos commented Nov 20, 2016

Solution: The bug is in line 8 of the original code. The point is that memoize should never have side effects. Memoization internally is implemented using atom and swap (a compare-and-swap operation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment