Skip to content

Instantly share code, notes, and snippets.

@Abhiroop
Last active August 1, 2016 20:07
Show Gist options
  • Save Abhiroop/2a452a230f2a78ac91fa2dbd8d4046ed to your computer and use it in GitHub Desktop.
Save Abhiroop/2a452a230f2a78ac91fa2dbd8d4046ed to your computer and use it in GitHub Desktop.
Top k elements from a vector
Timing the performance of various ways of extracting top k elements from a vector:
Here sequence size is about 10000
and k=300
Here is a randomly generated vector containing 10K maps. Each of the maps look like this
{
:id 123
:data (0 1 2 3 ....)
:more-data (0 1 2 3 ....)
}
Each object weighs about 4.3 KB
Function for generating the vector of maps:
(def
big-objects-vec
(shuffle
(vec
(map
(fn [i] {:id i :data (range 100) :more-data (range 1000)})
(range 10000)))))
Extracting top k elemnts from the vector can be done in 3 ways. Lets profile them:
1. (take 300 (sort-by :id big-objects-vec))
Timing 1 iteration
user=> (time (take 300 (sort-by :id big-objects-vec)))
"Elapsed time: 11.052344 msecs"
Timing 1000 iterations
user=> (time (dotimes [i 1000] (do (take 300 (sort-by :id big-objects-vec)))))
"Elapsed time: 9939.033409 msecs"
2.
(defn f [coll]
(apply sorted-set-by (fn [x y] (< (:id x) (:id y))) coll))
(take 300 (f big-objects-vec))
Timing 1 iteration
user=> (time (take 300 (f big-objects-vec)))
"Elapsed time: 30.529017 msecs"
Timing 1000 iterations
user=> (time (dotimes [i 1000] (do (take 300 (f big-objects-vec)))))
"Elapsed time: 27963.321466 msecs"
3.Using heap of size k
(defn top-by [k f coll]
(reduce
(fn [top x]
(let [top (conj top x)]
(if (> (count top) k)
(disj top (first top))
top)))
(sorted-set-by #(< (f %1) (f %2)))
coll))
Timing 1 iteration
user=> (time (top-by 300 :id big-objects-vec))
"Elapsed time: 27.388446 msecs"
Timing 1000 iterations
(time (dotimes [i 1000] (do (top-by 300 :id big-objects-vec))))
"Elapsed time: 26540.44524 msecs"
4. Using lazy quick sort. Creates a variant of quick select.
(defn qsort [xs k]
(lazy-seq
(if (seq xs)
(let [pivot (rand-nth xs)
smaller (filterv #(> (pivot k) (% k)) xs)
larger (filterv #(< (pivot k) (% k)) xs)
same (filterv #(= (pivot k) (% k)) xs)]
(concat (qsort smaller k)
same
(qsort larger k))))))
Timing 1 iteration
user=> (time (take 300 (qsort big-objects-vec :id)))
"Elapsed time: 0.029915 msecs"
Timing 1000 iterations
(time (dotimes [i 1000] (do (take 300 (qsort big-objects-vec :id)))))
"Elapsed time: 0.294812 msecs"
Even more...
Timing 100000 iterations
(time (dotimes [i 100000] (do (take 300 (qsort big-objects-vec :id)))))
"Elapsed time: 9.96925 msecs"
Even if you increase k to 300000
(time (dotimes [i 100000] (do (take 300000 (qsort big-objects-vec :id)))))
"Elapsed time: 11.255201 msecs"
nil
Moral of the story: Quick select kills the other algos in my machine. Should profile this in other machines.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment