johnclaus/sorts.md

## sorts.md

      
    Raw
  

              sorts.md
            
          
    github.com/twotwotwo/sorts is a Go package with parallel radix- and quicksorts. It can run up to 5x faster than stdlib sort on the right sort of large sort task, so it could be useful for analysis and indexing/database-y work in which you have to sort millions of items. (To be clear, I don't recommend most folks drop stdlib sort, which is great, and which sorts depends on.)
While the process of writing it's fresh on my mind, here are some technical details, some things that didn't make the cut, and some thoughts about the process:
Concretely, what this looks like inside:


Both number and string versions are in-place MSD radix sorts that look at a byte at a time and, once the range being sorted gets down to 128 items, call (essentially) the stdlib's quicksort.


The parallelization code's basic principle is "first, try to keep all the cores busy, then worry about everything else."
Concretely: when a sort routine has a >128-element subarray to sort, it tries to do a non-blocking send of a "task" to a channel to potentially hand it to another goroutine. If the send fails, it does the sort immediately, in the current goroutine. The channel is buffered so if a worker finishes its task, there's a good chance there'll be another job ready--I thought buffering would hurt, but it helped. Timings on 1 to 16 cores are online.
There are nicer approaches you could do: samplesort partitions the array into more-or-less even-sized buckets, one to a core. There also might be things that are broadly similar to the current design that are better at minimizing how often parts of the array are tossed back and forth between cores.


The integer and string sorts both do  pragmatic tricks to save work in the common case where some bytes of input are the same across the whole range being sorted. If you're sorting, say, a bunch of ints between 0 and 1000000 (so only using 20 of 64 bits), or a bunch of strings that all have the same long common prefix (think lines in a log file), this makes a noticeable difference.


In early revisions, the string sort could skip multiple bytes ahead in a single pass, but I removed that because it was tricky, tricky to test, and wasn't hugely faster than what it does now, even when I artificially built a file with a really long common prefix.


Some more things I tried but didn't make it:


Keeping my own stack. Radix sorts, especially the string sort, can use a lot of stack--it needs a couple 256-entry tables per level of recursion. American flag sort reduces stack use by keeping its own stack (or "to-do list" if you want to think about it informally) to avoid the need for recursion. I tried that, but it didn't usually affect speed, it made the code trickier, and we can afford to blow a certain amount of memory on stack in 2015 (they were writing in '93), so it's not there in the released version. If you added it back, you could lift the hard limit of radix sorting strings only by the first 32 bytes. It might help the same sort of data the long-common-prefix trick did.


Caching key bytes for string sorts between two passes over smallish ranges. Only helped a little and the code wasn't pretty.


Using more than 8 bits at a time to sort ints. Sometimes helped but it seemed very sensitive to the exact array size.


I still think there might be some wins in sorting string data by collecting the first bytes of the strings in uint64s and sorting them as numbers, then using the actual strings as a tiebreaker. (Keeping string bytes in one little contiguous region has to beat jumping all over RAM for them.) And there might be something to concrete (interfaceless) sort code, but then the trick is coming up with a sane API.
Some meta thoughts:


Don't be too quick to write things off as not worth trying. Kind of the computing equivalent of "it never hurts to ask." When I first started futzing around I figured single-type versions of the sort code (e.g., one that only compares uint64s) were the way to go, and only tried the current sort.Interface thing on out of curiosity, to see (I thought) how bad it'd do. Turned out, not that bad, so here we are.


You have to remember what code runs frequently and what doesn't. That long-common-prefix code felt like it should be making everything slower, but wasn't. Intuitively, when something is a third of the code you see in front of you, and worse yet in an inner loop, it's hard not to imagine it's hurting somehow. Benchmarks didn't bear this out; it probably usually boiled down to a never-taken branch that the processor could zip right past.
Of course, it turned out that particular code was doomed, haha, but by its trickiness, not its slowness.


Sometimes the hard part of a change is testing it. Optimizations have weird special cases to tickle. Multiple algorithms (radix sort, stdlib qSort, and qSort breaks down into insertionSort/heapSort), multiple data types for each algorithm, and either can be serial or parallel. Plus, you don't just want to be able to say you have coverage, you ideally want each type of sort to see as many kinds of weird data as you can. Brute force (here, sorting each dataset twelve ways) can help.


Anyway, perspective! I did this specifically because it's fun sometimes to not have to think too hard about sorting out folks' requirements and instead take an already-well-understood problem and just fiddle with computers for a bit. If I were to futz with this more, I'd much rather use it for any of the fun analysis, indexing, etc. one can build with sorts (something with actual benefit for some end user), rather than iterate more on a low-level part of the picture like this.
Questions, thoughts, non-thoughts, reply below or holler at me on the Twitters (@rf). Thanks for reading!