Skip to content

Instantly share code, notes, and snippets.

@hupili
Last active January 2, 2016 16:59
Show Gist options
  • Save hupili/8333373 to your computer and use it in GitHub Desktop.
Save hupili/8333373 to your computer and use it in GitHub Desktop.
Parallel sorting using Linux utils

Test setup:

  • 1.1G, 100 million lines of intergers.
  • 24 CPU, 48G mem

Use time ./parasort.sh all-ids 500000 20:

real	2m32.483s
user	10m52.996s
sys	0m9.066s

Compare with single core sort:

$time sort -u -n all-ids -o all-ids.sortu

real    8m44.057s
user    8m39.567s
sys     0m4.038s

Parameters

Use -S 10000000000 (allow 10G buffer).

real	2m39.086s
user	10m54.576s
sys	0m14.016s

Use large part size: time ./parasort.sh all-ids 5000000 20.

real	2m16.916s
user	11m33.363s
sys	0m28.230s
if [[ $# != 3 ]]; then
echo "usage: {fn_input} {file_part_lines} {concurrent_workers}"
exit 255
else
fn_input=$1
file_part_lines=$2
concurrent_workers=$3
fi
echo "Prepare"
rm -rf tmp/
mkdir -p tmp/parts
mkdir -p tmp/parts.sorted
echo "break input into pieces"
split -l $file_part_lines $fn_input tmp/parts/part.
echo "sort parts in parallel"
cd tmp; ls -1 parts | xargs -P $concurrent_workers -i sh -c 'echo {}; sort -u -n parts/{} -o parts.sorted/{}'; cd -
echo "merge sort"
sort -n -u -m tmp/parts.sorted/* -o $fn_input.sorted
@hupili
Copy link
Author

hupili commented Jan 9, 2014

Further optimisations:

  • You may want to set buffer size by -S

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment