Skip to content

Instantly share code, notes, and snippets.

@errzey
Created April 20, 2011 20:10
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save errzey/932592 to your computer and use it in GitHub Desktop.
Save errzey/932592 to your computer and use it in GitHub Desktop.
Example script to show how to do an intelligent merge-sort in parallel
#!/bin/bash
# determines number of proccessors, splits a large file into sizes that
# can be consumed by n-1 sort processes (where n is the number of processors)
#
# After the file has been split up properly, it will run a sort on each split
# file in parallel. Once all processes have completed, a merge sort is executed.
#
# mthomas@n2o:~/words [100%] $ du -h big
# 1.7G big
#
# Using normal sort:
# mthomas@n2o:~/words [100%] $ time sort -T . -o blop big
# real 7m4.264s
#
# Using this shell script:
# mthomas@n2o:~/words [100%] $ time ./msort.sh big
# Splitting big into 3 pieces.
# big.001 582483968 bytes
# big.002 582483968 bytes
# big.003 582478144 bytes
# Done!
# Attempting cache on big.001
# Running sort on big.001 on cpu 0x00000001
# Attempting cache on big.002
# Running sort on big.002 on cpu 0x00000002
# Attempting cache on big.003
# Running sort on big.003 on cpu 0x00000004
# Finalizing sort into big.sorted
# Removing temporary files
#
# real 4m54.142s
#
# That's a 2 minute speedup, of course the more procs/mem you have will affect
# the time.
if [ "$#" -ne 1 ]
then
echo "You need to specify an input file"
exit 1
fi
num_procs=`cat /proc/cpuinfo | egrep ^processor | wc -l`
big_file=$1
if [ "$num_procs" -gt 1 ]
then
let proc_mask_count="$num_procs - 2"
let proc_count="$num_procs - 1"
affinity_masks=()
# create an array of processor masks for affinity tasking
for cpu_n in `seq 0 $proc_mask_count`
do
let mask="1 << $cpu_n"
affinity_masks[$cpu_n]=`printf "%#.8x" $mask`
done
# determine size of the split files based on number of procs
file_size=`du -b $1 | awk '{print $1}'`
split_size=`echo "scale=10; $file_size / $proc_count" | bc | python -c "print round(float(raw_input()))"`
echo $file_size
echo $split_size
lxsplit -s $1 $split_size\b
for i in `seq 1 $proc_count`
do
infile="$1.00$i"
let offset="$i - 1"
echo "Running sort on $infile on cpu ${affinity_masks[$offset]}"
taskset ${affinity_masks[$offset]} sort -S 30% -T . -o $infile.s $infile &
done
wait
echo "Finalizing sort into $1.sorted"
sort -o $1.sorted -m $1.0*.s
echo "Removing temporary files"
rm $1.0*
else
echo "No need for parallel sorting op"
exit 1
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment