celoyd/parallel.md Secret

## parallel.md

      
    Raw
  

              parallel.md
            
          
    Using Gnu Parallel for more efficient shell

Introduction: the problem

We want to run many processes with different arguments, and we want to have exactly as many processes running as we have cores. We could do this with a for loop that looks something like this (assuming the task is simply to echo a number):
# Get the number of cores with sysctl for OS X or nproc for Linux:
n_cores=`(sysctl -n hw.ncpu 2> /dev/null) || nproc`

# The total number of processes:
n_tasks=100

# Split the processes up into groups of n_cores:
for task_group in `seq 0 $((n_tasks / n_cores))`; do
	# Enumerate the task numbers in a given group:
	first_task=$(($task_group * n_cores))
	last_task=$(($first_task + $n_cores - 1))
	for task_number in `seq $first_task $last_task`; do
		# Ignore any overflowing task numbers (in the last group):
		if [[ $task_number -lt $n_tasks ]]; then
			echo $task_number & # Background the task
		fi
    done
    wait # Hold until all backgrounded tasks in the group have returned
done
There are other and arguably neater ways to do this in shell, but they all share two big disadvantages:


Complexity: There’s way too much variable-juggling relative to expressive code. No one wants to tack down an off-by-one in this kind of hairball. And one of the most common cases for me is parallelizing some task over all the X and Y tiles in a bounding box, which would mean nesting this ugly construct (or one like it). No thanks.


Wastefulness: When the tasks take different amounts of time, cores lie idle. The wait means the loop is always held until the slowest of the $n_cores tasks finishes – rarely a huge waste, but it can add up for large jobs.


In many language environments, this is a solved problem. For example, Python has the processing module (among others), which provides a pool interface: we create a pool object with n workers, then call it with some function and a list of the arguments that we want for each instance of the function. However long the list is, only n workers run at the same time, and we get a list of results corresponding to the arguments. Of course we can do things like set n at runtime, so there might be 3 workers on a 4-core lappy but 15 on an EC2 with 16. Problem solved. But what about shell scripts?
The solution

Glory be, GNU parallel is here. If you’re familiar with xargs, it does a lot of the same things, but with the emphasis on parallelism rather than rewriting tricky pipes. At its simplest, it takes two chunks of input: a command, on the left, and arguments, on the right. Try this:
$ parallel echo ::: hi
hi

It’s combined the command with the argument, so it was exactly like running echo hi. Parallel uses the slightly bizarre separator ::: precisely because it is so odd that it’s unlikely to appear in the input. Let’s get a little more sophisticated:
$ parallel echo ::: cat dog seahorse
dog
seahorse
cat

Two things to notice: first, the arguments are split by spaces, like a shell list. You can group them with quotes:
$ parallel echo ::: "cat dog" seahorse
seahorse
cat dog

Second, they execute out of order. Look what happens if we print the numbers 1..10:
parallel echo ::: {1..10}
3
4
5
6
7
8
9
2
1
10

Parallel throws its tasks at the CPUs in arbitrary order and prints the results as they come back. As a starting point, just assume that it picks a completely random order every time.
What if we want the arguments to appear somewhere other than at the end of the command? We can use the {} placeholder, which is substituted like a variable:
$ parallel echo {} is tasty ::: ketchup mustard "ice cream"
mustard is tasty
ketchup is tasty
ice cream is tasty

If we’re doing something like X and Y tiles, where we want every pair from two sets, we can add another group of arguments and parallel will splice them together:
$ parallel echo {} ::: A B C ::: 1 2 3
A 3
B 1
B 2
B 3
C 1
C 2
A 2
A 1
C 3

For a more sophisticated format than a single separating space, --header : lets us assign group names and use them in the command side (which it now makes more sense to think of as the template side):
$ parallel --header : echo mmm, delicious {food} with {drink} ::: food burger hashbrowns squid ::: drink coffee "Crystal Pepsi"
mmm, delicious hashbrowns with coffee
mmm, delicious hashbrowns with Crystal Pepsi
mmm, delicious squid with coffee
mmm, delicious burger with Crystal Pepsi
mmm, delicious burger with coffee
mmm, delicious squid with Crystal Pepsi

Job count

By default, parallel runs as many simultaneous jobs as it finds cores. Pass -j and an integer to use instead, optionally with +, -, or % to set a number relative to the core count. For example, it sometimes make sense to use the number of cores minus 1, so that the kernel, the daemons, parallel itself, etc., aren’t competing with the workers, and we can do that with -j-1. -j200% might be useful for something that uses about 50% CPU per task because its bottleneck is network latency, for example.
Filenames

parallel often comes in handy for file lists. For example, to make a bunch of thumbnails:
$ parallel convert {} {.}_thumb.png ::: *.png

This also shows {.}, which is the argument stripped of its dot suffix. Thus, if there’s an a.png in that directory, it will become a_thumb.png instead of a.png_thumb.png.
Other useful options

--dry-run prints the commands that parallel would run. This is great for debugging, but obviously can be overwhelming if you’re doing something that expands into thousands of commands.
-k runs commands in order. Think before using this – it’s relatively rare that parallelism and inter-task order are important at the same time.
Using parallel in a pipe is often convenient. It reads arguments from stdin:
cat list_of_domain_names.txt | parallel -j1000% "ping -c 1 {} > {}.latency"

This also illustrates the importance of quoting a template that contains characters like > that the shell will try to parse. (Messing up quoting is my most common bug.) Parallel also has dedicated options for reading and writing files if you’re doing something particularly complex.
--colsep is useful when you’re working with CSVs – if you give it ',', '\t', or whatever, it will give you columns as numbered arguments: {1}, {2}, etc.
Important caveat

If parallel gives you weird errors, you may be running a variant version, “tollef”, that sucks: check --version. You can force the proper mode with --gnu, or wget http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2 and build it.
Further reading

This overwhelming manpage is where I learned the most. Parallel is the kind of tool that attracts gee-whiz one-liners, but in general they seem unmaintainable and I stay away from them.