Skip to content

Instantly share code, notes, and snippets.

@Brainiarc7
Last active April 21, 2024 05:16
Show Gist options
  • Star 31 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save Brainiarc7/24c966c8a001061ee86cc4bc05826bf4 to your computer and use it in GitHub Desktop.
Save Brainiarc7/24c966c8a001061ee86cc4bc05826bf4 to your computer and use it in GitHub Desktop.
How to set up a transient cluster using GNU parallel and SSHFS for distributed jobs (such as FFmpeg media encodes)

Transient compute clustering with GNU Parallel and sshfs:

GNU Parallel is a multipurpose program for running shell commands in parallel, which can often be used to replace shell script loops,find -exec, and find | xargs. It provides the --sshlogin and --sshloginfile options to farm out jobs to multiple hosts, as well as options for sending and retrieving static resources and and per-job input and output files.

For any particular task, however, keeping track of which files need to pushed to and retrieved from the remote hosts is somewhat of a hassle. Furthermore, cancelled or failed runs can leave garbage on the remote hosts, and if input and output files are large, sending them to local disk on the remote hosts is somewhat inefficient.

In a traditional cluster, this problem would be solved by giving all nodes access to a shared filesystem, usually with NFS or something more exotic. However, NFS doesn't work particularly well for machines that aren't always connected to the same LAN, and is very difficult to implement securely unless you can prevent people from connecting their personal machines to your network.

By using SSHfs instead, we can give all nodes access to a shared filesystem with no more root-level configuration than installing sshfs and adding the relevant user to the fuse group. This makes it easy to build temporary computing clusters out of a motley assortment of desktops, laptops, and friends' machines booted from USB.

GNU Parallel Configuration:

Parallel is configured with files in ~/.parallel. Files under this directory can be used to add groups of options to Parallel's command line with the -J option. For example, if you have a file ~/.parallel/local:

--nice 10
--progress

then parallel -Jlocal will behave as if you had called parallel --nice 10 --progress. The files ~/.parallel/config and ~/.parallel/sshloginfile are special. The first is included by default in all invocations of parallel, and the second can be used by calling parallel --sshloginfile .. rather than providing the path.

We use this sshloginfile to list the hosts in our cluster, like so:

lily
ghasthawk

When Parallel is used to run jobs on remote machines, it invokes that machine's copy of itself, so ~/.parallel/config must not cause parallel to log into any remote machines. If it does, an infinite loop will result. Personally, I leave this file empty and use named profiles. For clustering, we use ~/.parallel/cluster:

--nice 10
--sshloginfile ..
--progress
--workdir .

The --workdir . option causes the working directory on the local machine to be used when running commands remotely. With a shared filesystem mounted at the same path on all nodes, this allows us to distribute jobs without explicitly pushing and retrieving files.

Cluster Filesystem Setup/Teardown:

The script below activates and deactivates the shared filesystem on all nodes. The cluster is started by running clustermode.bash up on the master node, and stopped by running clustermode.bash down. The script must be executable and located somewhere on $PATH on all nodes. On my systems, this is achieved by putting it in my personal bin directory, which is opportunistically synced by git along with the rest of my config files.

We mount the master's home directory on all nodes. This allows you to use the cluster without copying your data files into a particular place. Also, the arcfour (RC4) cipher is used to reduce overhead.

#!/bin/bash

# clustermode.bash:
# Set up a shared sshfs file system in the same path (relative to $HOME)

set -ue

usage_message="Usage: clustermode.bash <up|down>"

# If script is called without specifying a master node, assume it's us.
master=${2:-$(hostname)}
sharedir="${HOME}/sshfs/${master}"

up ()
{
    # Ensure mountpoint exists.
    [ -d "$sharedir" ] || mkdir -p "$sharedir"
    # Mount shared fs; nohup required to avoid immediate unmount.
    nohup sshfs \
       -o Cipher=arcfour \
       -o follow_symlinks \
       "${master}:" "$sharedir" \
       >/dev/null
    # If we are the master node:
    if [[ "$(hostname)" == "$master" ]]; then
        for host in $(grep -v $master "${HOME}/.parallel/sshloginfile"); do
            # Recurse to connect to master node; -t required for passwords.
            # source ~/.profile required to get this script in $PATH.
            ssh -t $host "source ~/.profile; clustermode.bash up ${master}" \
                || echo "$host unreachable"
        done
        # Switch to current path in the shared filesystem.
        wd=$(pwd)
        cd "${sharedir}/${wd#${HOME}}"
        exec "$SHELL"
    fi
}

down ()
{
    fusermount -u "$sharedir"
    if [[ "$(hostname)" == "$master" ]]; then
        for host in $(grep -v $master "${HOME}/.parallel/sshloginfile"); do
            ssh -t $host "source ~/.profile; clustermode.bash down ${master}" \
                || echo "Could not deactivate ${host}."
        done
    fi
}


if [[ $# -eq 0 ]];then
    echo "$usage_message"
    exit 1
fi

if [[ "$1" == "up" ]]; then
    up
elif [[ "$1" == "down" ]]; then
    down
else
    echo "$usage_message"
fi

Running Something:

Once you have the shared filesystem set up, you can run distributed jobs in any directory beneath the mount point. Personally, I like to write a short shell script to specify exactly what each chunk of work is supposed to do, and then use parallel to apply it across the files, as shown in the ffmpeg usage script below:

#!/bin/bash

file="$1"

ffmpeg -i "$file" \
    -loglevel error \
    -c:a copy \
    -c:v libx264 \
    -profile:v high \
    -preset veryslow \
    -tune animation \
    -crf 20 \
    "./encoded/${file%.*}.mkv"

Because ffmpeg already uses all the cores on a node, we use the -j1 option so only one job runs on each node at once.

ls *.mp4 | parallel -Jcluster -j1 ./encode.bash "{}"

If you have everything set up right, you should get something like this:

Computers / CPU cores / Max jobs to run
1:ghasthawk / 2 / 1
2:lily / 2 / 1

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ghasthawk:1/14/50%/3594.8s  lily:1/14/50%/3594.8s 

Note: If you need to tune this for use with NVIDIA's NVENC based encoding on a cluster with multiple GPUs as illustrated here, change the ffmpeg command above as follows:

#!/bin/bash

file="$1"

ffmpeg -i "$file" \
    -loglevel error \
    -c:a copy \
    -filter:v hwupload_cuda,format=nv12:interp_algo=lanczos,hwdownload,format=nv12 \
    -profile:v high \
    -c:v h264_nvenc \
    -preset llhq -rc:v vbr_minqp -qmin:v 19 \
    "./encoded/${file%.*}.mkv"
@Foxtrod89
Copy link

would you provide any results? How fast it can transcode 10 bit h265 using NPP? Thx

@Brainiarc7
Copy link
Author

Sure, @Foxtrod89,

Performance is relatively linear, based on resolution and encoding source.

Taking advantage of parallelism with GNU parallel allows one to target more than one ffmpeg instance PER GPU with ease (see the -gpu option for h264_nvenc). Throughput is also dependent on some factors, namely any filters in use and the board generation you're on.

Maxwell is about 2x slower than Pascal for HEVC encoding, so adjust your options accordingly.

@marshalleq
Copy link

This is fascinating. If I'm reading correctly you're basically saying the binary you run doesn't need to be aware of any clustering configuration and no files need to be copied about or split? If so, this I something people have been asking for for years. In fact it's better. Am I understanding this correctly?

My favourite piece of software (which uses ffmpeg) is h265ize. Do you reckon I could call that from your script?

Thanks,

Marshalleq.

@Brainiarc7
Copy link
Author

@marshalleq yes, I believe you could call up such a binary.

@akhilleusuggo
Copy link

I'll give it a try, but maybe it worth changing it to something more useful. Like cutting video into segments (chunks), and send those segments to the nodes and encode those segments faster, using all node's cores, then merge (concatenate) the video back on the master.
I do use this locally (one single computer), but never achieved it across nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment