Skip to content

Instantly share code, notes, and snippets.

@reijovosu
Forked from rcoup/rsync_parallel.sh
Last active August 28, 2021 15:03
Show Gist options
  • Save reijovosu/fce3d808bed89d5021ade70223bfc4c3 to your computer and use it in GitHub Desktop.
Save reijovosu/fce3d808bed89d5021ade70223bfc4c3 to your computer and use it in GitHub Desktop.
Parallel-ise an rsync transfer when you want multiple concurrent transfers happening,
#!/bin/bash
set -e
# Usage:
# rsync_parallel.sh [--parallel=N] [rsync args...]
#
# Options:
# --parallel=N Use N parallel processes for transfer. Defaults to 10.
#
# Notes:
# * Requires GNU Parallel
# * Use with ssh-keys. Lots of password prompts will get very annoying.
# * Does an itemize-changes first, then chunks the resulting file list and launches N parallel
# rsyncs to transfer a chunk each.
# * be a little careful with the options you pass through to rsync. Normal ones will work, you
# might want to test weird options upfront.
#
if [[ "$1" == --parallel=* ]]; then
PARALLEL="${1##*=}"
shift
else
PARALLEL=10
fi
echo "Using up to $PARALLEL processes for transfer..."
TMPDIR=$(mktemp -d)
trap "rm -rf $TMPDIR" EXIT
echo "Figuring out file list..."
# sorted by size (descending)
rsync $@ --out-format="%l %n" --no-v --dry-run | sort -n -r > $TMPDIR/files.all
# check for nothing-to-do
TOTAL_FILES=$(cat $TMPDIR/files.all | wc -l)
if [ "$TOTAL_FILES" -eq "0" ]; then
echo "Nothing to transfer :)"
exit 0
fi
function array_min {
ARR=("$@")
# Default index for min value
min_i=0
# Default min value
min_v=${ARR[$min_i]}
for i in "${!ARR[@]}"; do
v="${ARR[$i]}"
(( v < min_v )) && min_v=$v && min_i=$i
done
echo "${min_i}"
}
echo "Calculating chunks..."
# declare chunk-size array
for ((I = 0 ; I < PARALLEL ; I++ )); do
CHUNKS["$I"]=0
done
# add each file to the emptiest chunk, so they're as balanced by size as possible
PROGRESS=0
SECONDS=0
while read FSIZE FPATH; do
PROGRESS=$((PROGRESS+1))
# Original Implementation
#MIN=($(array_min_old ${CHUNKS[@]})); MIN_I=${MIN[0]}
# Nathan's implementation
MIN_I=$(array_min ${CHUNKS[@]})
CHUNKS[${MIN_I}]=$((${CHUNKS[${MIN_I}]} + ${FSIZE}))
echo "${FPATH}" >> "${TMPDIR}/chunk.${MIN_I}"
if ! ((PROGRESS % 5000)); then
>&2 echo "${SECONDS}s: ${PROGRESS} of ${TOTAL_FILES}"
fi
done < "${TMPDIR}/files.all"
echo "${SECONDS}s"
find "$TMPDIR" -type f -name "chunk.*" -exec cat {} \;
echo "Starting transfers..."
find "$TMPDIR" -type f -name "chunk.*" | parallel -j $PARALLEL -t --verbose --progress rsync --files-from={} $@
@harryqt
Copy link

harryqt commented Apr 20, 2020

cp -R $TMPDIR/* /Users/reijo/Sites/warc/files

why is it hardcoded?

Copy link

ghost commented Apr 22, 2020

Hi,

Hope you are all well !

How to rsync a local directory to a remote directory on an aws ec2 instance ? what sould be the commandline ?
Can you provide an/some example(s) ?

Thanks in advance for any insights or inputs on these questions.

Cheers,
X

@ylluminate
Copy link

Yes, this hardcoded path is odd...

@reijovosu
Copy link
Author

I removed hardcoded path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment