Skip to content

Instantly share code, notes, and snippets.

@fengyuentau
Last active November 8, 2023 12:58
Show Gist options
  • Save fengyuentau/29a359eadde063952baf38cd1901683d to your computer and use it in GitHub Desktop.
Save fengyuentau/29a359eadde063952baf38cd1901683d to your computer and use it in GitHub Desktop.
Multiprocessing downloading with `wget` and `xargs` in bash
#!/bin/bash
# global vars
IND_START=100
IND_END=499
TAR_PATH="../tars"
CHECK_PASS=0
CHECK_FAIL=0
CHECK_FAIL_LIST=""
# perform checksum checking and tar files extracting
for ((i=$IND_START;i<=$IND_END;i++));
do
if [ $i -le 9 ] && [ $i -ge 0 ] ; then
IND=00$i
elif [ $i -le 99 ] && [ $i -ge 10 ] ; then
IND=0$i
else
IND=$i
fi
FILENAME=md5.images_$IND.txt
# cat the checksum file and store its output: md5sum tar_filename
MD5SUM=$(cat $FILENAME | cut -d' ' -f1)
TAR_FILENAME=$(cat $FILENAME | cut -d' ' -f3) # -f2 returns the whitespace
# get md5sum from downloaded tars
TAR_FILEPATH=$TAR_PATH/$TAR_FILENAME
MD5SUM_OUTPUT=$(md5sum "$TAR_FILEPATH" | cut -d' ' -f1)
if [ "$MD5SUM" = "$MD5SUM_OUTPUT" ] ; then
CHECK_PASS=$((CHECK_PASS+1))
echo $TAR_FILENAME checksum passed! CHECK_PASS=$CHECK_PASS, CHECK_FAIL=$CHECK_FAIL
else
CHECK_FAIL=$((CHECK_FAIL+1))
CHECK_FAIL_LIST=$CHECK_FAIL_LIST"$TAR_FILENAME\n"
echo $TAR_FILENAME checksum failed! CHECK_PASS=$CHECK_PASS, CHECK_FAIL=$CHECK_FAIL
fi
done
# write the fail list if any into a file
if [ ! -z "$CEHCK_FAIL_LIST" ] ; then
cat CHECK_FAIL_LIST > check_fail_list.log
cat check_fail_list.log
fi

Multiprocessing downloading shell script using wget and xargs

I was trying to download the google landmark dataset from here. To speed up downloading, I wrote a shell script which can downloads several files in a list simutaneously:

# firstly, generate a URL_LIST which contains all urls to files that you want to download
# NOTE: use whitespace to seperate each url, see https://stackoverflow.com/a/28806991/6769366 for more details.
for i in {100..200}
do
    URL_LIST=$URL_LIST\ "https://s3.amazonaws.com/google-landmark/train/images_$i.tar"
done
# NOTE: git-bash/mingw on windows(10) does not come with `wget`.
#       To install latest `wget`, check https://gist.github.com/evanwill/0207876c3243bbb6863e65ec5dc3f058.
#       To learn how `xargs` works, see https://stackoverflow.com/a/11850469/6769366
#       `-e` for `wget` is to set proxy. See https://superuser.com/a/526779 for details. Be in mind that `set proxy=127.0.0.1:1080` does not work in git-bash for windows.
#       `-q` for `wget` is to mute the output of `wget`.
echo $URL_LIST | xargs -n 1 -P 6 wget -e https_proxy=127.0.0.1:1080 -q

You can run this script on Linux/Windows. Specially, if you want to run this script on windows, you need either WSL(Windows Subsystem for Linux) or git-bash with wget installed. Furthermore, if you run the script on git-bash, you need to run it using sh like:

sh download.sh

See https://stackoverflow.com/a/44884649/6769366 for details.

To-Do

[x] Implementation of adding an 0 when $i is less than 100. [ ] Since the messive output of wget is muted using -q, a decent progress bar is needed. [ ] Perform checksum in a multiprocessing manner.

#!/bin/bash
# global vars
IND_START=0
IND_END=499
URL_LIST=""
# generate URL_LIST
for ((i=$IND_START;i<=$IND_END;i++))
do
if [ $i -le 9 ] && [ $i -ge 0 ] ; then
IND=00$i
elif [ $i -le 99 ] && [ $i -ge 10 ] ; then
IND=0$i
else
IND=$i
fi
URL_LIST=$URL_LIST\ "https://s3.amazonaws.com/google-landmark/train/images_$IND.tar"
done
# run `man xargs` to check more on options used here.
# `-n 1` means passing 1 parameter to each process.
# `-P 6` means running at most 6 processes simutaneously.
# options for `wget`:
# `-e https_proxy=127.0.0.1:1080` is setting `wget` to use proxy.
# `-q` is setting `wget` to be quiet without any output.
echo $URL_LIST | xargs -n 1 -P 6 wget -e https_proxy=127.0.0.1:1080 -q
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment