Skip to content

Instantly share code, notes, and snippets.

@noisychannel
Last active February 26, 2016 00:01
Show Gist options
  • Save noisychannel/c195b3cf7d5a12785948 to your computer and use it in GitHub Desktop.
Save noisychannel/c195b3cf7d5a12785948 to your computer and use it in GitHub Desktop.
Crawl images
#!/usr/bin/env bash
set -e
ROOT=...
MAX_RECURSION_DEPTH=10
crawl_folder() {
local page=$1
local folder_name=$2
local recursion_index=$3
wget $page -O ${folder_name}.html || echo "Could not wget"
for i in `cat ${folder_name}.html | grep -oh "href=\".*\"" | cut -c 7- | rev | cut -c 2- | rev`
do
if [[ $i == *.* ]]; then
wget "$page/$i";
else
if [[ "$i" =~ ^/.* ]]; then
continue
fi
if [[ $recursion_index == $MAX_RECURSION_DEPTH ]]; then
continue
fi
new_folder=`echo $i | rev | cut -c 2- | rev`
echo "Found folder : ${new_folder}"
mkdir ${new_folder}
cd ${new_folder}
echo "${page}/${new_folder} ${new_folder} $((recursion_index + 1))"
crawl_folder ${page}/${new_folder} ${new_folder} $((recursion_index + 1))
cd ..
fi
done
}
crawl_folder $ROOT root 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment