Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Download images and metadata from a Wikimedia Commons category or results page
# Download images from a Wikimedia Commons category
# Highest preview resolution and metadata in XML format will be saved in subfolders.
# Usage: execute this script on a Linux command line, providing the full URL to the category page.
# Shell tools required for this script: wget sed grep
if [ "$WIKI_URL" == '' ]; then
echo "Please provide a link to the category page you wish to download"
exit 1
# Get the main (category) page provided
echo "Downloading Image Pages"
wget --adjust-extension -nv -r -l 1 -A '*File:*' -e robots=off -w 1 -nc $WIKI_URL
# We could skip the next step by regexping the thumb URLs, e.g.:
# from
# to..
echo "Extracting Image Links"
WIKI_LINKS=`grep fullImageLink\:* | sed 's/^.*a href="//'| sed 's/".*$//'`
echo "Downloading Images"
wget -nv -nc -w 1 -e robots=off -P downloaded_wiki_images $WIKI_LINKS
# Use the original service
# ...or use Oleg's patched version
# ...or host your own by cloning
echo "Downloading Metadata"
mkdir -p downloaded_meta_data
cd downloaded_meta_data
for f in ../downloaded_wiki_images/*; do
IMAGE_FILE=`echo $f | sed 's/\.\.\/downloaded_wiki_images\///'`
# Clean out pixel size and convert spaces
IMAGE_FILE=`echo $IMAGE_FILE | sed 's/ /_/g' | sed 's/.*px-//'`
# Remove thumbnail extension from filename
IMAGE_FILE=`echo $IMAGE_FILE | sed 's/\.\([a-z]\+\)\.jpg/\.\1/'`
# Fetch using the remote API service configured above
wget -nv -nc -w 1 -e robots=off -O $IMAGE_FILE.xml $API_URL$IMAGE_FILE
cd ..
echo "Cleaning up image filenames"
cd downloaded_wiki_images
for i in *.jpg; do j=`echo $i | sed 's/.*[0-9]px-//g'`; mv "$i" "$j"; done
cd ..
echo "Cleaning up temp files"
rm -rf
echo "Done"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment