rjw57/micropasts-download.ipynb

## micropasts-download.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "BUCKET_URL = 'http://micropasts-palstaves2.s3.amazonaws.com/'\n",
    "TO_WHERE = 'palstaves2' # directory we'll end up downloading things to"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At the URL above is an XML document which points to all the rest of the data. We firstly need to download the data. Python comes with a web grabber built in. In Python 3 this is part of the urllib.request library. In Python 2 it's part of the urllib2 library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from urllib.request import urlopen"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use ``urlopen`` to fetch a document from the Intertrons."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with urlopen(BUCKET_URL) as f:\n",
    "    document_text = f.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's test this by printing the first 200 characters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "b'<?xml version=\"1.0\" encoding=\"UTF-8\"?>\\n<ListBucketResult xmlns=\"http://s3.amazonaws.com/doc/2006-03-01/\"><Name>micropasts-palstaves2</Name><Prefix></Prefix><Marker></Marker><MaxKeys>1000</MaxKeys><IsT'\n"
     ]
    }
   ],
   "source": [
    "print(document_text[:200])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now need to parse this as XML. Python comes with its own XML parser. The parser takes a file-like object. Fortunately the ``urlopen`` function returns one. We can thus download and parse all in one go!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import xml.etree.ElementTree as ET\n",
    "with urlopen(BUCKET_URL) as f:\n",
    "    document = ET.parse(f)\n",
    "document_root = document.getroot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The document root is a ``<ListBucketResult>`` element which has each file listed in a separate ``<Contents>`` element. Each element is namespaced which would be a pain to type out each time. Hence we'll create a custom prefix ``s3`` for Amazon S3 elements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ns = { 's3': 'http://s3.amazonaws.com/doc/2006-03-01/' }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now use the namespace prefix to match elements without having to specify the full URL which is tiresome. Test this by counting the number of ``Contents`` tags:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of files: 609\n"
     ]
    }
   ],
   "source": [
    "print('Number of files: {}'.format(len(document_root.findall('s3:Contents', ns))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Seems to work. Let's create an array of elements:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "contents = document_root.findall('s3:Contents', ns)\n",
    "assert len(contents) > 0 # Throws an exception if there are no files to download!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each ``Contents`` element has a ``Key`` element which stores the actual path to the file. Use a list comprehension to extract a list of keys:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of keys: 609\n"
     ]
    }
   ],
   "source": [
    "keys = [c.find('s3:Key', ns).text for c in contents]\n",
    "print('Number of keys: {}'.format(len(keys)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to form a URL for each key. Again Python has us covered. ``urllib.parse`` has a handy ``urljoin`` function which does the hard work for us. (In Python 2 this is in the ``urlparse`` module I *think*.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from urllib.parse import urljoin\n",
    "\n",
    "urls = [urljoin(BUCKET_URL, key) for key in keys]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at the first few:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/\n",
      "http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/\n",
      "http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3517.JPG\n",
      "http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3518.JPG\n",
      "http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3519.JPG\n"
     ]
    }
   ],
   "source": [
    "for u in urls[:5]:\n",
    "    print(u)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, getting there. We now need to download all of the files. If the URL ends in a ``/``, it's a directory. If it doesn't we'll assume it's a file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3517.JPG...\n",
      "Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3518.JPG...\n",
      "Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3519.JPG...\n",
      "Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3520.JPG...\n",
      "Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3521.JPG...\n",
      "Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3522.JPG...\n",
      "Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3523.JPG...\n",
      "Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3524.JPG...\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import shutil\n",
    "\n",
    "for key in keys[:10]:\n",
    "    # Get the URL for this key\n",
    "    url = urljoin(BUCKET_URL, key)\n",
    "    \n",
    "    # Compute the filename to download this to\n",
    "    path = os.path.join(TO_WHERE, key)\n",
    "    dirname = os.path.dirname(path)\n",
    "    filename = os.path.basename(path)\n",
    "    \n",
    "    # Skip URLs corresponding only to directories\n",
    "    if url.endswith('/'):\n",
    "        continue\n",
    "    \n",
    "    # Does the destination directory exist? If not, make it so (#1).\n",
    "    if not os.path.isdir(dirname):\n",
    "        os.makedirs(dirname)\n",
    "        \n",
    "    # Now we need to download the file...\n",
    "    print('Downloading {}...'.format(url))\n",
    "    with urlopen(url) as src, open(path, 'wb') as dst:\n",
    "        shutil.copyfileobj(src, dst)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 41,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"BUCKET_URL = 'http://micropasts-palstaves2.s3.amazonaws.com/'\n",
	"TO_WHERE = 'palstaves2' # directory we'll end up downloading things to"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"At the URL above is an XML document which points to all the rest of the data. We firstly need to download the data. Python comes with a web grabber built in. In Python 3 this is part of the urllib.request library. In Python 2 it's part of the urllib2 library."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"from urllib.request import urlopen"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can use ``urlopen`` to fetch a document from the Intertrons."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"with urlopen(BUCKET_URL) as f:\n",
	" document_text = f.read()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's test this by printing the first 200 characters:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"b'<?xml version=\"1.0\" encoding=\"UTF-8\"?>\\n<ListBucketResult xmlns=\"http://s3.amazonaws.com/doc/2006-03-01/\"><Name>micropasts-palstaves2</Name><Prefix></Prefix><Marker></Marker><MaxKeys>1000</MaxKeys><IsT'\n"
	]
	}
	],
	"source": [
	"print(document_text[:200])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We now need to parse this as XML. Python comes with its own XML parser. The parser takes a file-like object. Fortunately the ``urlopen`` function returns one. We can thus download and parse all in one go!"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import xml.etree.ElementTree as ET\n",
	"with urlopen(BUCKET_URL) as f:\n",
	" document = ET.parse(f)\n",
	"document_root = document.getroot()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The document root is a ``<ListBucketResult>`` element which has each file listed in a separate ``<Contents>`` element. Each element is namespaced which would be a pain to type out each time. Hence we'll create a custom prefix ``s3`` for Amazon S3 elements."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 23,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"ns = { 's3': 'http://s3.amazonaws.com/doc/2006-03-01/' }"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can now use the namespace prefix to match elements without having to specify the full URL which is tiresome. Test this by counting the number of ``Contents`` tags:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 26,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Number of files: 609\n"
	]
	}
	],
	"source": [
	"print('Number of files: {}'.format(len(document_root.findall('s3:Contents', ns))))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Seems to work. Let's create an array of elements:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 29,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"contents = document_root.findall('s3:Contents', ns)\n",
	"assert len(contents) > 0 # Throws an exception if there are no files to download!"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Each ``Contents`` element has a ``Key`` element which stores the actual path to the file. Use a list comprehension to extract a list of keys:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 35,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Number of keys: 609\n"
	]
	}
	],
	"source": [
	"keys = [c.find('s3:Key', ns).text for c in contents]\n",
	"print('Number of keys: {}'.format(len(keys)))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now we need to form a URL for each key. Again Python has us covered. ``urllib.parse`` has a handy ``urljoin`` function which does the hard work for us. (In Python 2 this is in the ``urlparse`` module I think.)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 42,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"from urllib.parse import urljoin\n",
	"\n",
	"urls = [urljoin(BUCKET_URL, key) for key in keys]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's take a look at the first few:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 43,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/\n",
	"http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/\n",
	"http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3517.JPG\n",
	"http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3518.JPG\n",
	"http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3519.JPG\n"
	]
	}
	],
	"source": [
	"for u in urls[:5]:\n",
	" print(u)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"OK, getting there. We now need to download all of the files. If the URL ends in a ``/``, it's a directory. If it doesn't we'll assume it's a file."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 47,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3517.JPG...\n",
	"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3518.JPG...\n",
	"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3519.JPG...\n",
	"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3520.JPG...\n",
	"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3521.JPG...\n",
	"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3522.JPG...\n",
	"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3523.JPG...\n",
	"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3524.JPG...\n"
	]
	}
	],
	"source": [
	"import os\n",
	"import shutil\n",
	"\n",
	"for key in keys[:10]:\n",
	" # Get the URL for this key\n",
	" url = urljoin(BUCKET_URL, key)\n",
	" \n",
	" # Compute the filename to download this to\n",
	" path = os.path.join(TO_WHERE, key)\n",
	" dirname = os.path.dirname(path)\n",
	" filename = os.path.basename(path)\n",
	" \n",
	" # Skip URLs corresponding only to directories\n",
	" if url.endswith('/'):\n",
	" continue\n",
	" \n",
	" # Does the destination directory exist? If not, make it so (#1).\n",
	" if not os.path.isdir(dirname):\n",
	" os.makedirs(dirname)\n",
	" \n",
	" # Now we need to download the file...\n",
	" print('Downloading {}...'.format(url))\n",
	" with urlopen(url) as src, open(path, 'wb') as dst:\n",
	" shutil.copyfileobj(src, dst)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.4.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}