Skip to content

Instantly share code, notes, and snippets.

@rjw57
Created October 12, 2015 11:35
Show Gist options
  • Save rjw57/d6d0a707b13447a6842c to your computer and use it in GitHub Desktop.
Save rjw57/d6d0a707b13447a6842c to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"BUCKET_URL = 'http://micropasts-palstaves2.s3.amazonaws.com/'\n",
"TO_WHERE = 'palstaves2' # directory we'll end up downloading things to"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At the URL above is an XML document which points to all the rest of the data. We firstly need to download the data. Python comes with a web grabber built in. In Python 3 this is part of the urllib.request library. In Python 2 it's part of the urllib2 library."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from urllib.request import urlopen"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use ``urlopen`` to fetch a document from the Intertrons."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"with urlopen(BUCKET_URL) as f:\n",
" document_text = f.read()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's test this by printing the first 200 characters:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"b'<?xml version=\"1.0\" encoding=\"UTF-8\"?>\\n<ListBucketResult xmlns=\"http://s3.amazonaws.com/doc/2006-03-01/\"><Name>micropasts-palstaves2</Name><Prefix></Prefix><Marker></Marker><MaxKeys>1000</MaxKeys><IsT'\n"
]
}
],
"source": [
"print(document_text[:200])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now need to parse this as XML. Python comes with its own XML parser. The parser takes a file-like object. Fortunately the ``urlopen`` function returns one. We can thus download and parse all in one go!"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import xml.etree.ElementTree as ET\n",
"with urlopen(BUCKET_URL) as f:\n",
" document = ET.parse(f)\n",
"document_root = document.getroot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The document root is a ``<ListBucketResult>`` element which has each file listed in a separate ``<Contents>`` element. Each element is namespaced which would be a pain to type out each time. Hence we'll create a custom prefix ``s3`` for Amazon S3 elements."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"ns = { 's3': 'http://s3.amazonaws.com/doc/2006-03-01/' }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now use the namespace prefix to match elements without having to specify the full URL which is tiresome. Test this by counting the number of ``Contents`` tags:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of files: 609\n"
]
}
],
"source": [
"print('Number of files: {}'.format(len(document_root.findall('s3:Contents', ns))))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Seems to work. Let's create an array of elements:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"contents = document_root.findall('s3:Contents', ns)\n",
"assert len(contents) > 0 # Throws an exception if there are no files to download!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each ``Contents`` element has a ``Key`` element which stores the actual path to the file. Use a list comprehension to extract a list of keys:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of keys: 609\n"
]
}
],
"source": [
"keys = [c.find('s3:Key', ns).text for c in contents]\n",
"print('Number of keys: {}'.format(len(keys)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we need to form a URL for each key. Again Python has us covered. ``urllib.parse`` has a handy ``urljoin`` function which does the hard work for us. (In Python 2 this is in the ``urlparse`` module I *think*.)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from urllib.parse import urljoin\n",
"\n",
"urls = [urljoin(BUCKET_URL, key) for key in keys]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at the first few:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/\n",
"http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/\n",
"http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3517.JPG\n",
"http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3518.JPG\n",
"http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3519.JPG\n"
]
}
],
"source": [
"for u in urls[:5]:\n",
" print(u)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"OK, getting there. We now need to download all of the files. If the URL ends in a ``/``, it's a directory. If it doesn't we'll assume it's a file."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3517.JPG...\n",
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3518.JPG...\n",
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3519.JPG...\n",
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3520.JPG...\n",
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3521.JPG...\n",
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3522.JPG...\n",
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3523.JPG...\n",
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3524.JPG...\n"
]
}
],
"source": [
"import os\n",
"import shutil\n",
"\n",
"for key in keys[:10]:\n",
" # Get the URL for this key\n",
" url = urljoin(BUCKET_URL, key)\n",
" \n",
" # Compute the filename to download this to\n",
" path = os.path.join(TO_WHERE, key)\n",
" dirname = os.path.dirname(path)\n",
" filename = os.path.basename(path)\n",
" \n",
" # Skip URLs corresponding only to directories\n",
" if url.endswith('/'):\n",
" continue\n",
" \n",
" # Does the destination directory exist? If not, make it so (#1).\n",
" if not os.path.isdir(dirname):\n",
" os.makedirs(dirname)\n",
" \n",
" # Now we need to download the file...\n",
" print('Downloading {}...'.format(url))\n",
" with urlopen(url) as src, open(path, 'wb') as dst:\n",
" shutil.copyfileobj(src, dst)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment