Created
October 12, 2015 11:35
-
-
Save rjw57/d6d0a707b13447a6842c to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 41, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"BUCKET_URL = 'http://micropasts-palstaves2.s3.amazonaws.com/'\n", | |
"TO_WHERE = 'palstaves2' # directory we'll end up downloading things to" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"At the URL above is an XML document which points to all the rest of the data. We firstly need to download the data. Python comes with a web grabber built in. In Python 3 this is part of the urllib.request library. In Python 2 it's part of the urllib2 library." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from urllib.request import urlopen" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can use ``urlopen`` to fetch a document from the Intertrons." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"with urlopen(BUCKET_URL) as f:\n", | |
" document_text = f.read()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's test this by printing the first 200 characters:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"b'<?xml version=\"1.0\" encoding=\"UTF-8\"?>\\n<ListBucketResult xmlns=\"http://s3.amazonaws.com/doc/2006-03-01/\"><Name>micropasts-palstaves2</Name><Prefix></Prefix><Marker></Marker><MaxKeys>1000</MaxKeys><IsT'\n" | |
] | |
} | |
], | |
"source": [ | |
"print(document_text[:200])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We now need to parse this as XML. Python comes with its own XML parser. The parser takes a file-like object. Fortunately the ``urlopen`` function returns one. We can thus download and parse all in one go!" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import xml.etree.ElementTree as ET\n", | |
"with urlopen(BUCKET_URL) as f:\n", | |
" document = ET.parse(f)\n", | |
"document_root = document.getroot()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The document root is a ``<ListBucketResult>`` element which has each file listed in a separate ``<Contents>`` element. Each element is namespaced which would be a pain to type out each time. Hence we'll create a custom prefix ``s3`` for Amazon S3 elements." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"ns = { 's3': 'http://s3.amazonaws.com/doc/2006-03-01/' }" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can now use the namespace prefix to match elements without having to specify the full URL which is tiresome. Test this by counting the number of ``Contents`` tags:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Number of files: 609\n" | |
] | |
} | |
], | |
"source": [ | |
"print('Number of files: {}'.format(len(document_root.findall('s3:Contents', ns))))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Seems to work. Let's create an array of elements:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"contents = document_root.findall('s3:Contents', ns)\n", | |
"assert len(contents) > 0 # Throws an exception if there are no files to download!" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Each ``Contents`` element has a ``Key`` element which stores the actual path to the file. Use a list comprehension to extract a list of keys:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Number of keys: 609\n" | |
] | |
} | |
], | |
"source": [ | |
"keys = [c.find('s3:Key', ns).text for c in contents]\n", | |
"print('Number of keys: {}'.format(len(keys)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now we need to form a URL for each key. Again Python has us covered. ``urllib.parse`` has a handy ``urljoin`` function which does the hard work for us. (In Python 2 this is in the ``urlparse`` module I *think*.)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 42, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from urllib.parse import urljoin\n", | |
"\n", | |
"urls = [urljoin(BUCKET_URL, key) for key in keys]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's take a look at the first few:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/\n", | |
"http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/\n", | |
"http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3517.JPG\n", | |
"http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3518.JPG\n", | |
"http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3519.JPG\n" | |
] | |
} | |
], | |
"source": [ | |
"for u in urls[:5]:\n", | |
" print(u)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"OK, getting there. We now need to download all of the files. If the URL ends in a ``/``, it's a directory. If it doesn't we'll assume it's a file." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 47, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3517.JPG...\n", | |
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3518.JPG...\n", | |
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3519.JPG...\n", | |
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3520.JPG...\n", | |
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3521.JPG...\n", | |
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3522.JPG...\n", | |
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3523.JPG...\n", | |
"Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3524.JPG...\n" | |
] | |
} | |
], | |
"source": [ | |
"import os\n", | |
"import shutil\n", | |
"\n", | |
"for key in keys[:10]:\n", | |
" # Get the URL for this key\n", | |
" url = urljoin(BUCKET_URL, key)\n", | |
" \n", | |
" # Compute the filename to download this to\n", | |
" path = os.path.join(TO_WHERE, key)\n", | |
" dirname = os.path.dirname(path)\n", | |
" filename = os.path.basename(path)\n", | |
" \n", | |
" # Skip URLs corresponding only to directories\n", | |
" if url.endswith('/'):\n", | |
" continue\n", | |
" \n", | |
" # Does the destination directory exist? If not, make it so (#1).\n", | |
" if not os.path.isdir(dirname):\n", | |
" os.makedirs(dirname)\n", | |
" \n", | |
" # Now we need to download the file...\n", | |
" print('Downloading {}...'.format(url))\n", | |
" with urlopen(url) as src, open(path, 'wb') as dst:\n", | |
" shutil.copyfileobj(src, dst)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.4.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment