Skip to content

Instantly share code, notes, and snippets.

@astrofrog
Last active November 5, 2015 09:19
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save astrofrog/0fcb7714be5cb7b04f78 to your computer and use it in GitHub Desktop.
Save astrofrog/0fcb7714be5cb7b04f78 to your computer and use it in GitHub Desktop.
Remote data access from Python - #dotastro - Day Zero
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Accessing remote resources from Python"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from __future__ import print_function, unicode_literals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dependencies\n",
"\n",
"To install dependencies for this tutorial, if you have conda:\n",
"\n",
" conda install requests beautifulsoup4\n",
" pip install twitter ads pygithub\n",
" \n",
"and if not using conda:\n",
"\n",
" pip install requests beautifulsoup4 twitter ads pygithub"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction to GET, POST, and PUT"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we access a web address (URL) through a browser, we are sending requests and receiving back responses from the server. There are different kinds of requests, and we will look at two of these here:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### GET requests"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the default type of request when you open a URL in a browser. For example, if you access http://www.google.com, you are implicitly doing a GET request. Here is what Google returns when you send a\n",
"\n",
" GET http://www.google.com\n",
" \n",
"request:\n",
"\n",
" 200 OK\n",
" Date: Mon, 02 Nov 2015 05:58:03 GMT\n",
" Expires: -1\n",
" Cache-Control: private, max-age=0\n",
" Content-Type: text/html; charset=UTF-8\n",
" p3p: CP=\"This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info.\"\n",
" Content-Encoding: gzip\n",
" Server: gws\n",
" X-XSS-Protection: 1; mode=block\n",
" X-Frame-Options: SAMEORIGIN\n",
" Set-Cookie: PREF=ID=1111111111111111:FF=0:TM=1446443883:LM=1446443883:V=1:S=MArUR2er4w4bbp8V; expires=Thu, 31-Dec-2015 16:02:17 GMT; path=/; domain=.google.com.au NID=73=Ge4XbDeJ8ahg7gLQOb3tlZPb-54GTW8SQmEifTRC9RpYnKywKCJh0zg-yiW3kL5MVgn6iMS9zmdIK-FBLjkGaC_yt4zIPlDFoiT5NTUZ-k_yeH28_1jHwgYWXugdMJtV1OCp3VUNlxE7A67vQdihcnsZuzcixAe5; expires=Tue, 03-May-2016 05:58:03 GMT; path=/; domain=.google.com.au; HttpOnly\n",
" alternate-protocol: 443:quic,p=1\n",
" Alt-Svc: quic=\"www.google.com:443\"; p=\"1\"; ma=600,quic=\":443\"; p=\"1\"; ma=600\n",
" X-Firefox-Spdy: h2\n",
"\n",
" <!doctype html><html itemscope=\"\" itemtype=\"http://schema.org/WebPage\" lang=\"en-AU\"><head><meta content=\"/logos/doodles/2015/george-booles-200th-birthday-5636122663190528.2-hp.gif\" itemprop=\"image\"><link href=\"/images/branding/product\n",
" etc.\n",
" \n",
"This is the full response, which we never see in practice. Note the content type:\n",
"\n",
" Content-Type: text/html; charset=UTF-8\n",
"\n",
"which we'll look at again later. A GET request can also include data in the URL: https://www.google.com.au/?q=dotastronomy\n",
"\n",
"A GET request should not have any side effects, i.e. it should not change anything on the server."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### POST and PUT requests"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unlike a GET request, a POST and PUT requests are used to send data that may for example be stored on the server, so it will explicly write data to the server (the distinction between the two is beyond the scope of this tutorial - you can read people discussing it [here](http://stackoverflow.com/questions/630453/put-vs-post-in-rest))\n",
"\n",
"However, data is not encoded in the URL, but instead is encoded and sent during the request. There is no easy way to sent a post request from the browser, but we can do this from Python."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The requests library"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python includes a library to get and post data to the web, [urllib](https://docs.python.org/3.5/library/urllib.html), but it is not straightforward to use, so a group of developers made [requests](http://docs.python-requests.org), a Python library that does *HTTP for Humans*."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import requests"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"r = requests.get('http://www.google.com')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"r.status_code"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"r.headers"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"r.headers['content-type']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"r.content[:1000]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"r.text[:1000]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Including data in the GET request is easy:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"r = requests.get('http://www.google.com', params={'q': 'dotastronomy'})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"r.status_code"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"r.request.url"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To send a post request instead, we can simply replace ``requests.get`` by ``requests.post``."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scraping HTML"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say we want to extract ('scrape') data off an ADS webpage:\n",
"\n",
"http://adsabs.harvard.edu/abs/2013A%26A...558A..33A\n",
"\n",
"We can start off by getting the source for the page:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"r = requests.get('http://adsabs.harvard.edu/abs/2013A%26A...558A..33A')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(r.text[:1000].strip())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now use [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) to parse this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import bs4\n",
"soup = bs4.BeautifulSoup(r.content)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"soup.title"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"soup.find_all('a')[1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"table = soup.find_all('table')[0]\n",
"table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the above example is for demonstration only. In practice, there is a proper way to access the ADS data that does not involve scraping (see further down)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## APIs (Application programming interfaces)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In practice, scraping HTML code is hard, and should be avoided whenever possible. A number of websites now offer APIs, which are documented ways of accessing machine-readable code.\n",
"\n",
"For example, GitHub offers a way to access data about users and repositories through an API that is described [here](https://developer.github.com/v3/)\n",
"\n",
"Many APIs require authentication, which I won't cover here, but there are a few that do not.\n",
"\n",
"Let's take a look at one of the examples for GitHub which does not require authentication: https://developer.github.com/v3/users/"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"r = requests.get('https://api.github.com/users/astrofrog')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"r.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"r.headers['content-type']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like the data was returned in JSON. Let's take a look in more detail!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## JSON (JavaScript Object Notation)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"JSON is a very common data format used for many APIs. A JSON object is a string that basically looks to Python users like a set of strings, lists, and dictionaries. We can easily transform a JSON object into an actual Python object with the requests library:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"data = r.json()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is now a Python dictionary! We can access keys with:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data['name']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data['location']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In general, APIs such as this can also accept POST requests for some of the actions that would have an effect on the repository or the user (see e.g. [this](https://developer.github.com/v3/repos/releases/#create-a-release) for an example of a possible POST request). The data can be passed to ``requests.post`` using a normal Python dictionary.\n",
"\n",
"Some APIs also use PUT instead of POST - in that case, use ``requests.put``. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you need to parse JSON manually, you can use the built-in [json](https://docs.python.org/3.5/library/json.html) library:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import json\n",
"d = json.loads('{\"a\":1}')\n",
"d['a']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Specialized libraries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In some cases, you don't actually need to use the APIs directly, but you can use exiting packages that will provide a 'Pythonic' interface to various websites.\n",
"\n",
"### Github\n",
"\n",
"For example, we can use [PyGithub](https://github.com/PyGithub/PyGithub):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from github import Github"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"g = Github()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"user = g.get_user('astrofrog')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"user.name"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"user.location"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Twitter"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are also libraries for e.g. [Twitter](http://www.twitter.com):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from twitter import Twitter\n",
"from twitter import oauth_dance, OAuth"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Normally these should be secret but I am using a dummy account, so this is fine\n",
"CONSUMER_KEY = 'igqeyII44KCFOFLGqrYYjU6qo'\n",
"CONSUMER_SECRET = '88vrcN4xjrDPyMNr8jgpB82ar3nddFtRfjqOjBjTwG7iSXkZr2'\n",
"\n",
"oauth_token, oauth_token_secret = oauth_dance(\"Test Bot\", CONSUMER_KEY, CONSUMER_SECRET)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"oauth_token, oauth_token_secret"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"t = Twitter(auth=OAuth(oauth_token, oauth_token_secret, CONSUMER_KEY, CONSUMER_SECRET))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"t.statuses.home_timeline()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"result = t.statuses.update(status=\"Testing, 1, 2, 3\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"tweets = t.search.tweets(q='#dotastro')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"len(tweets['statuses'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"tweets['statuses'][0]['text']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"response = t.statuses.update(status='FRA ✈ SYD')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ADS"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A final example is the [ads](https://github.com/andycasey/ads) library (``pip install ads``). To use this, we need an API key which we get by creating an account [here](https://ui.adsabs.harvard.edu), which we then put in ``.ads/dev_key``."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import ads"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"papers = ads.SearchQuery(q=\"author:hogg,d.\", sort=\"citation_count\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for paper in papers:\n",
" print(paper.title[0], paper.citation_count)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Python on the server"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you are interested in developing a web server than *runs* Python, you can look into [Django](https://www.djangoproject.com/) and [Flask](http://flask.pocoo.org/). You can then host these types of apps on services like [Heroku](https://www.heroku.com/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment