Skip to content

Instantly share code, notes, and snippets.

@jonathanmorgan
Last active January 12, 2017 11:31
Show Gist options
  • Save jonathanmorgan/e98b308aaf3e25b55d03 to your computer and use it in GitHub Desktop.
Save jonathanmorgan/e98b308aaf3e25b55d03 to your computer and use it in GitHub Desktop.
APIs - getting, storing and using data
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# APIs - getting, storing, and using data from 3rd parties\n",
"\n",
"## Table of Contents\n",
"\n",
"- [What is an API?](#What-is-an-API?)\n",
"- [HTTP](#HTTP)\n",
"\n",
" - [HTTP request](#HTTP-request)\n",
" - [HTTP response](#HTTP-response)\n",
" \n",
"- [Making HTTP API requests](#Making-HTTP-API-requests)\n",
"\n",
" - [HTTP with Python and `requests`](#HTTP-with-Python-and-requests)\n",
" - [Authentication](#Authentication)\n",
" \n",
" - [Basic HTTP authentication](#Basic-HTTP-authentication)\n",
" - [API key authentication](#API-key-authorization)\n",
" - [Oauth and Oauth2](#Oauth-and-Oauth2)\n",
" \n",
"- [Working with API responses](#Working-with-API-responses)\n",
"\n",
" - [Data formats](#Data-formats)\n",
" \n",
" - [JSON](#JSON)\n",
" - [XML](#XML)\n",
" \n",
"- [A REST-based API example - OpenCalais](#A-REST-based-API-example---OpenCalais)\n",
"- [API client libraries](#API-client-libraries)\n",
"\n",
" - [The `twitter` library - Collecting from public Twitter stream](#The-twitter-library---Collecting-from-public-Twitter-stream)\n",
" \n",
" - [Setting up twitter API authentication](#Setting-up-twitter-API-authentication)\n",
" - [twitter streaming API example](#twitter-streaming-API-example)\n",
" - [Sample JSON of tweet](#Sample-JSON-of-tweet)\n",
" \n",
" - [`tweepy` - an object-oriented alternative](#tweepy---an-object-oriented-alternative)\n",
" \n",
"- [Capturing data](#Capturing-data)\n",
"- [Practical considerations](#Practical-considerations)\n",
"\n",
" - [Know and follow the rules of APIs](#Know-and-follow-the-rules-of-APIs)\n",
" - [Performance](#Performance)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is an API?\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"**API** stands for \"Application Programming Interface\". An API is an agreed upon way for one computer program to interact with another computer program. There are many different kinds of APIs. Some facilitate interaction between computers over the Internet, some do not. For this class, however, we are going to focus on Internet APIs, and more specifically, Internet APIs that interact through HTTP."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## HTTP\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"References:\n",
"\n",
"- _based in part on information from Wikipedia: [https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol)_\n",
"\n",
"HTTP stands for \"HyperText Transfer Protocol\". HTTP, including the secure version of HTTP, HTTPS, is a request/response protocol that is used for a substantial amount of the traffic carried by the Internet. It is the underlying network framework that web browsers use to interact with web servers.\n",
"\n",
"HTTP is known as a request/response protocol because an HTTP transaction always includes a request and a corresponding response. You ask for a web page, the server on which the page lives sends you the HTML of the page, so your browser can render it. An API client asks for the most recent tweets for a given Twitter user, the Twitter server sends back the requested information (as long as you are authenticated and the user is either public or someone you follow)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### HTTP request\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"An HTTP request contains the following information:\n",
"\n",
"- a text request line that includes the following, in this order, with each separated by a single space:\n",
"\n",
" - the request method (GET, HEAD, POST, PUT, DELETE, TRACE, OPTIONS, CONNECT, PATCH - most common are GET and POST).\n",
" - the URL of the resource you are trying to access.\n",
" - the specific version of HTTP you are using.\n",
" \n",
"- a header block that contains one or more header variables, name-value pairs with name separate from value by a colon and a space.\n",
"\n",
" - examples:\n",
"\n",
" - `Host: api.opencalais.com`\n",
" - `Accept-Language: en`\n",
" \n",
"- a blank line\n",
"- the body of the request, if there is one. POST requests generated by submitting a web form, for example, have a body, place all the form inputs in the body of a request.\n",
"\n",
"Example:\n",
"\n",
" GET /index.html HTTP/1.1\n",
" Host: www.example.com\n",
" Accept-Language: en"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### HTTP response\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"An HTTP response is very similar. It contains:\n",
"\n",
"- a text status line that includes the following, in this order, with each separated by a single space:\n",
"\n",
" - the specific version of HTTP you are using.\n",
" - the status code for the request.\n",
" - a status message for the request.\n",
" \n",
" Common status codes:\n",
" \n",
" - 200 - OK\n",
" - 404 - File not found\n",
" - 500 - server error\n",
" - 503 - server down\n",
"\n",
"- a header block that contains one or more header variables, name-value pairs with name separate from value by a colon and a space.\n",
"\n",
" - example: `Content-Type: text/html`\n",
" \n",
"- a blank line\n",
"- the body of the response. For a request from a web browser to a web server, the response will contain the HTML for the page, which the browser will render. For an API request, the response body could contain data in any number of formats (more on this later).\n",
"\n",
"Example:\n",
"\n",
" HTTP/1.1 200 OK\n",
" Date: Mon, 23 May 2005 22:38:34 GMT\n",
" Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)\n",
" Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT\n",
" ETag: \"3f80f-1b6-3e1cb03b\"\n",
" Content-Type: text/html; charset=UTF-8\n",
" Content-Length: 131\n",
" Accept-Ranges: bytes\n",
" Connection: close\n",
"\n",
" <html>\n",
" <head>\n",
" <title>An Example Page</title>\n",
" </head>\n",
" <body>\n",
" Hello World, this is a very simple HTML document.\n",
" </body>\n",
" </html>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Making HTTP API requests\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"There are many ways to use HTTP to request information from APIs on the Internet. A few examples you may have heard of:\n",
"\n",
"- SOAP - Simple Object Access Protocol - XML RPC\n",
"- REST\n",
"- Scraping HTML\n",
"\n",
"In practice, while it is good to know of high-level categories of API implementation, every API is different, and there is considerable variation even within APIs that share a common philosophy (like SOAP or REST). Knowing a given API claims to be REST-based, for example, tells you it is an HTTP API, but doesn't reliably tell you much else. The devil in most APIs in the details:\n",
"\n",
"- how you authenticate.\n",
"- what data you have to pass to the API.\n",
"- what data you can get back from the API.\n",
"- whether there are libraries that simplify interactions with the API, or whether you have to interact at a low level using HTTP.\n",
"\n",
"While you will learn to recognize patterns from API to API, each HTTP API you encounter has the potential to be substantially different, and so each will require you to read the API's documentation carefully to understand exactly how it works.\n",
"\n",
"There will also usually be a number of different options for interacting with an API:\n",
"\n",
"- low-level HTTP requests implemented using one of a number of HTTP librarys.\n",
"- client libraries that hide the details of interacting with the API behind simplified functions and objects.\n",
"\n",
"In general, you should try to find client libraries that simplify your interactions with APIs, as long as they are easy and good (more on this in a bit). If there are no client libraries, or if the client libraries are not easy and good, however, you might have to resort to interacting manually with an API using an HTTP package like `requests`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### HTTP with Python and `requests`\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"In practice, while it is good to know how HTTP works, you won't need to know the precise details because you'll interact with HTTP resources through a package or library that abstracts those details out. There are many HTTP libraries for Python that have different balances of simplicity versus control and advanced features. Examples:\n",
"\n",
"- urllib2 - built-in Python package for making HTTP requests.\n",
"- mechanize - full-featured HTTP client that is closer to a web browser than a simple HTTP client - supports cookies, for example.\n",
"- requests - easy-to-use Python package for making HTTP requests.\n",
"\n",
"Each of these packages has different strengths and weaknesses, and it is good to be aware that there are options should one not work for a given API or site you want to scrape.\n",
"\n",
"That said, in general, I'd recommend using `requests` if you have to do straight low-level HTTP requests.\n",
"\n",
"Requests exposes methods for each of the HTTP request methods:\n",
"\n",
"- GET = `requests.get()`\n",
"- POST = `requests.post()`\n",
"- etc.\n",
"\n",
"And at its most basic, requests lets you submit a request by passing a string URL to the appropriate method and returns a parsed response object that makes it easy to get at response code, header variables, and the body of the request.\n",
"\n",
"Before you run the example below, make sure you have installed the requests packge by running `conda install requests` at a command prompt.\n",
"\n",
"An example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# import requests\n",
"import requests\n",
"\n",
"# declare variables\n",
"response = None\n",
"request_url = \"\"\n",
"github_username = \"\"\n",
"github_password = \"\"\n",
"auth_tuple = None\n",
"parameter_dict = None\n",
"header_dict = None\n",
"\n",
"# make a simple GET request\n",
"request_url = 'https://api.github.com/user'\n",
"github_username = \"jonathanmorgan\"\n",
"github_password = \"\"\n",
"auth_tuple = ( github_username, github_password )\n",
"response = requests.get( request_url, auth = auth_tuple )\n",
"\n",
"# you could also pass parameters\n",
"parameter_dict = { 'key1': 'value1', 'key2': 'value2' }\n",
"\n",
"# request with parameters as well, passed appropriately for which\n",
"# ever method you call (on URL for GET, in body for POST, etc.).\n",
"response = requests.get( request_url, auth = auth_tuple, params = parameter_dict )\n",
"\n",
"# and header variables\n",
"header_dict = { 'content-type': 'text/plain' }\n",
"\n",
"# request with HTTP header variables as well.\n",
"response = requests.get( request_url, auth = auth_tuple, params = parameter_dict, headers = header_dict )\n",
"\n",
"# check the status code\n",
"print( \"Status code = \" + str( response.status_code ) )\n",
"\n",
"# Header - Content Type?\n",
"print( \"Content type = \" + response.headers['content-type'] )\n",
"\n",
"# Header - Encoding?\n",
"print( \"Encoding = \" + response.encoding )\n",
"\n",
"# Text contained in the body of the response\n",
"print( \"Contents of response body = \" + response.text )\n",
"\n",
"# that is JSON - convert to a JSON object.\n",
"print( \"As JSON object: \" + str( response.json() ) )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, each API will be different, so if you have to use low-level HTTP libraries, make sure to read the documentation on the API carefully to figure out exactly what you need to pass to the API and how, and then what you should expect back and how you should process it (see \"Working with API responses\" below).\n",
"\n",
"More information on requests:\n",
"\n",
"- [http://docs.python-requests.org/en/latest/](http://docs.python-requests.org/en/latest/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Authentication\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"When you deal with APIs, you will generally be required to set up an account of some sort, and then each time you make a request, you'll need to authenticate yourself to the API.\n",
"\n",
"Authentication lets sites that provide APIs track users and usage and hold users accountable if they violate the API's rules and rate limits. It also lets them block malicious or incorrect access that could potentially harm the site as a whole.\n",
"\n",
"There are a number of different authentication schemes used by APIs. While you'll need to read each API's documentation carefully to figure out how they work, many will fall into one of the following three general categories:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Basic HTTP authentication\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"References:\n",
"\n",
"- _information from [https://en.wikipedia.org/wiki/Basic_access_authentication](https://en.wikipedia.org/wiki/Basic_access_authentication)_\n",
"\n",
"Basic HTTP authentication is the type of authentication used in the requests example above. It is a simple username and password authentication scheme that is built into HTTP. When you provide a username and password for basic authentication, the username, then password are placed together in a string, separated by a colon ( \":\" ). Then the entire string is Base-64 encoded and placed in a header variable named \"Authorization\" along with the authorization type (in this case, \"Basic\").\n",
"\n",
"An example authorization header for username \"Aladdin\" and password \"open sesame\":\n",
"\n",
" Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ=="
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### API key authorization\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Another authentication method is to require an API key. An API key is a separate per-user secret that some APIs use for authentication instead of a user's username and password. API hosts can have different criteria for giving an API key to a user. Some sites let you request and receive an API key automatically through their sight once you sign in. Some sites require you to submit a request and be approved for an API key. Some sites make you pay for an API key.\n",
"\n",
"Regardless of how you get one, once you have an API key, you'll need to pass that key to the API with each request, so it can verify that you are an approved client. The method for passing an API key to an API is not standardized like the Basic HTTP Auth method, though, so there are lots of ways you could pass an API key to an API: as an HTTP header variable, as a query parameter appended to the URL you are accessing, or even as part of the body of a request. You'll either have to read the doc to figure out how to submit your API key, or find a library that takes care of it for you."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Oauth and Oauth2\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Oauth and Oauth2 are a more complicated means of authentication implemented by larger APIs like twitter. The Oauth family of authentication schemes require you to set up encryption keys with the API provider, then, when you connect, there is a relatively complicated handshake process that takes place before the API server gives you a token that you can use like a password to gain access for the duration of your current session.\n",
"\n",
"There are Python packages that implement oauth and oauth2, so if you have to authenticate manually with Oauth or Oauth 2, use these libraries (the one Twitter recommends is [https://github.com/simplegeo/python-oauth2](https://github.com/simplegeo/python-oauth2)). Most APIs that have a client library also integrate authentication handling into the library, so you can just pass whatever information the API requires in to the client library's connect method, and then it will handle the rest for you. For an example of this, see the `twitter` client library example below.\n",
"\n",
"However you do it, you'll need to get set up with all the information needed for the API before you can access it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Working with API responses\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"### Data formats\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"References:\n",
"\n",
"- _includes information from: [https://zapier.com/learn/apis/chapter-3-data-formats/](https://zapier.com/learn/apis/chapter-3-data-formats/)_\n",
"\n",
"Once you submit your request and receive your response, you then need to parse the data into a form such that you can interact with it and make use of it.\n",
"\n",
"API response data can come in a wide variety of formats, including CSV files, PDF files, Excel files, and binary data like images or executables. Most responses contain text, however, and two old text standbys that are still widely used today are JSON and XML.\n",
"\n",
"Below, we'll outline what JSON and XML look like and how you parse a document in either format such that you can then subsequently and easily interact with the information contained inside it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### JSON\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"JSON is a string format made up of objects (sets of name-value pairs, similar to a Python dictionary) and lists (sets of values, similar to a list in Python). Lists can contain objects, and vice versa.\n",
"\n",
"The format for an object is very similar to Python's dictionary:\n",
"\n",
"- curly-braces denote the beginning and end of the object.\n",
"- name-value pairs are listed out inside the curly-braces, separated by commas, with name and value quoted if strings, and name and value separate by a colon ( \":\" ).\n",
"\n",
"The format for a list is also very similar to Python:\n",
"\n",
"- square brackets denote the beginning and end of a list.\n",
"- values inside list are separated by commas, and strings must be quoted.\n",
"\n",
"Example JSON document:\n",
"\n",
"- from [https://zapier.com/learn/apis/chapter-3-data-formats/](https://zapier.com/learn/apis/chapter-3-data-formats/)\n",
" \n",
" {\n",
" \"crust\": \"original\",\n",
" \"toppings\": [\"cheese\", \"pepperoni\", \"garlic\"],\n",
" \"status\": \"cooking\",\n",
" \"customer\": {\n",
" \"name\": \"Brian\",\n",
" \"phone\": \"573-111-1111\"\n",
" }\n",
" }\n",
"\n",
"To interact with a JSON document in Python, you use the built in json decoder to convert the document into a JSON object ( [https://docs.python.org/2/library/json.html](https://docs.python.org/2/library/json.html) ). Once you convert a JSON document to a JSON object, JSON \"objects\" are converted to Python dictionaries, and JSON \"lists\" are converted to Python lists.\n",
"\n",
"Example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# import json package\n",
"import json\n",
"\n",
"# declare variables\n",
"json_string = \"\"\n",
"json_object = None\n",
"customer_name = \"\"\n",
"my_crust = \"\"\n",
"my_toppings_list = None\n",
"current_topping = \"\"\n",
"\n",
"# place JSON document in variable\n",
"json_string = '''\n",
"{\n",
" \"crust\": \"original\",\n",
" \"toppings\": [\"cheese\", \"pepperoni\", \"garlic\"],\n",
" \"status\": \"cooking\",\n",
" \"customer\": {\n",
" \"name\": \"Brian\",\n",
" \"phone\": \"573-111-1111\"\n",
" }\n",
"}\n",
"'''\n",
"\n",
"# convert to json object\n",
"json_object = json.loads( json_string )\n",
"\n",
"# JSON objects converted to dictionaries, lists to Python lists.\n",
"\n",
"# get customer name\n",
"customer = json_object[ \"customer\" ]\n",
"\n",
"customer_name = customer[ \"name\" ]\n",
"\n",
"print( customer_name + \"'s order:\" )\n",
"\n",
"# get crust\n",
"my_crust = json_object[ \"crust\" ]\n",
"print( \"====> Crust: \" + my_crust )\n",
"\n",
"# get and loop over toppings.\n",
"my_toppings_list = json_object[ \"toppings\" ]\n",
"for current_topping in my_toppings_list:\n",
" \n",
" print ( \"----> Topping: \" + current_topping )\n",
" \n",
"#-- END loop over toppings --#\n",
"\n",
"# and, convert back to pretty-printed string\n",
"print( json.dumps( json_object, sort_keys = True, indent = 4, separators=(',', ': ') ) )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### XML\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"XML is another string data format that is designed to hold hierarchically structured data.\n",
"\n",
"XML documents are made up of elements. An element consists of a start tag, and end tag, and content that is between the start and end tags. Example for a pizza order element named \"status\":\n",
"\n",
" <status>cooking</status>\n",
" \n",
"Where:\n",
"\n",
"- `<status>` is the start tag of the status element.\n",
"- \"cooking\" is the content of the status element.\n",
"- `</status>` is the end tag of the status element.\n",
"\n",
"Elements can contain any content, including one or more child element.\n",
"\n",
"A given element's start tag can also contain attributes - name-value pairs of additional information about a given element and the elements inside it. With our status element, an example where we've added an \"updated_by\" attribute, set to \"Benny\":\n",
"\n",
" <status updated_by=\"Benny\">cooking</status>\n",
" \n",
"The attribute name is never in quotation marks, the attribute value always is. Name and value are separated by an equal sign ( \"=\" ).\n",
"\n",
"An example document modeled after our pizza JSON above:\n",
"\n",
" <order>\n",
" <crust>original</crust>\n",
" <toppings>\n",
" <topping>cheese</topping>\n",
" <topping>pepperoni</topping>\n",
" <topping>garlic</topping>\n",
" </toppings>\n",
" <status>cooking</status>\n",
" <customer>\n",
" <name>Brian</name>\n",
" <phone>573-111-1111</phone>\n",
" </customer>\n",
" </order>\n",
"\n",
"To interact with an XML document in Python, you use the lxml ( `conda install lxml` OR `pip install lxml` ) or xmltodict ( `pip install xmltodict` ) python packages to convert the document into a dictionary object. Once you convert a JSON document to a JSON object, JSON \"objects\" are converted to Python dictionaries, and JSON \"lists\" are converted to Python lists.\n",
"\n",
"Example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# import lxml package\n",
"import lxml\n",
"import lxml.objectify\n",
"\n",
"# declare variables\n",
"xml_string = \"\"\n",
"order_root_element = None\n",
"customer_name = \"\"\n",
"my_crust = \"\"\n",
"my_toppings_list = None\n",
"current_topping = \"\"\n",
"\n",
"# place XML document in variable\n",
"xml_string = '''<order>\n",
" <crust>original</crust>\n",
" <toppings>\n",
" <topping>cheese</topping>\n",
" <topping>pepperoni</topping>\n",
" <topping>garlic</topping>\n",
" </toppings>\n",
" <status>cooking</status>\n",
" <customer>\n",
" <name>Brian</name>\n",
" <phone>573-111-1111</phone>\n",
" </customer>\n",
"</order>\n",
"'''\n",
"\n",
"# convert to XML object\n",
"order_root_element = lxml.objectify.fromstring( xml_string )\n",
"\n",
"print( xml_object )\n",
"\n",
"# XML object converted such that return reference is to root element, \n",
"# elements inside are added as object instance variables to element that\n",
"# contains them. If more than one element of same name, that name\n",
"# becomes a reference to a list of the values.\n",
"# So, for this example, xml_object is the <order> element.\n",
"\n",
"# get customer name\n",
"customer_name = order_root_element.customer.name\n",
"print( customer_name + \"'s order:\" )\n",
"\n",
"# get crust\n",
"my_crust = order_root_element.crust\n",
"print( \"====> Crust: \" + my_crust )\n",
"\n",
"# get and loop over toppings.\n",
"my_toppings_list = order_root_element.toppings.topping\n",
"for current_topping in my_toppings_list:\n",
" \n",
" print ( \"----> Topping: \" + current_topping )\n",
" \n",
"#-- END loop over toppings --#\n",
"\n",
"# and, convert back to pretty-printed string\n",
"print( \"XML: \" + lxml.etree.tostring( order_root_element ) )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A REST-based API example - OpenCalais\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Open Calais ( [http://www.opencalais.com/](http://www.opencalais.com/) ) is a Thomson Reuters service that automatically extracts meta-data from text. It accepts the string body of a text document, then returns a JSON response that contains all of the meta-data it can derive, including topics; entities like people, companies, and places; and quotations and the person who was quoted.\n",
"\n",
"It has a relatively simple single-call API - you pass it text, it does everything it can to that text, returns you the results. There are a few options - you can choose how you call it (SOAP or REST), specify input format, and specify output format (either JSON or XML) - but other than that, it is really straightforward. Pass it text, get a giant mountain of stuff.\n",
"\n",
"The API documentation: [http://www.opencalais.com/calaisAPI](http://www.opencalais.com/calaisAPI)\n",
"\n",
"This API documentation is a good example of what you'll have to sort through to figure out how to get at something you are interested in. From clicking through it briefly, it can be hard to tell how it works. All the information you need is there, but it can be hard to tell which pieces relate to REST versus SOAP (which you do not want to use - trust me), and the information is broken up in such a way that it can be hard to find details on a certain part of the process when you need them.\n",
"\n",
"Once you figure it out, though, the API is pretty straightforward. Register for an API key ( [http://www.opencalais.com/APIkey](http://www.opencalais.com/APIkey) ), then the most basic usage works as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# import JSON\n",
"import json\n",
"\n",
"# declare variables\n",
"article_body_text = \"\"\n",
"calais_api_key = \"\"\n",
"calais_rest_api_url = \"\"\n",
"calais_submitter = \"\"\n",
"header_dict = {}\n",
"requests_response = None\n",
"requests_raw_text = \"\"\n",
"requests_response_json = None\n",
"\n",
"# sample article text\n",
"# - from: http://www.nytimes.com/2015/02/25/technology/path-clears-for-net-neutrality-ahead-of-fcc-vote.html\n",
"article_body_text = '''\n",
"Last April, a dozen New York-based Internet companies gathered in the Flatiron District boardroom of the social media website Tumblr to hear dire warnings that broadband providers were about to get the right to charge for the fastest speeds on the web.\n",
"\n",
"The implication: If they didn’t pay up, they would be stuck in the slow lane.\n",
"\n",
"What followed has been the longest, most sustained campaign of Internet activism in history, one that the little guys appear to have won. On Thursday, the Federal Communications Commission is expected to vote to regulate the Internet as a public good. On Tuesday, Senator John Thune, Republican of South Dakota and chairman of the Senate Commerce Committee, all but surrendered on efforts to overturn the coming ruling, conceding Democrats are lining up with President Obama in favor of the F.C.C.\n",
"\n",
"“We’re not going to get a signed bill that doesn’t have Democrats’ support,” he said, explaining that Democrats have insisted on waiting until after Thursday’s F.C.C. vote before even beginning to talk.\n",
"\n",
"“I told Democrats, Yes, you can wait until the 26th, but you’re going to lose the critical mass I think that’s necessary to come up with a legislative alternative once the F.C.C. acts,” he said.\n",
"\n",
"In the battle over so-called net neutrality, a swarm of small players, from Tumblr to Etsy, BoingBoing to Reddit, has overwhelmed the giants of the tech world, Comcast, Verizon and TimeWarner Cable, with a new brand of corporate activism — New World versus Old. The biggest players on the Internet, Amazon and Google, have stayed in the background, while smaller players — some household names like Twitter and Netflix, others far more obscure, like Chess.com and Urban Dictionary — have mobilized a grass-roots crusade.\n",
"\n",
"“We don’t have an army of lobbyists to deploy. We don’t have financial resources to throw around,” said Liba Rubenstein, Tumblr’s director of social impact and public policy. “What we do have is access to an incredibly engaged, incredibly passionate user base, and we can give folks the tools to respond.”\n",
"'''\n",
"# set up details of REST API and API Key\n",
"calais_rest_api_url = \"http://api.opencalais.com/tag/rs/enrich\"\n",
"\n",
"# insert your own API key here.\n",
"calais_api_key = \"\"\n",
"\n",
"# pick your own submitter string, too.\n",
"calais_submitter = \"UofM-big_data_class_example\"\n",
"\n",
"# set up header variables for call to REST API.\n",
"# this is just required parameters. All params: http://www.opencalais.com/documentation/calais-web-service-api/forming-api-calls/input-parameters\n",
"header_dict[ \"x-calais-licenseID\" ] = calais_api_key\n",
"\n",
"# NOTE - does not deal well with HTML - send it raw text!\n",
"header_dict[ \"Content-Type\" ] = \"TEXT/RAW\"\n",
"\n",
"# ask for JSON\n",
"header_dict[ \"outputformat\" ] = \"Application/JSON\"\n",
"\n",
"# and tell it who we are\n",
"header_dict[ \"submitter\" ] = \"sourcenet testing\"\n",
"\n",
"# make the request\n",
"requests_response = requests.post( calais_rest_api_url, data = article_body_text, headers = header_dict )\n",
"\n",
"# raw text:\n",
"requests_raw_text = requests_response.text\n",
"\n",
"# convert to a json object:\n",
"requests_response_json = requests_response.json()\n",
"\n",
"# loop over the stuff in the response:\n",
"print( \"=============================================\" )\n",
"print( \"Open Calais response json\" )\n",
"print( \"=============================================\" )\n",
"print( json.dumps( requests_response_json, sort_keys = True, indent = 4, separators=(',', ': ') ) )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API client libraries\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Most popular APIs will not only have an API, they'll also have separate API client libraries that make interacting as easy as possible (it is still HTTP under the hood, and it is still going to be dealing with machine-readable data formats like XML or JSON, so you can only do so much).\n",
"\n",
"Some of these libraries will be provided by the API provider, but many are developed separate from the API and made availble as open source software. This brings us back to a previous comment: in general, you should use API client libraries whenever possible, as long as they are easy and good.\n",
"\n",
"So what are \"easy and good\"? In order for an API client library to be worth using:\n",
"\n",
"- it should be genuinely easier to use than the actual API. API client libraries that are just different, or that do weird things with the data that comes back, or that are more complicated than the base API without the library, should be avoided. Just lump it and learn the API rather than learning someone's complex API into the API.\n",
"- a good API client library is still being updated. Look for a release date or commit (if hosted in a version control site like github) within the last 3 or 4 months. Some libraries that aren't being updated, especially for APIs that are stable and so don't change often, can still be good. But, unmaintained libraries tend to be trouble.\n",
"- a good API client library also will take data returned by the API and make it as easy as possible for you to interact with it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The `twitter` library - Collecting from public Twitter stream\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"An example of collecting data from public Twitter streams follows.\n",
"\n",
"#### Setting up twitter API authentication\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Before you can use the Twitter API, you'll need to:\n",
"\n",
"- set up a Twitter developer account, then set up an application and authentication through that application.\n",
"\n",
" - first, go to [https://apps.twitter.com/](https://apps.twitter.com/) and log in with your twitter user.\n",
" - click on the \"Create new app\" button in the upper right hand corner.\n",
" - fill in the application name, description, and your web site (a placeholder is OK). Leave callback URL empty. Click \"Yes, I agree\" to the terms of service. Then, click the \"Create your Twitter application\" button.\n",
" - Once your application has ben created, you'll see it on the left when you go to apps.twitter.com. Click on the name to view details, then go to the \"Keys and access tokens\" tab to get the information you'll need to connect to the twitter streaming API:\n",
"\n",
" - consumer key\n",
" - consumer secret\n",
" - access key/token\n",
" - access token secret\n",
"\n",
"- make sure the `six` python package is installed by running the following in your computer's command line (not in IPython or IPython Notebook): `conda install six`\n",
"- install the `twitter` package using pip by running the following in your computer's command line (not in IPython or IPython Notebook): `pip install twitter`\n",
"\n",
" - if you get a message \"Requirement already satisfied\", that means the `twitter` package is already installed.\n",
"\n",
"- copy the code example below into a separate file.\n",
"- in your copy of the example, update the OAuth variables you see here so they contain the values you received above when you set up your twitter development account:\n",
"\n",
" # set up OAuth stuff.\n",
" CONSUMER_KEY = ''\n",
" CONSUMER_SECRET = ''\n",
" ACCESS_TOKEN_KEY = ''\n",
" ACCESS_TOKEN_SECRET = ''\n",
"\n",
"- Run the code file in a command-line session of ipython ( **_NOT IPYTHON NOTEBOOK_** ), else the volume of tweets will cause you serious problems.\n",
"\n",
"#### twitter streaming API example\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"The `twitter` API client library is a good example of an easy and good API client library ( [https://github.com/sixohsix/twitter/tree/master](https://github.com/sixohsix/twitter/tree/master) ). As of time of writing, it had last been updated 10 days ago. It supports the latest version of Twitter's API. It is genuinely easy to use. It integrates OAuth authentication into the client, so you don't have to know how it works. It also exposes the entire API, including streaming collection points (one for a \"random\" sample, and one for a filtered \"random\" sample), and makes interacting with a stream straightforward and easy.\n",
"\n",
"Before you start:\n",
"\n",
"- _make sure you install \"twitter\" package with pip before you run the following exercise!_\n",
"- _if you are going to run this a long time, you should run it in IPython on a server, not in a Jupyter notebook in a browser._\n",
"- _the sample code below doesn't do anything with the data. If you want to store it, you'll want to either write it to a file or a database._"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from __future__ import unicode_literals\n",
"\n",
"# import six package\n",
"import six\n",
"\n",
"# import twitter\n",
"import twitter\n",
"\n",
"# variables to hold tweet info\n",
"tweet_timestamp = \"\"\n",
"twitter_tweet_id = \"\"\n",
"tweet_text = \"\"\n",
"tweet_language = \"\"\n",
"twitter_user_twitter_id = \"\"\n",
"twitter_user_screenname = \"\"\n",
"user_followers_count = \"\"\n",
"user_favorites_count = \"\"\n",
"user_created = \"\"\n",
"user_location = \"\"\n",
"tweet_retweet_count = \"\"\n",
"tweet_place_JSON = None\n",
"tweet_place = \"\"\n",
"tweet_user_mention_count = \"\"\n",
"tweet_users_mentioned_screennames = \"\"\n",
"tweet_users_mentioned_ids = \"\"\n",
"tweet_hashtag_mention_count = \"\"\n",
"tweet_hashtags_mentioned = \"\"\n",
"tweet_url_count = \"\"\n",
"tweet_shortened_urls_mentioned = \"\"\n",
"tweet_full_urls_mentioned = \"\"\n",
"user_description = \"\"\n",
"user_friends_count = \"\"\n",
"user_statuses_count = \"\"\n",
"tweet_display_urls_mentioned = \"\"\n",
"\n",
"# variables for processing tweets\n",
"my_oauth = None\n",
"twitter_stream = None\n",
"tweet_iterator = None\n",
"tweet_counter = -1\n",
"current_tweet = None\n",
"\n",
"# hashtag processing\n",
"tweet_hashtag_json_list = None\n",
"hashtag_count = -1\n",
"tweet_hashtag_json = None\n",
"current_hashtag_text = \"\"\n",
"tweet_hashtag_list = []\n",
"\n",
"# url processing\n",
"tweet_url_json_list = None\n",
"url_count = -1\n",
"tweet_url_json = None\n",
"current_url_text = \"\"\n",
"current_dislpay_url_text = \"\"\n",
"current_short_url_text = \"\"\n",
"tweet_url_list = []\n",
"tweet_display_url_list = []\n",
"tweet_short_url_list = []\n",
"\n",
"# user mention processing\n",
"tweet_user_mentions_json_list = None\n",
"user_mention_count = -1\n",
"tweet_user_mention_json = None\n",
"current_user_id = \"\"\n",
"current_user_screenname = \"\"\n",
"tweet_user_id_list = []\n",
"tweet_user_screenname_list = []\n",
"\n",
"# set up OAuth stuff.\n",
"CONSUMER_KEY = ''\n",
"CONSUMER_SECRET = ''\n",
"ACCESS_TOKEN_KEY = ''\n",
"ACCESS_TOKEN_SECRET = ''\n",
"\n",
"# Make an OAuth object.\n",
"my_oauth = twitter.OAuth( ACCESS_TOKEN_KEY, ACCESS_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET )\n",
"\n",
"# Create a tweetstream\n",
"twitter_stream = twitter.TwitterStream( auth = my_oauth )\n",
"\n",
"# get an iterator over tweets\n",
"tweet_iterator = twitter_stream.statuses.sample()\n",
"\n",
"# get an iterator over tweets - basic sample\n",
"tweet_iterator = twitter_stream.statuses.sample()\n",
"\n",
"# or, filtered sample\n",
"# from https://dev.twitter.com/streaming/reference/post/statuses/filter\n",
"# \"track\" = list of string keywords\n",
"# \"locations\" = list of string lat. long. locations ( \"<lat>,<long>\" )\n",
"# \"follow\" = list of users whose statuses we want returned.\n",
"#tweet_iterator = twitter_stream.statuses.filter( track = [ \"nytimes\", ] )\n",
"\n",
"# loop over tweets\n",
"tweet_counter = 0\n",
"for current_tweet in tweet_iterator:\n",
"\n",
" tweet_counter += 1\n",
" \n",
" # check for delete request.\n",
" try:\n",
" \n",
" # if delete request, will have a delete element at the root.\n",
" # If not, this will throw an exception, and you'll process the tweet.\n",
" delete_info = current_tweet[ 'delete' ]\n",
" print( \"--> Deletion request - moving on.\" )\n",
" \n",
" except:\n",
"\n",
" # print out the tweet.\n",
" #print( \"====> Tweet JSON:\" )\n",
" #print( current_tweet_JSON_string )\n",
"\n",
" #------------------------------------------------------------------------\n",
" # tweet data\n",
" #------------------------------------------------------------------------\n",
"\n",
" # get tweet data\n",
" twitter_tweet_id = current_tweet[ 'id' ]\n",
" tweet_text = current_tweet[ 'text' ]\n",
" tweet_timestamp = current_tweet[ 'created_at' ]\n",
" tweet_language = current_tweet[ 'lang' ]\n",
"\n",
" # init tweet_place to empty\n",
" tweet_place = \"\"\n",
" \n",
" # check if tweet place in JSON\n",
" tweet_place_JSON = current_tweet[ 'place' ]\n",
" if ( ( tweet_place_JSON is not None) and ( tweet_place_JSON != \"\" ) ):\n",
"\n",
" # got JSON - get full_name from inside.\n",
" tweet_place = tweet_place_JSON[ \"full_name\" ]\n",
"\n",
" else:\n",
" \n",
" # no place.\n",
" tweet_place = \"\"\n",
" \n",
" #-- END check to see if tweet_place present --#\n",
"\n",
" tweet_retweet_count = current_tweet[ 'retweet_count' ]\n",
" \n",
" # !tweet hashtags?\n",
" \n",
" # initialize hashtag variables\n",
" tweet_hashtag_mention_count = 0\n",
" tweet_hashtags_mentioned = \"\"\n",
" \n",
" # see if we have any hash tags\n",
" tweet_hashtag_json_list = current_tweet[ 'entities' ][ 'hashtags' ]\n",
" hashtag_count = len( tweet_hashtag_json_list )\n",
" if hashtag_count > 0:\n",
" \n",
" # got at least one hashtag. loop and build list.\n",
" tweet_hashtag_list = []\n",
" for tweet_hashtag_json in tweet_hashtag_json_list:\n",
" \n",
" # get hash tag value\n",
" current_hashtag_text = tweet_hashtag_json[ 'text' ]\n",
" \n",
" # append to list\n",
" tweet_hashtag_list.append( current_hashtag_text )\n",
" \n",
" #-- END loop over hash tags --#\n",
" \n",
" # store count\n",
" tweet_hashtag_mention_count = len( tweet_hashtag_list )\n",
"\n",
" # convert to comma-delimited list for storage.\n",
" tweet_hashtags_mentioned = \",\".join( tweet_hashtag_list )\n",
"\n",
" else:\n",
" \n",
" # set all variables to 0, empty string.\n",
" tweet_hashtag_mention_count = 0\n",
" tweet_hashtags_mentioned = \"\"\n",
" \n",
" #-- END check to see if one or more hash tags --#\n",
" \n",
" # !tweet urls?\n",
" \n",
" # initialize URL variables.\n",
" tweet_url_count = 0\n",
" tweet_shortened_urls_mentioned = \"\"\n",
" tweet_display_urls_mentioned = \"\"\n",
" tweet_full_urls_mentioned = \"\"\n",
"\n",
" # do we have URLs in tweet?\n",
" tweet_url_json_list = current_tweet[ 'entities' ][ 'urls' ]\n",
" url_count = len( tweet_url_json_list )\n",
" if url_count > 0:\n",
" \n",
" # got at least one url. loop and build lists.\n",
" tweet_url_list = []\n",
" tweet_display_url_list = []\n",
" tweet_short_url_list = []\n",
" for tweet_url_json in tweet_url_json_list:\n",
" \n",
" # get URL, display URL, and short URL\n",
" current_url_text = tweet_url_json[ 'expanded_url' ]\n",
" current_display_url_text = tweet_url_json[ 'display_url' ]\n",
" current_short_url_text = tweet_url_json[ 'url' ]\n",
"\n",
" # append to lists\n",
" encoded_value = current_url_text.encode( 'utf-8' )\n",
" tweet_url_list.append( six.moves.urllib.parse.quote_plus( encoded_value ) )\n",
" encoded_value = current_display_url_text.encode( 'utf-8' )\n",
" tweet_display_url_list.append( six.moves.urllib.parse.quote_plus( encoded_value ) )\n",
" encoded_value = current_short_url_text.encode( 'utf-8' )\n",
" tweet_short_url_list.append( six.moves.urllib.parse.quote_plus( encoded_value ) )\n",
" \n",
" #-- END loop over URLs --#\n",
" \n",
" # store count\n",
" tweet_url_count = len( tweet_url_list )\n",
"\n",
" # convert to comma-delimited lists for storage.\n",
" tweet_shortened_urls_mentioned = \",\".join( tweet_short_url_list )\n",
" tweet_display_urls_mentioned = \",\".join( tweet_display_url_list )\n",
" tweet_full_urls_mentioned = \",\".join( tweet_url_list )\n",
"\n",
" else:\n",
" \n",
" # set count to 0, everything else to \"\".\n",
" tweet_url_count = 0\n",
" tweet_shortened_urls_mentioned = \"\"\n",
" tweet_display_urls_mentioned = \"\"\n",
" tweet_full_urls_mentioned = \"\"\n",
" \n",
" #-- END check to see if one or more urls --#\n",
" \n",
" # !tweet user mentions?\n",
" \n",
" # initialize user mention variables.\n",
" tweet_user_mention_count = 0\n",
" tweet_users_mentioned_ids = \"\"\n",
" tweet_users_mentioned_screennames = \"\"\n",
" \n",
" # do we have user mentions in this tweet?\n",
" tweet_user_mentions_json_list = current_tweet[ 'entities' ][ 'user_mentions' ]\n",
" user_mention_count = len( tweet_user_mentions_json_list )\n",
" if user_mention_count > 0:\n",
" \n",
" # got at least one user mention. loop and build lists.\n",
" tweet_user_id_list = []\n",
" tweet_user_screenname_list = []\n",
" for tweet_user_mention_json in tweet_user_mentions_json_list:\n",
" \n",
" # get user mention values\n",
" current_user_id = tweet_user_mention_json[ 'id_str' ]\n",
" current_user_screenname = tweet_user_mention_json[ 'screen_name' ]\n",
" \n",
" # append to lists\n",
" tweet_user_id_list.append( current_user_id )\n",
" tweet_user_screenname_list.append( current_user_screenname )\n",
" \n",
" #-- END loop over hash tags --#\n",
" \n",
" # store count\n",
" tweet_user_mention_count = len( tweet_user_id_list )\n",
"\n",
" # convert to comma-delimited lists for storage.\n",
" tweet_users_mentioned_ids = \",\".join( tweet_user_id_list )\n",
" tweet_users_mentioned_screennames = \",\".join( tweet_user_screenname_list )\n",
"\n",
" else:\n",
" \n",
" # no user mentions - set count to 0, everything else to \"\".\n",
" tweet_user_mention_count = 0\n",
" tweet_users_mentioned_ids = \"\"\n",
" tweet_users_mentioned_screennames = \"\"\n",
" \n",
" #-- END check to see if one or more user mentions --#\n",
" \n",
" #------------------------------------------------------------------------\n",
" # user data\n",
" #------------------------------------------------------------------------\n",
"\n",
" twitter_user_twitter_id = current_tweet[ 'user' ][ 'id' ]\n",
" twitter_user_screenname = current_tweet[ 'user' ][ 'screen_name' ]\n",
" user_followers_count = current_tweet[ 'user' ][ 'followers_count' ]\n",
" user_favorites_count = current_tweet[ 'user' ][ 'favourites_count' ]\n",
" user_friends_count = current_tweet[ 'user' ][ 'friends_count' ]\n",
" user_created = current_tweet[ 'user' ][ 'created_at' ]\n",
" user_location = current_tweet[ 'user' ][ 'location' ]\n",
" user_description = current_tweet[ 'user' ][ 'description' ]\n",
" user_statuses_count = current_tweet[ 'user' ][ 'statuses_count' ]\n",
"\n",
" # DO SOMETHING WITH THE DATA!\n",
" #print( \"--> \" + str( twitter_tweet_id ) + \" - \" + tweet_text )\n",
" \n",
" #-- END try-except to see if deleted tweet. --#\n",
" \n",
" if ( tweet_counter % 100 ) == 0:\n",
" \n",
" # yes - print a brief message\n",
" print( \"====> tweet count = \" + str( tweet_counter ) )\n",
" \n",
" #-- END check to see if we've done another hundred --#\n",
" \n",
"#-- END loop over tweet stream --#"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Sample JSON of tweet\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Sample JSON text:\n",
"\n",
" {\n",
" \"contributors\": null,\n",
" \"coordinates\": null,\n",
" \"created_at\": \"Sun Feb 08 17:19:39 +0000 2015\",\n",
" \"entities\": {\n",
" \"hashtags\": [],\n",
" \"symbols\": [],\n",
" \"trends\": [],\n",
" \"urls\": [\n",
" {\n",
" \"display_url\": \"knzmuslim.com\",\n",
" \"expanded_url\": \"http://knzmuslim.com\",\n",
" \"indices\": [\n",
" 72,\n",
" 94\n",
" ],\n",
" \"url\": \"http://t.co/sfymYu8CVj\"\n",
" }\n",
" ],\n",
" \"user_mentions\": []\n",
" },\n",
" \"favorite_count\": 0,\n",
" \"favorited\": false,\n",
" \"filter_level\": \"low\",\n",
" \"geo\": null,\n",
" \"id\": 564473647612841985,\n",
" \"id_str\": \"564473647612841985\",\n",
" \"in_reply_to_screen_name\": null,\n",
" \"in_reply_to_status_id\": null,\n",
" \"in_reply_to_status_id_str\": null,\n",
" \"in_reply_to_user_id\": null,\n",
" \"in_reply_to_user_id_str\": null,\n",
" \"lang\": \"ar\",\n",
" \"place\": null,\n",
" \"possibly_sensitive\": false,\n",
" \"retweet_count\": 0,\n",
" \"retweeted\": false,\n",
" \"source\": \"<a href=\\\"http://knzmuslim.com\\\" rel=\\\"nofollow\\\">knzmuslim \\u0643\\u0646\\u0632 \\u0627\\u0644\\u0645\\u0633\\u0644\\u0645</a>\",\n",
" \"text\": \"\\u0627\\u0644\\u0644\\u0647\\u0645 \\u0635\\u0628\\u062d\\u0646\\u0627 \\u0628\\u0645\\u0627 \\u064a\\u0633\\u0631\\u0646\\u0627 \\u0648\\u0643\\u0641 \\u0639\\u0646\\u0627 \\u0645\\u0627 \\u064a\\u0636\\u0631\\u0646\\u0627 \\u0648\\u064a\\u0633\\u0631 \\u0644\\u0646\\u0627 \\u062f\\u0631\\u0648\\u0628\\u0646\\u0627 \\u0648\\u0646\\u0648\\u0631 \\u0628\\u0646\\u0648\\u0631\\u0643 \\u064a\\u0648\\u0645\\u0646\\u0627 http://t.co/sfymYu8CVj\",\n",
" \"timestamp_ms\": \"1423415979661\",\n",
" \"truncated\": false,\n",
" \"user\": {\n",
" \"contributors_enabled\": false,\n",
" \"created_at\": \"Tue Aug 05 08:20:21 +0000 2014\",\n",
" \"default_profile\": true,\n",
" \"default_profile_image\": false,\n",
" \"description\": \"\\u0635\\u0646\\u0627\\u0639 \\u0627\\u0644\\u062d\\u064a\\u0627\\u0647\",\n",
" \"favourites_count\": 0,\n",
" \"follow_request_sent\": null,\n",
" \"followers_count\": 24,\n",
" \"following\": null,\n",
" \"friends_count\": 37,\n",
" \"geo_enabled\": false,\n",
" \"id\": 2708736350,\n",
" \"id_str\": \"2708736350\",\n",
" \"is_translator\": false,\n",
" \"lang\": \"ar\",\n",
" \"listed_count\": 0,\n",
" \"location\": \"\",\n",
" \"name\": \" \\u0644\\u064a\\u0646\\u0627 \\u0627\\u062d\\u0645\\u062f\",\n",
" \"notifications\": null,\n",
" \"profile_background_color\": \"C0DEED\",\n",
" \"profile_background_image_url\": \"http://abs.twimg.com/images/themes/theme1/bg.png\",\n",
" \"profile_background_image_url_https\": \"https://abs.twimg.com/images/themes/theme1/bg.png\",\n",
" \"profile_background_tile\": false,\n",
" \"profile_image_url\": \"http://pbs.twimg.com/profile_images/496571979177017345/bSFdCKPp_normal.jpeg\",\n",
" \"profile_image_url_https\": \"https://pbs.twimg.com/profile_images/496571979177017345/bSFdCKPp_normal.jpeg\",\n",
" \"profile_link_color\": \"0084B4\",\n",
" \"profile_sidebar_border_color\": \"C0DEED\",\n",
" \"profile_sidebar_fill_color\": \"DDEEF6\",\n",
" \"profile_text_color\": \"333333\",\n",
" \"profile_use_background_image\": true,\n",
" \"protected\": false,\n",
" \"screen_name\": \"linaa_hmad\",\n",
" \"statuses_count\": 7045,\n",
" \"time_zone\": null,\n",
" \"url\": null,\n",
" \"utc_offset\": null,\n",
" \"verified\": false\n",
" }\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### tweepy - an object-oriented alternative\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"_`tweepy`_ is another popular Python twitter streaming API. Rather than interact with JSON responses directly, as you do with _`twitter`_, _`tweepy`_ wraps all responses in Python objects that you interact with using dot-notation. It isn't as simple as _`twitter`_, but it is solid and well-supported, and it abstracts out the need to understand or work with JSON. Some links if you are interested:\n",
"\n",
"- tweepy home page: [http://www.tweepy.org/](http://www.tweepy.org/)\n",
"- tweepy doc: [http://tweepy.readthedocs.org/en/latest/](http://tweepy.readthedocs.org/en/latest/)\n",
"- Using tweepy for streaming data: [http://runnable.com/Us9rrMiTWf9bAAW3/how-to-stream-data-from-twitter-with-tweepy-for-python](http://runnable.com/Us9rrMiTWf9bAAW3/how-to-stream-data-from-twitter-with-tweepy-for-python)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Capturing data\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Considerations when capturing data:\n",
"\n",
"- in general, fine to capture to flat files or to database, but I prefer database because you can use SELECT to explore data as you are collecting it.\n",
"- As long as it is easy, capture everything from your data source.\n",
"- Even if there are challenges, consider capturing everything from your data source.\n",
"- Once you get to derivation and analysis, you can ignore source data you don’t use, but if you later have a new question that relates to other bits of data in the data stream, it helps to have that data so you can use your existing collection to assess the usefulness, even if you subsequently have to do a new collection to get fresh data.\n",
"- if performance matters, start flat.\n",
"- normalization note\n",
"\n",
" - don’t always have to be hyper-vigilant with normalization\n",
" - it is OK to only turn the screws on basic things first (tweet, user) and then refine more later as you see what you are interested in. You don't NEED to be perfectly normalized right away. And, it can also be OK to denormalize things sometimes if it aids in deriving data or analysis. But, if you do that, you should do it for a reason. And, don’t throw away data you might need later to normalize.\n",
" \n",
"- Once you get your twitter stream consumer working, you should be able to then just use the INSERT from the example from Week 6 that read from a CSV file and inserted a record per row to insert each tweet into the database as it comes over the stream ( [http://nbviewer.ipython.org/gist/jonathanmorgan/7b66cf2cc1c63f92ac1b#Using-\"csv\"-package-to-read-and-process-CSV-files](http://nbviewer.ipython.org/gist/jonathanmorgan/7b66cf2cc1c63f92ac1b#Using-\"csv\"-package-to-read-and-process-CSV-files) ).\n",
"\n",
" - This is another reason I wanted you to first make the flat table in the database, then work from there. That flat table is also a good landing spot for tweets you collect by other means."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practical considerations\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"### Know and follow the rules of APIs\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Common rules of APIs:\n",
"\n",
"- rate-limiting\n",
"\n",
" - number of transactions per second\n",
" - number of transactions per day\n",
" - number of parallel connections\n",
"\n",
"- also, follow terms of service\n",
"\n",
" - know limits on collection, storage.\n",
" - look to see if it is prohibited to scrape a given site.\n",
" \n",
"### Performance\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"- In general, make code that is well-designed and -structured, and only worry about performance when you run into a problem.\n",
"- sometimes, with APIs, you will run into problems - especially streaming APIs.\n",
"\n",
" - one reason to write to flat database tables when streaming from an API is that you can get cut off if you don't keep up, and writing to multiple tables per tweet slows you down."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment