Skip to content

Instantly share code, notes, and snippets.

@gcr
Created February 8, 2018 21:48
Show Gist options
  • Save gcr/f5cd87c871c30dc8b63e258e4221b6ee to your computer and use it in GitHub Desktop.
Save gcr/f5cd87c871c30dc8b63e258e4221b6ee to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial: Using the New York Times API\n",
"Code by [Nick Diakopoulos](http://www.nickdiakopoulos.com/).\n",
"\n",
"APIs, or \"Application Programming Interfaces\" are useful because they can allow you to access data or services on other servers on the web. The goal of this tutorial is to give you some experience collecting data from APIs - one way to think of an API is as a URL that returns data when loaded. We'll look specifically at the New York Times APIs. Whenever working with APIs you'll need to get cozy with the API documentation as that will dictate what you can and can't do with the API. \n",
"\n",
"Here is [the documentation](http://developer.nytimes.com/) on the NYT APIs.\n",
"\n",
"And here's [the documentation](http://developer.nytimes.com/community_api_v3.json#/README) on the NYT Community API which we'll use to collect a day's worth of comments. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Something to be aware of with APIs is that they are usually rate limited, and you may need to sign up for an authorization key to use them. Before continuing, sign up for an NYT API key and copy the string key into the variable below `api_key`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import requests, json \n",
"import math\n",
"\n",
"# Copy your api_key here as a string\n",
"api_key = 'API_KEY'\n",
"url = 'http://api.nytimes.com/svc/community/v3/user-content/by-date.json'\n",
"\n",
"api_response = requests.get(url, params={\"api-key\": api_key, \"date\": \"2018-02-01\"})\n",
"api_response.url\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you paste that URL into a browser you will see all of the sweet JSON data that the API has sent back to fulfill your request. \n",
"\n",
"The next step is to figure out how to parse the response. There is a variable `results` and beneath that a variable `comments` which has a list of JSON objects, one for each comment. \n",
"\n",
"Let's the get the API response as a JSON object. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"date_of_interest = \"2018-01-01\"\n",
"api_response = requests.get(url, params={\"api-key\": api_key, \"date\": date_of_interest}).json()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we can isolate the comments list and parse it into a pandas dataframe. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"comments = pd.read_json(json.dumps(api_response[\"results\"][\"comments\"]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We don't need all those columns so let's drop some of them. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"comments.columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"comments.drop(labels = [\"commentSequence\", \"commentTitle\", \"lft\", \"rgt\", \"status\", \"statusID\", \"userTitle\", \"userURL\"], axis=1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"comments.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Neat, we've collected 25 comments from the date we specified in the URL. BUT there were many more comments made that day and we we want to loop through and collect them all. \n",
"\n",
"Just how many comments were there? We can look at the `totalCommentsFound` field in the response object to find out. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"api_response[\"results\"][\"totalCommentsFound\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"nIterationsNeeded = int(math.ceil((api_response[\"results\"][\"totalCommentsFound\"]) / 25.0))\n",
"print (\"We need to collect\", nIterationsNeeded ,\"times, since we only get 25 comments at a time.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Based on the Community API [readme](http://developer.nytimes.com/community_api_v3.json#/README) there is another URL parameter called `offset` which allows us to grab blocks of 25 comments from a different starting offset. We can increment this parameter and repeatedly call the API with different offset values in order to collect all the comments. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# We'll need this library to add sleeping functionality and slow our script down\n",
"from time import sleep\n",
"\n",
"# Create an empty dataframe with the columns we want (as gathered above)\n",
"all_comments = pd.DataFrame(columns = comments.columns)\n",
"\n",
"# Iterate from zero up to the number of iterations needed\n",
"for i in range(0, nIterationsNeeded):\n",
" print (i)\n",
" # set the offset by multiplying by 25\n",
" offset = i * 25\n",
" # call the api with the offset parameter\n",
" api_response = requests.get(url, params={\"api-key\": api_key, \"date\": date_of_interest, \"offset\": offset})\n",
" #print requests.get(url, params={\"api-key\": api_key, \"date\": \"2016-12-15\", \"offset\": offset}).url\n",
" if api_response.status_code != 200:\n",
" sleep(1)\n",
" api_response = requests.get(url, params={\"api-key\": api_key, \"date\": date_of_interest, \"offset\": offset})\n",
" \n",
" api_response = api_response.json()\n",
" comments_batch = pd.read_json(json.dumps(api_response[\"results\"][\"comments\"]))\n",
" comments_batch.drop(labels = [\"commentSequence\", \"commentTitle\", \"lft\", \"rgt\", \"status\", \"statusID\", \"userTitle\", \"userURL\"], axis=1, inplace=True)\n",
" \n",
" # Append these comments\n",
" all_comments = all_comments.append(comments_batch)\n",
" # Because we just appended a bunch of rows we need to reset the index\n",
" all_comments.reset_index()\n",
" print (\" Collected\", all_comments.shape[0], \"comments.\")\n",
" # Sleep for a bit in between each call (it's courteous not to request data to an api too frequently and some APIs dictate this through rate limiting)\n",
" sleep(.1) # half second"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"all_comments.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Let's save it\n",
"all_comments.to_csv(\"NYT-Comments\" + date_of_interest + \".csv\", index=False, encoding='utf-8')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which users were most active commenting on that day?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"grouped = all_comments.groupby(\"userID\")\n",
"grouped.size()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Oddly it has incorrectly parsed the userID as a floating point number, but we know better that it's an integer and we can set it explicitely. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"all_comments.userID = all_comments.userID.astype(int)\n",
"grouped = all_comments.groupby(\"userID\")\n",
"group_sizes_df = grouped.size().reset_index()\n",
"group_sizes_df.columns = ['userID', \"group_size\"]\n",
"group_sizes_df[group_sizes_df.group_size > 5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And then we can use the top userID to see what kind of comments that person wrote. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.set_option('display.max_colwidth', -1)\n",
"all_comments[all_comments.userID == 57123959].commentBody"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could also aggregate and then rank people by another column, like the average or total `recommendation_count` of their comments. This will give a sense of the users that were overall most recommended. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"all_comments.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's see who got the most recommednation for their comments"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"subset = all_comments[['userID', 'recommendationCount']]\n",
"grouped = subset.groupby(\"userID\")\n",
"print (grouped.agg(np.sum).sort_values(by=\"recommendationCount\", ascending=False))\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"all_comments[all_comments.userID == 57123959].recommendationCount"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment