Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save gdsaxton/ae8a4f36f7ea741fb1a34ae15315ac0c to your computer and use it in GitHub Desktop.
Save gdsaxton/ae8a4f36f7ea741fb1a34ae15315ac0c to your computer and use it in GitHub Desktop.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Overview\n",
"This is the first in a series of tutorials that illustrate how to download the 990 e-file data the IRS started to make public in 2017. The IRS 990 e-file data are housed on Amazon Web Services (AWS) at https://aws.amazon.com/public-data-sets/irs-990/\n",
"\n",
"In this first notebook we will set up access to AWS and create a combined list of all the annual index files that list all available 990 filings. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Set up AWS CLI\n",
"\n",
"I am assuming you are using a similar set-up to me -- namely, Anaconda Python 2.7 -- with MongoDB also installed. For an overview see this blog post: http://social-metrics.org/python-code-prerequisites/\n",
"\n",
"\n",
"\n",
"\n",
"What you will first need to do is install AWS CLI. Open a UNIX terminal (e.g., the *Terminal* app on a Mac or Linux machine) and type the following code:\n",
"\n",
"&nbsp; &nbsp; &nbsp; &nbsp; <code>pip install awscli</code>\n",
"\n",
"\n",
"You then need to go to the Web and configure *AWS CLI*. For this you will need an AWS account and will need to generate an <code>access key ID</code> and a <code>secret access key</code>. For a step-by-step guide see here: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html\n",
"\n",
"Once you have that done, you can configure <code>awscli</code>. Go back to your *Terminal* and type in the following\n",
"\n",
"&nbsp; &nbsp; &nbsp; &nbsp; <code>aws configure</code>\n",
"\n",
"You will be asked to enter your access key ID and your secret access key. You can also set your region if desired (you can leave it blank and hit enter as well). You will only have to do this set-up once.\n",
"\n",
"Now you're all set to run the <code>awscli</code>"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"#### Use *awscli* to grab annual index files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"We first want to grab a list of the annual index files. Each of these files contains a list of all the e-filings available that year.\n",
"\n",
"To get the listing, you could type the following into a Terminal window:\n",
"\n",
"- aws s3 ls s3://irs-form-990/index --human-readable --summarize\n",
"\n",
"We can replicate that command here in Jupyter to pre-prending the command with an exclamation mark:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2016-07-14 12:21:03 24.6 MiB index_2011.csv\r\n",
"2016-09-27 12:16:08 59.7 MiB index_2011.json\r\n",
"2016-08-05 02:46:28 29.5 MiB index_2012.csv\r\n",
"2016-09-27 12:01:16 76.9 MiB index_2012.json\r\n",
"2016-08-05 02:45:51 29.1 MiB index_2013.csv\r\n",
"2016-09-27 11:36:14 76.9 MiB index_2013.json\r\n",
"2016-07-06 12:25:45 45.9 MiB index_2014.csv\r\n",
"2016-09-27 11:27:37 113.9 MiB index_2014.json\r\n",
"2016-08-05 02:45:12 28.8 MiB index_2015.csv\r\n",
"2016-09-26 18:42:12 76.6 MiB index_2015.json\r\n",
"2017-02-09 10:47:45 42.8 MiB index_2016.csv\r\n",
"2017-01-11 17:40:21 111.4 MiB index_2016.json\r\n",
"2018-01-18 19:43:55 59.5 MiB index_2017.csv\r\n",
"2018-01-19 19:55:48 143.6 MiB index_2017.json\r\n",
"2018-04-12 12:42:42 15.9 MiB index_2018.csv\r\n",
"2018-04-10 17:21:28 38.7 MiB index_2018.json\r\n",
"\r\n",
"Total Objects: 16\r\n",
" Total Size: 973.7 MiB\r\n"
]
}
],
"source": [
"!aws s3 ls s3://irs-form-990/index --human-readable --summarize"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"We see there are both CSV and JSON format versions of each annual file. Let's test out the first year of data by grabbing *index_2011.json*. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"import json\n",
"url = 'https://s3.amazonaws.com/irs-form-990/index_2011.json'\n",
"f = requests.get(url)\n",
"data = f.json()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"The data returned are in JSON format and, specifically, are nested under a <code>Filings2011</code> key. The following three lines of code show, respectively, the top-level 'keys' for the data, the number of objects (filings) listed under that major key, and the first index filing. We see there are 203,075 filings in the 2011 index file."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Top-level key(s) in the 2011 index file: [u'Filings2011'] \n",
"\n",
"# of filings in the 2011 index file: 203075 \n",
"\n"
]
},
{
"data": {
"text/plain": [
"[{u'DLN': u'93493316003251',\n",
" u'EIN': u'591971002',\n",
" u'FormType': u'990',\n",
" u'LastUpdated': u'2016-03-21T17:23:53',\n",
" u'ObjectId': u'201103169349300325',\n",
" u'OrganizationName': u'ANGELUS INC',\n",
" u'SubmittedOn': u'2011-11-30',\n",
" u'TaxPeriod': u'201009',\n",
" u'URL': u'https://s3.amazonaws.com/irs-form-990/201103169349300325_public.xml'}]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print 'Top-level key(s) in the 2011 index file:', data.keys(), '\\n'\n",
"print '# of filings in the 2011 index file:', len(data['Filings2011']), '\\n'\n",
"data['Filings2011'][:1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"Python is tailor-made for automating many coding tasks. What we ultimately want to do is download each of the index files we found earlier. We *could* thus create a python *list* of the JSON versions of those file names like so:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['index_2011.json', 'index_2012.json', 'index_2013.json', 'index_2014.json', 'index_2015.json', 'index_2016.json', 'index_2017.json', 'index_2018.json']\n"
]
}
],
"source": [
"index_list = []\n",
"for x in range(11, 19, 1):\n",
" index_list.append('index_20%s.json' % x)\n",
"print index_list"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>However, we can use a different shortcut. The only part that varies in each file name is the year. So, let's create a python *list* containing all index file years. We will use it in a loop below to download each yearly index."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018]\n"
]
}
],
"source": [
"year_list = []\n",
"for x in range(2011, 2019, 1):\n",
" year_list.append(x)\n",
"print year_list"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>We now create an empty python *dictionary* that will hold each year's worth of filings. In the second line of code we begin a *for loop*, looping over each year in the above list. We use that *year* value to create a variable *url* string for each year. In the fourth and fifth lines of code we visit the URL and assign the returned JSON data to our dictionary *data*. Each year will have its own *key* in the dictionary; for instance the 2011 data will be nested under the <code>data['Filings2011']</code> key. \n",
"\n",
"Note that each year of data returned by the index file nests the data under a different key. 2011 data is nested under a *Filings2011* key, 2012 data under a *Filings2012* key, etc. We will retain the same structure in our *data* dictionary.\n",
"\n",
"In the penultimate line of code we access the key in each loop and print out the number of filings that year. The final line of code prints the length of the entire dictionary; given that there is one top-level key per year, this command returns the number of years' worth of data in our dictionary. "
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# of filings in 2011 : 203075\n",
"# of filings in 2012 : 261622\n",
"# of filings in 2013 : 261449\n",
"# of filings in 2014 : 387529\n",
"# of filings in 2015 : 261034\n",
"# of filings in 2016 : 378420\n",
"# of filings in 2017 : 489013\n",
"# of filings in 2018 : 131168\n",
"# of years of data: 8\n"
]
}
],
"source": [
"data = {}\n",
"for year in year_list:\n",
" url = 'https://s3.amazonaws.com/irs-form-990/index_%s.json' % year\n",
" f = requests.get(url)\n",
" data['Filings%s' % year] = f.json()['Filings%s' % year]\n",
" print '# of filings in', year, ':', len(data['Filings%s' % year]) \n",
"print '# of years of data:', len(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>We can now see that there are 8 keys in our dictionary -- one for each year."
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Filings2014', 'Filings2015', 'Filings2016', 'Filings2017', 'Filings2011', 'Filings2012', 'Filings2013', 'Filings2018']\n"
]
}
],
"source": [
"print data.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>We can also use some Python to get a count of the total number of filings listed under the 8 keys."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"203075"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sum(len(v) for v in data.itervalues())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"We'll save a copy of our *data* dictionary. In our next notebook we'll open this file and throw the data into a MongoDB database."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, set our working directory"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/Users/gsaxton/Dropbox/990 e-file data\n"
]
}
],
"source": [
"cd '/Users/gsaxton/Dropbox/990 e-file data'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then save the dictionary into the above directory"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"with open('data.pickle', 'wb') as handle:\n",
" pickle.dump(data, handle)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
},
"toc": {
"colors": {
"hover_highlight": "#DAA520",
"navigate_num": "#000000",
"navigate_text": "#333333",
"running_highlight": "#FF0000",
"selected_highlight": "#FFD700",
"sidebar_border": "#EEEEEE",
"wrapper_background": "#FFFFFF"
},
"moveMenuLeft": true,
"nav_menu": {
"height": "279px",
"width": "252px"
},
"navigate_menu": true,
"number_sections": true,
"sideBar": true,
"threshold": 4,
"toc_cell": false,
"toc_section_display": "block",
"toc_window_display": false,
"widenNotebook": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment