Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kenorb/ccbcd5f9215a61c109779494f3c3936b to your computer and use it in GitHub Desktop.
Save kenorb/ccbcd5f9215a61c109779494f3c3936b to your computer and use it in GitHub Desktop.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Overview\n",
"\n",
"This is the fourth in a series of tutorials that illustrate how to download the IRS 990 e-file data available at https://aws.amazon.com/public-data-sets/irs-990/\n",
"\n",
"In this fourth notebook we will download into a MongoDB table all available 990 e-filings from a sample of 5 nonprofit organizations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load Packages"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import sys\n",
"import time\n",
"import json"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/Users/gsaxton/Dropbox/990 e-file data\n"
]
}
],
"source": [
"cd '/Users/gsaxton/Dropbox/990 e-file data'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### MongoDB\n",
"Depending on the project, I will store the data in SQLite or MongoDB. This time I'll use MongoDB -- it's great for storing JSON data where each observation could have different variables. Before we get to the interesting part the following code blocks set up the MongoDB environment and the new database we'll be using. \n",
"\n",
"**_Note:_** In a terminal we'll have to start MongoDB by running the command *mongod* or *sudo mongod*. Then we run the following code block here to access MongoDB."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pymongo\n",
"from pymongo import MongoClient\n",
"client = MongoClient()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>Connect to our database that contains the filing index data we downloaded earlier."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# DEFINE MY mongoDB DATABASE\n",
"db = client['irs_990_db']\n",
"\n",
"# DEFINE MY COLLECTION WHERE I'LL INSERT MY SEARCH \n",
"file_list = db['990_files_2011_2018']"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[u'_id_', u'ObjectId_1']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(db.file_list.index_information())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>Check how many observations in the database table."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2373310"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"file_list.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loop over List of Filings, grab e-file data, and insert into second database\n",
"First we'll write a function to turn an ordered dictionary (which is what is returned by *xmltodict*) into a normal Python dictionary so that we can combine it with the filing data gathered above."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from json import loads, dumps\n",
"from collections import OrderedDict\n",
"\n",
"def to_dict(input_ordered_dict):\n",
" return loads(dumps(input_ordered_dict))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>Let's create a list of five EINs. Each one uniquely identifies a nonprofit organization."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"eins = ['010202467', '010211478', '010211513', '010211530', '010211543']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"Check how many filings there are for this sample of 5 EINs. Here we are access the <code>file_list</code> collection and counting the number of rows in the database that match the EINs in the above list. There is more than one filing per organization given the multiple years in our data. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"33"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"file_list.find({'EIN': { '$in': eins}}).count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>We can loop over each of these rows in our MongoDB collection. Let's show the index details for the first filing."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{u'OrganizationName': u'MAINE HISTORICAL SOCIETY', u'ObjectId': u'201100709349300510', u'URL': u'https://s3.amazonaws.com/irs-form-990/201100709349300510_public.xml', u'SubmittedOn': u'2011-03-22', u'DLN': u'93493070005101', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201009', u'FormType': u'990', u'_id': ObjectId('5ad8e83435fd3fae98d6464a'), u'EIN': u'010211530'}\n"
]
}
],
"source": [
"for file in file_list.find({'EIN': { '$in': eins}})[:1]:\n",
" print file"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>To recall, our current dataset contains basic details on all 2,373,310 IRS 990 filings. We still don't have the actual 990 data, however. To get that, we will have to pick which filings we want and then access the *URL* column for that filing as seen in our dataset. What we are going to want to do later on is loop over all 33 rows in our database and access each filing by visiting the link shown in the *URL* field. The code block below prints out the URL for the first filing by our list of 5 EINS. "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://s3.amazonaws.com/irs-form-990/201100709349300510_public.xml\n"
]
}
],
"source": [
"for file in file_list.find({'EIN': { '$in': eins}})[:1]:\n",
" print file['URL']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>To access the filing we are going to use a Python library called <code>urllib2</code>. In the following code block we access the first matching row in our MongoDB collection, visit the *URL* and assign the page data to a variable called *url_data*, read in the data and assign it to a variable called *f_string*, and then print out the first 1000 characters of data. "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://s3.amazonaws.com/irs-form-990/201100709349300510_public.xml\n",
"<?xml version=\"1.0\" encoding=\"utf-8\"?>\n",
"<Return xmlns=\"http://www.irs.gov/efile\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.irs.gov/efile\" returnVersion=\"2009v1.7\">\n",
" <ReturnHeader binaryAttachmentCount=\"0\">\n",
" <Timestamp>2011-03-11T11:03:33-06:00</Timestamp>\n",
" <TaxPeriodEndDate>2010-09-30</TaxPeriodEndDate>\n",
" <PreparerFirm>\n",
" <PreparerFirmBusinessName>\n",
" <BusinessNameLine1>MARSHALL &amp; LIBBY LLC</BusinessNameLine1>\n",
" </PreparerFirmBusinessName>\n",
" <PreparerFirmUSAddress>\n",
" <AddressLine1>2367 CONGRESS ST</AddressLine1>\n",
" <City>PORTLAND</City>\n",
" <State>ME</State>\n",
" <ZIPCode>041021932</ZIPCode>\n",
" </PreparerFirmUSAddress>\n",
" </PreparerFirm>\n",
" <ReturnType>990</ReturnType>\n",
" <TaxPeriodBeginDate>2009-10-01</TaxPeriodBeginDate>\n",
" <Filer>\n",
" <EIN>010211530</EIN>\n",
" <Name>\n",
" <BusinessNameLine1>MAINE HISTORICAL SOCIETY</BusinessNameLine1>\n",
" </Name>\n",
" <\n"
]
}
],
"source": [
"import xmltodict\n",
"import urllib2\n",
"for file in file_list.find({'EIN': { '$in': eins}})[:1]:\n",
" print file['URL']\n",
" url_data = urllib2.urlopen(file['URL'])\n",
" f_string = url_data.read()\n",
" print f_string[:1000]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>The above data are in XML format. We will need to convert them to JSON format in order to insert into MongoDB. For that we will leverage the Python module <code>xmltodict</code>."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://s3.amazonaws.com/irs-form-990/201100709349300510_public.xml\n"
]
}
],
"source": [
"import xmltodict\n",
"import urllib2\n",
"for file in file_list.find({'EIN': { '$in': eins}})[:1]:\n",
" print file['URL']\n",
" url_data = urllib2.urlopen(file['URL'])\n",
" f_string = url_data.read()\n",
" data = xmltodict.parse(f_string)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>Now the data are in a dictionary format. Let's check which keys are present."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[u'Return']"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>We see there is nothing relevant there, so let's drop down a level."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[u'@xmlns',\n",
" u'@xmlns:xsi',\n",
" u'@xsi:schemaLocation',\n",
" u'@returnVersion',\n",
" u'ReturnHeader',\n",
" u'ReturnData']"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['Return'].keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>The first four keys do no contain much useful information."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"u'http://www.irs.gov/efile'"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['Return']['@xmlns']"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"u'http://www.w3.org/2001/XMLSchema-instance'"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['Return']['@xmlns:xsi']"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"u'http://www.irs.gov/efile'"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['Return']['@xsi:schemaLocation']"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"u'2009v1.7'"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['Return']['@returnVersion']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>The <code>ReturnHeader</code> key contains 12 sub-keys. Here we have information that could be useful for certain research projects. For most purposes, however, we can skip this."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[u'@binaryAttachmentCount',\n",
" u'Timestamp',\n",
" u'TaxPeriodEndDate',\n",
" u'PreparerFirm',\n",
" u'ReturnType',\n",
" u'TaxPeriodBeginDate',\n",
" u'Filer',\n",
" u'Officer',\n",
" u'Preparer',\n",
" u'TaxYear',\n",
" u'BuildTS']"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['Return']['ReturnHeader'].keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>Lastly, we inspect the <code>ReturnData</code> key. The first key simply provides a count of the number of documents (in this case, 7) as well as keys holding the 990 return data and any of the associated \"schedules.\""
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[u'@documentCount',\n",
" u'IRS990',\n",
" u'IRS990ScheduleA',\n",
" u'IRS990ScheduleB',\n",
" u'IRS990ScheduleD',\n",
" u'IRS990ScheduleM',\n",
" u'IRS990ScheduleO',\n",
" u'IRS990ScheduleR']"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['Return']['ReturnData'].keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>In almost all cases, what we want is the 990 data. As we can see here, there are 207 keys nested under the <code>IRS990</code> key. "
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# of keys in 990 data: 207 \n",
"\n",
"First 5 keys: [u'@documentId', u'@referenceDocumentId', u'NameOfPrincipalOfficerPerson', u'AddressPrincipalOfficerUS', u'GrossReceipts']\n"
]
}
],
"source": [
"print \"# of keys in 990 data:\", len(data['Return']['ReturnData']['IRS990'].keys()), '\\n'\n",
"print \"First 5 keys:\", data['Return']['ReturnData']['IRS990'].keys()[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>In many cases, such as the fourth key above, <code>AddressPrincipalOfficerUS</code>, there are multiple keys nested under it. Depending on your data needs you can \"flatten\" these data as desired. We'll skip this for now and instead do such manipulations in PANDAS in a later tutorial."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"OrderedDict([(u'AddressLine1', u'485 CONGRESS STREET'),\n",
" (u'City', u'PORTLAND'),\n",
" (u'State', u'ME'),\n",
" (u'ZIPCode', u'04101')])"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['Return']['ReturnData']['IRS990']['AddressPrincipalOfficerUS']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>In brief, what we want to do is write code that will loop over all relevant rows in our MongoDB collection, visit the respective URL where the 990 data are located, grab those data, convert them to a dictionary, and then insert into a new MongoDB collection. \n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Set up new MongoDB collection\n",
"In this second collection we will be inserting the actual 990 data we will be downloading."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# DEFINE MY COLLECTION (DATABASE TABLE) WHERE I'LL INSERT THE 990 Data \n",
"filings = db['filings_test']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>Set a unique constraint on the collection based on *URL*. This will avert duplicates from being inserted."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"u'URL_1'"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db.filings.create_index([('URL', pymongo.ASCENDING)], unique=True)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[u'_id_', u'URL_1']"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(db.filings.index_information())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>There are no filings yet."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"filings.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Download the 990 Data\n",
"Now let's run a loop for all five EINs, grab the 33 IRS 990 filings, and insert them into our new MongoDB table. This block has some additional code that I won't discuss in detail (see comments below for further details). The short answer is that we are looping over each row in our database, visiting the URL that contains the 990 data, and then grabbing all of the data returned by the <code>IRS990</code> key. For convenience purposes, we then combine this new data with the associated filing index data from our first database, and then insert the combined data into our new <code>filings</code> MongoDB collection.\n",
"\n",
"Note that this code block will only work for organizations that have the <code>IRS990</code> key. The \"check\" is found in the following line of code:\n",
"\n",
"&nbsp; &nbsp; <code>if 'IRS990' in data['Return']['ReturnData']:</code>\n",
"\n",
"This means that organizations filing *990EZ* or *990PF* will be skipped. However, the code block could easily be modified to grab 990EZ or 990PF filings. Similarly, we could modify this in order to grab the data for any of the 'schedules' included under <code>data['Return']['ReturnData']</code>"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('counter: ', 33, u'BOYS AND GIRLS CLUBS OF SOUTHERN MAINE') \n",
"\n",
"# of minutes: 0.252277318637 \n",
"\n",
"# of filings added to database: 33 \n",
"\n"
]
}
],
"source": [
"import xmltodict\n",
"import urllib2\n",
"import timeit\n",
"\n",
"start_time = timeit.default_timer()\n",
"from IPython.display import display, clear_output ##### FOR USE WITH STDOUT (DYNAMIC, SINGLE-LINE PRINTING)\n",
"\n",
"start_count = filings.count()\n",
"counter = 0\n",
"for f in file_list.find({'EIN': { '$in': eins}}):\n",
" counter += 1\n",
" \n",
" print f\n",
" if 'URL' in f:\n",
" url = f['URL']\n",
" print url, '\\n'\n",
" url_data = urllib2.urlopen(url)\n",
" f_string = url_data.read()\n",
" data = xmltodict.parse(f_string)\n",
" \n",
" try:\n",
" if 'IRS990' in data['Return']['ReturnData']: ##### CHECK WHETHER THE 'IRS990' KEY IS PRESENT IN THE DATA\n",
" data = data['Return']['ReturnData']['IRS990']\n",
" data = to_dict(data)\n",
" \n",
" ##### COMBINE THE DICT OF FILING INFO FROM FIRST STEP WITH FILING DATA GATHERED HERE\n",
" c = {key: value for (key, value) in (f.items() + data.items())}\n",
" c.pop('_id', None) #DROP 'id' (OR IT WILL NOT INSERT)\n",
" t = json.dumps(c)\n",
" #print t\n",
" loaded_entry = json.loads(t)\n",
" #print type(loaded_entry) , loaded_entry #<type 'dict'>\n",
" try:\n",
" filings.insert_one(loaded_entry)\n",
" except pymongo.errors.DuplicateKeyError, e:\n",
" print e, '\\n'\n",
" \n",
" except KeyError:\n",
" print 'IRS9990 key not in data (likely a 990EZ or 990PF filer)'\n",
" print data['Return']['ReturnData'].keys()\n",
" \n",
" else:\n",
" print f['IsAvailable']\n",
" \n",
" clear_output()\n",
" print ('counter: ', counter, f['OrganizationName']), '\\n'\n",
" sys.stdout.flush() \n",
" \n",
"elapsed = timeit.default_timer() - start_time\n",
"print '# of minutes: ', elapsed/60, '\\n'\n",
"print \"# of filings added to database:\", filings.count() - start_count, '\\n' "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>To wrap things up, what we now have is a new MongoDB table, called *filings*, that contains the 33 IRS 990 filings for our sample of 5 nonprofit organizations. Each row in our database table (or, in MongoDB language, each *document* in our database *collection*) includes both the filing information from our first MongoDB table along with the 990 data we have just downloaded. \n",
"\n",
"One thing to note is that the columns (aka *fields*) in our MongoDB database often contain one or more variables that are nested under sub-keys. All 9 of the top-level keys from our first MongoDB table are included, as are all 200+ top-level keys from <code>['Return']['ReturnData']['IRS990']</code>. Note that in each column there can be nested data as well, which can be \"flattened\" to extract the nested data. That is a task for which PANDAS is much better suited.\n",
"\n",
"\n",
"In addition, we will ultimately need a \"map\" or \"dictionary\" or \"codebook\" to make sense of what the individual columns mean. Solving both of these tasks is what we will take up in future tutorials. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
},
"toc": {
"colors": {
"hover_highlight": "#DAA520",
"navigate_num": "#000000",
"navigate_text": "#333333",
"running_highlight": "#FF0000",
"selected_highlight": "#FFD700",
"sidebar_border": "#EEEEEE",
"wrapper_background": "#FFFFFF"
},
"moveMenuLeft": true,
"nav_menu": {
"height": "279px",
"width": "252px"
},
"navigate_menu": true,
"number_sections": true,
"sideBar": true,
"threshold": 4,
"toc_cell": false,
"toc_section_display": "block",
"toc_window_display": false,
"widenNotebook": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment