gdsaxton/IRS 990 e-File Data (6) -- Download IRS 990, 990EZ, and 990PF Filings and Associated Schedules into MongoDB.ipynb

## IRS 990 e-File Data (6) -- Download IRS 990, 990EZ, and 990PF Filings and Associated Schedules into MongoDB.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Overview\n",
    "\n",
    "This is sixth in a series tutorials that illustrate how to download the <a href=\"https://aws.amazon.com/public-data-sets/irs-990/\">IRS 990 e-file data</a>. <a href=\"http://social-metrics.org/irs-990-e-file-data-part-5/\">Previous tutorials</a> have shown how to download data from IRS 990 filers. This tutorial adds downloads for 990PF and 990EZ filers. I also show you how to modify the code to download *all* filings as well as take a random sample of 990, 990PF, or 990EZ filings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Packages and Set Working Directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import time\n",
    "import json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/Users/gsaxton/Dropbox/990 e-file data\n"
     ]
    }
   ],
   "source": [
    "cd '/Users/gsaxton/Dropbox/990 e-file data'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Connect to MongoDB database\n",
    "*Note:* In a terminal we'll have to start MongoDB by running the command *mongod* or *sudo mongod*. Then we run the following code block here to access MongoDB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pymongo\n",
    "from pymongo import MongoClient\n",
    "client = MongoClient()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Connect to our database that contains the filing index data we downloaded earlier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# DEFINE MY mongoDB DATABASE\n",
    "db = client['irs_990_db']\n",
    "\n",
    "# DEFINE MY COLLECTION WHERE I'LL INSERT MY SEARCH \n",
    "file_list = db['990_files_2011_2018']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Check how many observations in the database table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2373310"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "file_list.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Three Types of 990 Filers\n",
    "We know there are three types of filers in the database: 990 filers, 990PF filers, and 990EZ filers. Here are the frequencies for each type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{u'_id': u'990PF', u'count': 318823},\n",
       " {u'_id': u'990EZ', u'count': 715541},\n",
       " {u'_id': u'990', u'count': 1338946}]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from bson.son import SON\n",
    "pipeline = [ {\"$group\": {\"_id\": \"$FormType\", \"count\": {\"$sum\": 1}}} ]\n",
    "list(file_list.aggregate(pipeline))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Each of these three filings has different columns and different associated \"schedules.\" Therefore, I would recommend separating them into different tables (collections) in our database. Below we'll thus create three loops -- one for each type of filer. Not only will this facilitate the mapping of variables to database columns, but it is more in line with the typical research design decision of focusing on one of these three types of filers. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: Download 990 Filers\n",
    "We begin by downloading the 990 data -- in a slightly different way than in <a href=\"http://social-metrics.org/irs-990-e-file-data-part-5/\">the previous tutorial</a>. Specifically, we are not creating a random sample of 5 EINs and then matching those to filings in our MongoDB collection. Instead, we are filtering our collection and taking the first 5 that have a &nbsp; <code>'990'</code> &nbsp; *FormType*, like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{u'OrganizationName': u'ANGELUS INC', u'ObjectId': u'201103169349300325', u'URL': u'https://s3.amazonaws.com/irs-form-990/201103169349300325_public.xml', u'SubmittedOn': u'2011-11-30', u'DLN': u'93493316003251', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201009', u'FormType': u'990', u'_id': ObjectId('5ad8e83235fd3fae98d5e53b'), u'EIN': u'591971002'} \n",
      "\n",
      "{u'OrganizationName': u'TOUCH-STONE SOLUTIONS INC', u'ObjectId': u'201113139349301231', u'URL': u'https://s3.amazonaws.com/irs-form-990/201113139349301231_public.xml', u'SubmittedOn': u'2011-11-30', u'DLN': u'93493313012311', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201106', u'FormType': u'990', u'_id': ObjectId('5ad8e83235fd3fae98d5e53c'), u'EIN': u'251713602'} \n",
      "\n",
      "{u'OrganizationName': u'RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC', u'ObjectId': u'201113139349301301', u'URL': u'https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml', u'SubmittedOn': u'2011-11-30', u'DLN': u'93493313013011', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201012', u'FormType': u'990', u'_id': ObjectId('5ad8e83235fd3fae98d5e53d'), u'EIN': u'232705170'} \n",
      "\n",
      "{u'OrganizationName': u'TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER', u'ObjectId': u'201113139349301311', u'URL': u'https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml', u'SubmittedOn': u'2011-11-30', u'DLN': u'93493313013111', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201106', u'FormType': u'990', u'_id': ObjectId('5ad8e83235fd3fae98d5e53e'), u'EIN': u'581805618'} \n",
      "\n",
      "{u'OrganizationName': u'HOUSTON VOA INDEPENDENT HOUSING INC HEIGHTS MANOR', u'ObjectId': u'201113139349301316', u'URL': u'https://s3.amazonaws.com/irs-form-990/201113139349301316_public.xml', u'SubmittedOn': u'2011-11-30', u'DLN': u'93493313013161', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201106', u'FormType': u'990', u'_id': ObjectId('5ad8e83235fd3fae98d5e53f'), u'EIN': u'581876019'} \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for f in file_list.find({'FormType': { '$in': ['990']}})[:5]:\n",
    "    print f, '\\n'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Let's set up a new collection in our database for the 990 filings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# DEFINE MY COLLECTION (DATABASE TABLE) WHERE I'LL INSERT THE DATA \n",
    "filings_990 = db['filings_990']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Set a unique constraint on the collection based on *URL*. This will avert duplicates from being inserted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "u'URL_1'"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "db.filings_990.create_index([('URL', pymongo.ASCENDING)], unique=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'_id_', u'URL_1']"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(db.filings_990.index_information())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filings_990.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Before downloading the data, write a function to turn an ordered dictionary (which is what is returned by *xmltodict*) into a normal Python dictionary so that we can combine it with the filing data gathered above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from json import loads, dumps\n",
    "from collections import OrderedDict\n",
    "\n",
    "def to_dict(input_ordered_dict):\n",
    "    return loads(dumps(input_ordered_dict))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Now we implement the main loop for 990 filers. Note two important changes to the loop. One, in the eighth line of code, the *for loop* is now looping over the first five 990 filings:\n",
    "\n",
    "&nbsp; &nbsp; <code>for f in file_list.find({'FormType': { '$in': ['990']}})[:5]:</code>\n",
    "\n",
    "This helps solve an inefficiency we had in the previous version. Namely, before we were looping over all filings by EINs that matched our list, and then visiting the filing URL and -- if it was a 990 return -- downloading the data. While that worked perfectly well for our \"EIN matching\" approach, if we instead want to download *all* 990 data then this would be inefficient, for we would be visiting all 2,373,310 filing URLs but only want the 1,338,946 that are 990 filings. Better instead to select those 1,338,946 IRS 990 filings first and then only access and download those data. A similar strategy would work for selecting a random sample of 990 filings. \n",
    "\n",
    "Two, we are now inserting the data into our &nbsp; <code>filings_990</code> &nbsp; collection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('counter: ', 5, u'HOUSTON VOA INDEPENDENT HOUSING INC HEIGHTS MANOR') \n",
      "\n",
      "# of minutes:  0.0360030651093 \n",
      "\n",
      "# of filings added to database: 5 \n",
      "\n"
     ]
    }
   ],
   "source": [
    "import xmltodict\n",
    "import urllib2\n",
    "import timeit\n",
    "\n",
    "start_time = timeit.default_timer()\n",
    "from IPython.display import display, clear_output   ##### FOR USE WITH STDOUT (DYNAMIC, SINGLE-LINE PRINTING)\n",
    "\n",
    "start_count = filings_990.count()\n",
    "counter = 0\n",
    "#for f in file_list.find({'EIN': { '$in': eins}}):\n",
    "for f in file_list.find({'FormType': { '$in': ['990']}})[:5]:\n",
    "    counter += 1\n",
    "    \n",
    "    print f\n",
    "    if 'URL' in f:\n",
    "        url = f['URL']\n",
    "        print url, '\\n'\n",
    "        url_data = urllib2.urlopen(url)\n",
    "        f_string = url_data.read()\n",
    "        datax = xmltodict.parse(f_string)\n",
    "        \n",
    "        try:\n",
    "            if 'IRS990' in datax['Return']['ReturnData']:        ##### CHECK WHETHER THE 'IRS990' KEY IS PRESENT IN THE DATA\n",
    "                ##### CREATE DICTIONARY CONTAINING ALL KEYS NESTED UNDER 'Return' EXCEPT FOR THE 'ReturnData' KEY\n",
    "                ##### THERE WILL BE FIVE KEYS IN return_info: '@xmlns', '@xmlns:xsi', '@xsi:schemaLocation', '@returnVersion', 'ReturnHeader'\n",
    "                return_info = {k:v for k,v in datax['Return'].iteritems() if k not in ('ReturnData')}\n",
    "                ##### CREATE DICTIONARY CONTAINING ALL KEYS IN ['ReturnData'] EXCEPT FOR THE 'IRS990' KEY\n",
    "                ##### THERE WILL BE A KEY CALLED '@documentCount' (OR '@documentCnt') AS WELL AS ONE PER INCLUDED SCHEDULE\n",
    "                schedules = {k:v for k,v in datax['Return']['ReturnData'].iteritems() if k not in ('IRS990')}\n",
    "                ##### CREATE DICTIONARY FOR 990 DATA\n",
    "                data = datax['Return']['ReturnData']['IRS990']\n",
    "                data = to_dict(data)\n",
    "                        \n",
    "                ##### COMBINE THE DICT OF FILING INFO FROM FIRST STEP WITH FILING DATA GATHERED HERE\n",
    "                c = {key: value for (key, value) in (f.items() + return_info.items() + schedules.items() + data.items())}\n",
    "                c.pop('_id', None)        #DROP 'id' (OR IT WILL NOT INSERT)\n",
    "                t = json.dumps(c)\n",
    "                #print t\n",
    "                loaded_entry = json.loads(t)\n",
    "                #print type(loaded_entry) , loaded_entry    #<type 'dict'>\n",
    "            try:\n",
    "                filings_990.insert_one(loaded_entry)\n",
    "            except pymongo.errors.DuplicateKeyError, e:\n",
    "                print e, '\\n'\n",
    "        \n",
    "        except KeyError:\n",
    "            print 'IRS990 key not in data (likely a 990EZ or 990PF filer)'\n",
    "            print data['Return']['ReturnData'].keys()\n",
    "            \n",
    "    else:\n",
    "        print f['IsAvailable']\n",
    "        \n",
    "    clear_output()\n",
    "    print ('counter: ', counter, f['OrganizationName']), '\\n'\n",
    "    sys.stdout.flush()        \n",
    "        \n",
    "elapsed = timeit.default_timer() - start_time\n",
    "print '# of minutes: ', elapsed/60, '\\n'\n",
    "print \"# of filings added to database:\", filings_990.count() - start_count, '\\n'  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>What we now have is a new MongoDB table, called *filings_990*, that contains the first 5 IRS 990 filings. Each row in our database table (or, in MongoDB language, each *document* in our database *collection*) includes both the filing information from our first MongoDB table along with the 990 data we have just downloaded. In addition, we have added in 5 keys from the <code>['Return']</code> key as well as all available keys from the <code>['ReturnData']</code> key -- that is, those keys containing the \"schedules.\"\n",
    "\n",
    "In the last download above, for instance, there are 9 columns from our first 'index' MongoDB collection, 226 columns of <code>IRS 990</code> filing data, 5 new columns from the <code>['Return']</code> key, and 6 new columns containing <code>['@documentCount']</code> as well as 5 schedules under the following keys: <code>['IRS990ScheduleA', 'IRS990ScheduleD', 'IRS990ScheduleJ', 'IRS990ScheduleO', 'IRS990ScheduleR']</code>. A total of 266 columns (or *fields*) in our DB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of columns from first MongoDB collection: 9\n",
      "# of columns from ['Return'] key: 5\n",
      "# of columns from ['ReturnData'] : 6\n",
      "# of columns from ['IRS990'] key: 246\n",
      "Total # of columns inserted into *filings* MongoDB collection: 266\n"
     ]
    }
   ],
   "source": [
    "print \"# of columns from first MongoDB collection:\", len(f.keys()) - 1\n",
    "print  \"# of columns from ['Return'] key:\", len(return_info.keys())\n",
    "print  \"# of columns from ['ReturnData'] :\", len(schedules.keys())\n",
    "print  \"# of columns from ['IRS990'] key:\", len(data.keys())\n",
    "print  \"Total # of columns inserted into *filings* MongoDB collection:\", len(c.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As noted above, our code will only work for organizations that have the <code>IRS990</code> key. We won't get data for 990EZ or 990PF filers. Focusing on one of the three groups of filers would be the typical research design decision. In the next code blocks we will modify our code to download the 990EZ and 990PF data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Download 990EZ Filers\n",
    "Let's turn to the 990EZ filers. Here we filter our collection and taking the first 5 that have a &nbsp; <code>'990EZ'</code> &nbsp; *FormType:*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{u'OrganizationName': u'OHIO HIGH SCHOOL BOYS VOLLEYBALL ASSOC', u'ObjectId': u'201103149349200730', u'URL': u'https://s3.amazonaws.com/irs-form-990/201103149349200730_public.xml', u'SubmittedOn': u'2011-11-28', u'DLN': u'93492314007301', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201106', u'FormType': u'990EZ', u'_id': ObjectId('5ad8e83235fd3fae98d5e5b1'), u'EIN': u'311793149'} \n",
      "\n",
      "{u'OrganizationName': u'KINSHIP OF THE PERHAM AREA', u'ObjectId': u'201103149349200735', u'URL': u'https://s3.amazonaws.com/irs-form-990/201103149349200735_public.xml', u'SubmittedOn': u'2011-11-28', u'DLN': u'93492314007351', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201106', u'FormType': u'990EZ', u'_id': ObjectId('5ad8e83235fd3fae98d5e5b2'), u'EIN': u'273092148'} \n",
      "\n",
      "{u'OrganizationName': u'CASTLEWOOD INC', u'ObjectId': u'201103149349200800', u'URL': u'https://s3.amazonaws.com/irs-form-990/201103149349200800_public.xml', u'SubmittedOn': u'2011-11-28', u'DLN': u'93492314008001', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201106', u'FormType': u'990EZ', u'_id': ObjectId('5ad8e83235fd3fae98d5e5b3'), u'EIN': u'611384219'} \n",
      "\n",
      "{u'OrganizationName': u'HOME BUILDING FOUNDATION OF THE DELTA', u'ObjectId': u'201103139349201135', u'URL': u'https://s3.amazonaws.com/irs-form-990/201103139349201135_public.xml', u'SubmittedOn': u'2011-11-28', u'DLN': u'93492313011351', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201012', u'FormType': u'990EZ', u'_id': ObjectId('5ad8e83235fd3fae98d5e5b4'), u'EIN': u'870752964'} \n",
      "\n",
      "{u'OrganizationName': u'PTAI REBA O STECK', u'ObjectId': u'201103139349201140', u'URL': u'https://s3.amazonaws.com/irs-form-990/201103139349201140_public.xml', u'SubmittedOn': u'2011-11-28', u'DLN': u'93492313011401', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201106', u'FormType': u'990EZ', u'_id': ObjectId('5ad8e83235fd3fae98d5e5b5'), u'EIN': u'363830578'} \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for f in file_list.find({'FormType': { '$in': ['990EZ']}})[:5]:\n",
    "    print f, '\\n'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Let's set up a new collection in our database for the 990EZ filings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# DEFINE MY COLLECTION (DATABASE TABLE) WHERE I'LL INSERT THE DATA \n",
    "filings_990EZ = db['filings_990EZ']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Set a unique constraint on the collection based on *URL*. This will avert duplicates from being inserted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "u'URL_1'"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "db.filings_990EZ.create_index([('URL', pymongo.ASCENDING)], unique=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'_id_', u'URL_1']"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(db.filings_990EZ.index_information())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filings_990EZ.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Now we implement the main loop for 990EZ filers. Note three important changes to the loop. One, in the eighth line of code, the *for loop* is now looping over the first five 990EZ filings:\n",
    "\n",
    "&nbsp; &nbsp; <code>for f in file_list.find({'FormType': { '$in': ['990EZ']}})[:5]:</code>\n",
    "\n",
    "Two, we are now inserting the data into our &nbsp; <code>filings_990EZ</code> &nbsp; collection. And three, our 'check' is changed from searching for an 'IRS990' key to a 'IRS990EZ' key:\n",
    "\n",
    "&nbsp; &nbsp; <code>if 'IRS990EZ' in datax['Return']['ReturnData']: </code>\n",
    "\n",
    "Two other places in the code specifying the 'IRS990' key have also been changed to 'IRS990EZ'. Note that I have done some test downloads before this to know that this is the relevant key. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('counter: ', 5, u'PTAI REBA O STECK') \n",
      "\n",
      "# of minutes:  0.0371069987615 \n",
      "\n",
      "# of filings added to database: 5 \n",
      "\n"
     ]
    }
   ],
   "source": [
    "import xmltodict\n",
    "import urllib2\n",
    "import timeit\n",
    "\n",
    "start_time = timeit.default_timer()\n",
    "from IPython.display import display, clear_output   ##### FOR USE WITH STDOUT (DYNAMIC, SINGLE-LINE PRINTING)\n",
    "\n",
    "start_count = filings_990EZ.count()\n",
    "counter = 0\n",
    "#for f in file_list.find({'EIN': { '$in': eins}}):\n",
    "for f in file_list.find({'FormType': { '$in': ['990EZ']}})[:5]:\n",
    "    counter += 1\n",
    "    \n",
    "    print f\n",
    "    if 'URL' in f:\n",
    "        url = f['URL']\n",
    "        print url, '\\n'\n",
    "        url_data = urllib2.urlopen(url)\n",
    "        f_string = url_data.read()\n",
    "        datax = xmltodict.parse(f_string)\n",
    "        \n",
    "        #print datax['Return']['ReturnData'].keys()\n",
    "        \n",
    "        try:\n",
    "            if 'IRS990EZ' in datax['Return']['ReturnData']:        ##### CHECK WHETHER THE 'IRS990' KEY IS PRESENT IN THE DATA\n",
    "                ##### CREATE DICTIONARY CONTAINING ALL KEYS NESTED UNDER 'Return' EXCEPT FOR THE 'ReturnData' KEY\n",
    "                ##### THERE WILL BE FIVE KEYS IN return_info: '@xmlns', '@xmlns:xsi', '@xsi:schemaLocation', '@returnVersion', 'ReturnHeader'\n",
    "                return_info = {k:v for k,v in datax['Return'].iteritems() if k not in ('ReturnData')}\n",
    "                ##### CREATE DICTIONARY CONTAINING ALL KEYS IN ['ReturnData'] EXCEPT FOR THE 'IRS990' KEY\n",
    "                ##### THERE WILL BE A KEY CALLED '@documentCount' (OR '@documentCnt') AS WELL AS ONE PER INCLUDED SCHEDULE\n",
    "                schedules = {k:v for k,v in datax['Return']['ReturnData'].iteritems() if k not in ('IRS990EZ')}\n",
    "                ##### CREATE DICTIONARY FOR 990 DATA\n",
    "                data = datax['Return']['ReturnData']['IRS990EZ']\n",
    "                data = to_dict(data)\n",
    "                        \n",
    "                ##### COMBINE THE DICT OF FILING INFO FROM FIRST STEP WITH FILING DATA GATHERED HERE\n",
    "                c = {key: value for (key, value) in (f.items() + return_info.items() + schedules.items() + data.items())}\n",
    "                c.pop('_id', None)        #DROP 'id' (OR IT WILL NOT INSERT)\n",
    "                t = json.dumps(c)\n",
    "                #print t\n",
    "                loaded_entry = json.loads(t)\n",
    "                #print type(loaded_entry) , loaded_entry    #<type 'dict'>\n",
    "            try:\n",
    "                filings_990EZ.insert_one(loaded_entry)\n",
    "            except pymongo.errors.DuplicateKeyError, e:\n",
    "                print e, '\\n'\n",
    "        \n",
    "        except KeyError:\n",
    "            print '990EZ key not in data (likely a 990 or 990PF filer)'\n",
    "            print data['Return']['ReturnData'].keys()\n",
    "            \n",
    "    else:\n",
    "        print f['IsAvailable']\n",
    "        \n",
    "    clear_output()\n",
    "    print ('counter: ', counter, f['OrganizationName']), '\\n'\n",
    "    sys.stdout.flush()        \n",
    "        \n",
    "elapsed = timeit.default_timer() - start_time\n",
    "print '# of minutes: ', elapsed/60, '\\n'\n",
    "print \"# of filings added to database:\", filings_990EZ.count() - start_count, '\\n'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: Download 990PF Filers\n",
    "Let's turn to the 990PF filers. Here we filter our collection and taking the first 5 that have a &nbsp; <code>'990PF'</code> &nbsp; *FormType:*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{u'OrganizationName': u'SUSAN DANIELS FAMILY CHARITABLE FOUNDATION', u'ObjectId': u'201103169349100000', u'URL': u'https://s3.amazonaws.com/irs-form-990/201103169349100000_public.xml', u'SubmittedOn': u'2011-12-01', u'DLN': u'93491316000001', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201106', u'FormType': u'990PF', u'_id': ObjectId('5ad8e83235fd3fae98d5e764'), u'EIN': u'043338422'} \n",
      "\n",
      "{u'OrganizationName': u'PETER NAGER CHARITABLE TRUST', u'ObjectId': u'201103169349100100', u'URL': u'https://s3.amazonaws.com/irs-form-990/201103169349100100_public.xml', u'SubmittedOn': u'2011-12-01', u'DLN': u'93491316001001', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201106', u'FormType': u'990PF', u'_id': ObjectId('5ad8e83235fd3fae98d5e765'), u'EIN': u'137196662'} \n",
      "\n",
      "{u'OrganizationName': u'CCJ FOUNDATION', u'ObjectId': u'201103169349100200', u'URL': u'https://s3.amazonaws.com/irs-form-990/201103169349100200_public.xml', u'SubmittedOn': u'2011-12-01', u'DLN': u'93491316002001', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201012', u'FormType': u'990PF', u'_id': ObjectId('5ad8e83235fd3fae98d5e766'), u'EIN': u'256731926'} \n",
      "\n",
      "{u'OrganizationName': u'PATRICIA GALLOWAY AND KRIS NIELSEN FOUNDATION', u'ObjectId': u'201103169349100300', u'URL': u'https://s3.amazonaws.com/irs-form-990/201103169349100300_public.xml', u'SubmittedOn': u'2011-12-01', u'DLN': u'93491316003001', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201012', u'FormType': u'990PF', u'_id': ObjectId('5ad8e83235fd3fae98d5e767'), u'EIN': u'271196652'} \n",
      "\n",
      "{u'OrganizationName': u'JEANNE AND JIM RILEY FOUNDATION', u'ObjectId': u'201103169349100400', u'URL': u'https://s3.amazonaws.com/irs-form-990/201103169349100400_public.xml', u'SubmittedOn': u'2011-12-01', u'DLN': u'93491316004001', u'LastUpdated': u'2016-03-21T17:23:53', u'TaxPeriod': u'201012', u'FormType': u'990PF', u'_id': ObjectId('5ad8e83235fd3fae98d5e768'), u'EIN': u'731690017'} \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for f in file_list.find({'FormType': { '$in': ['990PF']}})[:5]:\n",
    "    print f, '\\n'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Let's set up a new collection in our database for the 990 filings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# DEFINE MY COLLECTION (DATABASE TABLE) WHERE I'LL INSERT THE DATA \n",
    "filings_990PF = db['filings_990PF']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Set a unique constraint on the collection based on *URL*. This will avert duplicates from being inserted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "u'URL_1'"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "db.filings_990PF.create_index([('URL', pymongo.ASCENDING)], unique=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'_id_', u'URL_1']"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(db.filings_990PF.index_information())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filings_990PF.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Now we implement the main loop for 990PF filers. Note three important changes to the loop. One, in the eighth line of code, the *for loop* is now looping over the first five 990PF filings:\n",
    "\n",
    "&nbsp; &nbsp; <code>for f in file_list.find({'FormType': { '$in': ['990PF']}})[:5]:</code>\n",
    "\n",
    "Two, we are now inserting the data into our &nbsp; <code>filings_990PF</code> &nbsp; collection. And three, our 'check' is changed from searching for an 'IRS990' key to a 'IRS990PF' key:\n",
    "\n",
    "&nbsp; &nbsp; <code>if 'IRS990PF' in datax['Return']['ReturnData']: </code>\n",
    "\n",
    "Two other places in the code specifying the 'IRS990' key have also been changed to 'IRS990PF'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('counter: ', 5, 'JEANNE AND JIM RILEY FOUNDATION') \n",
      "\n",
      "# of minutes:  0.0301501512527 \n",
      "\n",
      "# of filings added to database: 5 \n",
      "\n"
     ]
    }
   ],
   "source": [
    "import xmltodict\n",
    "import urllib2\n",
    "import timeit\n",
    "\n",
    "start_time = timeit.default_timer()\n",
    "from IPython.display import display, clear_output   ##### FOR USE WITH STDOUT (DYNAMIC, SINGLE-LINE PRINTING)\n",
    "\n",
    "start_count = filings_990PF.count()\n",
    "counter = 0\n",
    "#for f in file_list.find({'EIN': { '$in': eins}}):\n",
    "for f in file_list.find({'FormType': { '$in': ['990PF']}})[:5]:\n",
    "    counter += 1\n",
    "    \n",
    "    print f\n",
    "    if 'URL' in f:\n",
    "        url = f['URL']\n",
    "        print url, '\\n'\n",
    "        url_data = urllib2.urlopen(url)\n",
    "        f_string = url_data.read()\n",
    "        datax = xmltodict.parse(f_string)\n",
    "        \n",
    "        print datax['Return']['ReturnData'].keys()\n",
    "        \n",
    "        try:\n",
    "            if 'IRS990PF' in datax['Return']['ReturnData']:        ##### CHECK WHETHER THE 'IRS990' KEY IS PRESENT IN THE DATA\n",
    "                ##### CREATE DICTIONARY CONTAINING ALL KEYS NESTED UNDER 'Return' EXCEPT FOR THE 'ReturnData' KEY\n",
    "                ##### THERE WILL BE FIVE KEYS IN return_info: '@xmlns', '@xmlns:xsi', '@xsi:schemaLocation', '@returnVersion', 'ReturnHeader'\n",
    "                return_info = {k:v for k,v in datax['Return'].iteritems() if k not in ('ReturnData')}\n",
    "                ##### CREATE DICTIONARY CONTAINING ALL KEYS IN ['ReturnData'] EXCEPT FOR THE 'IRS990' KEY\n",
    "                ##### THERE WILL BE A KEY CALLED '@documentCount' (OR '@documentCnt') AS WELL AS ONE PER INCLUDED SCHEDULE\n",
    "                schedules = {k:v for k,v in datax['Return']['ReturnData'].iteritems() if k not in ('IRS990PF')}\n",
    "                ##### CREATE DICTIONARY FOR 990 DATA\n",
    "                data = datax['Return']['ReturnData']['IRS990PF']\n",
    "                data = to_dict(data)\n",
    "                        \n",
    "                ##### COMBINE THE DICT OF FILING INFO FROM FIRST STEP WITH FILING DATA GATHERED HERE\n",
    "                c = {key: value for (key, value) in (f.items() + return_info.items() + schedules.items() + data.items())}\n",
    "                c.pop('_id', None)        #DROP 'id' (OR IT WILL NOT INSERT)\n",
    "                t = json.dumps(c)\n",
    "                #print t\n",
    "                loaded_entry = json.loads(t)\n",
    "                #print type(loaded_entry) , loaded_entry    #<type 'dict'>\n",
    "            try:\n",
    "                filings_990PF.insert_one(loaded_entry)\n",
    "            except pymongo.errors.DuplicateKeyError, e:\n",
    "                print e, '\\n'\n",
    "        \n",
    "        except KeyError:\n",
    "            print '990PF key not in data (likely a 990 or 990EZ filer)'\n",
    "            print data['Return']['ReturnData'].keys()\n",
    "            \n",
    "    else:\n",
    "        print f['IsAvailable']\n",
    "        \n",
    "    clear_output()\n",
    "    print ('counter: ', counter, str(f['OrganizationName'])), '\\n'\n",
    "    sys.stdout.flush()        \n",
    "        \n",
    "elapsed = timeit.default_timer() - start_time\n",
    "print '# of minutes: ', elapsed/60, '\\n'\n",
    "print \"# of filings added to database:\", filings_990PF.count() - start_count, '\\n'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### To download *all* 990 filings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have MongoDB collections set up for the three main types of 990 filers -- 990, 990EZ, and 990PF -- and have done trial downloads of 5 filings for each type. To download *all* available filings for a given type, just modify the following line of code:\n",
    "\n",
    "&nbsp; &nbsp; <code>for f in file_list.find({'FormType': { '$in': ['990PF']}})[:5]:</code>\n",
    "\n",
    "What you need to do is replace the &nbsp; <code>[:5]:</code> &nbsp; at the end with &nbsp; <code>:</code> &nbsp; \n",
    "\n",
    "This will then loop over all available filings. It will take some time and take up lots of 100s of GB of hard drive space, but you can have your own copy of the entire IRS e-file database."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### To download a *sample* of filings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also modify the above line and take a random sample of 990, 990EZ, or 990PF filings. The following line will match all filings of type *990PF* and then take a random sample of 10 990PF filings: \n",
    "\n",
    "&nbsp; &nbsp; <code>for f in file_list.aggregate([{'$match':  {'FormType': {'$in': ['990PF']}}}, {'$sample': {'size': 10}}]):</code>\n",
    "\n",
    "Modify the form type and the sample size and you're good to go!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "navigate_num": "#000000",
    "navigate_text": "#333333",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700",
    "sidebar_border": "#EEEEEE",
    "wrapper_background": "#FFFFFF"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "279px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}