kenorb/IRS 990 e-File Data (7) -- Generate Data Dictionary.ipynb

## IRS 990 e-File Data (7) -- Generate Data Dictionary.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Overview\n",
    "\n",
    "This is seventh in a series tutorials that illustrate how to download the <a href=\"https://aws.amazon.com/public-data-sets/irs-990/\">IRS 990 e-file data</a>. In a series of <a href=\"http://social-metrics.org/sox/\">previous tutorials</a> I have shown how to download data from IRS 990, 990PF, and 990EZ filers. In this tutorial we are going to generate a codebook or 'data dictionary' that can help us make sense of the variables we have downloaded."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "import time\n",
    "import json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from pandas import DataFrame\n",
    "from pandas import Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#http://pandas.pydata.org/pandas-docs/stable/options.html\n",
    "pd.set_option('display.max_columns', None)\n",
    "pd.set_option('max_colwidth', 250)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The need for a codebook\n",
    "\n",
    "As you can tell from looking at the following sample filing, the data by themselves are difficult to interpret and difficult to match to the various sections of the paper 990 documents: https://s3.amazonaws.com/irs-form-990/201543109349200219_public.xml\n",
    "\n",
    "To get the variable descriptions you need to dig through the 'schemas', such as that here for 2015: https://www.irs.gov/e-file-providers/990-990-ez-990-pf-ty2015-v30-schema-business-rules-and-release-memo \n",
    "  \n",
    "The IRS makes it difficult to find the <code>*.xsd</code> stylesheets that can be used to make sense of the data. You can visit here -- https://www.irs.gov/downloads/irs-schema/ -- sort the files by Name, and then look for the <code>efile990...</code> zip files. \n",
    "\n",
    "Alternatively, you can visit https://www.irs.gov/e-file-providers/current-valid-xml-schemas-and-business-rules-for-exempt-organizations-modernized-e-file -- where you can currently find links to .xsd files for 2014 and 2015. \n",
    "\n",
    "\n",
    "For 2015 you would iterate through the links this way:\n",
    "- https://www.irs.gov/e-file-providers/ty2015-current-valid-xml-schemas-and-business-rules-for-exempt-organizations-modernized-efile-mef then\n",
    "- https://www.irs.gov/e-file-providers/990-990-ez-990-pf-ty2015-v30-schema-business-rules-and-release-memo then\n",
    "- https://www.irs.gov/pub/irs-schema/efile990x_2015v3.0.zip\n",
    "\n",
    "\n",
    "For 2014, the latter will take you here:\n",
    "- https://www.irs.gov/e-file-providers/ty2014-current-valid-xml-schemas-and-business-rules-for-exempt-organizations-modernized-efile-mef\n",
    "Then here:\n",
    "- https://www.irs.gov/e-file-providers/990-990-ez-990-pf-ty2014-v60-schema-business-rules-and-release-memo\n",
    "Then you will download the \"Schemas\" zip file:\n",
    "- https://www.irs.gov/pub/irs-schema/efile990x_2014v6.0_09082015.zip\n",
    "\n",
    "\n",
    "As a shorthand, here are the direct links to .xsd files for all available years:\n",
    "\n",
    "- 2015:\n",
    "    - https://www.irs.gov/pub/irs-schema/efile990x_2015v3.0.zip\n",
    "    - After unzipping, <code>IRS990.xsd</code> will be found here: &nbsp; <code>/efile990x_2015v3.0/2015v3.0/TEGE/TEGE990/IRS990/IRS990.xsd</code> &nbsp; \n",
    "    - Under &nbsp; <code>/TEGE/</code> &nbsp; there are folders for &nbsp;<code>/TEGE990PF/</code> &nbsp; and &nbsp;<code>/TEGE990EZ/</code> &nbsp; and &nbsp;<code>/Common/</code> &nbsp; (which contains &nbsp;<code>*.xsd</code> &nbsp; files for the schedules).\n",
    "\n",
    "\n",
    "- 2014:\n",
    "    - https://www.irs.gov/pub/irs-schema/efile990x_2014v6.0_09082015.zip\n",
    "    - After unzipping, <code>IRS990.xsd</code> will be found here: &nbsp; <code>efile990x_2014v6.0_09082015/2014v6.0/TEGE/TEGE990/IRS990/IRS990.xsd</code> &nbsp;\n",
    "    - Under &nbsp; <code>/TEGE/</code> &nbsp; there are folders for &nbsp;<code>/TEGE990PF/</code> &nbsp; and &nbsp;<code>/TEGE990EZ/</code> &nbsp; and &nbsp;<code>/Common/</code> &nbsp; (which contains &nbsp;<code>*.xsd</code> &nbsp; files for the schedules).\n",
    "    \n",
    "\n",
    "\n",
    "- 2013:\n",
    "    - https://www.irs.gov/pub/irs-schema/efile990x_2013v4.0.zip\n",
    "    - After unzipping, <code>IRS990.xsd</code> will be found here: &nbsp; <code>efile990x_2013v4.0/2013v4.0/TEGE/TEGE990/IRS990/IRS990.xsd</code> &nbsp;\n",
    "    - Under &nbsp; <code>/TEGE/</code> &nbsp; there are folders for &nbsp;<code>/TEGE990PF/</code> &nbsp; and &nbsp;<code>/TEGE990EZ/</code> &nbsp; and &nbsp;<code>/Common/</code> &nbsp; (which contains &nbsp;<code>*.xsd</code> &nbsp; files for the schedules).    \n",
    "\n",
    "\n",
    "- 2010:\n",
    "    - https://www.irs.gov/pub/irs-schema/efile990x_2010v3.7.zip  \n",
    "    - After unzipping, <code>IRS990.xsd</code> will be found here: &nbsp; <code>/efile990x_2010v3.7/2010v3.7/TEGE/TEGE990/IRS990/IRS990.xsd</code> &nbsp; \n",
    "    - Under &nbsp; <code>/TEGE/</code> &nbsp; there are folders for &nbsp;<code>/TEGE990PF/</code> &nbsp; and &nbsp;<code>/TEGE990EZ/</code> &nbsp; and &nbsp;<code>/Common/</code> &nbsp; (which contains &nbsp;<code>*.xsd</code> &nbsp; files for the schedules).\n",
    "    \n",
    "    \n",
    "    \n",
    "    \n",
    "You should also be aware of the *Nonprofit Open Data Collective*. That project is a more comprehensive effort, particularly their <a href=\"https://github.com/Nonprofit-Open-Data-Collective/irs-efile-master-concordance-file/blob/master/efiler_master_concordance.csv\">master concordance file</a>. My tutorial is mostly intended to highlight the nature of the data and the schemas and serve as an alternative Python version."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loop over XSD files, generate relevant dictionary with variable details, and append to PANDAS dataframe\n",
    "Now we will process the .xsd files. I am assuming you have downloaded the four above *IRS900.xsd* files, put them in a folder called */xsd files/*, and renamed them *IRS990 (2010).xsd*, *IRS990 (2013).xsd*, *IRS990 (2014).xsd*, and *IRS990 (2015).xsd*.\n",
    "\n",
    "In the following code blocks we will set our working directory and then loop over each of the four xsd files. For each one, we are going to move down to where are the 'elements' are contained. Each element is variable. What we are interested in is the &nbsp; <code>name</code> &nbsp; attribute, the &nbsp; <code>Description</code>, &nbsp; which tells us what the variable means, and the &nbsp; <code>LineNumber</code>, &nbsp; which provides a location for the variable within the 990 form. Together, we have enough information to match the columns in our 990 data to the specific variables we need. Here is an an example of one 'element' in the 2014 xsd file:\n",
    "\n",
    "\t\t\t<!-- Total Number Volunteers -->\n",
    "\t\t\t<xsd:element name=\"TotalVolunteersCnt\" type=\"IntegerNNType\" minOccurs=\"0\">\n",
    "\t\t\t\t<xsd:annotation>\n",
    "\t\t\t\t\t<xsd:documentation>\n",
    "\t\t\t\t\t\t<Description>Total number volunteers</Description>\n",
    "\t\t\t\t\t\t<LineNumber>Part I Line 6</LineNumber>\n",
    "\t\t\t\t\t</xsd:documentation>\n",
    "\t\t\t\t</xsd:annotation>\n",
    "\t\t\t</xsd:element>\n",
    "\n",
    "Our code below searches over each element and assigns the Description to a variable *description*, the LineNumber to a variable *line_number*, and the year to a variable *year*, and assigns all three variables to a dictionary called *variables*, with the major keys being the variable names. After processing each xsd file, the data are added to a PANDAS dataframe called *variables_df.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "cd '/Users/gsaxton/Dropbox/990 e-file data/xsd files/'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# of rows in dataframe: 1314 \n",
      "\n",
      "# of rows per year: 2013    331\n",
      "2014    331\n",
      "2015    331\n",
      "2010    321\n",
      "Name: year, dtype: int64 \n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>description</th>\n",
       "      <th>line_number</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>variable</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AccountantCompileOrReview</th>\n",
       "      <td>Accountant provide compilation or review?</td>\n",
       "      <td>Part XII Line 2a</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>AccountsPayableAccruedExpenses</th>\n",
       "      <td>Accounts payable and accrued expenses</td>\n",
       "      <td>Part X Line 17</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>AccountsReceivable</th>\n",
       "      <td>Accounts receivable, net</td>\n",
       "      <td>Part X Line 4</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ActivitiesConductedPartnership</th>\n",
       "      <td>Activities conducted thru partnership?</td>\n",
       "      <td>Part IV Line 37</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Activity2</th>\n",
       "      <td>Second activity group</td>\n",
       "      <td>Part III Line 4b</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                              description  \\\n",
       "variable                                                                    \n",
       "AccountantCompileOrReview       Accountant provide compilation or review?   \n",
       "AccountsPayableAccruedExpenses      Accounts payable and accrued expenses   \n",
       "AccountsReceivable                               Accounts receivable, net   \n",
       "ActivitiesConductedPartnership     Activities conducted thru partnership?   \n",
       "Activity2                                           Second activity group   \n",
       "\n",
       "                                     line_number  year  \n",
       "variable                                                \n",
       "AccountantCompileOrReview       Part XII Line 2a  2010  \n",
       "AccountsPayableAccruedExpenses    Part X Line 17  2010  \n",
       "AccountsReceivable                 Part X Line 4  2010  \n",
       "ActivitiesConductedPartnership   Part IV Line 37  2010  \n",
       "Activity2                       Part III Line 4b  2010  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "variables_df = pd.DataFrame()\n",
    "import xml.etree.ElementTree as ET\n",
    "for year in ['2010', '2013', '2014', '2015']:\n",
    "    variables = {}\n",
    "    tree = ET.parse('IRS990 (%s).xsd' % year)\n",
    "    root = tree.getroot()\n",
    "    #print root[3].tag, root[3].attrib\n",
    "    \n",
    "    for r in root[3][1].findall(\"{http://www.w3.org/2001/XMLSchema}element\"):\n",
    "        name = r.attrib['name']\n",
    "        attrs = r.find('{http://www.w3.org/2001/XMLSchema}annotation/{http://www.w3.org/2001/XMLSchema}documentation')\n",
    "        description = attrs[0]\n",
    "        if len(attrs)>1:\n",
    "            line_number = attrs[1].text\n",
    "        else:\n",
    "            line_number = ''\n",
    "        #print name, description.text, line_number\n",
    "        \n",
    "        #if name not in variables:\n",
    "        #    variables[name] == name\n",
    "        variables.update({name: {'description': description.text, 'line_number': line_number, 'year': year}})\n",
    "    \n",
    "    variables_df = variables_df.append(DataFrame(variables).T) #, ignore_index = True)\n",
    "    variables_df.index.name = 'variable'\n",
    "\n",
    "print \"# of rows in dataframe:\", len(variables_df), '\\n'\n",
    "print \"# of rows per year:\", variables_df['year'].value_counts(), '\\n'\n",
    "variables_df[:5]  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>We see there are 1,314 rows in the database. Above you can see the first five rows and below the last five rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>description</th>\n",
       "      <th>line_number</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>variable</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>VotingMembersGoverningBodyCnt</th>\n",
       "      <td>Number voting members governing body</td>\n",
       "      <td>Part I Line 3</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>VotingMembersIndependentCnt</th>\n",
       "      <td>Number independent voting members</td>\n",
       "      <td>Part I Line 4</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WebsiteAddressTxt</th>\n",
       "      <td>Web site</td>\n",
       "      <td>J</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WhistleblowerPolicyInd</th>\n",
       "      <td>Whistleblower policy?</td>\n",
       "      <td>Part VI Section B Line 13</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>WrittenPolicyOrProcedureInd</th>\n",
       "      <td>Written policy or procedure?</td>\n",
       "      <td>Part VI Section B Line 16b</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                        description  \\\n",
       "variable                                                              \n",
       "VotingMembersGoverningBodyCnt  Number voting members governing body   \n",
       "VotingMembersIndependentCnt       Number independent voting members   \n",
       "WebsiteAddressTxt                                          Web site   \n",
       "WhistleblowerPolicyInd                        Whistleblower policy?   \n",
       "WrittenPolicyOrProcedureInd            Written policy or procedure?   \n",
       "\n",
       "                                              line_number  year  \n",
       "variable                                                         \n",
       "VotingMembersGoverningBodyCnt               Part I Line 3  2015  \n",
       "VotingMembersIndependentCnt                 Part I Line 4  2015  \n",
       "WebsiteAddressTxt                                       J  2015  \n",
       "WhistleblowerPolicyInd          Part VI Section B Line 13  2015  \n",
       "WrittenPolicyOrProcedureInd    Part VI Section B Line 16b  2015  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "variables_df[-5:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>We're not quite done, however. To see why, let's reset the index and then sort the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>variable</th>\n",
       "      <th>description</th>\n",
       "      <th>line_number</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AccountantCompileOrReview</td>\n",
       "      <td>Accountant provide compilation or review?</td>\n",
       "      <td>Part XII Line 2a</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AccountsPayableAccruedExpenses</td>\n",
       "      <td>Accounts payable and accrued expenses</td>\n",
       "      <td>Part X Line 17</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AccountsReceivable</td>\n",
       "      <td>Accounts receivable, net</td>\n",
       "      <td>Part X Line 4</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ActivitiesConductedPartnership</td>\n",
       "      <td>Activities conducted thru partnership?</td>\n",
       "      <td>Part IV Line 37</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Activity2</td>\n",
       "      <td>Second activity group</td>\n",
       "      <td>Part III Line 4b</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                         variable                                description  \\\n",
       "0       AccountantCompileOrReview  Accountant provide compilation or review?   \n",
       "1  AccountsPayableAccruedExpenses      Accounts payable and accrued expenses   \n",
       "2              AccountsReceivable                   Accounts receivable, net   \n",
       "3  ActivitiesConductedPartnership     Activities conducted thru partnership?   \n",
       "4                       Activity2                      Second activity group   \n",
       "\n",
       "        line_number  year  \n",
       "0  Part XII Line 2a  2010  \n",
       "1    Part X Line 17  2010  \n",
       "2     Part X Line 4  2010  \n",
       "3   Part IV Line 37  2010  \n",
       "4  Part III Line 4b  2010  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "variables_df = variables_df.reset_index()\n",
    "variables_df[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>variable</th>\n",
       "      <th>description</th>\n",
       "      <th>line_number</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AccountantCompileOrReview</td>\n",
       "      <td>Accountant provide compilation or review?</td>\n",
       "      <td>Part XII Line 2a</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>321</th>\n",
       "      <td>AccountantCompileOrReviewInd</td>\n",
       "      <td>Accountant provide compilation or review?</td>\n",
       "      <td>Part XII Line 2a</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>652</th>\n",
       "      <td>AccountantCompileOrReviewInd</td>\n",
       "      <td>Accountant provide compilation or review?</td>\n",
       "      <td>Part XII Line 2a</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>983</th>\n",
       "      <td>AccountantCompileOrReviewInd</td>\n",
       "      <td>Accountant provide compilation or review?</td>\n",
       "      <td>Part XII Line 2a</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>322</th>\n",
       "      <td>AccountsPayableAccrExpnssGrp</td>\n",
       "      <td>Accounts payable and accrued expenses</td>\n",
       "      <td>Part X Line 17</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>653</th>\n",
       "      <td>AccountsPayableAccrExpnssGrp</td>\n",
       "      <td>Accounts payable and accrued expenses</td>\n",
       "      <td>Part X Line 17</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>984</th>\n",
       "      <td>AccountsPayableAccrExpnssGrp</td>\n",
       "      <td>Accounts payable and accrued expenses</td>\n",
       "      <td>Part X Line 17</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AccountsPayableAccruedExpenses</td>\n",
       "      <td>Accounts payable and accrued expenses</td>\n",
       "      <td>Part X Line 17</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AccountsReceivable</td>\n",
       "      <td>Accounts receivable, net</td>\n",
       "      <td>Part X Line 4</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>323</th>\n",
       "      <td>AccountsReceivableGrp</td>\n",
       "      <td>Accounts receivable, net</td>\n",
       "      <td>Part X Line 4</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           variable  \\\n",
       "0         AccountantCompileOrReview   \n",
       "321    AccountantCompileOrReviewInd   \n",
       "652    AccountantCompileOrReviewInd   \n",
       "983    AccountantCompileOrReviewInd   \n",
       "322    AccountsPayableAccrExpnssGrp   \n",
       "653    AccountsPayableAccrExpnssGrp   \n",
       "984    AccountsPayableAccrExpnssGrp   \n",
       "1    AccountsPayableAccruedExpenses   \n",
       "2                AccountsReceivable   \n",
       "323           AccountsReceivableGrp   \n",
       "\n",
       "                                   description       line_number  year  \n",
       "0    Accountant provide compilation or review?  Part XII Line 2a  2010  \n",
       "321  Accountant provide compilation or review?  Part XII Line 2a  2013  \n",
       "652  Accountant provide compilation or review?  Part XII Line 2a  2014  \n",
       "983  Accountant provide compilation or review?  Part XII Line 2a  2015  \n",
       "322      Accounts payable and accrued expenses    Part X Line 17  2013  \n",
       "653      Accounts payable and accrued expenses    Part X Line 17  2014  \n",
       "984      Accounts payable and accrued expenses    Part X Line 17  2015  \n",
       "1        Accounts payable and accrued expenses    Part X Line 17  2010  \n",
       "2                     Accounts receivable, net     Part X Line 4  2010  \n",
       "323                   Accounts receivable, net     Part X Line 4  2013  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "variables_df.sort_values(by=['variable', 'year'], ascending=[1,1])[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>As you can see above, each variable name occurs in more than one row. This is good, because we don't want the IRS to change the variable name every year. A dataframe in this format is also helpful if you are searching by year and then variable name. Yet it becomes problematic if you are searching by variable name. \n",
    "\n",
    "So, what we will do is 'collapse' the above dataframe. Specifically, we will search for each row that has an indentical array of values on the first three columns -- *variable*, *description*, and *line_number*. We will then create new columns for the first year the variable-description-line_number array appears, the last year it appears, and all years it appears. \n",
    "\n",
    "This is a task PANDAS excels at. It is called a *groupby* operation. I will show you two ways of doing it. The first is with an aggregation *dictionary*. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "652\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>variable</th>\n",
       "      <th>description</th>\n",
       "      <th>line_number</th>\n",
       "      <th>first_year</th>\n",
       "      <th>last_year</th>\n",
       "      <th>all_years</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AccountantCompileOrReview</td>\n",
       "      <td>Accountant provide compilation or review?</td>\n",
       "      <td>Part XII Line 2a</td>\n",
       "      <td>2010</td>\n",
       "      <td>2010</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AccountantCompileOrReviewInd</td>\n",
       "      <td>Accountant provide compilation or review?</td>\n",
       "      <td>Part XII Line 2a</td>\n",
       "      <td>2013</td>\n",
       "      <td>2015</td>\n",
       "      <td>2013, 2014, 2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AccountsPayableAccrExpnssGrp</td>\n",
       "      <td>Accounts payable and accrued expenses</td>\n",
       "      <td>Part X Line 17</td>\n",
       "      <td>2013</td>\n",
       "      <td>2015</td>\n",
       "      <td>2013, 2014, 2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AccountsPayableAccruedExpenses</td>\n",
       "      <td>Accounts payable and accrued expenses</td>\n",
       "      <td>Part X Line 17</td>\n",
       "      <td>2010</td>\n",
       "      <td>2010</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AccountsReceivable</td>\n",
       "      <td>Accounts receivable, net</td>\n",
       "      <td>Part X Line 4</td>\n",
       "      <td>2010</td>\n",
       "      <td>2010</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>AccountsReceivableGrp</td>\n",
       "      <td>Accounts receivable, net</td>\n",
       "      <td>Part X Line 4</td>\n",
       "      <td>2013</td>\n",
       "      <td>2015</td>\n",
       "      <td>2013, 2014, 2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>AcctCompileOrReviewBasisGrp</td>\n",
       "      <td>Basis in which the financial statements were compiled or reviewed by independent accountant</td>\n",
       "      <td>Part XII Line 2a</td>\n",
       "      <td>2013</td>\n",
       "      <td>2015</td>\n",
       "      <td>2013, 2014, 2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>ActivitiesConductedPartnership</td>\n",
       "      <td>Activities conducted thru partnership?</td>\n",
       "      <td>Part IV Line 37</td>\n",
       "      <td>2010</td>\n",
       "      <td>2010</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>ActivitiesConductedPrtshpInd</td>\n",
       "      <td>Activities conducted thru partnership?</td>\n",
       "      <td>Part IV Line 37</td>\n",
       "      <td>2013</td>\n",
       "      <td>2015</td>\n",
       "      <td>2013, 2014, 2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Activity2</td>\n",
       "      <td>Second activity group</td>\n",
       "      <td>Part III Line 4b</td>\n",
       "      <td>2010</td>\n",
       "      <td>2010</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                         variable  \\\n",
       "0       AccountantCompileOrReview   \n",
       "1    AccountantCompileOrReviewInd   \n",
       "2    AccountsPayableAccrExpnssGrp   \n",
       "3  AccountsPayableAccruedExpenses   \n",
       "4              AccountsReceivable   \n",
       "5           AccountsReceivableGrp   \n",
       "6     AcctCompileOrReviewBasisGrp   \n",
       "7  ActivitiesConductedPartnership   \n",
       "8    ActivitiesConductedPrtshpInd   \n",
       "9                       Activity2   \n",
       "\n",
       "                                                                                   description  \\\n",
       "0                                                    Accountant provide compilation or review?   \n",
       "1                                                    Accountant provide compilation or review?   \n",
       "2                                                        Accounts payable and accrued expenses   \n",
       "3                                                        Accounts payable and accrued expenses   \n",
       "4                                                                     Accounts receivable, net   \n",
       "5                                                                     Accounts receivable, net   \n",
       "6  Basis in which the financial statements were compiled or reviewed by independent accountant   \n",
       "7                                                       Activities conducted thru partnership?   \n",
       "8                                                       Activities conducted thru partnership?   \n",
       "9                                                                        Second activity group   \n",
       "\n",
       "        line_number first_year last_year         all_years  \n",
       "0  Part XII Line 2a       2010      2010              2010  \n",
       "1  Part XII Line 2a       2013      2015  2013, 2014, 2015  \n",
       "2    Part X Line 17       2013      2015  2013, 2014, 2015  \n",
       "3    Part X Line 17       2010      2010              2010  \n",
       "4     Part X Line 4       2010      2010              2010  \n",
       "5     Part X Line 4       2013      2015  2013, 2014, 2015  \n",
       "6  Part XII Line 2a       2013      2015  2013, 2014, 2015  \n",
       "7   Part IV Line 37       2010      2010              2010  \n",
       "8   Part IV Line 37       2013      2015  2013, 2014, 2015  \n",
       "9  Part III Line 4b       2010      2010              2010  "
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# CREATE DICTIONARY OF AGGREGATION OPERATIONS\n",
    "aggregations = {'year': {'first_year': 'first', 'last_year': 'last', 'all_years': lambda col: ', '.join(col)}}\n",
    "collapsed_df = variables_df.groupby(['variable', 'description', 'line_number']).agg(aggregations)\n",
    "#Drop the outermost level from the hierarchical column index:\n",
    "#http://stackoverflow.com/questions/19078325/naming-returned-columns-in-pandas-aggregate-function\n",
    "collapsed_df.columns = collapsed_df.columns.droplevel(0)\n",
    "collapsed_df = collapsed_df.reset_index()\n",
    "print len(collapsed_df)\n",
    "collapsed_df[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PANDAS will be <a href=\"https://stackoverflow.com/questions/44635626/pandas-aggregation-warning-futurewarning-using-a-dict-with-renaming-is-depreca\">deprecating the above dictionary approach</a>, so to future-proof the code I will also show you the method that will work in future versions of PANDAS. We need to do this <a href=\"https://github.com/pandas-dev/pandas/issues/18366#issuecomment-349090402\">workaround for 'first' and 'last' aggregation</a> but the end result is the same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "652\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>variable</th>\n",
       "      <th>description</th>\n",
       "      <th>line_number</th>\n",
       "      <th>first_year</th>\n",
       "      <th>last_year</th>\n",
       "      <th>all_years</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AccountantCompileOrReview</td>\n",
       "      <td>Accountant provide compilation or review?</td>\n",
       "      <td>Part XII Line 2a</td>\n",
       "      <td>2010</td>\n",
       "      <td>2010</td>\n",
       "      <td>[2010]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AccountantCompileOrReviewInd</td>\n",
       "      <td>Accountant provide compilation or review?</td>\n",
       "      <td>Part XII Line 2a</td>\n",
       "      <td>2013</td>\n",
       "      <td>2015</td>\n",
       "      <td>[2013, 2014, 2015]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AccountsPayableAccrExpnssGrp</td>\n",
       "      <td>Accounts payable and accrued expenses</td>\n",
       "      <td>Part X Line 17</td>\n",
       "      <td>2013</td>\n",
       "      <td>2015</td>\n",
       "      <td>[2013, 2014, 2015]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AccountsPayableAccruedExpenses</td>\n",
       "      <td>Accounts payable and accrued expenses</td>\n",
       "      <td>Part X Line 17</td>\n",
       "      <td>2010</td>\n",
       "      <td>2010</td>\n",
       "      <td>[2010]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AccountsReceivable</td>\n",
       "      <td>Accounts receivable, net</td>\n",
       "      <td>Part X Line 4</td>\n",
       "      <td>2010</td>\n",
       "      <td>2010</td>\n",
       "      <td>[2010]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>AccountsReceivableGrp</td>\n",
       "      <td>Accounts receivable, net</td>\n",
       "      <td>Part X Line 4</td>\n",
       "      <td>2013</td>\n",
       "      <td>2015</td>\n",
       "      <td>[2013, 2014, 2015]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>AcctCompileOrReviewBasisGrp</td>\n",
       "      <td>Basis in which the financial statements were compiled or reviewed by independent accountant</td>\n",
       "      <td>Part XII Line 2a</td>\n",
       "      <td>2013</td>\n",
       "      <td>2015</td>\n",
       "      <td>[2013, 2014, 2015]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>ActivitiesConductedPartnership</td>\n",
       "      <td>Activities conducted thru partnership?</td>\n",
       "      <td>Part IV Line 37</td>\n",
       "      <td>2010</td>\n",
       "      <td>2010</td>\n",
       "      <td>[2010]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>ActivitiesConductedPrtshpInd</td>\n",
       "      <td>Activities conducted thru partnership?</td>\n",
       "      <td>Part IV Line 37</td>\n",
       "      <td>2013</td>\n",
       "      <td>2015</td>\n",
       "      <td>[2013, 2014, 2015]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Activity2</td>\n",
       "      <td>Second activity group</td>\n",
       "      <td>Part III Line 4b</td>\n",
       "      <td>2010</td>\n",
       "      <td>2010</td>\n",
       "      <td>[2010]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                         variable  \\\n",
       "0       AccountantCompileOrReview   \n",
       "1    AccountantCompileOrReviewInd   \n",
       "2    AccountsPayableAccrExpnssGrp   \n",
       "3  AccountsPayableAccruedExpenses   \n",
       "4              AccountsReceivable   \n",
       "5           AccountsReceivableGrp   \n",
       "6     AcctCompileOrReviewBasisGrp   \n",
       "7  ActivitiesConductedPartnership   \n",
       "8    ActivitiesConductedPrtshpInd   \n",
       "9                       Activity2   \n",
       "\n",
       "                                                                                   description  \\\n",
       "0                                                    Accountant provide compilation or review?   \n",
       "1                                                    Accountant provide compilation or review?   \n",
       "2                                                        Accounts payable and accrued expenses   \n",
       "3                                                        Accounts payable and accrued expenses   \n",
       "4                                                                     Accounts receivable, net   \n",
       "5                                                                     Accounts receivable, net   \n",
       "6  Basis in which the financial statements were compiled or reviewed by independent accountant   \n",
       "7                                                       Activities conducted thru partnership?   \n",
       "8                                                       Activities conducted thru partnership?   \n",
       "9                                                                        Second activity group   \n",
       "\n",
       "        line_number first_year last_year           all_years  \n",
       "0  Part XII Line 2a       2010      2010              [2010]  \n",
       "1  Part XII Line 2a       2013      2015  [2013, 2014, 2015]  \n",
       "2    Part X Line 17       2013      2015  [2013, 2014, 2015]  \n",
       "3    Part X Line 17       2010      2010              [2010]  \n",
       "4     Part X Line 4       2010      2010              [2010]  \n",
       "5     Part X Line 4       2013      2015  [2013, 2014, 2015]  \n",
       "6  Part XII Line 2a       2013      2015  [2013, 2014, 2015]  \n",
       "7   Part IV Line 37       2010      2010              [2010]  \n",
       "8   Part IV Line 37       2013      2015  [2013, 2014, 2015]  \n",
       "9  Part III Line 4b       2010      2010              [2010]  "
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def agg_funcs(x):\n",
    "    names = {\n",
    "        'first_year': x['year'].head(1).values[0],\n",
    "        'last_year':  x['year'].tail(1).values[0],\n",
    "        #'all_years':  ', '.join(x['year']),\n",
    "        'all_years':  x['year'].tolist()}\n",
    "    #THE FOLLOWING SHORTCUT WORKS BUT CHANGES THE ORDER OF THE COLUMNS\n",
    "    #return pd.Series(names, index = list(names.keys()))\n",
    "    return pd.Series(names, index=['first_year', 'last_year', 'all_years'])\n",
    "collapsed_df = variables_df.groupby(['variable', 'description', 'line_number']).apply(agg_funcs)\n",
    "collapsed_df = collapsed_df.reset_index()\n",
    "print len(collapsed_df)\n",
    "collapsed_df[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>Now we have two variations of our codebook. I would lean toward the latter, but you can take your pick and use whichever suits your particular research purposes. One thing to note above is that there appear to be two types of variables -- those that are used in 2010 and those that are found in the 2013-2015 versions. The complication here is that, if you are searching for, say, the data in *Part XII Line 2a*, you will encounter different variable names according to the year you are working. In effect, we have a sense of the enormity of the problem the *Nonprofit Open Data Collective* is confronting in developing a master variable concordance file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save Dataframes in PANDAS and Excel Formats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "variables_df.to_pickle('variable descriptions 2010-2015 990 e-file data.pkl')\n",
    "variables_df.to_excel('variable descriptions 2010-2015 990 e-file data.xls')\n",
    "collapsed_df.to_pickle('variable descriptions 2010-2015 990 e-file data (collapsed).pkl')\n",
    "collapsed_df.to_excel('variable descriptions 2010-2015 990 e-file data (collapsed).xls')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>OK, we now have our 990 data and our data codebook. In the next and final tutorial we will employ both to create a dataset with a sample of key 990 variables we are interested in."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "navigate_num": "#000000",
    "navigate_text": "#333333",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700",
    "sidebar_border": "#EEEEEE",
    "wrapper_background": "#FFFFFF"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "117px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}