Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kenorb/c914014b5ffb8115552b7e0320cd29a3 to your computer and use it in GitHub Desktop.
Save kenorb/c914014b5ffb8115552b7e0320cd29a3 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Overview\n",
"\n",
"This is seventh in a series tutorials that illustrate how to download the <a href=\"https://aws.amazon.com/public-data-sets/irs-990/\">IRS 990 e-file data</a>. In a series of <a href=\"http://social-metrics.org/sox/\">previous tutorials</a> I have shown how to download data from IRS 990, 990PF, and 990EZ filers. In this tutorial we are going to generate a codebook or 'data dictionary' that can help us make sense of the variables we have downloaded."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load Packages"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"import time\n",
"import json"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"from pandas import DataFrame\n",
"from pandas import Series"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#http://pandas.pydata.org/pandas-docs/stable/options.html\n",
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('max_colwidth', 250)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The need for a codebook\n",
"\n",
"As you can tell from looking at the following sample filing, the data by themselves are difficult to interpret and difficult to match to the various sections of the paper 990 documents: https://s3.amazonaws.com/irs-form-990/201543109349200219_public.xml\n",
"\n",
"To get the variable descriptions you need to dig through the 'schemas', such as that here for 2015: https://www.irs.gov/e-file-providers/990-990-ez-990-pf-ty2015-v30-schema-business-rules-and-release-memo \n",
" \n",
"The IRS makes it difficult to find the <code>*.xsd</code> stylesheets that can be used to make sense of the data. You can visit here -- https://www.irs.gov/downloads/irs-schema/ -- sort the files by Name, and then look for the <code>efile990...</code> zip files. \n",
"\n",
"Alternatively, you can visit https://www.irs.gov/e-file-providers/current-valid-xml-schemas-and-business-rules-for-exempt-organizations-modernized-e-file -- where you can currently find links to .xsd files for 2014 and 2015. \n",
"\n",
"\n",
"For 2015 you would iterate through the links this way:\n",
"- https://www.irs.gov/e-file-providers/ty2015-current-valid-xml-schemas-and-business-rules-for-exempt-organizations-modernized-efile-mef then\n",
"- https://www.irs.gov/e-file-providers/990-990-ez-990-pf-ty2015-v30-schema-business-rules-and-release-memo then\n",
"- https://www.irs.gov/pub/irs-schema/efile990x_2015v3.0.zip\n",
"\n",
"\n",
"For 2014, the latter will take you here:\n",
"- https://www.irs.gov/e-file-providers/ty2014-current-valid-xml-schemas-and-business-rules-for-exempt-organizations-modernized-efile-mef\n",
"Then here:\n",
"- https://www.irs.gov/e-file-providers/990-990-ez-990-pf-ty2014-v60-schema-business-rules-and-release-memo\n",
"Then you will download the \"Schemas\" zip file:\n",
"- https://www.irs.gov/pub/irs-schema/efile990x_2014v6.0_09082015.zip\n",
"\n",
"\n",
"As a shorthand, here are the direct links to .xsd files for all available years:\n",
"\n",
"- 2015:\n",
" - https://www.irs.gov/pub/irs-schema/efile990x_2015v3.0.zip\n",
" - After unzipping, <code>IRS990.xsd</code> will be found here: &nbsp; <code>/efile990x_2015v3.0/2015v3.0/TEGE/TEGE990/IRS990/IRS990.xsd</code> &nbsp; \n",
" - Under &nbsp; <code>/TEGE/</code> &nbsp; there are folders for &nbsp;<code>/TEGE990PF/</code> &nbsp; and &nbsp;<code>/TEGE990EZ/</code> &nbsp; and &nbsp;<code>/Common/</code> &nbsp; (which contains &nbsp;<code>*.xsd</code> &nbsp; files for the schedules).\n",
"\n",
"\n",
"- 2014:\n",
" - https://www.irs.gov/pub/irs-schema/efile990x_2014v6.0_09082015.zip\n",
" - After unzipping, <code>IRS990.xsd</code> will be found here: &nbsp; <code>efile990x_2014v6.0_09082015/2014v6.0/TEGE/TEGE990/IRS990/IRS990.xsd</code> &nbsp;\n",
" - Under &nbsp; <code>/TEGE/</code> &nbsp; there are folders for &nbsp;<code>/TEGE990PF/</code> &nbsp; and &nbsp;<code>/TEGE990EZ/</code> &nbsp; and &nbsp;<code>/Common/</code> &nbsp; (which contains &nbsp;<code>*.xsd</code> &nbsp; files for the schedules).\n",
" \n",
"\n",
"\n",
"- 2013:\n",
" - https://www.irs.gov/pub/irs-schema/efile990x_2013v4.0.zip\n",
" - After unzipping, <code>IRS990.xsd</code> will be found here: &nbsp; <code>efile990x_2013v4.0/2013v4.0/TEGE/TEGE990/IRS990/IRS990.xsd</code> &nbsp;\n",
" - Under &nbsp; <code>/TEGE/</code> &nbsp; there are folders for &nbsp;<code>/TEGE990PF/</code> &nbsp; and &nbsp;<code>/TEGE990EZ/</code> &nbsp; and &nbsp;<code>/Common/</code> &nbsp; (which contains &nbsp;<code>*.xsd</code> &nbsp; files for the schedules). \n",
"\n",
"\n",
"- 2010:\n",
" - https://www.irs.gov/pub/irs-schema/efile990x_2010v3.7.zip \n",
" - After unzipping, <code>IRS990.xsd</code> will be found here: &nbsp; <code>/efile990x_2010v3.7/2010v3.7/TEGE/TEGE990/IRS990/IRS990.xsd</code> &nbsp; \n",
" - Under &nbsp; <code>/TEGE/</code> &nbsp; there are folders for &nbsp;<code>/TEGE990PF/</code> &nbsp; and &nbsp;<code>/TEGE990EZ/</code> &nbsp; and &nbsp;<code>/Common/</code> &nbsp; (which contains &nbsp;<code>*.xsd</code> &nbsp; files for the schedules).\n",
" \n",
" \n",
" \n",
" \n",
"You should also be aware of the *Nonprofit Open Data Collective*. That project is a more comprehensive effort, particularly their <a href=\"https://github.com/Nonprofit-Open-Data-Collective/irs-efile-master-concordance-file/blob/master/efiler_master_concordance.csv\">master concordance file</a>. My tutorial is mostly intended to highlight the nature of the data and the schemas and serve as an alternative Python version."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loop over XSD files, generate relevant dictionary with variable details, and append to PANDAS dataframe\n",
"Now we will process the .xsd files. I am assuming you have downloaded the four above *IRS900.xsd* files, put them in a folder called */xsd files/*, and renamed them *IRS990 (2010).xsd*, *IRS990 (2013).xsd*, *IRS990 (2014).xsd*, and *IRS990 (2015).xsd*.\n",
"\n",
"In the following code blocks we will set our working directory and then loop over each of the four xsd files. For each one, we are going to move down to where are the 'elements' are contained. Each element is variable. What we are interested in is the &nbsp; <code>name</code> &nbsp; attribute, the &nbsp; <code>Description</code>, &nbsp; which tells us what the variable means, and the &nbsp; <code>LineNumber</code>, &nbsp; which provides a location for the variable within the 990 form. Together, we have enough information to match the columns in our 990 data to the specific variables we need. Here is an an example of one 'element' in the 2014 xsd file:\n",
"\n",
"\t\t\t<!-- Total Number Volunteers -->\n",
"\t\t\t<xsd:element name=\"TotalVolunteersCnt\" type=\"IntegerNNType\" minOccurs=\"0\">\n",
"\t\t\t\t<xsd:annotation>\n",
"\t\t\t\t\t<xsd:documentation>\n",
"\t\t\t\t\t\t<Description>Total number volunteers</Description>\n",
"\t\t\t\t\t\t<LineNumber>Part I Line 6</LineNumber>\n",
"\t\t\t\t\t</xsd:documentation>\n",
"\t\t\t\t</xsd:annotation>\n",
"\t\t\t</xsd:element>\n",
"\n",
"Our code below searches over each element and assigns the Description to a variable *description*, the LineNumber to a variable *line_number*, and the year to a variable *year*, and assigns all three variables to a dictionary called *variables*, with the major keys being the variable names. After processing each xsd file, the data are added to a PANDAS dataframe called *variables_df.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"cd '/Users/gsaxton/Dropbox/990 e-file data/xsd files/'"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# of rows in dataframe: 1314 \n",
"\n",
"# of rows per year: 2013 331\n",
"2014 331\n",
"2015 331\n",
"2010 321\n",
"Name: year, dtype: int64 \n",
"\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>description</th>\n",
" <th>line_number</th>\n",
" <th>year</th>\n",
" </tr>\n",
" <tr>\n",
" <th>variable</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>AccountantCompileOrReview</th>\n",
" <td>Accountant provide compilation or review?</td>\n",
" <td>Part XII Line 2a</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>AccountsPayableAccruedExpenses</th>\n",
" <td>Accounts payable and accrued expenses</td>\n",
" <td>Part X Line 17</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>AccountsReceivable</th>\n",
" <td>Accounts receivable, net</td>\n",
" <td>Part X Line 4</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ActivitiesConductedPartnership</th>\n",
" <td>Activities conducted thru partnership?</td>\n",
" <td>Part IV Line 37</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Activity2</th>\n",
" <td>Second activity group</td>\n",
" <td>Part III Line 4b</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" description \\\n",
"variable \n",
"AccountantCompileOrReview Accountant provide compilation or review? \n",
"AccountsPayableAccruedExpenses Accounts payable and accrued expenses \n",
"AccountsReceivable Accounts receivable, net \n",
"ActivitiesConductedPartnership Activities conducted thru partnership? \n",
"Activity2 Second activity group \n",
"\n",
" line_number year \n",
"variable \n",
"AccountantCompileOrReview Part XII Line 2a 2010 \n",
"AccountsPayableAccruedExpenses Part X Line 17 2010 \n",
"AccountsReceivable Part X Line 4 2010 \n",
"ActivitiesConductedPartnership Part IV Line 37 2010 \n",
"Activity2 Part III Line 4b 2010 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"variables_df = pd.DataFrame()\n",
"import xml.etree.ElementTree as ET\n",
"for year in ['2010', '2013', '2014', '2015']:\n",
" variables = {}\n",
" tree = ET.parse('IRS990 (%s).xsd' % year)\n",
" root = tree.getroot()\n",
" #print root[3].tag, root[3].attrib\n",
" \n",
" for r in root[3][1].findall(\"{http://www.w3.org/2001/XMLSchema}element\"):\n",
" name = r.attrib['name']\n",
" attrs = r.find('{http://www.w3.org/2001/XMLSchema}annotation/{http://www.w3.org/2001/XMLSchema}documentation')\n",
" description = attrs[0]\n",
" if len(attrs)>1:\n",
" line_number = attrs[1].text\n",
" else:\n",
" line_number = ''\n",
" #print name, description.text, line_number\n",
" \n",
" #if name not in variables:\n",
" # variables[name] == name\n",
" variables.update({name: {'description': description.text, 'line_number': line_number, 'year': year}})\n",
" \n",
" variables_df = variables_df.append(DataFrame(variables).T) #, ignore_index = True)\n",
" variables_df.index.name = 'variable'\n",
"\n",
"print \"# of rows in dataframe:\", len(variables_df), '\\n'\n",
"print \"# of rows per year:\", variables_df['year'].value_counts(), '\\n'\n",
"variables_df[:5] "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>We see there are 1,314 rows in the database. Above you can see the first five rows and below the last five rows."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>description</th>\n",
" <th>line_number</th>\n",
" <th>year</th>\n",
" </tr>\n",
" <tr>\n",
" <th>variable</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>VotingMembersGoverningBodyCnt</th>\n",
" <td>Number voting members governing body</td>\n",
" <td>Part I Line 3</td>\n",
" <td>2015</td>\n",
" </tr>\n",
" <tr>\n",
" <th>VotingMembersIndependentCnt</th>\n",
" <td>Number independent voting members</td>\n",
" <td>Part I Line 4</td>\n",
" <td>2015</td>\n",
" </tr>\n",
" <tr>\n",
" <th>WebsiteAddressTxt</th>\n",
" <td>Web site</td>\n",
" <td>J</td>\n",
" <td>2015</td>\n",
" </tr>\n",
" <tr>\n",
" <th>WhistleblowerPolicyInd</th>\n",
" <td>Whistleblower policy?</td>\n",
" <td>Part VI Section B Line 13</td>\n",
" <td>2015</td>\n",
" </tr>\n",
" <tr>\n",
" <th>WrittenPolicyOrProcedureInd</th>\n",
" <td>Written policy or procedure?</td>\n",
" <td>Part VI Section B Line 16b</td>\n",
" <td>2015</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" description \\\n",
"variable \n",
"VotingMembersGoverningBodyCnt Number voting members governing body \n",
"VotingMembersIndependentCnt Number independent voting members \n",
"WebsiteAddressTxt Web site \n",
"WhistleblowerPolicyInd Whistleblower policy? \n",
"WrittenPolicyOrProcedureInd Written policy or procedure? \n",
"\n",
" line_number year \n",
"variable \n",
"VotingMembersGoverningBodyCnt Part I Line 3 2015 \n",
"VotingMembersIndependentCnt Part I Line 4 2015 \n",
"WebsiteAddressTxt J 2015 \n",
"WhistleblowerPolicyInd Part VI Section B Line 13 2015 \n",
"WrittenPolicyOrProcedureInd Part VI Section B Line 16b 2015 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"variables_df[-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>We're not quite done, however. To see why, let's reset the index and then sort the dataframe."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>variable</th>\n",
" <th>description</th>\n",
" <th>line_number</th>\n",
" <th>year</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>AccountantCompileOrReview</td>\n",
" <td>Accountant provide compilation or review?</td>\n",
" <td>Part XII Line 2a</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>AccountsPayableAccruedExpenses</td>\n",
" <td>Accounts payable and accrued expenses</td>\n",
" <td>Part X Line 17</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>AccountsReceivable</td>\n",
" <td>Accounts receivable, net</td>\n",
" <td>Part X Line 4</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ActivitiesConductedPartnership</td>\n",
" <td>Activities conducted thru partnership?</td>\n",
" <td>Part IV Line 37</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Activity2</td>\n",
" <td>Second activity group</td>\n",
" <td>Part III Line 4b</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" variable description \\\n",
"0 AccountantCompileOrReview Accountant provide compilation or review? \n",
"1 AccountsPayableAccruedExpenses Accounts payable and accrued expenses \n",
"2 AccountsReceivable Accounts receivable, net \n",
"3 ActivitiesConductedPartnership Activities conducted thru partnership? \n",
"4 Activity2 Second activity group \n",
"\n",
" line_number year \n",
"0 Part XII Line 2a 2010 \n",
"1 Part X Line 17 2010 \n",
"2 Part X Line 4 2010 \n",
"3 Part IV Line 37 2010 \n",
"4 Part III Line 4b 2010 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"variables_df = variables_df.reset_index()\n",
"variables_df[:5]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>variable</th>\n",
" <th>description</th>\n",
" <th>line_number</th>\n",
" <th>year</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>AccountantCompileOrReview</td>\n",
" <td>Accountant provide compilation or review?</td>\n",
" <td>Part XII Line 2a</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>321</th>\n",
" <td>AccountantCompileOrReviewInd</td>\n",
" <td>Accountant provide compilation or review?</td>\n",
" <td>Part XII Line 2a</td>\n",
" <td>2013</td>\n",
" </tr>\n",
" <tr>\n",
" <th>652</th>\n",
" <td>AccountantCompileOrReviewInd</td>\n",
" <td>Accountant provide compilation or review?</td>\n",
" <td>Part XII Line 2a</td>\n",
" <td>2014</td>\n",
" </tr>\n",
" <tr>\n",
" <th>983</th>\n",
" <td>AccountantCompileOrReviewInd</td>\n",
" <td>Accountant provide compilation or review?</td>\n",
" <td>Part XII Line 2a</td>\n",
" <td>2015</td>\n",
" </tr>\n",
" <tr>\n",
" <th>322</th>\n",
" <td>AccountsPayableAccrExpnssGrp</td>\n",
" <td>Accounts payable and accrued expenses</td>\n",
" <td>Part X Line 17</td>\n",
" <td>2013</td>\n",
" </tr>\n",
" <tr>\n",
" <th>653</th>\n",
" <td>AccountsPayableAccrExpnssGrp</td>\n",
" <td>Accounts payable and accrued expenses</td>\n",
" <td>Part X Line 17</td>\n",
" <td>2014</td>\n",
" </tr>\n",
" <tr>\n",
" <th>984</th>\n",
" <td>AccountsPayableAccrExpnssGrp</td>\n",
" <td>Accounts payable and accrued expenses</td>\n",
" <td>Part X Line 17</td>\n",
" <td>2015</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>AccountsPayableAccruedExpenses</td>\n",
" <td>Accounts payable and accrued expenses</td>\n",
" <td>Part X Line 17</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>AccountsReceivable</td>\n",
" <td>Accounts receivable, net</td>\n",
" <td>Part X Line 4</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>323</th>\n",
" <td>AccountsReceivableGrp</td>\n",
" <td>Accounts receivable, net</td>\n",
" <td>Part X Line 4</td>\n",
" <td>2013</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" variable \\\n",
"0 AccountantCompileOrReview \n",
"321 AccountantCompileOrReviewInd \n",
"652 AccountantCompileOrReviewInd \n",
"983 AccountantCompileOrReviewInd \n",
"322 AccountsPayableAccrExpnssGrp \n",
"653 AccountsPayableAccrExpnssGrp \n",
"984 AccountsPayableAccrExpnssGrp \n",
"1 AccountsPayableAccruedExpenses \n",
"2 AccountsReceivable \n",
"323 AccountsReceivableGrp \n",
"\n",
" description line_number year \n",
"0 Accountant provide compilation or review? Part XII Line 2a 2010 \n",
"321 Accountant provide compilation or review? Part XII Line 2a 2013 \n",
"652 Accountant provide compilation or review? Part XII Line 2a 2014 \n",
"983 Accountant provide compilation or review? Part XII Line 2a 2015 \n",
"322 Accounts payable and accrued expenses Part X Line 17 2013 \n",
"653 Accounts payable and accrued expenses Part X Line 17 2014 \n",
"984 Accounts payable and accrued expenses Part X Line 17 2015 \n",
"1 Accounts payable and accrued expenses Part X Line 17 2010 \n",
"2 Accounts receivable, net Part X Line 4 2010 \n",
"323 Accounts receivable, net Part X Line 4 2013 "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"variables_df.sort_values(by=['variable', 'year'], ascending=[1,1])[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>As you can see above, each variable name occurs in more than one row. This is good, because we don't want the IRS to change the variable name every year. A dataframe in this format is also helpful if you are searching by year and then variable name. Yet it becomes problematic if you are searching by variable name. \n",
"\n",
"So, what we will do is 'collapse' the above dataframe. Specifically, we will search for each row that has an indentical array of values on the first three columns -- *variable*, *description*, and *line_number*. We will then create new columns for the first year the variable-description-line_number array appears, the last year it appears, and all years it appears. \n",
"\n",
"This is a task PANDAS excels at. It is called a *groupby* operation. I will show you two ways of doing it. The first is with an aggregation *dictionary*. "
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"652\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>variable</th>\n",
" <th>description</th>\n",
" <th>line_number</th>\n",
" <th>first_year</th>\n",
" <th>last_year</th>\n",
" <th>all_years</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>AccountantCompileOrReview</td>\n",
" <td>Accountant provide compilation or review?</td>\n",
" <td>Part XII Line 2a</td>\n",
" <td>2010</td>\n",
" <td>2010</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>AccountantCompileOrReviewInd</td>\n",
" <td>Accountant provide compilation or review?</td>\n",
" <td>Part XII Line 2a</td>\n",
" <td>2013</td>\n",
" <td>2015</td>\n",
" <td>2013, 2014, 2015</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>AccountsPayableAccrExpnssGrp</td>\n",
" <td>Accounts payable and accrued expenses</td>\n",
" <td>Part X Line 17</td>\n",
" <td>2013</td>\n",
" <td>2015</td>\n",
" <td>2013, 2014, 2015</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>AccountsPayableAccruedExpenses</td>\n",
" <td>Accounts payable and accrued expenses</td>\n",
" <td>Part X Line 17</td>\n",
" <td>2010</td>\n",
" <td>2010</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>AccountsReceivable</td>\n",
" <td>Accounts receivable, net</td>\n",
" <td>Part X Line 4</td>\n",
" <td>2010</td>\n",
" <td>2010</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>AccountsReceivableGrp</td>\n",
" <td>Accounts receivable, net</td>\n",
" <td>Part X Line 4</td>\n",
" <td>2013</td>\n",
" <td>2015</td>\n",
" <td>2013, 2014, 2015</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>AcctCompileOrReviewBasisGrp</td>\n",
" <td>Basis in which the financial statements were compiled or reviewed by independent accountant</td>\n",
" <td>Part XII Line 2a</td>\n",
" <td>2013</td>\n",
" <td>2015</td>\n",
" <td>2013, 2014, 2015</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>ActivitiesConductedPartnership</td>\n",
" <td>Activities conducted thru partnership?</td>\n",
" <td>Part IV Line 37</td>\n",
" <td>2010</td>\n",
" <td>2010</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>ActivitiesConductedPrtshpInd</td>\n",
" <td>Activities conducted thru partnership?</td>\n",
" <td>Part IV Line 37</td>\n",
" <td>2013</td>\n",
" <td>2015</td>\n",
" <td>2013, 2014, 2015</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Activity2</td>\n",
" <td>Second activity group</td>\n",
" <td>Part III Line 4b</td>\n",
" <td>2010</td>\n",
" <td>2010</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" variable \\\n",
"0 AccountantCompileOrReview \n",
"1 AccountantCompileOrReviewInd \n",
"2 AccountsPayableAccrExpnssGrp \n",
"3 AccountsPayableAccruedExpenses \n",
"4 AccountsReceivable \n",
"5 AccountsReceivableGrp \n",
"6 AcctCompileOrReviewBasisGrp \n",
"7 ActivitiesConductedPartnership \n",
"8 ActivitiesConductedPrtshpInd \n",
"9 Activity2 \n",
"\n",
" description \\\n",
"0 Accountant provide compilation or review? \n",
"1 Accountant provide compilation or review? \n",
"2 Accounts payable and accrued expenses \n",
"3 Accounts payable and accrued expenses \n",
"4 Accounts receivable, net \n",
"5 Accounts receivable, net \n",
"6 Basis in which the financial statements were compiled or reviewed by independent accountant \n",
"7 Activities conducted thru partnership? \n",
"8 Activities conducted thru partnership? \n",
"9 Second activity group \n",
"\n",
" line_number first_year last_year all_years \n",
"0 Part XII Line 2a 2010 2010 2010 \n",
"1 Part XII Line 2a 2013 2015 2013, 2014, 2015 \n",
"2 Part X Line 17 2013 2015 2013, 2014, 2015 \n",
"3 Part X Line 17 2010 2010 2010 \n",
"4 Part X Line 4 2010 2010 2010 \n",
"5 Part X Line 4 2013 2015 2013, 2014, 2015 \n",
"6 Part XII Line 2a 2013 2015 2013, 2014, 2015 \n",
"7 Part IV Line 37 2010 2010 2010 \n",
"8 Part IV Line 37 2013 2015 2013, 2014, 2015 \n",
"9 Part III Line 4b 2010 2010 2010 "
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# CREATE DICTIONARY OF AGGREGATION OPERATIONS\n",
"aggregations = {'year': {'first_year': 'first', 'last_year': 'last', 'all_years': lambda col: ', '.join(col)}}\n",
"collapsed_df = variables_df.groupby(['variable', 'description', 'line_number']).agg(aggregations)\n",
"#Drop the outermost level from the hierarchical column index:\n",
"#http://stackoverflow.com/questions/19078325/naming-returned-columns-in-pandas-aggregate-function\n",
"collapsed_df.columns = collapsed_df.columns.droplevel(0)\n",
"collapsed_df = collapsed_df.reset_index()\n",
"print len(collapsed_df)\n",
"collapsed_df[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PANDAS will be <a href=\"https://stackoverflow.com/questions/44635626/pandas-aggregation-warning-futurewarning-using-a-dict-with-renaming-is-depreca\">deprecating the above dictionary approach</a>, so to future-proof the code I will also show you the method that will work in future versions of PANDAS. We need to do this <a href=\"https://github.com/pandas-dev/pandas/issues/18366#issuecomment-349090402\">workaround for 'first' and 'last' aggregation</a> but the end result is the same."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"652\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>variable</th>\n",
" <th>description</th>\n",
" <th>line_number</th>\n",
" <th>first_year</th>\n",
" <th>last_year</th>\n",
" <th>all_years</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>AccountantCompileOrReview</td>\n",
" <td>Accountant provide compilation or review?</td>\n",
" <td>Part XII Line 2a</td>\n",
" <td>2010</td>\n",
" <td>2010</td>\n",
" <td>[2010]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>AccountantCompileOrReviewInd</td>\n",
" <td>Accountant provide compilation or review?</td>\n",
" <td>Part XII Line 2a</td>\n",
" <td>2013</td>\n",
" <td>2015</td>\n",
" <td>[2013, 2014, 2015]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>AccountsPayableAccrExpnssGrp</td>\n",
" <td>Accounts payable and accrued expenses</td>\n",
" <td>Part X Line 17</td>\n",
" <td>2013</td>\n",
" <td>2015</td>\n",
" <td>[2013, 2014, 2015]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>AccountsPayableAccruedExpenses</td>\n",
" <td>Accounts payable and accrued expenses</td>\n",
" <td>Part X Line 17</td>\n",
" <td>2010</td>\n",
" <td>2010</td>\n",
" <td>[2010]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>AccountsReceivable</td>\n",
" <td>Accounts receivable, net</td>\n",
" <td>Part X Line 4</td>\n",
" <td>2010</td>\n",
" <td>2010</td>\n",
" <td>[2010]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>AccountsReceivableGrp</td>\n",
" <td>Accounts receivable, net</td>\n",
" <td>Part X Line 4</td>\n",
" <td>2013</td>\n",
" <td>2015</td>\n",
" <td>[2013, 2014, 2015]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>AcctCompileOrReviewBasisGrp</td>\n",
" <td>Basis in which the financial statements were compiled or reviewed by independent accountant</td>\n",
" <td>Part XII Line 2a</td>\n",
" <td>2013</td>\n",
" <td>2015</td>\n",
" <td>[2013, 2014, 2015]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>ActivitiesConductedPartnership</td>\n",
" <td>Activities conducted thru partnership?</td>\n",
" <td>Part IV Line 37</td>\n",
" <td>2010</td>\n",
" <td>2010</td>\n",
" <td>[2010]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>ActivitiesConductedPrtshpInd</td>\n",
" <td>Activities conducted thru partnership?</td>\n",
" <td>Part IV Line 37</td>\n",
" <td>2013</td>\n",
" <td>2015</td>\n",
" <td>[2013, 2014, 2015]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Activity2</td>\n",
" <td>Second activity group</td>\n",
" <td>Part III Line 4b</td>\n",
" <td>2010</td>\n",
" <td>2010</td>\n",
" <td>[2010]</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" variable \\\n",
"0 AccountantCompileOrReview \n",
"1 AccountantCompileOrReviewInd \n",
"2 AccountsPayableAccrExpnssGrp \n",
"3 AccountsPayableAccruedExpenses \n",
"4 AccountsReceivable \n",
"5 AccountsReceivableGrp \n",
"6 AcctCompileOrReviewBasisGrp \n",
"7 ActivitiesConductedPartnership \n",
"8 ActivitiesConductedPrtshpInd \n",
"9 Activity2 \n",
"\n",
" description \\\n",
"0 Accountant provide compilation or review? \n",
"1 Accountant provide compilation or review? \n",
"2 Accounts payable and accrued expenses \n",
"3 Accounts payable and accrued expenses \n",
"4 Accounts receivable, net \n",
"5 Accounts receivable, net \n",
"6 Basis in which the financial statements were compiled or reviewed by independent accountant \n",
"7 Activities conducted thru partnership? \n",
"8 Activities conducted thru partnership? \n",
"9 Second activity group \n",
"\n",
" line_number first_year last_year all_years \n",
"0 Part XII Line 2a 2010 2010 [2010] \n",
"1 Part XII Line 2a 2013 2015 [2013, 2014, 2015] \n",
"2 Part X Line 17 2013 2015 [2013, 2014, 2015] \n",
"3 Part X Line 17 2010 2010 [2010] \n",
"4 Part X Line 4 2010 2010 [2010] \n",
"5 Part X Line 4 2013 2015 [2013, 2014, 2015] \n",
"6 Part XII Line 2a 2013 2015 [2013, 2014, 2015] \n",
"7 Part IV Line 37 2010 2010 [2010] \n",
"8 Part IV Line 37 2013 2015 [2013, 2014, 2015] \n",
"9 Part III Line 4b 2010 2010 [2010] "
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def agg_funcs(x):\n",
" names = {\n",
" 'first_year': x['year'].head(1).values[0],\n",
" 'last_year': x['year'].tail(1).values[0],\n",
" #'all_years': ', '.join(x['year']),\n",
" 'all_years': x['year'].tolist()}\n",
" #THE FOLLOWING SHORTCUT WORKS BUT CHANGES THE ORDER OF THE COLUMNS\n",
" #return pd.Series(names, index = list(names.keys()))\n",
" return pd.Series(names, index=['first_year', 'last_year', 'all_years'])\n",
"collapsed_df = variables_df.groupby(['variable', 'description', 'line_number']).apply(agg_funcs)\n",
"collapsed_df = collapsed_df.reset_index()\n",
"print len(collapsed_df)\n",
"collapsed_df[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>Now we have two variations of our codebook. I would lean toward the latter, but you can take your pick and use whichever suits your particular research purposes. One thing to note above is that there appear to be two types of variables -- those that are used in 2010 and those that are found in the 2013-2015 versions. The complication here is that, if you are searching for, say, the data in *Part XII Line 2a*, you will encounter different variable names according to the year you are working. In effect, we have a sense of the enormity of the problem the *Nonprofit Open Data Collective* is confronting in developing a master variable concordance file."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save Dataframes in PANDAS and Excel Formats"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"variables_df.to_pickle('variable descriptions 2010-2015 990 e-file data.pkl')\n",
"variables_df.to_excel('variable descriptions 2010-2015 990 e-file data.xls')\n",
"collapsed_df.to_pickle('variable descriptions 2010-2015 990 e-file data (collapsed).pkl')\n",
"collapsed_df.to_excel('variable descriptions 2010-2015 990 e-file data (collapsed).xls')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>OK, we now have our 990 data and our data codebook. In the next and final tutorial we will employ both to create a dataset with a sample of key 990 variables we are interested in."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.14"
},
"toc": {
"colors": {
"hover_highlight": "#DAA520",
"navigate_num": "#000000",
"navigate_text": "#333333",
"running_highlight": "#FF0000",
"selected_highlight": "#FFD700",
"sidebar_border": "#EEEEEE",
"wrapper_background": "#FFFFFF"
},
"moveMenuLeft": true,
"nav_menu": {
"height": "117px",
"width": "252px"
},
"navigate_menu": true,
"number_sections": true,
"sideBar": true,
"threshold": 4,
"toc_cell": false,
"toc_section_display": "block",
"toc_window_display": false,
"widenNotebook": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment