Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kenorb/e65aeccf68afd9ae8729feff54c1b967 to your computer and use it in GitHub Desktop.
Save kenorb/e65aeccf68afd9ae8729feff54c1b967 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Overview\n",
"\n",
"This is the third in a series of tutorials that illustrate how to download the IRS 990 e-file data available at https://aws.amazon.com/public-data-sets/irs-990/\n",
"\n",
"In this third notebook we will load the MongoDB database we created in <a href=\"http://social-metrics.org/irs-990-e-file-data-part-2/\">part 2</a> and bring them into Python PANDAS.\n",
"\n",
"By now you may be wondering why we are dealing with different data formats. In <a href=\"http://social-metrics.org/irs-990-e-file-data-part-1/\">part 1</a> we initially had our data in a Python *dictionary.* In part 2 we then inserted the data into a MongoDB table, and in this tutorial we insert the data into a PANDAS dataframe. There are good reasons for these steps. Namely, in part 1, having the data in a dictionary format facilitated moving the data into MongoDB. I spent some time describing the dictionary given how useful *lists* and *dictionaries* are in Python. \n",
"\n",
"Then in part 2 we inserted the data into MongoDB. We didn't need to do this, but MongoDB is tailor-made for JSON data. Whenever I'm dealing with messy JSON data -- such as this e-file data or data returned from the Twitter API -- I now like to throw the data directly into a MongoDB database. I would highly recommend you have an understanding of MongoDB in your data science toolkit. For a more thorough discussion of the trade-offs between MongoDB and relational databases such as SQL, see <a href=\"http://social-metrics.org/sqlite-vs-mongodb/\">this blog post</a>. \n",
"\n",
"There is another reason for introducing MongoDB in these tutorials: When we come to downloading the actual 990 files we will make good use of MongoDB. That is coming in future tutorial in the series.\n",
"\n",
"The final format used here will be PANDAS. PANDAS absolutely excels at data wrangling. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Load Packages and Set Working Directory\n",
"Import several necessary Python packages. We will be using the <a href=\"http://pandas.pydata.org/\">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import sys\n",
"import time\n",
"import json"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from pandas import DataFrame\n",
"from pandas import Series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#http://pandas.pydata.org/pandas-docs/stable/options.html\n",
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('max_colwidth', 250)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>Set working directory"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/Users/gsaxton/Dropbox/990 e-file data\n"
]
}
],
"source": [
"cd '/Users/gsaxton/Dropbox/990 e-file data'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# MongoDB\n",
"Depending on the project, I will store the data in SQLite or MongoDB. This time we're using MongoDB -- it's great for storing JSON data where each observation could have different variables. Before we get to the interesting part the following code blocks set up the MongoDB environment and the new database we'll be using. \n",
"\n",
"**_Note:_** In a terminal we'll have to start MongoDB by running the command *mongod* or *sudo mongod*. Then we run the following code block here to access MongoDB."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pymongo\n",
"from pymongo import MongoClient\n",
"client = MongoClient()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"Here we're connecting to the database we created in the previous tutorial"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# DEFINE MY mongoDB DATABASE\n",
"db = client['irs_990_db']\n",
"\n",
"# DEFINE MY COLLECTION WHERE I'LL INSERT MY SEARCH \n",
"file_list = db['990_files_2011_2018']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>Check how many observations in the database table."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2373310"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"file_list.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Read MongoDB database into PANDAS DF\n",
"Importing the filings into PANDAS is a simple one-liner. We now have a dataframe called *df* that contains all 2,373,310 filings."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# of columns: 10\n",
"# of observations: 2373310\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>DLN</th>\n",
" <th>EIN</th>\n",
" <th>FormType</th>\n",
" <th>LastUpdated</th>\n",
" <th>ObjectId</th>\n",
" <th>OrganizationName</th>\n",
" <th>SubmittedOn</th>\n",
" <th>TaxPeriod</th>\n",
" <th>URL</th>\n",
" <th>_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>93493316003251</td>\n",
" <td>591971002</td>\n",
" <td>990</td>\n",
" <td>2016-03-21T17:23:53</td>\n",
" <td>201103169349300325</td>\n",
" <td>ANGELUS INC</td>\n",
" <td>2011-11-30</td>\n",
" <td>201009</td>\n",
" <td>https://s3.amazonaws.com/irs-form-990/201103169349300325_public.xml</td>\n",
" <td>5ad8e83235fd3fae98d5e53b</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" DLN EIN FormType LastUpdated \\\n",
"0 93493316003251 591971002 990 2016-03-21T17:23:53 \n",
"\n",
" ObjectId OrganizationName SubmittedOn TaxPeriod \\\n",
"0 201103169349300325 ANGELUS INC 2011-11-30 201009 \n",
"\n",
" URL \\\n",
"0 https://s3.amazonaws.com/irs-form-990/201103169349300325_public.xml \n",
"\n",
" _id \n",
"0 5ad8e83235fd3fae98d5e53b "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(list(file_list.find()))\n",
"print '# of columns:', len(df.columns)\n",
"print '# of observations:', len(df)\n",
"df.head(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>Getting the frequency counts is easier than in MongoDB"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"990 1338946\n",
"990EZ 715541\n",
"990PF 318823\n",
"Name: FormType, dtype: int64"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['FormType'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>Let's also get the top 5 frequencies for the variable *TaxPeriod*"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"201612 260809\n",
"201512 250605\n",
"201412 230967\n",
"201312 204436\n",
"201212 181772\n",
"Name: TaxPeriod, dtype: int64"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['TaxPeriod'].value_counts()[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>Here's a simplistic example of how PANDAS is useful for datawrangling. Let's transform the above variable and generate a new one called *Year* so we can see the increase in e-filings over time. "
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2016 425971\n",
"2015 408156\n",
"2014 375704\n",
"2013 336044\n",
"2012 299310\n",
"2011 245447\n",
"2010 176064\n",
"2017 106614\n",
"Name: Year, dtype: int64"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Year'] = df['TaxPeriod'].str[:4] #CREATE A VARIABLE BASED ON FIRST 4 CHARACTERS OF TaxPeriod\n",
"df['Year'] = df['Year'].astype('int') #MAKE THE VARIABLE INTEGER FORMAT INSTEAD OF STRING\n",
"df['Year'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Plot yearly frequencies\n",
"Besides data manipulation, PANDAS is also great for visualization. Here's a simple bar graph of the number of filings per fiscal year. "
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline "
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"plt.rcParams['figure.figsize'] = (10, 5)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x14afb6350>"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAm0AAAFCCAYAAACjGerAAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHzZJREFUeJzt3X+w3fVd5/HnS1KRsUIJRBYTanBIdQG3dIkpO9XZKiuJ\n0hU6AzSdtWRns+AMVKvjjAuOs1QwDszYorhbdnDJEFhbiNQOGVtkI7Truis/AmIptJSshIVMgEhS\nsK5Qk773j/MJPbn3JPcmueTkc+/zMXPmfs/7+/187+e85yZ55fs9n3tSVUiSJOnI9l3jnoAkSZKm\nZmiTJEnqgKFNkiSpA4Y2SZKkDhjaJEmSOmBokyRJ6oChTZIkqQOGNkmSpA4Y2iRJkjpgaJMkSerA\nvHFPYKadeOKJtXjx4nFPQ5IkaUqPPvro31bVgukcO+tC2+LFi9m0adO4pyFJkjSlJM9N91hvj0qS\nJHXA0CZJktQBQ5skSVIHDG2SJEkdMLRJkiR1wNAmSZLUAUObJElSBwxtkiRJHTC0SZIkdcDQJkmS\n1IFZ9zFWkiTNBouv+vy4p/CmLdefP+4pCK+0SZIkdcHQJkmS1AFDmyRJUgcMbZIkSR0wtEmSJHXA\n0CZJktQBQ5skSVIHDG2SJEkdMLRJkiR1wNAmSZLUAT/GSpI0dn5kkzS1aV9pS3JUkr9K8ift+fwk\nG5M8074eP3Ts1Uk2J3k6yfKh+tlJnmj7bkqSVj86yV2t/lCSxUNjVrXv8UySVTPxoiVJknpzILdH\nPwZ8dej5VcD9VbUEuL89J8npwErgDGAF8KkkR7UxNwOXAUvaY0WrrwZ2VtVpwI3ADe1c84FrgPcC\ny4BrhsOhJEnSXDGt0JZkEXA+8F+HyhcA69r2OuDCofqdVfVGVT0LbAaWJTkZOLaqHqyqAm6fMGbP\nue4Gzm1X4ZYDG6tqR1XtBDbynaAnSZI0Z0z3StvvAr8GfHuodlJVbWvbLwInte2FwPNDx73Qagvb\n9sT6XmOqahfwKnDCfs4lSZI0p0wZ2pJ8AHi5qh7d1zHtylnN5MQORJLLk2xKsmn79u3jmoYkSdJb\nZjpX2t4H/FySLcCdwE8l+W/AS+2WJ+3ry+34rcApQ+MXtdrWtj2xvteYJPOA44BX9nOuvVTVLVW1\ntKqWLliwYBovSZIkqS9ThraqurqqFlXVYgYLDB6oqp8HNgB7VnOuAu5p2xuAlW1F6KkMFhw83G6l\nvpbknPZ+tUsnjNlzrova9yjgPuC8JMe3BQjntZokSdKccii/p+16YH2S1cBzwCUAVfVkkvXAU8Au\n4Mqq2t3GXAHcBhwD3NseALcCdyTZDOxgEA6pqh1JrgMeacddW1U7DmHOkiRJXTqg0FZVXwK+1LZf\nAc7dx3FrgDUj6puAM0fUXwcu3se51gJrD2SekiRJs40fYyVJktQBQ5skSVIHDG2SJEkdMLRJkiR1\nwNAmSZLUAUObJElSBwxtkiRJHTC0SZIkdcDQJkmS1AFDmyRJUgcO5bNHJUkHaPFVnx/3FN605frz\nxz0FSQfAK22SJEkdMLRJkiR1wNAmSZLUAUObJElSBwxtkiRJHTC0SZIkdcDQJkmS1AFDmyRJUgcM\nbZIkSR0wtEmSJHXA0CZJktSBKUNbku9J8nCSv07yZJLfbPWPJ9ma5PH2+NmhMVcn2Zzk6STLh+pn\nJ3mi7bspSVr96CR3tfpDSRYPjVmV5Jn2WDWTL16SJKkX0/nA+DeAn6qqbyZ5G/AXSe5t+26sqt8Z\nPjjJ6cBK4AzgB4A/S/KuqtoN3AxcBjwEfAFYAdwLrAZ2VtVpSVYCNwAfSjIfuAZYChTwaJINVbXz\n0F62JElSX6a80lYD32xP39YetZ8hFwB3VtUbVfUssBlYluRk4NiqerCqCrgduHBozLq2fTdwbrsK\ntxzYWFU7WlDbyCDoSZIkzSnTek9bkqOSPA68zCBEPdR2/WKSLydZm+T4VlsIPD80/IVWW9i2J9b3\nGlNVu4BXgRP2cy5JkqQ5ZVqhrap2V9VZwCIGV83OZHCr84eAs4BtwCfesllOIcnlSTYl2bR9+/Zx\nTUOSJOktc0CrR6vqG8AXgRVV9VILc98G/gBY1g7bCpwyNGxRq21t2xPre41JMg84DnhlP+eaOK9b\nqmppVS1dsGDBgbwkSZKkLkxn9eiCJO9o28cAPw18rb1HbY8PAl9p2xuAlW1F6KnAEuDhqtoGvJbk\nnPZ+tUuBe4bG7FkZehHwQHvf233AeUmOb7dfz2s1SZKkOWU6q0dPBtYlOYpByFtfVX+S5I4kZzFY\nlLAF+AWAqnoyyXrgKWAXcGVbOQpwBXAbcAyDVaN7VqHeCtyRZDOwg8HqU6pqR5LrgEfacddW1Y5D\neL2SJEldmjK0VdWXgfeMqH9kP2PWAGtG1DcBZ46ovw5cvI9zrQXWTjVPSZKk2cxPRJAkSeqAoU2S\nJKkDhjZJkqQOTGchgiQdsMVXfX7cU3jTluvPH/cUJOmQeaVNkiSpA4Y2SZKkDhjaJEmSOmBokyRJ\n6oChTZIkqQOGNkmSpA4Y2iRJkjpgaJMkSeqAoU2SJKkDhjZJkqQOGNokSZI6YGiTJEnqgKFNkiSp\nA4Y2SZKkDhjaJEmSOmBokyRJ6oChTZIkqQNThrYk35Pk4SR/neTJJL/Z6vOTbEzyTPt6/NCYq5Ns\nTvJ0kuVD9bOTPNH23ZQkrX50krta/aEki4fGrGrf45kkq2byxUuSJPViOlfa3gB+qqreDZwFrEhy\nDnAVcH9VLQHub89JcjqwEjgDWAF8KslR7Vw3A5cBS9pjRauvBnZW1WnAjcAN7VzzgWuA9wLLgGuG\nw6EkSdJcMWVoq4Fvtqdva48CLgDWtfo64MK2fQFwZ1W9UVXPApuBZUlOBo6tqgerqoDbJ4zZc667\ngXPbVbjlwMaq2lFVO4GNfCfoSZIkzRnTek9bkqOSPA68zCBEPQScVFXb2iEvAie17YXA80PDX2i1\nhW17Yn2vMVW1C3gVOGE/55IkSZpT5k3noKraDZyV5B3A55KcOWF/Jam3YoLTkeRy4HKAd77zneOa\nhuaoxVd9ftxTeNOW688f9xQkSW+RA1o9WlXfAL7I4BblS+2WJ+3ry+2wrcApQ8MWtdrWtj2xvteY\nJPOA44BX9nOuifO6paqWVtXSBQsWHMhLkiRJ6sJ0Vo8uaFfYSHIM8NPA14ANwJ7VnKuAe9r2BmBl\nWxF6KoMFBw+3W6mvJTmnvV/t0glj9pzrIuCB9r63+4DzkhzfFiCc12qSJElzynRuj54MrGsrQL8L\nWF9Vf5LkL4H1SVYDzwGXAFTVk0nWA08Bu4Ar2+1VgCuA24BjgHvbA+BW4I4km4EdDFafUlU7klwH\nPNKOu7aqdhzKC5YkSerRlKGtqr4MvGdE/RXg3H2MWQOsGVHfBJw5ov46cPE+zrUWWDvVPCVJkmYz\nPxFBkiSpA4Y2SZKkDhjaJEmSOmBokyRJ6oChTZIkqQOGNkmSpA4Y2iRJkjpgaJMkSeqAoU2SJKkD\nhjZJkqQOGNokSZI6YGiTJEnqgKFNkiSpA4Y2SZKkDhjaJEmSOmBokyRJ6oChTZIkqQOGNkmSpA4Y\n2iRJkjpgaJMkSeqAoU2SJKkDhjZJkqQOTBnakpyS5ItJnkryZJKPtfrHk2xN8nh7/OzQmKuTbE7y\ndJLlQ/WzkzzR9t2UJK1+dJK7Wv2hJIuHxqxK8kx7rJrJFy9JktSLedM4Zhfwq1X1WJLvAx5NsrHt\nu7Gqfmf44CSnAyuBM4AfAP4sybuqajdwM3AZ8BDwBWAFcC+wGthZVaclWQncAHwoyXzgGmApUO17\nb6iqnYf2siVJkvoy5ZW2qtpWVY+17b8Dvgos3M+QC4A7q+qNqnoW2AwsS3IycGxVPVhVBdwOXDg0\nZl3bvhs4t12FWw5srKodLahtZBD0JEmS5pQDek9bu235HgZXygB+McmXk6xNcnyrLQSeHxr2Qqst\nbNsT63uNqapdwKvACfs518R5XZ5kU5JN27dvP5CXJEmS1IVph7Ykbwc+C/xyVb3G4FbnDwFnAduA\nT7wlM5yGqrqlqpZW1dIFCxaMaxqSJElvmWmFtiRvYxDY/rCq/higql6qqt1V9W3gD4Bl7fCtwClD\nwxe12ta2PbG+15gk84DjgFf2cy5JkqQ5ZcqFCO29ZbcCX62qTw7VT66qbe3pB4GvtO0NwKeTfJLB\nQoQlwMNVtTvJa0nOYXB79VLg94fGrAL+ErgIeKCqKsl9wG8P3Xo9D7j64F+uDtXiqz4/7im8acv1\n5497CpIkHTbTWT36PuAjwBNJHm+1Xwc+nOQsBqs6twC/AFBVTyZZDzzFYOXplW3lKMAVwG3AMQxW\njd7b6rcCdyTZDOxgsPqUqtqR5DrgkXbctVW14+BeqiRJUr+mDG1V9RdARuz6wn7GrAHWjKhvAs4c\nUX8duHgf51oLrJ1qnpIkSbOZn4ggSZLUAUObJElSBwxtkiRJHTC0SZIkdcDQJkmS1AFDmyRJUgcM\nbZIkSR0wtEmSJHXA0CZJktQBQ5skSVIHDG2SJEkdMLRJkiR1wNAmSZLUAUObJElSBwxtkiRJHTC0\nSZIkdcDQJkmS1AFDmyRJUgcMbZIkSR0wtEmSJHXA0CZJktSBKUNbklOSfDHJU0meTPKxVp+fZGOS\nZ9rX44fGXJ1kc5Knkywfqp+d5Im276YkafWjk9zV6g8lWTw0ZlX7Hs8kWTWTL16SJKkX07nStgv4\n1ao6HTgHuDLJ6cBVwP1VtQS4vz2n7VsJnAGsAD6V5Kh2rpuBy4Al7bGi1VcDO6vqNOBG4IZ2rvnA\nNcB7gWXANcPhUJIkaa6YMrRV1baqeqxt/x3wVWAhcAGwrh22DriwbV8A3FlVb1TVs8BmYFmSk4Fj\nq+rBqirg9glj9pzrbuDcdhVuObCxqnZU1U5gI98JepIkSXPGAb2nrd22fA/wEHBSVW1ru14ETmrb\nC4Hnh4a90GoL2/bE+l5jqmoX8Cpwwn7OJUmSNKdMO7QleTvwWeCXq+q14X3tylnN8NymLcnlSTYl\n2bR9+/ZxTUOSJOktM63QluRtDALbH1bVH7fyS+2WJ+3ry62+FThlaPiiVtvatifW9xqTZB5wHPDK\nfs61l6q6paqWVtXSBQsWTOclSZIkdWU6q0cD3Ap8tao+ObRrA7BnNecq4J6h+sq2IvRUBgsOHm63\nUl9Lck4756UTxuw510XAA+3q3X3AeUmObwsQzms1SZKkOWXeNI55H/AR4Ikkj7farwPXA+uTrAae\nAy4BqKonk6wHnmKw8vTKqtrdxl0B3AYcA9zbHjAIhXck2QzsYLD6lKrakeQ64JF23LVVteMgX6sk\nSVK3pgxtVfUXQPax+9x9jFkDrBlR3wScOaL+OnDxPs61Flg71TwlSZJmMz8RQZIkqQOGNkmSpA4Y\n2iRJkjpgaJMkSeqAoU2SJKkDhjZJkqQOGNokSZI6YGiTJEnqgKFNkiSpA9P5GKs5afFVnx/3FN60\n5frzxz0FSZI0Zl5pkyRJ6oChTZIkqQOGNkmSpA4Y2iRJkjpgaJMkSeqAoU2SJKkDhjZJkqQOGNok\nSZI6YGiTJEnqgKFNkiSpA4Y2SZKkDkwZ2pKsTfJykq8M1T6eZGuSx9vjZ4f2XZ1kc5Knkywfqp+d\n5Im276YkafWjk9zV6g8lWTw0ZlWSZ9pj1Uy9aEmSpN5M50rbbcCKEfUbq+qs9vgCQJLTgZXAGW3M\np5Ic1Y6/GbgMWNIee865GthZVacBNwI3tHPNB64B3gssA65JcvwBv0JJkqRZYMrQVlV/DuyY5vku\nAO6sqjeq6llgM7AsycnAsVX1YFUVcDtw4dCYdW37buDcdhVuObCxqnZU1U5gI6PDoyRJ0qx3KO9p\n+8UkX263T/dcAVsIPD90zAuttrBtT6zvNaaqdgGvAifs51ySJElzzsGGtpuBHwLOArYBn5ixGR2E\nJJcn2ZRk0/bt28c5FUmSpLfEQYW2qnqpqnZX1beBP2DwnjOArcApQ4cuarWtbXtifa8xSeYBxwGv\n7Odco+ZzS1UtraqlCxYsOJiXJEmSdEQ7qNDW3qO2xweBPStLNwAr24rQUxksOHi4qrYBryU5p71f\n7VLgnqExe1aGXgQ80N73dh9wXpLj2+3X81pNkiRpzpk31QFJPgO8HzgxyQsMVnS+P8lZQAFbgF8A\nqKonk6wHngJ2AVdW1e52qisYrEQ9Bri3PQBuBe5IspnBgoeV7Vw7klwHPNKOu7aqprsgQpIkaVaZ\nMrRV1YdHlG/dz/FrgDUj6puAM0fUXwcu3se51gJrp5qjJEnSbOcnIkiSJHXA0CZJktQBQ5skSVIH\nDG2SJEkdMLRJkiR1wNAmSZLUAUObJElSBwxtkiRJHTC0SZIkdcDQJkmS1AFDmyRJUgcMbZIkSR0w\ntEmSJHXA0CZJktQBQ5skSVIHDG2SJEkdMLRJkiR1wNAmSZLUAUObJElSBwxtkiRJHTC0SZIkdcDQ\nJkmS1IF5Ux2QZC3wAeDlqjqz1eYDdwGLgS3AJVW1s+27GlgN7AZ+qarua/WzgduAY4AvAB+rqkpy\nNHA7cDbwCvChqtrSxqwCfqNN5beqat0hv2JJktStxVd9ftxTeNOW688/rN9vOlfabgNWTKhdBdxf\nVUuA+9tzkpwOrATOaGM+leSoNuZm4DJgSXvsOedqYGdVnQbcCNzQzjUfuAZ4L7AMuCbJ8Qf+EiVJ\nkvo3ZWirqj8HdkwoXwDsueq1DrhwqH5nVb1RVc8Cm4FlSU4Gjq2qB6uqGFxZu3DEue4Gzk0SYDmw\nsap2tKt4G5kcHiVJkuaEg31P20lVta1tvwic1LYXAs8PHfdCqy1s2xPre42pql3Aq8AJ+zmXJEnS\nnHPICxHalbOagbkctCSXJ9mUZNP27dvHORVJkqS3xMGGtpfaLU/a15dbfStwytBxi1pta9ueWN9r\nTJJ5wHEMFiTs61yTVNUtVbW0qpYuWLDgIF+SJEnSketgQ9sGYFXbXgXcM1RfmeToJKcyWHDwcLuV\n+lqSc9r71S6dMGbPuS4CHmhX7+4DzktyfFuAcF6rSZIkzTnT+ZUfnwHeD5yY5AUGKzqvB9YnWQ08\nB1wCUFVPJlkPPAXsAq6sqt3tVFfwnV/5cW97ANwK3JFkM4MFDyvbuXYkuQ54pB13bVVNXBAhSZI0\nJ0wZ2qrqw/vYde4+jl8DrBlR3wScOaL+OnDxPs61Flg71RwlSZJmOz8RQZIkqQOGNkmSpA4Y2iRJ\nkjpgaJMkSeqAoU2SJKkDhjZJkqQOGNokSZI6YGiTJEnqgKFNkiSpA4Y2SZKkDhjaJEmSOmBokyRJ\n6oChTZIkqQOGNkmSpA4Y2iRJkjpgaJMkSeqAoU2SJKkDhjZJkqQOGNokSZI6YGiTJEnqgKFNkiSp\nA4cU2pJsSfJEkseTbGq1+Uk2JnmmfT1+6Pirk2xO8nSS5UP1s9t5Nie5KUla/egkd7X6Q0kWH8p8\nJUmSejUTV9p+sqrOqqql7flVwP1VtQS4vz0nyenASuAMYAXwqSRHtTE3A5cBS9pjRauvBnZW1WnA\njcANMzBfSZKk7rwVt0cvANa17XXAhUP1O6vqjap6FtgMLEtyMnBsVT1YVQXcPmHMnnPdDZy75yqc\nJEnSXHKooa2AP0vyaJLLW+2kqtrWtl8ETmrbC4Hnh8a+0GoL2/bE+l5jqmoX8CpwwiHOWZIkqTvz\nDnH8j1fV1iTfD2xM8rXhnVVVSeoQv8eUWmC8HOCd73znW/3tJEmSDrtDutJWVVvb15eBzwHLgJfa\nLU/a15fb4VuBU4aGL2q1rW17Yn2vMUnmAccBr4yYxy1VtbSqli5YsOBQXpIkSdIR6aBDW5LvTfJ9\ne7aB84CvABuAVe2wVcA9bXsDsLKtCD2VwYKDh9ut1NeSnNPer3bphDF7znUR8EB735skSdKccii3\nR08CPtfWBcwDPl1Vf5rkEWB9ktXAc8AlAFX1ZJL1wFPALuDKqtrdznUFcBtwDHBvewDcCtyRZDOw\ng8HqU0mSpDnnoENbVf0N8O4R9VeAc/cxZg2wZkR9E3DmiPrrwMUHO0dJkqTZwk9EkCRJ6oChTZIk\nqQOGNkmSpA4Y2iRJkjpgaJMkSeqAoU2SJKkDhjZJkqQOGNokSZI6YGiTJEnqgKFNkiSpA4Y2SZKk\nDhjaJEmSOmBokyRJ6oChTZIkqQOGNkmSpA4Y2iRJkjpgaJMkSeqAoU2SJKkDhjZJkqQOGNokSZI6\nYGiTJEnqQBehLcmKJE8n2ZzkqnHPR5Ik6XA74kNbkqOA/wz8DHA68OEkp493VpIkSYfXER/agGXA\n5qr6m6r6FnAncMGY5yRJknRY9RDaFgLPDz1/odUkSZLmjFTVuOewX0kuAlZU1b9vzz8CvLeqPjp0\nzOXA5e3pDwNPH/aJjnYi8LfjnsQRyL6MZl9Gsy+T2ZPR7Mto9mW0I6UvP1hVC6Zz4Ly3eiYzYCtw\nytDzRa32pqq6BbjlcE5qOpJsqqql457Hkca+jGZfRrMvk9mT0ezLaPZltB770sPt0UeAJUlOTfLd\nwEpgw5jnJEmSdFgd8VfaqmpXko8C9wFHAWur6skxT0uSJOmwOuJDG0BVfQH4wrjncRCOuFu2Rwj7\nMpp9Gc2+TGZPRrMvo9mX0brryxG/EEGSJEl9vKdNkiRpzjO0SZIkdcDQJkmS1AFDmyRJUgcMbTMo\nyfIkNyfZ0B43J1kx7nkdqZL8x3HPYZzaz8vqJIsn1P/deGY0Xhm4JMnFbfvcJDcluSKJf1cNSfLA\nuOcwbklOnPD859vPy+VJMq55jVuSDyaZ37YXJLk9yRNJ7kqyaNzzG4ckn0zyvnHPYya4enSGJPld\n4F3A7Qw+HxUGn95wKfBMVX1sXHM7UiX5v1X1znHPYxyS/Dbw48BjwL8Gfreqfr/te6yq/vk45zcO\nST4FfD/w3cBrwNEMfpH2+cBLc/XPUJIvTywx+LvmaYCq+meHfVJHgOE/J0l+A/gJ4NPAB4AXqupX\nxjm/cUnyVFWd3rbvAh4E/gj4V8C/qaqfHuf8xiHJduA5YAFwF/CZqvqr8c7q4BjaZkiSr1fVu0bU\nA3y9qpaMYVpjl+S1fe0CjqmqLn5X4ExL8gTwnvbLo9/B4B+bp6vqV5L8VVW9Z8xTPOySPFFVP5rk\nbcCLwMlV9a0k84DH5nA42cAgxP4W8A8M/uz8Twahn6p6bnyzG5/hPydJHgN+oqr+vv38PFZVPzre\nGY5Hkqer6ofb9qNVdfbQvser6qzxzW489vysJHkX8CEGn6x0FPAZBgHu62Od4AHwlsPMeT3Jj42o\n/xjw+uGezBHkG8CSqjp2wuP7gG3jntwYzauqXQBV9Q0GV9uOTfJHDK40zUV7+vGPwCNV9a32fBfw\n7XFObJyq6ueAzzL4RaDvrqotwD9W1XNzNbA1xyR5T5KzgbdV1d/Dmz8/u8c7tbH6UpJrkxzTtj8I\nkOQngVfHO7WxKYCq+npVXVdVZwCXAN9DZ7+439A2c/4t8J+SPJXkv7fHV4Gb2r656nbgB/ex79OH\ncyJHmP+T5F/ueVJVu6tqNYNbXv90fNMaqxeTvB2gqt58L2iSfwJ8a2yzOgJU1eeAnwHen+Qe5m6w\nH7YN+CTwO8DfJjkZIMkJtP8AzFEfZfCfnKeBi4HPJvk74DLgI+Oc2BhNeo9jVX25qq6uqtPGMaGD\n5e3RGdb+gVnYnm6tqhfHOR8dmdr/gqmqfxixb2FVbT38szoyJfle4Hur6uVxz+VIkOTdwL+oqv8y\n7rkciZIcBRxdVf9v3HMZtyTHMbiq/8q45zJOSd5eVd8c9zxmgqHtMEjyI1X1tXHP40hjX0azL5PZ\nk9Hsy2j2ZTT7MllvPTG0HQZzeZXk/tiX0ezLZPZkNPsymn0Zzb5M1ltP5uTKvbdCkpv2tQt4x+Gc\ny5HEvoxmXyazJ6PZl9Hsy2j2ZbLZ1BOvtM2Q9kbPXwXeGLH7E1V14oj6rGdfRrMvk9mT0ezLaPZl\nNPsy2WzqiVfaZs4jwFeq6n9P3JHk44d/OkcM+zKafZnMnoxmX0azL6PZl8lmTU+80jZD2seGvO6K\npb3Zl9Hsy2T2ZDT7Mpp9Gc2+TDabemJokyRJ6oC/XHeGJDkuyfVJvpZkR5JXkny11bp6o+NMsi+j\n2ZfJ7Mlo9mU0+zKafZlsNvXE0DZz1gM7gfdX1fyqOgH4yVZbP9aZjZd9Gc2+TGZPRrMvo9mX0ezL\nZLOmJ94enSEZ+pDeA9k329mX0ezLZPZkNPsymn0Zzb5MNpt64pW2mfNckl9LctKeQpKTkvwH4Pkx\nzmvc7Mto9mUyezKafRnNvoxmXyabNT0xtM2cDwEnAP8jyc4kO4AvAfOBS8Y5sTGzL6PZl8nsyWj2\nZTT7Mpp9mWzW9MTbozMoyY8Ai4AHhz+cNsmKqvrT8c1svOzLaPZlMnsymn0Zzb6MZl8mmy098Urb\nDEnyS8A9wEeBryS5YGj3b49nVuNnX0azL5PZk9Hsy2j2ZTT7Mtls6omfiDBzLgPOrqpvJlkM3J1k\ncVX9HoPPN5ur7Mto9mUyezKafRnNvoxmXyabNT0xtM2c79pzybWqtiR5P4MfjB+ksx+KGWZfRrMv\nk9mT0ezLaPZlNPsy2azpibdHZ85LSc7a86T9gHwAOBH40bHNavzsy2j2ZTJ7Mpp9Gc2+jGZfJps1\nPXEhwgxJsgjYVVUvjtj3vqr6X2OY1tjZl9Hsy2T2ZDT7Mpp9Gc2+TDabemJokyRJ6oC3RyVJkjpg\naJMkSeqAoU2SJKkDhjZJkqQOGNokSZI68P8BSqGvmORoYLYAAAAASUVORK5CYII=\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x14b45a3d0>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df.sort_values(by=['Year'])['Year'].value_counts().sort_index().plot(kind='bar')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>Save a copy of the dataframe. It's 736.4MB on my MacBook."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.to_pickle('list of 990 filings, 2011-2018.pkl')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Next steps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have a dataset containing basic details on all 2,373,310 990 filings. We still don't have the actual 990 data, however. To get that, we will have to pick which filings we want and then access the *URL* column for that filing as seen in our dataset. The code block below prints out the first URL. "
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 https://s3.amazonaws.com/irs-form-990/201103169349300325_public.xml\n",
"Name: URL, dtype: object\n"
]
}
],
"source": [
"print df['URL'][:1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>If you followed the link you will see the data are XML format. A key task in future tutorials will be to write code that can 1) parse the XML and 2) map the elements to specific 990 variables. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
},
"toc": {
"colors": {
"hover_highlight": "#DAA520",
"navigate_num": "#000000",
"navigate_text": "#333333",
"running_highlight": "#FF0000",
"selected_highlight": "#FFD700",
"sidebar_border": "#EEEEEE",
"wrapper_background": "#FFFFFF"
},
"moveMenuLeft": true,
"nav_menu": {
"height": "279px",
"width": "252px"
},
"navigate_menu": true,
"number_sections": true,
"sideBar": true,
"threshold": 4,
"toc_cell": false,
"toc_section_display": "block",
"toc_window_display": false,
"widenNotebook": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment