Skip to content

Instantly share code, notes, and snippets.

@TDahlberg
Created August 20, 2016 01:42
Show Gist options
  • Save TDahlberg/fa11c91d291b2dfcdc936628057cdc63 to your computer and use it in GitHub Desktop.
Save TDahlberg/fa11c91d291b2dfcdc936628057cdc63 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cleaning the Nonprofit List"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Set up the data\n",
"After downloading and unzipping the 4-part file from [https://github.com/TDahlberg/nonprofit_data](https://github.com/TDahlberg/nonprofit_data), import as a CSV"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"data = pd.read_csv(\"/Users/tylerdahlberg/projects/internal/nonprofit_mailinglist/RawData/all_nonprofits.csv\")\n",
"data.rename(columns = {\"NTEE_CD\":'code'},inplace=True) # for a later join"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore the data briefly"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Columns in merged file:\n",
"Unnamed: 0\n",
"Match_addr\n",
"EIN\n",
"NAME\n",
"ICO\n",
"STREET\n",
"CITY_1\n",
"STATE\n",
"ZIP\n",
"GROUP_\n",
"SUBSECTION\n",
"AFFILIATION\n",
"CLASSIFICATION\n",
"RULING\n",
"DEDUCTIBILITY\n",
"FOUNDATION\n",
"ACTIVITY\n",
"ORGANIZATION\n",
"STATUS_1\n",
"TAX_PERIOD\n",
"ASSET_CD\n",
"INCOME_CD\n",
"FILING_REQ_CD\n",
"PF_FILING_REQ_CD\n",
"ACCT_PD\n",
"ASSET_AMT\n",
"INCOME_AMT\n",
"REVENUE_AMT\n",
"code\n",
"Zip5\n",
"Zip4\n",
"Eco\n",
"Lat\n",
"Long\n",
"description\n",
"category\n",
"\n",
"Number of records 1551677\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>Match_addr</th>\n",
" <th>EIN</th>\n",
" <th>NAME</th>\n",
" <th>ICO</th>\n",
" <th>STREET</th>\n",
" <th>CITY_1</th>\n",
" <th>STATE</th>\n",
" <th>ZIP</th>\n",
" <th>GROUP_</th>\n",
" <th>...</th>\n",
" <th>INCOME_AMT</th>\n",
" <th>REVENUE_AMT</th>\n",
" <th>code</th>\n",
" <th>Zip5</th>\n",
" <th>Zip4</th>\n",
" <th>Eco</th>\n",
" <th>Lat</th>\n",
" <th>Long</th>\n",
" <th>description</th>\n",
" <th>category</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>61519148</td>\n",
" <td>AVON LIONS CHARITIES INC</td>\n",
" <td>% DUANE E STARR</td>\n",
" <td>PO BOX 39</td>\n",
" <td>AVON</td>\n",
" <td>CT</td>\n",
" <td>06001-0039</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>T12</td>\n",
" <td>6001</td>\n",
" <td>39</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Fund Raising &amp; Fund Distribution</td>\n",
" <td>philanthropy</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>34 Stockbridge Dr, Avon, Connecticut, 06001</td>\n",
" <td>66074890</td>\n",
" <td>DELTA MU HOUSE ASSOCIATION OF KAPPA KAPPA GAMM...</td>\n",
" <td>% KAREN GREENBERG</td>\n",
" <td>34 STOCKBRIDGE DR</td>\n",
" <td>AVON</td>\n",
" <td>CT</td>\n",
" <td>06001-4415</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>6001</td>\n",
" <td>4415</td>\n",
" <td>NaN</td>\n",
" <td>41.799419</td>\n",
" <td>-72.899200</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>44 Copplestone Rd, Avon, Connecticut, 06001</td>\n",
" <td>66055634</td>\n",
" <td>UNITED STATES POWER SQUADRONS</td>\n",
" <td>% CLIFFORD W MCKIBBIN</td>\n",
" <td>44 COPPLESTONE RD</td>\n",
" <td>AVON</td>\n",
" <td>CT</td>\n",
" <td>06001-2913</td>\n",
" <td>1041</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>6001</td>\n",
" <td>2913</td>\n",
" <td>NaN</td>\n",
" <td>41.776310</td>\n",
" <td>-72.869194</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>21 Waterville Rd, Avon, Connecticut, 06001</td>\n",
" <td>61502263</td>\n",
" <td>FOLEY FAMILY FOUNDATION INC</td>\n",
" <td>% PEPE &amp; HAZARD KATHLEEN BORNHORST</td>\n",
" <td>21 WATERVILLE RD</td>\n",
" <td>AVON</td>\n",
" <td>CT</td>\n",
" <td>06001-2097</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>T20Z</td>\n",
" <td>6001</td>\n",
" <td>2097</td>\n",
" <td>NaN</td>\n",
" <td>41.803600</td>\n",
" <td>-72.818550</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>107 Craigemore Cir, Avon, Connecticut, 06001</td>\n",
" <td>60787490</td>\n",
" <td>BRIARCLIFF SWIMMING &amp; RECREATION CLUB INC</td>\n",
" <td>% PETER SCOTT</td>\n",
" <td>107 CRAIGEMORE CIR</td>\n",
" <td>AVON</td>\n",
" <td>CT</td>\n",
" <td>06001-3418</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>35597</td>\n",
" <td>35597</td>\n",
" <td>NaN</td>\n",
" <td>6001</td>\n",
" <td>3418</td>\n",
" <td>NaN</td>\n",
" <td>41.808411</td>\n",
" <td>-72.881012</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 36 columns</p>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 Match_addr EIN \\\n",
"0 0 NaN 61519148 \n",
"1 1 34 Stockbridge Dr, Avon, Connecticut, 06001 66074890 \n",
"2 2 44 Copplestone Rd, Avon, Connecticut, 06001 66055634 \n",
"3 3 21 Waterville Rd, Avon, Connecticut, 06001 61502263 \n",
"4 4 107 Craigemore Cir, Avon, Connecticut, 06001 60787490 \n",
"\n",
" NAME \\\n",
"0 AVON LIONS CHARITIES INC \n",
"1 DELTA MU HOUSE ASSOCIATION OF KAPPA KAPPA GAMM... \n",
"2 UNITED STATES POWER SQUADRONS \n",
"3 FOLEY FAMILY FOUNDATION INC \n",
"4 BRIARCLIFF SWIMMING & RECREATION CLUB INC \n",
"\n",
" ICO STREET CITY_1 STATE \\\n",
"0 % DUANE E STARR PO BOX 39 AVON CT \n",
"1 % KAREN GREENBERG 34 STOCKBRIDGE DR AVON CT \n",
"2 % CLIFFORD W MCKIBBIN 44 COPPLESTONE RD AVON CT \n",
"3 % PEPE & HAZARD KATHLEEN BORNHORST 21 WATERVILLE RD AVON CT \n",
"4 % PETER SCOTT 107 CRAIGEMORE CIR AVON CT \n",
"\n",
" ZIP GROUP_ ... INCOME_AMT REVENUE_AMT code Zip5 \\\n",
"0 06001-0039 0 ... 0 0 T12 6001 \n",
"1 06001-4415 0 ... 0 0 NaN 6001 \n",
"2 06001-2913 1041 ... 0 0 NaN 6001 \n",
"3 06001-2097 0 ... 1 NaN T20Z 6001 \n",
"4 06001-3418 0 ... 35597 35597 NaN 6001 \n",
"\n",
" Zip4 Eco Lat Long description \\\n",
"0 39 NaN NaN NaN Fund Raising & Fund Distribution \n",
"1 4415 NaN 41.799419 -72.899200 NaN \n",
"2 2913 NaN 41.776310 -72.869194 NaN \n",
"3 2097 NaN 41.803600 -72.818550 NaN \n",
"4 3418 NaN 41.808411 -72.881012 NaN \n",
"\n",
" category \n",
"0 philanthropy \n",
"1 NaN \n",
"2 NaN \n",
"3 NaN \n",
"4 NaN \n",
"\n",
"[5 rows x 36 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display the columns in the data\n",
"print(\"Columns in merged file:\")\n",
"for i in list(data):\n",
" print(i)\n",
" \n",
"print(\"\\nNumber of records\", len(data))\n",
"\n",
"data.shape\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Add in NTEE Codes as a Dataframe\n",
"\n",
"This will read in NTEE codes from a separate dataset. These are necessary because the NTEE codes included in the IRS master file are short codes, and we need more descriptive NTEE codes to decipher what each nonprofit in the list actually does."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ntee = pd.read_csv(\"https://raw.githubusercontent.com/TDahlberg/ntee_codes/master/ntee_codes.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Join nonprofit list to NTEE Codes\n",
"This joins the IRS master file to the list of descriptive NTEE codes using a pandas merge function."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>Match_addr</th>\n",
" <th>EIN</th>\n",
" <th>NAME</th>\n",
" <th>ICO</th>\n",
" <th>STREET</th>\n",
" <th>CITY_1</th>\n",
" <th>STATE</th>\n",
" <th>ZIP</th>\n",
" <th>GROUP_</th>\n",
" <th>...</th>\n",
" <th>code</th>\n",
" <th>Zip5</th>\n",
" <th>Zip4</th>\n",
" <th>Eco</th>\n",
" <th>Lat</th>\n",
" <th>Long</th>\n",
" <th>description_x</th>\n",
" <th>category_x</th>\n",
" <th>description_y</th>\n",
" <th>category_y</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>61519148</td>\n",
" <td>AVON LIONS CHARITIES INC</td>\n",
" <td>% DUANE E STARR</td>\n",
" <td>PO BOX 39</td>\n",
" <td>AVON</td>\n",
" <td>CT</td>\n",
" <td>06001-0039</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>T12</td>\n",
" <td>6001</td>\n",
" <td>39</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Fund Raising &amp; Fund Distribution</td>\n",
" <td>philanthropy</td>\n",
" <td>Fund Raising &amp; Fund Distribution</td>\n",
" <td>philanthropy</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>34 Stockbridge Dr, Avon, Connecticut, 06001</td>\n",
" <td>66074890</td>\n",
" <td>DELTA MU HOUSE ASSOCIATION OF KAPPA KAPPA GAMM...</td>\n",
" <td>% KAREN GREENBERG</td>\n",
" <td>34 STOCKBRIDGE DR</td>\n",
" <td>AVON</td>\n",
" <td>CT</td>\n",
" <td>06001-4415</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>6001</td>\n",
" <td>4415</td>\n",
" <td>NaN</td>\n",
" <td>41.799419</td>\n",
" <td>-72.899200</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>44 Copplestone Rd, Avon, Connecticut, 06001</td>\n",
" <td>66055634</td>\n",
" <td>UNITED STATES POWER SQUADRONS</td>\n",
" <td>% CLIFFORD W MCKIBBIN</td>\n",
" <td>44 COPPLESTONE RD</td>\n",
" <td>AVON</td>\n",
" <td>CT</td>\n",
" <td>06001-2913</td>\n",
" <td>1041</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>6001</td>\n",
" <td>2913</td>\n",
" <td>NaN</td>\n",
" <td>41.776310</td>\n",
" <td>-72.869194</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>21 Waterville Rd, Avon, Connecticut, 06001</td>\n",
" <td>61502263</td>\n",
" <td>FOLEY FAMILY FOUNDATION INC</td>\n",
" <td>% PEPE &amp; HAZARD KATHLEEN BORNHORST</td>\n",
" <td>21 WATERVILLE RD</td>\n",
" <td>AVON</td>\n",
" <td>CT</td>\n",
" <td>06001-2097</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>T20Z</td>\n",
" <td>6001</td>\n",
" <td>2097</td>\n",
" <td>NaN</td>\n",
" <td>41.803600</td>\n",
" <td>-72.818550</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>107 Craigemore Cir, Avon, Connecticut, 06001</td>\n",
" <td>60787490</td>\n",
" <td>BRIARCLIFF SWIMMING &amp; RECREATION CLUB INC</td>\n",
" <td>% PETER SCOTT</td>\n",
" <td>107 CRAIGEMORE CIR</td>\n",
" <td>AVON</td>\n",
" <td>CT</td>\n",
" <td>06001-3418</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>6001</td>\n",
" <td>3418</td>\n",
" <td>NaN</td>\n",
" <td>41.808411</td>\n",
" <td>-72.881012</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 38 columns</p>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 Match_addr EIN \\\n",
"0 0 NaN 61519148 \n",
"1 1 34 Stockbridge Dr, Avon, Connecticut, 06001 66074890 \n",
"2 2 44 Copplestone Rd, Avon, Connecticut, 06001 66055634 \n",
"3 3 21 Waterville Rd, Avon, Connecticut, 06001 61502263 \n",
"4 4 107 Craigemore Cir, Avon, Connecticut, 06001 60787490 \n",
"\n",
" NAME \\\n",
"0 AVON LIONS CHARITIES INC \n",
"1 DELTA MU HOUSE ASSOCIATION OF KAPPA KAPPA GAMM... \n",
"2 UNITED STATES POWER SQUADRONS \n",
"3 FOLEY FAMILY FOUNDATION INC \n",
"4 BRIARCLIFF SWIMMING & RECREATION CLUB INC \n",
"\n",
" ICO STREET CITY_1 STATE \\\n",
"0 % DUANE E STARR PO BOX 39 AVON CT \n",
"1 % KAREN GREENBERG 34 STOCKBRIDGE DR AVON CT \n",
"2 % CLIFFORD W MCKIBBIN 44 COPPLESTONE RD AVON CT \n",
"3 % PEPE & HAZARD KATHLEEN BORNHORST 21 WATERVILLE RD AVON CT \n",
"4 % PETER SCOTT 107 CRAIGEMORE CIR AVON CT \n",
"\n",
" ZIP GROUP_ ... code Zip5 Zip4 Eco Lat \\\n",
"0 06001-0039 0 ... T12 6001 39 NaN NaN \n",
"1 06001-4415 0 ... NaN 6001 4415 NaN 41.799419 \n",
"2 06001-2913 1041 ... NaN 6001 2913 NaN 41.776310 \n",
"3 06001-2097 0 ... T20Z 6001 2097 NaN 41.803600 \n",
"4 06001-3418 0 ... NaN 6001 3418 NaN 41.808411 \n",
"\n",
" Long description_x category_x \\\n",
"0 NaN Fund Raising & Fund Distribution philanthropy \n",
"1 -72.899200 NaN NaN \n",
"2 -72.869194 NaN NaN \n",
"3 -72.818550 NaN NaN \n",
"4 -72.881012 NaN NaN \n",
"\n",
" description_y category_y \n",
"0 Fund Raising & Fund Distribution philanthropy \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
"[5 rows x 38 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## Left join on 'code' column\n",
"datamerge = pd.merge(data, ntee, on='code',how='left')\n",
"\n",
"datamerge.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore the results\n",
"The datamerge.shape is used merely to check that I didn't break the data set during the merge"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(1551677, 38)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get the rows and columns from the new dataframe\n",
"datamerge.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Export the results as a CSV"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Write out datamerge as a new CSV\n",
"datamerge.to_csv(\"/Users/tylerdahlberg/Downloads/all_nonprofits.csv\")"
]
}
],
"metadata": {
"celltoolbar": "Raw Cell Format",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment