Skip to content

Instantly share code, notes, and snippets.

@careduz
Last active March 29, 2018 15:24
Show Gist options
  • Save careduz/2b1c592b9dee2b7ce93a3d4ed750b698 to your computer and use it in GitHub Desktop.
Save careduz/2b1c592b9dee2b7ce93a3d4ed750b698 to your computer and use it in GitHub Desktop.
clearedpermits_transformation.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 0. Context\n",
"The purpose of this script is to transform the cleared permits CSV file, created after the consolidation step, in order to prepare for data visualization and analysis. This transformation is performed through the following steps:\n",
"> 1. Importing the consolidated .CSV file into Python as a table\n",
"> 2. Exclude data outside of scope of analysis\n",
"> 3. Calculate duration between dates\n",
"> 4. Include metadata fields for visualization\n",
"> 5. Clean numerical columns and create additional features\n",
"> 6. Create CSV file with transformed data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import numpy as np\n",
"import pandas as pd\n",
"from tqdm import tqdm\n",
"\n",
"directory = \"./\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Import consolidated .CSV file\n",
"Will retrieve the transformed csv file (_note_: filename is hardcoded) from the *input_files* folder."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Rows: 517,561 | Columns: 30 | Permits: 517,561\n"
]
}
],
"source": [
"input_file = 'clearedpermits_2001to2017.csv'\n",
"raw_data = pd.read_csv('{0}{1}'.format(directory, input_file), low_memory=False)\n",
"\n",
"raw_data_count = len(raw_data['PERMIT_NUM'].unique())\n",
"\n",
"print(\"Rows: {0} | Columns: {1} | Permits: {0}\".format(\"{:,}\".format(raw_data.shape[0]), raw_data.shape[1], \"{:,}\".format(raw_data_count)))\n",
"\n",
"raw_data = raw_data.set_index('PERMIT_NUM')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Exclude data outside scope of analysis\n",
"Opted to remove Cancelled permits from analysis given these are cancelled for a number of reasons that the City of often has no control over and thus could skew analysis results."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Rows: 494,812 | Columns: 29 | Permits: 461,266\n"
]
}
],
"source": [
"all_permits = raw_data[raw_data['STATUS'] != 'Cancelled']\n",
"all_permits_count = len(all_permits.reset_index(drop=False)['PERMIT_NUM'].unique())\n",
"\n",
"print(\"Rows: {0} | Columns: {1} | Permits: {2}\".format(\"{:,}\".format(all_permits.shape[0]), all_permits.shape[1], \"{:,}\".format(all_permits_count)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Calculate duration between dates\n",
"A lot can be learned from a date, such as when a permit was opened or completed. When faced with dates, however, we find considerable value can be derived from the _length of time_ between dates or times than from the actual dates.\n",
"\n",
"For instance, knowing the _duration_ from permit issue to completion (e.g. 10 days) makes the data more meaningful and facilitates comparison between permits. Essentially, it helps answer questions that start with \"how long\".\n",
"\n",
"Dealing this type of data (i.e. datetime) can be tricky for various reasons, for example:\n",
"* It is both a dimension (i.e. distinct, such as months in the year - one could compare all Januaries, or all Mondays) _and_ a measure (i.e. continuous, like a range of numbers)\n",
"* It is a string (i.e. distinct, once again) yet mathematical operations can be performed (e.g. difference between datetime values)\n",
"\n",
"Here we cover how we worked with the time to calculate duration\n",
"> 1. Duration (days) between permit revisions\n",
"> 2. Duration (days) between permits, such as number of days between APPLICATION_DATE and ISSUED_DATE\n",
"\n",
"## 3.1. Duration between permit revisions\n",
"To calculate duration between revisions performed the steps below.\n",
"> 1. Split permits into two types: single-revision (SR), and multiple revision (MR)\n",
"> 2. Calculated duration between revisions for MR permits\n",
"> 3. Merged SR and MR back into the same table\n",
"\n",
"### 3.1.1. Split permits into single- and multiple- revision types\n",
"Grouped by PERMIT_NUM and counted the REVISION_NUM. If count = 1, the permit is single-revision, (SR) and if count > 1 the permit is multiple-revision (MR). This is because I want duration between revisions only for MR permits."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"split the 461,266 permits into: \n",
" 431,635 permits with one revision (431,635 rows) \n",
" 29,631 permits with multiple revisions (63,177 rows)\n"
]
}
],
"source": [
"single_rev_permits, multi_rev_permits = pd.DataFrame(), pd.DataFrame()\n",
"\n",
"# group by PERMIT_NUM and count REVISION_NUM\n",
"permit_group = all_permits.reset_index().groupby('PERMIT_NUM').agg({'REVISION_NUM':['count']})\n",
"\n",
"# get PERMIT_NUMs with one REVISION_NUM\n",
"single_rev_group = permit_group[ permit_group[('REVISION_NUM', 'count')] == 1 ]\n",
"single_rev_group.columns = single_rev_group.columns.droplevel(0)\n",
"single_rev_group_permits = single_rev_group.shape[0]\n",
"\n",
"# get PERMIT_NUMs with more than one REVISION_NUM\n",
"multiple_rev_group = permit_group[ permit_group[('REVISION_NUM', 'count')] > 1 ]\n",
"multiple_rev_group.columns = multiple_rev_group.columns.droplevel(0)\n",
"multiple_rev_group_permits = multiple_rev_group.shape[0]\n",
"\n",
"# load permit into appropiate tables (single or multiple revisions)\n",
"single_rev_permits = all_permits[all_permits.index.isin(single_rev_group.index)].copy()\n",
"single_rev_permits.loc[:, 'REVISIONS_COUNT'] = 1\n",
"\n",
"multi_rev_permits = all_permits.join(multiple_rev_group, how='inner').copy()\n",
"multi_rev_permits = multi_rev_permits.rename(columns= {'count': \"REVISIONS_COUNT\"})\n",
"\n",
"split_all_permits_count = single_rev_group_permits + multiple_rev_group_permits\n",
"split_all_permits_records_count = single_rev_permits.shape[0] + multi_rev_permits.shape[0]\n",
"\n",
"print(\"split the {0} permits into:\".format(\"{:,}\".format(all_permits_count)),\n",
" \"\\n {0} permits with one revision ({1} rows)\".format(\"{:,}\".format(single_rev_group_permits), \"{:,}\".format(single_rev_permits.shape[0])),\n",
" \"\\n {0} permits with multiple revisions ({1} rows)\".format(\"{:,}\".format(multiple_rev_group_permits), \"{:,}\".format(multi_rev_permits.shape[0])))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.1.2. Calculate time between revisions\n",
"To determine time between revisions needed to:\n",
"> 1. Ensure REVISION_NUM is numerical so operations can be performed\n",
"> 2. Order records by PERMIT_NUM and REVISION_NUM\n",
"> 3. Calculate duration between revisions per MR permit\n",
"\n",
"#### Ensure records can be sorted by REVISION_NUM"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"distinct REVISION_NUM values in the 431,635 Single-Revision Permits:\n",
"['0', '00', '01', '02', '03', '1', '2', '4']\n",
"\n",
"distinct REVISION_NUM values in the 29,631 Multiple-Revision Permits:\n",
"['0', '00', '01', '02', '03', '04', '05', '06', '07', '1', '2', '3', '4', '5', '6', '7', '8']\n"
]
}
],
"source": [
"single_rev_values = single_rev_permits['REVISION_NUM'].unique()\n",
"multi_rev_values = multi_rev_permits['REVISION_NUM'].unique()\n",
"\n",
"print(\"\\ndistinct REVISION_NUM values in the {0} Single-Revision Permits:\\n{1}\".format('{:,}'.format(single_rev_group_permits),\n",
" sorted(single_rev_values)\n",
" )\n",
" )\n",
"\n",
"print(\"\\ndistinct REVISION_NUM values in the {0} Multiple-Revision Permits:\\n{1}\".format('{:,}'.format(multiple_rev_group_permits),\n",
" sorted(multi_rev_values)\n",
" )\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that single quotes ('') indicate value is a string, which means it is treated as text - however, all the values are digits. So we must convert the column to numeric for sorting.\n",
"\n",
"#### Convert REVISION_NUM values to numeric"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"multi_rev_permits.loc[:, 'REVISION_NUM'] = pd.to_numeric(multi_rev_permits['REVISION_NUM'])\n",
"single_rev_permits.loc[:, 'REVISION_NUM'] = pd.to_numeric(single_rev_permits['REVISION_NUM'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With the REVISION_NUM column now a number, the multiple can be sorted in order of revision number and the time interval between them can be calculated.\n",
"\n",
"#### Sort Multiple-Revision Permits by REVISION_NUM"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>REVISION_NUM</th>\n",
" <th>STATUS</th>\n",
" <th>APPLICATION_DATE</th>\n",
" <th>ISSUED_DATE</th>\n",
" <th>COMPLETED_DATE</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PERMIT_NUM</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>00 119162 BLD</th>\n",
" <td>1</td>\n",
" <td>Closed</td>\n",
" <td>2002/08/21</td>\n",
" <td>2002/09/24</td>\n",
" <td>2003/07/16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 119162 BLD</th>\n",
" <td>2</td>\n",
" <td>Closed</td>\n",
" <td>2003/01/14</td>\n",
" <td>2003/02/07</td>\n",
" <td>2003/09/09</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 119926 CMB</th>\n",
" <td>0</td>\n",
" <td>Closed</td>\n",
" <td>2000/01/10</td>\n",
" <td>2000/07/06</td>\n",
" <td>2001/04/05</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 119926 CMB</th>\n",
" <td>1</td>\n",
" <td>Closed</td>\n",
" <td>2000/10/17</td>\n",
" <td>2000/11/15</td>\n",
" <td>2009/02/03</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 120273 CMB</th>\n",
" <td>0</td>\n",
" <td>Closed</td>\n",
" <td>2000/01/11</td>\n",
" <td>2000/04/13</td>\n",
" <td>2003/12/16</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" REVISION_NUM STATUS APPLICATION_DATE ISSUED_DATE \\\n",
"PERMIT_NUM \n",
"00 119162 BLD 1 Closed 2002/08/21 2002/09/24 \n",
"00 119162 BLD 2 Closed 2003/01/14 2003/02/07 \n",
"00 119926 CMB 0 Closed 2000/01/10 2000/07/06 \n",
"00 119926 CMB 1 Closed 2000/10/17 2000/11/15 \n",
"00 120273 CMB 0 Closed 2000/01/11 2000/04/13 \n",
"\n",
" COMPLETED_DATE \n",
"PERMIT_NUM \n",
"00 119162 BLD 2003/07/16 \n",
"00 119162 BLD 2003/09/09 \n",
"00 119926 CMB 2001/04/05 \n",
"00 119926 CMB 2009/02/03 \n",
"00 120273 CMB 2003/12/16 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"multi_rev_permits = multi_rev_permits.reset_index(drop=False).sort_values(['PERMIT_NUM', 'REVISION_NUM']).set_index('PERMIT_NUM')\n",
"multi_rev_permits[['REVISION_NUM', 'STATUS','APPLICATION_DATE', 'ISSUED_DATE', 'COMPLETED_DATE']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Merge Multiple- and Single- revision permits\n",
"\n",
"With the Multiple-Revision Permits now correctly ordered, the Single-Revision Permits can be merged back. Just need to ensure of MRPs is maintained, so will simply append those records at the bottom of the MRP table,"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"distinct REVISION_NUM values across all 494,812 permits:\n",
"[0, 1, 2, 3, 4, 5, 6, 7, 8]\n"
]
},
{
"data": {
"text/plain": [
"REVISION_NUM int64\n",
"dtype: object"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_permits = multi_rev_permits.append(single_rev_permits).copy()\n",
"all_rev_values = all_permits['REVISION_NUM'].unique()\n",
"\n",
"print(\"distinct REVISION_NUM values across all {0} permits:\\n{1}\".format('{:,}'.format(all_permits.shape[0]), sorted(all_rev_values)))\n",
"all_permits[['REVISION_NUM']].dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"REVISION_NUM values are now numbers. Single- and multi- revision permits are now in the same table, and also in the order needed for calculating time difference of revisions for multi-revision permits.\n",
"\n",
"#### Calculate days between revisions\n",
"##### Convert Date column values to datetime\n",
"\n",
"To calculate the number of days between dates need to ensure the values are DATETIME type rather than STRING. This allows for time operations.\n",
"\n",
"At the moment we only need APPLICATION_DATE, however the remaining date columns must be properly formatted for calculating durations between them later, so looked at all date columns at the same time. Here are the current data types for all date fields."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"APPLICATION_DATE object\n",
"ISSUED_DATE object\n",
"COMPLETED_DATE object\n",
"dtype: object"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_permits[['APPLICATION_DATE', 'ISSUED_DATE', 'COMPLETED_DATE']].dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note they are all _object_, which is the most general type used for columns with mixed data types or strings. Time operations, like calculating difference between dates, cannot be performed on object columns so they needed to be converted."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"APPLICATION_DATE datetime64[ns]\n",
"ISSUED_DATE datetime64[ns]\n",
"COMPLETED_DATE datetime64[ns]\n",
"dtype: object"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_permits['APPLICATION_DATE'] = pd.to_datetime(all_permits['APPLICATION_DATE'])\n",
"all_permits['ISSUED_DATE'] = pd.to_datetime(all_permits['ISSUED_DATE'])\n",
"all_permits['COMPLETED_DATE'] = pd.to_datetime(all_permits['COMPLETED_DATE'])\n",
"all_permits[['APPLICATION_DATE', 'ISSUED_DATE', 'COMPLETED_DATE']].dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Date column types are now _datetime64[ns]_, as needed for performing time operations.\n",
"\n",
"##### Calculate number of days between revisions"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 3/3 [00:11<00:00, 3.84s/it]\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>REVISION_NUM</th>\n",
" <th>REVISIONS_COUNT</th>\n",
" <th>APPLICATION_DATE</th>\n",
" <th>APPLICATION_DATE_REVISION_DAYS</th>\n",
" <th>COMPLETED_DATE</th>\n",
" <th>COMPLETED_DATE_REVISION_DAYS</th>\n",
" <th>ISSUED_DATE</th>\n",
" <th>ISSUED_DATE_REVISION_DAYS</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PERMIT_NUM</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>00 123626 CMB</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>2000-01-25</td>\n",
" <td>NaN</td>\n",
" <td>2001-04-03</td>\n",
" <td>NaN</td>\n",
" <td>2000-03-16</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 123626 CMB</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2000-06-02</td>\n",
" <td>129.0</td>\n",
" <td>2005-03-09</td>\n",
" <td>1436.0</td>\n",
" <td>2000-06-08</td>\n",
" <td>84.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 123626 CMB</th>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>2000-07-18</td>\n",
" <td>46.0</td>\n",
" <td>2001-01-20</td>\n",
" <td>-1509.0</td>\n",
" <td>2000-07-21</td>\n",
" <td>43.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 129159 CMB</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>2000-02-22</td>\n",
" <td>NaN</td>\n",
" <td>2006-06-12</td>\n",
" <td>NaN</td>\n",
" <td>2000-05-04</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 129159 CMB</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2001-02-19</td>\n",
" <td>363.0</td>\n",
" <td>2006-06-12</td>\n",
" <td>0.0</td>\n",
" <td>2001-04-05</td>\n",
" <td>336.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" REVISION_NUM REVISIONS_COUNT APPLICATION_DATE \\\n",
"PERMIT_NUM \n",
"00 123626 CMB 0 3 2000-01-25 \n",
"00 123626 CMB 1 3 2000-06-02 \n",
"00 123626 CMB 2 3 2000-07-18 \n",
"00 129159 CMB 0 3 2000-02-22 \n",
"00 129159 CMB 1 3 2001-02-19 \n",
"\n",
" APPLICATION_DATE_REVISION_DAYS COMPLETED_DATE \\\n",
"PERMIT_NUM \n",
"00 123626 CMB NaN 2001-04-03 \n",
"00 123626 CMB 129.0 2005-03-09 \n",
"00 123626 CMB 46.0 2001-01-20 \n",
"00 129159 CMB NaN 2006-06-12 \n",
"00 129159 CMB 363.0 2006-06-12 \n",
"\n",
" COMPLETED_DATE_REVISION_DAYS ISSUED_DATE \\\n",
"PERMIT_NUM \n",
"00 123626 CMB NaN 2000-03-16 \n",
"00 123626 CMB 1436.0 2000-06-08 \n",
"00 123626 CMB -1509.0 2000-07-21 \n",
"00 129159 CMB NaN 2000-05-04 \n",
"00 129159 CMB 0.0 2001-04-05 \n",
"\n",
" ISSUED_DATE_REVISION_DAYS \n",
"PERMIT_NUM \n",
"00 123626 CMB NaN \n",
"00 123626 CMB 84.0 \n",
"00 123626 CMB 43.0 \n",
"00 129159 CMB NaN \n",
"00 129159 CMB 336.0 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"single_rev_permits_temp = pd.DataFrame()\n",
"multi_rev_permits_temp = pd.DataFrame()\n",
"\n",
"# since time is only needed for multi-revision permits, splitting permits so calculations are not performed on the entire table\n",
"single_rev_permits_temp = all_permits[all_permits['REVISIONS_COUNT'] == 1].copy()\n",
"multi_rev_permits_temp = all_permits[all_permits['REVISIONS_COUNT'] > 1].copy()\n",
"\n",
"for date_field in tqdm(['APPLICATION_DATE', 'ISSUED_DATE', 'COMPLETED_DATE']):\n",
" date_diff_field = \"{0}_REVISION_DAYS\".format(date_field)\n",
" multi_rev_permits_temp[date_diff_field] = multi_rev_permits_temp.groupby('PERMIT_NUM', sort=False)[date_field].diff().dt.days\n",
"\n",
"all_permits = multi_rev_permits_temp.append(single_rev_permits_temp).copy()\n",
"date_day_fields = [ date_field for date_field in all_permits.columns.values if (\"DATE\" in date_field) or (\"DAY\" in date_field) ]\n",
"all_permits[ ['REVISION_NUM', 'REVISIONS_COUNT'] + date_day_fields][all_permits['REVISIONS_COUNT'] > 2].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2. Days between permit dates"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>APPLICATION_DATE</th>\n",
" <th>REVIEW_PERIOD_DAYS</th>\n",
" <th>ISSUED_DATE</th>\n",
" <th>INSPECTION_PERIOD_DAYS</th>\n",
" <th>COMPLETED_DATE</th>\n",
" <th>PERMIT_ACTIVE_DAYS</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PERMIT_NUM</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>00 119162 BLD</th>\n",
" <td>2002-08-21</td>\n",
" <td>34.0</td>\n",
" <td>2002-09-24</td>\n",
" <td>295.0</td>\n",
" <td>2003-07-16</td>\n",
" <td>329</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 119162 BLD</th>\n",
" <td>2003-01-14</td>\n",
" <td>24.0</td>\n",
" <td>2003-02-07</td>\n",
" <td>214.0</td>\n",
" <td>2003-09-09</td>\n",
" <td>238</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 119926 CMB</th>\n",
" <td>2000-01-10</td>\n",
" <td>178.0</td>\n",
" <td>2000-07-06</td>\n",
" <td>273.0</td>\n",
" <td>2001-04-05</td>\n",
" <td>451</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 119926 CMB</th>\n",
" <td>2000-10-17</td>\n",
" <td>29.0</td>\n",
" <td>2000-11-15</td>\n",
" <td>3002.0</td>\n",
" <td>2009-02-03</td>\n",
" <td>3031</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 120273 CMB</th>\n",
" <td>2000-01-11</td>\n",
" <td>93.0</td>\n",
" <td>2000-04-13</td>\n",
" <td>1342.0</td>\n",
" <td>2003-12-16</td>\n",
" <td>1435</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" APPLICATION_DATE REVIEW_PERIOD_DAYS ISSUED_DATE \\\n",
"PERMIT_NUM \n",
"00 119162 BLD 2002-08-21 34.0 2002-09-24 \n",
"00 119162 BLD 2003-01-14 24.0 2003-02-07 \n",
"00 119926 CMB 2000-01-10 178.0 2000-07-06 \n",
"00 119926 CMB 2000-10-17 29.0 2000-11-15 \n",
"00 120273 CMB 2000-01-11 93.0 2000-04-13 \n",
"\n",
" INSPECTION_PERIOD_DAYS COMPLETED_DATE PERMIT_ACTIVE_DAYS \n",
"PERMIT_NUM \n",
"00 119162 BLD 295.0 2003-07-16 329 \n",
"00 119162 BLD 214.0 2003-09-09 238 \n",
"00 119926 CMB 273.0 2001-04-05 451 \n",
"00 119926 CMB 3002.0 2009-02-03 3031 \n",
"00 120273 CMB 1342.0 2003-12-16 1435 "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_permits['REVIEW_PERIOD_DAYS'] = (all_permits['ISSUED_DATE'] - all_permits['APPLICATION_DATE']).dt.days\n",
"all_permits['INSPECTION_PERIOD_DAYS'] = (all_permits['COMPLETED_DATE']- all_permits['ISSUED_DATE']).dt.days\n",
"all_permits['PERMIT_ACTIVE_DAYS'] = (all_permits['COMPLETED_DATE'] - all_permits['APPLICATION_DATE']).dt.days\n",
"all_permits[['APPLICATION_DATE', 'REVIEW_PERIOD_DAYS', 'ISSUED_DATE', 'INSPECTION_PERIOD_DAYS', 'COMPLETED_DATE', 'PERMIT_ACTIVE_DAYS']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4. Include metadata fields for visualization\n",
"Additional fields can be added to make visualization easier"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>REVISION_NUM</th>\n",
" <th>FINAL_REVISION_NUM</th>\n",
" <th>FINAL_REVISION_FLAG</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PERMIT_NUM</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>00 123626 CMB</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 123626 CMB</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 123626 CMB</th>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 129159 CMB</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 129159 CMB</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" REVISION_NUM FINAL_REVISION_NUM FINAL_REVISION_FLAG\n",
"PERMIT_NUM \n",
"00 123626 CMB 0 2 NaN\n",
"00 123626 CMB 1 2 NaN\n",
"00 123626 CMB 2 2 1.0\n",
"00 129159 CMB 0 2 NaN\n",
"00 129159 CMB 1 2 NaN"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_permits['FINAL_REVISION_NUM'] = all_permits.reset_index().groupby('PERMIT_NUM', sort=False)['REVISION_NUM'].max()\n",
"all_permits[\"FINAL_REVISION_FLAG\"] = (all_permits['FINAL_REVISION_NUM'] == all_permits['REVISION_NUM']).astype(int).replace(0, np.nan)\n",
"all_permits[['REVISION_NUM', 'FINAL_REVISION_NUM', 'FINAL_REVISION_FLAG']][all_permits['REVISIONS_COUNT'] > 2].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5. Clean numerical columns and create additional features\n",
"## 5.1. Identify misclassified numerical columns"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"CURRENT_USE object\n",
"DESCRIPTION object\n",
"DWELLING_UNITS_CREATED object\n",
"EST_CONST_COST object\n",
"PERMIT_TYPE object\n",
"POSTAL object\n",
"PROPOSED_USE object\n",
"STATUS object\n",
"STREET_DIRECTION object\n",
"STREET_NAME object\n",
"STREET_NUM object\n",
"STREET_TYPE object\n",
"STRUCTURE_TYPE object\n",
"WARD_GRID object\n",
"WORK object\n",
"dtype: object"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_permits.select_dtypes(include='object').dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the list above, 2 fields that should be numerical: *DWELLING_UNITS_CREATED* and *EST_CONST_COST*\n",
"\n",
"## 5.2. Dwelling Units\n",
"It is important to note the number of dwelling units is captured in two fields: one for number created and one for number lost. This presents the opportunity to derive a new field, calculating the difference between them to get back the total number of dwelling units created/lost.\n",
"\n",
"For this, both fields need to be numerical, however, as we saw above, *DWELLING_UNITS_CREATED* is object type. Will convert it first and then create a new field with the difference between them."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DWELLING_UNITS_CREATED object\n",
"DWELLING_UNITS_LOST float64\n",
"dtype: object"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_permits[['DWELLING_UNITS_CREATED', 'DWELLING_UNITS_LOST']].dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.2.1. Make field numerical\n",
"Looking at the unique values in tihs column, the vast majority of values are already numerical with the notable exception of \"_\\*\\*\\*NOT USED\\*\\*\\*_\". Assumed that would mean no dwelling units were created, hence that one could just be replaced with NaN (i.e. NULL or blank)."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([nan, '1.0', '1', '0', '6.0', '0.0', '46.0', '15.0', '193', '9.0',\n",
" '3.0', '11.0', '177.0', '7.0', '4.0', '2.0', '10.0', '5.0', '8.0',\n",
" '5', '136.0', '55.0', '33', '22.0', '40.0', '313.0', '229.0', '2',\n",
" '35', '378', '14.0', '130', '120', '4', '239.0', '423.0', '103.0',\n",
" '12.0', '38.0', '6', '97.0', '16.0', '22', '96.0', '20.0', '103',\n",
" '10', '85.0', '8', '399.0', '3', '24.0', '227.0', '7', '9', '82',\n",
" '357.0', '108', '17.0', '35.0', '175.0', '232.0', '60.0', '91.0',\n",
" '73.0', '247.0', '349.0', '408.0', '13.0', '268.0', '256.0',\n",
" '102.0', '25.0', '30.0', '445.0', '497.0', '68.0', '43.0', '152.0',\n",
" '18.0', '261.0', '402.0', '26.0', '163.0', '36.0', '180.0',\n",
" '205.0', '411.0', '144.0', '37.0', '19.0', '368.0', '118.0',\n",
" '217.0', '293.0', '31.0', '271.0', '464.0', '84.0', '868.0',\n",
" '80.0', '476.0', '334.0', '44.0', '82.0', '291.0', '34.0', '683.0',\n",
" '62.0', '222.0', '134.0', '237.0', '64.0', '75.0', '560.0',\n",
" '506.0', '405.0', '443.0', '101.0', '188.0', '665.0', '780.0',\n",
" '50.0', '157.0', '143.0', '52.0', '155.0', '27.0', '364.0',\n",
" '410.0', '436.0', '95.0', '429.0', '108.0', '145.0', '72.0',\n",
" '384.0', '369.0', '495.0', '439.0', '39.0', '798.0', '32.0',\n",
" '81.0', '28.0', '-1.0', '255.0', '225.0', '241.0', '33.0', '23.0',\n",
" '78.0', '154.0', '121.0', '21.0', '65.0', '107.0', '42.0', '112.0',\n",
" '182.0', '191.0', '126.0', '130.0', '178.0', '120.0', '71.0',\n",
" '29.0', '221.0', '164.0', '545.0', '61.0', '174.0', '48.0',\n",
" '119.0', '57.0', '210.0', '86.0', '270.0', '183.0', '49.0',\n",
" '277.0', '79.0', '213.0', '135.0', '76.0', '83.0', '131.0',\n",
" '132.0', '114.0', '100.0', '54.0', '300.0', '200.0', '116.0',\n",
" '165.0', '142.0', '90.0', '70.0', '109.0', '69.0', '124', '80',\n",
" '00', '226', '54', '429', '44', '303', '306', '0000', '24', '15.7',\n",
" '227', '20', '13', '47', '11', '132', '16', '212', '0001', '115',\n",
" '162', '0003', '196', '25', '0002', '2500', '0005', '0008', '0004',\n",
" '84', '153', '379', '12', '171', '190', '51', '332', '93', '46',\n",
" '129', '18', '158', '117', '32', '92', '110', '*** NOT USED ***',\n",
" '01', '502', '135', '48', '42', '0285', '0224', '0131', '62',\n",
" '0010', '0011', '0337', '274.0', '206.0', '172.0', '266.0',\n",
" '167.0', '194.0', '388.0', '245.0', '254.0', '301.0', '140.0',\n",
" '158.0', '387.0', '383.0', '58.0', '110.0', '92.0', '314.0',\n",
" '450.0', '128.0', '197.0', '138.0', '74.0', '45.0', '425.0',\n",
" '47.0', '308.0', '661.0', '1180.0', '156.0', '201.0', '150.0',\n",
" '168.0', '88.0', '642.0', '323.0', '190.0', '137.0', '57.97',\n",
" '169.0', '224.0', '360.0', '139.0', '490.0', '330.0', '382.0',\n",
" '338.0', '216.0', '328.0', '127.0', '220.0', '339.0', '297.0',\n",
" '354.0', '303.0', '99.0', '538.0', '317.0', '203.0', '310.0',\n",
" '501.0', '111.0', '106.0', '214.0', '204.0', '228.0', '276.0',\n",
" '257.0', '302.0', '198.0', '151.0', '149.0', '231.0', '153.0',\n",
" '146.0', '288.0', '315.0', '179.0', '463.0', '77.0', '59.0',\n",
" '340.0', '343.0', '486.0', '248.0', '187.0', '280.0', '307.0',\n",
" '218.0', '344.0', '637.0', '430.0', '295.0', '234.0', '258.0',\n",
" '209.0', '422.0', '359.0', '283.0', '87.0'], dtype=object)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_permits['DWELLING_UNITS_CREATED'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DWELLING_UNITS_CREATED float64\n",
"DWELLING_UNITS_LOST float64\n",
"dtype: object"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_permits.loc[:, 'DWELLING_UNITS_CREATED'] = pd.to_numeric(all_permits['DWELLING_UNITS_CREATED'], errors='coerce')\n",
"all_permits[['DWELLING_UNITS_CREATED', 'DWELLING_UNITS_LOST']].dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.2.2. Calculate difference between dwelling units created/lost"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>DWELLING_UNITS_CREATED</th>\n",
" <th>DWELLING_UNITS_LOST</th>\n",
" <th>DWELLING_UNITS_DIFFERENCE</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PERMIT_NUM</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>00 119162 BLD</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 119162 BLD</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 119926 CMB</th>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 119926 CMB</th>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>00 120273 CMB</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" DWELLING_UNITS_CREATED DWELLING_UNITS_LOST \\\n",
"PERMIT_NUM \n",
"00 119162 BLD NaN NaN \n",
"00 119162 BLD NaN NaN \n",
"00 119926 CMB 1.0 0.0 \n",
"00 119926 CMB 1.0 0.0 \n",
"00 120273 CMB NaN NaN \n",
"\n",
" DWELLING_UNITS_DIFFERENCE \n",
"PERMIT_NUM \n",
"00 119162 BLD NaN \n",
"00 119162 BLD NaN \n",
"00 119926 CMB 1.0 \n",
"00 119926 CMB 1.0 \n",
"00 120273 CMB NaN "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_permits['DWELLING_UNITS_DIFFERENCE'] = all_permits['DWELLING_UNITS_CREATED'] - all_permits['DWELLING_UNITS_LOST']\n",
"all_permits[['DWELLING_UNITS_CREATED', 'DWELLING_UNITS_LOST', 'DWELLING_UNITS_DIFFERENCE']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5.3. Estimated Construction Cost\n",
"Similar to dwelling units, will coerce the conversion to numeric."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"all_permits.loc[:, 'EST_CONST_COST'] = pd.to_numeric(all_permits['EST_CONST_COST'], errors='coerce')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 6. Create CSV files with transformed data\n",
"From discussions with the experts of the data discovered there are two every distinct time periods for this dataset, before and after 2006, due to changes to how the data was recorded. These changes are significant enough that these two time periods are not quite comparable; Residential Building permit types, for example, were recategorized under New Buildings and New Houses permit types around then. \n",
"\n",
"Therefore opted two create two distinct CSV files. As an added bonus, this split reduces the file size and makes visualizations more responsive."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"permits_pre_2006 = all_permits[all_permits['ISSUED_DATE'].dt.year <= 2005]\n",
"permits_post_2006 = all_permits[all_permits['ISSUED_DATE'].dt.year > 2005]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" file created: ./clearedpermits_issued_1973_to_2005.csv\n",
" file created: ./clearedpermits_issued_2006_to_2017.csv\n"
]
}
],
"source": [
"while True:\n",
" write_to_csv = input(\"output into .CSV files? (yes / no): \")\n",
" \n",
" if \"yes\" in write_to_csv:\n",
" for table in [permits_pre_2006, permits_post_2006]:\n",
" output_file_name = \"clearedpermits_issued_{0}_to_{1}.csv\".format(table['ISSUED_DATE'].min().date().year, table['ISSUED_DATE'].max().date().year)\n",
" table.reset_index().to_csv(\"{0}{1}\".format(directory, output_file_name), index=True, index_label=\"Record_ID\")\n",
" print(\" file created: {0}{1}\".format(directory, output_file_name))\n",
" break\n",
" \n",
" if \"no\" in write_to_csv:\n",
" print(\"finished (file not created)\".upper())\n",
" break\n",
" \n",
" else: print(\"try again\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment