careduz/clearedpermits_transformation.ipynb

## clearedpermits_transformation.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 0. Context\n",
    "The purpose of this script is to transform the cleared permits CSV file, created after the consolidation step, in order to prepare for data visualization and analysis. This transformation is performed through the following steps:\n",
    "> 1. Importing the consolidated .CSV file into Python as a table\n",
    "> 2. Exclude data outside of scope of analysis\n",
    "> 3. Calculate duration between dates\n",
    "> 4. Include metadata fields for visualization\n",
    "> 5. Clean numerical columns and create additional features\n",
    "> 6. Create CSV file with transformed data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from tqdm import tqdm\n",
    "\n",
    "directory = \"./\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Import consolidated .CSV file\n",
    "Will retrieve the transformed csv file (_note_: filename is hardcoded) from the *input_files* folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Rows: 517,561 | Columns: 30 | Permits: 517,561\n"
     ]
    }
   ],
   "source": [
    "input_file = 'clearedpermits_2001to2017.csv'\n",
    "raw_data = pd.read_csv('{0}{1}'.format(directory, input_file), low_memory=False)\n",
    "\n",
    "raw_data_count = len(raw_data['PERMIT_NUM'].unique())\n",
    "\n",
    "print(\"Rows: {0} | Columns: {1} | Permits: {0}\".format(\"{:,}\".format(raw_data.shape[0]), raw_data.shape[1], \"{:,}\".format(raw_data_count)))\n",
    "\n",
    "raw_data = raw_data.set_index('PERMIT_NUM')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Exclude data outside scope of analysis\n",
    "Opted to remove Cancelled permits from analysis given these are cancelled for a number of reasons that the City of often has no control over and thus could skew analysis results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Rows: 494,812 | Columns: 29 | Permits: 461,266\n"
     ]
    }
   ],
   "source": [
    "all_permits = raw_data[raw_data['STATUS'] != 'Cancelled']\n",
    "all_permits_count = len(all_permits.reset_index(drop=False)['PERMIT_NUM'].unique())\n",
    "\n",
    "print(\"Rows: {0} | Columns: {1} | Permits: {2}\".format(\"{:,}\".format(all_permits.shape[0]), all_permits.shape[1], \"{:,}\".format(all_permits_count)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Calculate duration between dates\n",
    "A lot can be learned from a date, such as when a permit was opened or completed. When faced with dates, however, we find considerable value can be derived from the _length of time_ between dates or times than from the actual dates.\n",
    "\n",
    "For instance, knowing the _duration_ from permit issue to completion (e.g. 10 days) makes the data more meaningful and facilitates comparison between permits. Essentially, it helps answer questions that start with \"how long\".\n",
    "\n",
    "Dealing this type of data (i.e. datetime) can be tricky for various reasons, for example:\n",
    "* It is both a dimension (i.e. distinct, such as months in the year - one could compare all Januaries, or all Mondays) _and_ a measure (i.e. continuous, like a range of numbers)\n",
    "* It is a string (i.e. distinct, once again) yet mathematical operations can be performed (e.g. difference between datetime values)\n",
    "\n",
    "Here we cover how we worked with the time to calculate duration\n",
    "> 1. Duration (days) between permit revisions\n",
    "> 2. Duration (days) between permits, such as number of days between APPLICATION_DATE and ISSUED_DATE\n",
    "\n",
    "## 3.1. Duration between permit revisions\n",
    "To calculate duration between revisions performed the steps below.\n",
    "> 1. Split permits into two types: single-revision (SR), and multiple revision (MR)\n",
    "> 2. Calculated duration between revisions for MR permits\n",
    "> 3. Merged SR and MR back into the same table\n",
    "\n",
    "### 3.1.1. Split permits into single- and multiple- revision types\n",
    "Grouped by PERMIT_NUM and counted the REVISION_NUM. If count = 1, the permit is single-revision, (SR) and if count > 1 the permit is multiple-revision (MR). This is because I want duration between revisions only for MR permits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "split the 461,266 permits into: \n",
      "   431,635 permits with one revision (431,635 rows) \n",
      "   29,631 permits with multiple revisions (63,177 rows)\n"
     ]
    }
   ],
   "source": [
    "single_rev_permits, multi_rev_permits = pd.DataFrame(), pd.DataFrame()\n",
    "\n",
    "# group by PERMIT_NUM and count REVISION_NUM\n",
    "permit_group = all_permits.reset_index().groupby('PERMIT_NUM').agg({'REVISION_NUM':['count']})\n",
    "\n",
    "# get PERMIT_NUMs with one REVISION_NUM\n",
    "single_rev_group = permit_group[ permit_group[('REVISION_NUM', 'count')] == 1 ]\n",
    "single_rev_group.columns = single_rev_group.columns.droplevel(0)\n",
    "single_rev_group_permits = single_rev_group.shape[0]\n",
    "\n",
    "# get PERMIT_NUMs with more than one REVISION_NUM\n",
    "multiple_rev_group = permit_group[ permit_group[('REVISION_NUM', 'count')] > 1 ]\n",
    "multiple_rev_group.columns = multiple_rev_group.columns.droplevel(0)\n",
    "multiple_rev_group_permits = multiple_rev_group.shape[0]\n",
    "\n",
    "# load permit into appropiate tables (single or multiple revisions)\n",
    "single_rev_permits = all_permits[all_permits.index.isin(single_rev_group.index)].copy()\n",
    "single_rev_permits.loc[:, 'REVISIONS_COUNT'] = 1\n",
    "\n",
    "multi_rev_permits = all_permits.join(multiple_rev_group, how='inner').copy()\n",
    "multi_rev_permits = multi_rev_permits.rename(columns= {'count': \"REVISIONS_COUNT\"})\n",
    "\n",
    "split_all_permits_count = single_rev_group_permits + multiple_rev_group_permits\n",
    "split_all_permits_records_count = single_rev_permits.shape[0] + multi_rev_permits.shape[0]\n",
    "\n",
    "print(\"split the {0} permits into:\".format(\"{:,}\".format(all_permits_count)),\n",
    "      \"\\n   {0} permits with one revision ({1} rows)\".format(\"{:,}\".format(single_rev_group_permits), \"{:,}\".format(single_rev_permits.shape[0])),\n",
    "      \"\\n   {0} permits with multiple revisions ({1} rows)\".format(\"{:,}\".format(multiple_rev_group_permits), \"{:,}\".format(multi_rev_permits.shape[0])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1.2. Calculate time between revisions\n",
    "To determine time between revisions needed to:\n",
    "> 1. Ensure REVISION_NUM is numerical so operations can be performed\n",
    "> 2. Order records by PERMIT_NUM and REVISION_NUM\n",
    "> 3. Calculate duration between revisions per MR permit\n",
    "\n",
    "#### Ensure records can be sorted by REVISION_NUM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "distinct REVISION_NUM values in the 431,635 Single-Revision Permits:\n",
      "['0', '00', '01', '02', '03', '1', '2', '4']\n",
      "\n",
      "distinct REVISION_NUM values in the 29,631 Multiple-Revision Permits:\n",
      "['0', '00', '01', '02', '03', '04', '05', '06', '07', '1', '2', '3', '4', '5', '6', '7', '8']\n"
     ]
    }
   ],
   "source": [
    "single_rev_values = single_rev_permits['REVISION_NUM'].unique()\n",
    "multi_rev_values = multi_rev_permits['REVISION_NUM'].unique()\n",
    "\n",
    "print(\"\\ndistinct REVISION_NUM values in the {0} Single-Revision Permits:\\n{1}\".format('{:,}'.format(single_rev_group_permits),\n",
    "                                                                   sorted(single_rev_values)\n",
    "                                                                  )\n",
    "     )\n",
    "\n",
    "print(\"\\ndistinct REVISION_NUM values in the {0} Multiple-Revision Permits:\\n{1}\".format('{:,}'.format(multiple_rev_group_permits),\n",
    "                                                                   sorted(multi_rev_values)\n",
    "                                                                  )\n",
    "     )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that single quotes ('') indicate value is a string, which means it is treated as text - however, all the values are digits. So we must convert the column to numeric for sorting.\n",
    "\n",
    "#### Convert REVISION_NUM values to numeric"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "multi_rev_permits.loc[:, 'REVISION_NUM'] = pd.to_numeric(multi_rev_permits['REVISION_NUM'])\n",
    "single_rev_permits.loc[:, 'REVISION_NUM'] = pd.to_numeric(single_rev_permits['REVISION_NUM'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the REVISION_NUM column now a number, the multiple can be sorted in order of revision number and the time interval between them can be calculated.\n",
    "\n",
    "#### Sort Multiple-Revision Permits by REVISION_NUM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>REVISION_NUM</th>\n",
       "      <th>STATUS</th>\n",
       "      <th>APPLICATION_DATE</th>\n",
       "      <th>ISSUED_DATE</th>\n",
       "      <th>COMPLETED_DATE</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PERMIT_NUM</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>00 119162 BLD</th>\n",
       "      <td>1</td>\n",
       "      <td>Closed</td>\n",
       "      <td>2002/08/21</td>\n",
       "      <td>2002/09/24</td>\n",
       "      <td>2003/07/16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 119162 BLD</th>\n",
       "      <td>2</td>\n",
       "      <td>Closed</td>\n",
       "      <td>2003/01/14</td>\n",
       "      <td>2003/02/07</td>\n",
       "      <td>2003/09/09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 119926 CMB</th>\n",
       "      <td>0</td>\n",
       "      <td>Closed</td>\n",
       "      <td>2000/01/10</td>\n",
       "      <td>2000/07/06</td>\n",
       "      <td>2001/04/05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 119926 CMB</th>\n",
       "      <td>1</td>\n",
       "      <td>Closed</td>\n",
       "      <td>2000/10/17</td>\n",
       "      <td>2000/11/15</td>\n",
       "      <td>2009/02/03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 120273 CMB</th>\n",
       "      <td>0</td>\n",
       "      <td>Closed</td>\n",
       "      <td>2000/01/11</td>\n",
       "      <td>2000/04/13</td>\n",
       "      <td>2003/12/16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               REVISION_NUM  STATUS APPLICATION_DATE ISSUED_DATE  \\\n",
       "PERMIT_NUM                                                         \n",
       "00 119162 BLD             1  Closed       2002/08/21  2002/09/24   \n",
       "00 119162 BLD             2  Closed       2003/01/14  2003/02/07   \n",
       "00 119926 CMB             0  Closed       2000/01/10  2000/07/06   \n",
       "00 119926 CMB             1  Closed       2000/10/17  2000/11/15   \n",
       "00 120273 CMB             0  Closed       2000/01/11  2000/04/13   \n",
       "\n",
       "              COMPLETED_DATE  \n",
       "PERMIT_NUM                    \n",
       "00 119162 BLD     2003/07/16  \n",
       "00 119162 BLD     2003/09/09  \n",
       "00 119926 CMB     2001/04/05  \n",
       "00 119926 CMB     2009/02/03  \n",
       "00 120273 CMB     2003/12/16  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "multi_rev_permits = multi_rev_permits.reset_index(drop=False).sort_values(['PERMIT_NUM', 'REVISION_NUM']).set_index('PERMIT_NUM')\n",
    "multi_rev_permits[['REVISION_NUM', 'STATUS','APPLICATION_DATE', 'ISSUED_DATE', 'COMPLETED_DATE']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Merge Multiple- and Single- revision permits\n",
    "\n",
    "With the Multiple-Revision Permits now correctly ordered, the Single-Revision Permits can be merged back. Just need to ensure of MRPs is maintained, so will simply append those records at the bottom of the MRP table,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "distinct REVISION_NUM values across all 494,812 permits:\n",
      "[0, 1, 2, 3, 4, 5, 6, 7, 8]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "REVISION_NUM    int64\n",
       "dtype: object"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_permits = multi_rev_permits.append(single_rev_permits).copy()\n",
    "all_rev_values = all_permits['REVISION_NUM'].unique()\n",
    "\n",
    "print(\"distinct REVISION_NUM values across all {0} permits:\\n{1}\".format('{:,}'.format(all_permits.shape[0]), sorted(all_rev_values)))\n",
    "all_permits[['REVISION_NUM']].dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "REVISION_NUM values are now numbers. Single- and multi- revision permits are now in the same table, and also in the order needed for calculating time difference of revisions for multi-revision permits.\n",
    "\n",
    "#### Calculate days between revisions\n",
    "##### Convert Date column values to datetime\n",
    "\n",
    "To calculate the number of days between dates need to ensure the values are DATETIME type rather than STRING. This allows for time operations.\n",
    "\n",
    "At the moment we only need APPLICATION_DATE, however the remaining date columns must be properly formatted for calculating durations between them later, so looked at all date columns at the same time. Here are the current data types for all date fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "APPLICATION_DATE    object\n",
       "ISSUED_DATE         object\n",
       "COMPLETED_DATE      object\n",
       "dtype: object"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_permits[['APPLICATION_DATE', 'ISSUED_DATE', 'COMPLETED_DATE']].dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note they are all _object_, which is the most general type used for columns with mixed data types or strings. Time operations, like calculating difference between dates, cannot be performed on object columns so they needed to be converted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "APPLICATION_DATE    datetime64[ns]\n",
       "ISSUED_DATE         datetime64[ns]\n",
       "COMPLETED_DATE      datetime64[ns]\n",
       "dtype: object"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_permits['APPLICATION_DATE'] = pd.to_datetime(all_permits['APPLICATION_DATE'])\n",
    "all_permits['ISSUED_DATE'] = pd.to_datetime(all_permits['ISSUED_DATE'])\n",
    "all_permits['COMPLETED_DATE'] = pd.to_datetime(all_permits['COMPLETED_DATE'])\n",
    "all_permits[['APPLICATION_DATE', 'ISSUED_DATE', 'COMPLETED_DATE']].dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Date column types are now _datetime64[ns]_, as needed for performing time operations.\n",
    "\n",
    "##### Calculate number of days between revisions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 3/3 [00:11<00:00,  3.84s/it]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>REVISION_NUM</th>\n",
       "      <th>REVISIONS_COUNT</th>\n",
       "      <th>APPLICATION_DATE</th>\n",
       "      <th>APPLICATION_DATE_REVISION_DAYS</th>\n",
       "      <th>COMPLETED_DATE</th>\n",
       "      <th>COMPLETED_DATE_REVISION_DAYS</th>\n",
       "      <th>ISSUED_DATE</th>\n",
       "      <th>ISSUED_DATE_REVISION_DAYS</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PERMIT_NUM</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>00 123626 CMB</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>2000-01-25</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2001-04-03</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2000-03-16</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 123626 CMB</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2000-06-02</td>\n",
       "      <td>129.0</td>\n",
       "      <td>2005-03-09</td>\n",
       "      <td>1436.0</td>\n",
       "      <td>2000-06-08</td>\n",
       "      <td>84.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 123626 CMB</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>2000-07-18</td>\n",
       "      <td>46.0</td>\n",
       "      <td>2001-01-20</td>\n",
       "      <td>-1509.0</td>\n",
       "      <td>2000-07-21</td>\n",
       "      <td>43.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 129159 CMB</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>2000-02-22</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2006-06-12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2000-05-04</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 129159 CMB</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2001-02-19</td>\n",
       "      <td>363.0</td>\n",
       "      <td>2006-06-12</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2001-04-05</td>\n",
       "      <td>336.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               REVISION_NUM  REVISIONS_COUNT APPLICATION_DATE  \\\n",
       "PERMIT_NUM                                                      \n",
       "00 123626 CMB             0                3       2000-01-25   \n",
       "00 123626 CMB             1                3       2000-06-02   \n",
       "00 123626 CMB             2                3       2000-07-18   \n",
       "00 129159 CMB             0                3       2000-02-22   \n",
       "00 129159 CMB             1                3       2001-02-19   \n",
       "\n",
       "               APPLICATION_DATE_REVISION_DAYS COMPLETED_DATE  \\\n",
       "PERMIT_NUM                                                     \n",
       "00 123626 CMB                             NaN     2001-04-03   \n",
       "00 123626 CMB                           129.0     2005-03-09   \n",
       "00 123626 CMB                            46.0     2001-01-20   \n",
       "00 129159 CMB                             NaN     2006-06-12   \n",
       "00 129159 CMB                           363.0     2006-06-12   \n",
       "\n",
       "               COMPLETED_DATE_REVISION_DAYS ISSUED_DATE  \\\n",
       "PERMIT_NUM                                                \n",
       "00 123626 CMB                           NaN  2000-03-16   \n",
       "00 123626 CMB                        1436.0  2000-06-08   \n",
       "00 123626 CMB                       -1509.0  2000-07-21   \n",
       "00 129159 CMB                           NaN  2000-05-04   \n",
       "00 129159 CMB                           0.0  2001-04-05   \n",
       "\n",
       "               ISSUED_DATE_REVISION_DAYS  \n",
       "PERMIT_NUM                                \n",
       "00 123626 CMB                        NaN  \n",
       "00 123626 CMB                       84.0  \n",
       "00 123626 CMB                       43.0  \n",
       "00 129159 CMB                        NaN  \n",
       "00 129159 CMB                      336.0  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "single_rev_permits_temp = pd.DataFrame()\n",
    "multi_rev_permits_temp = pd.DataFrame()\n",
    "\n",
    "# since time is only needed for multi-revision permits, splitting permits so calculations are not performed on the entire table\n",
    "single_rev_permits_temp = all_permits[all_permits['REVISIONS_COUNT'] == 1].copy()\n",
    "multi_rev_permits_temp = all_permits[all_permits['REVISIONS_COUNT'] > 1].copy()\n",
    "\n",
    "for date_field in tqdm(['APPLICATION_DATE', 'ISSUED_DATE', 'COMPLETED_DATE']):\n",
    "    date_diff_field = \"{0}_REVISION_DAYS\".format(date_field)\n",
    "    multi_rev_permits_temp[date_diff_field] = multi_rev_permits_temp.groupby('PERMIT_NUM', sort=False)[date_field].diff().dt.days\n",
    "\n",
    "all_permits = multi_rev_permits_temp.append(single_rev_permits_temp).copy()\n",
    "date_day_fields = [ date_field for date_field in all_permits.columns.values if (\"DATE\" in date_field) or (\"DAY\" in date_field) ]\n",
    "all_permits[ ['REVISION_NUM', 'REVISIONS_COUNT'] + date_day_fields][all_permits['REVISIONS_COUNT'] > 2].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2. Days between permit dates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>APPLICATION_DATE</th>\n",
       "      <th>REVIEW_PERIOD_DAYS</th>\n",
       "      <th>ISSUED_DATE</th>\n",
       "      <th>INSPECTION_PERIOD_DAYS</th>\n",
       "      <th>COMPLETED_DATE</th>\n",
       "      <th>PERMIT_ACTIVE_DAYS</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PERMIT_NUM</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>00 119162 BLD</th>\n",
       "      <td>2002-08-21</td>\n",
       "      <td>34.0</td>\n",
       "      <td>2002-09-24</td>\n",
       "      <td>295.0</td>\n",
       "      <td>2003-07-16</td>\n",
       "      <td>329</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 119162 BLD</th>\n",
       "      <td>2003-01-14</td>\n",
       "      <td>24.0</td>\n",
       "      <td>2003-02-07</td>\n",
       "      <td>214.0</td>\n",
       "      <td>2003-09-09</td>\n",
       "      <td>238</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 119926 CMB</th>\n",
       "      <td>2000-01-10</td>\n",
       "      <td>178.0</td>\n",
       "      <td>2000-07-06</td>\n",
       "      <td>273.0</td>\n",
       "      <td>2001-04-05</td>\n",
       "      <td>451</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 119926 CMB</th>\n",
       "      <td>2000-10-17</td>\n",
       "      <td>29.0</td>\n",
       "      <td>2000-11-15</td>\n",
       "      <td>3002.0</td>\n",
       "      <td>2009-02-03</td>\n",
       "      <td>3031</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 120273 CMB</th>\n",
       "      <td>2000-01-11</td>\n",
       "      <td>93.0</td>\n",
       "      <td>2000-04-13</td>\n",
       "      <td>1342.0</td>\n",
       "      <td>2003-12-16</td>\n",
       "      <td>1435</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              APPLICATION_DATE  REVIEW_PERIOD_DAYS ISSUED_DATE  \\\n",
       "PERMIT_NUM                                                       \n",
       "00 119162 BLD       2002-08-21                34.0  2002-09-24   \n",
       "00 119162 BLD       2003-01-14                24.0  2003-02-07   \n",
       "00 119926 CMB       2000-01-10               178.0  2000-07-06   \n",
       "00 119926 CMB       2000-10-17                29.0  2000-11-15   \n",
       "00 120273 CMB       2000-01-11                93.0  2000-04-13   \n",
       "\n",
       "               INSPECTION_PERIOD_DAYS COMPLETED_DATE  PERMIT_ACTIVE_DAYS  \n",
       "PERMIT_NUM                                                                \n",
       "00 119162 BLD                   295.0     2003-07-16                 329  \n",
       "00 119162 BLD                   214.0     2003-09-09                 238  \n",
       "00 119926 CMB                   273.0     2001-04-05                 451  \n",
       "00 119926 CMB                  3002.0     2009-02-03                3031  \n",
       "00 120273 CMB                  1342.0     2003-12-16                1435  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_permits['REVIEW_PERIOD_DAYS'] = (all_permits['ISSUED_DATE'] - all_permits['APPLICATION_DATE']).dt.days\n",
    "all_permits['INSPECTION_PERIOD_DAYS'] = (all_permits['COMPLETED_DATE']- all_permits['ISSUED_DATE']).dt.days\n",
    "all_permits['PERMIT_ACTIVE_DAYS'] = (all_permits['COMPLETED_DATE'] - all_permits['APPLICATION_DATE']).dt.days\n",
    "all_permits[['APPLICATION_DATE', 'REVIEW_PERIOD_DAYS', 'ISSUED_DATE', 'INSPECTION_PERIOD_DAYS', 'COMPLETED_DATE', 'PERMIT_ACTIVE_DAYS']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. Include metadata fields for visualization\n",
    "Additional fields can be added to make visualization easier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>REVISION_NUM</th>\n",
       "      <th>FINAL_REVISION_NUM</th>\n",
       "      <th>FINAL_REVISION_FLAG</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PERMIT_NUM</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>00 123626 CMB</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 123626 CMB</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 123626 CMB</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 129159 CMB</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 129159 CMB</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               REVISION_NUM  FINAL_REVISION_NUM  FINAL_REVISION_FLAG\n",
       "PERMIT_NUM                                                          \n",
       "00 123626 CMB             0                   2                  NaN\n",
       "00 123626 CMB             1                   2                  NaN\n",
       "00 123626 CMB             2                   2                  1.0\n",
       "00 129159 CMB             0                   2                  NaN\n",
       "00 129159 CMB             1                   2                  NaN"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_permits['FINAL_REVISION_NUM'] = all_permits.reset_index().groupby('PERMIT_NUM', sort=False)['REVISION_NUM'].max()\n",
    "all_permits[\"FINAL_REVISION_FLAG\"] = (all_permits['FINAL_REVISION_NUM'] == all_permits['REVISION_NUM']).astype(int).replace(0, np.nan)\n",
    "all_permits[['REVISION_NUM', 'FINAL_REVISION_NUM', 'FINAL_REVISION_FLAG']][all_permits['REVISIONS_COUNT'] > 2].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Clean numerical columns and create additional features\n",
    "## 5.1. Identify misclassified numerical columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CURRENT_USE               object\n",
       "DESCRIPTION               object\n",
       "DWELLING_UNITS_CREATED    object\n",
       "EST_CONST_COST            object\n",
       "PERMIT_TYPE               object\n",
       "POSTAL                    object\n",
       "PROPOSED_USE              object\n",
       "STATUS                    object\n",
       "STREET_DIRECTION          object\n",
       "STREET_NAME               object\n",
       "STREET_NUM                object\n",
       "STREET_TYPE               object\n",
       "STRUCTURE_TYPE            object\n",
       "WARD_GRID                 object\n",
       "WORK                      object\n",
       "dtype: object"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_permits.select_dtypes(include='object').dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the list above, 2 fields that should be numerical: *DWELLING_UNITS_CREATED* and *EST_CONST_COST*\n",
    "\n",
    "## 5.2. Dwelling Units\n",
    "It is important to note the number of dwelling units is captured in two fields: one for number created and one for number lost. This presents the opportunity to derive a new field, calculating the difference between them to get back the total number of dwelling units created/lost.\n",
    "\n",
    "For this, both fields need to be numerical, however, as we saw above, *DWELLING_UNITS_CREATED* is object type. Will convert it first and then create a new field with the difference between them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DWELLING_UNITS_CREATED     object\n",
       "DWELLING_UNITS_LOST       float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_permits[['DWELLING_UNITS_CREATED', 'DWELLING_UNITS_LOST']].dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2.1. Make field numerical\n",
    "Looking at the unique values in tihs column, the vast majority of values are already numerical with the notable exception of \"_\\*\\*\\*NOT USED\\*\\*\\*_\". Assumed that would mean no dwelling units were created, hence that one could just be replaced with NaN (i.e. NULL or blank)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([nan, '1.0', '1', '0', '6.0', '0.0', '46.0', '15.0', '193', '9.0',\n",
       "       '3.0', '11.0', '177.0', '7.0', '4.0', '2.0', '10.0', '5.0', '8.0',\n",
       "       '5', '136.0', '55.0', '33', '22.0', '40.0', '313.0', '229.0', '2',\n",
       "       '35', '378', '14.0', '130', '120', '4', '239.0', '423.0', '103.0',\n",
       "       '12.0', '38.0', '6', '97.0', '16.0', '22', '96.0', '20.0', '103',\n",
       "       '10', '85.0', '8', '399.0', '3', '24.0', '227.0', '7', '9', '82',\n",
       "       '357.0', '108', '17.0', '35.0', '175.0', '232.0', '60.0', '91.0',\n",
       "       '73.0', '247.0', '349.0', '408.0', '13.0', '268.0', '256.0',\n",
       "       '102.0', '25.0', '30.0', '445.0', '497.0', '68.0', '43.0', '152.0',\n",
       "       '18.0', '261.0', '402.0', '26.0', '163.0', '36.0', '180.0',\n",
       "       '205.0', '411.0', '144.0', '37.0', '19.0', '368.0', '118.0',\n",
       "       '217.0', '293.0', '31.0', '271.0', '464.0', '84.0', '868.0',\n",
       "       '80.0', '476.0', '334.0', '44.0', '82.0', '291.0', '34.0', '683.0',\n",
       "       '62.0', '222.0', '134.0', '237.0', '64.0', '75.0', '560.0',\n",
       "       '506.0', '405.0', '443.0', '101.0', '188.0', '665.0', '780.0',\n",
       "       '50.0', '157.0', '143.0', '52.0', '155.0', '27.0', '364.0',\n",
       "       '410.0', '436.0', '95.0', '429.0', '108.0', '145.0', '72.0',\n",
       "       '384.0', '369.0', '495.0', '439.0', '39.0', '798.0', '32.0',\n",
       "       '81.0', '28.0', '-1.0', '255.0', '225.0', '241.0', '33.0', '23.0',\n",
       "       '78.0', '154.0', '121.0', '21.0', '65.0', '107.0', '42.0', '112.0',\n",
       "       '182.0', '191.0', '126.0', '130.0', '178.0', '120.0', '71.0',\n",
       "       '29.0', '221.0', '164.0', '545.0', '61.0', '174.0', '48.0',\n",
       "       '119.0', '57.0', '210.0', '86.0', '270.0', '183.0', '49.0',\n",
       "       '277.0', '79.0', '213.0', '135.0', '76.0', '83.0', '131.0',\n",
       "       '132.0', '114.0', '100.0', '54.0', '300.0', '200.0', '116.0',\n",
       "       '165.0', '142.0', '90.0', '70.0', '109.0', '69.0', '124', '80',\n",
       "       '00', '226', '54', '429', '44', '303', '306', '0000', '24', '15.7',\n",
       "       '227', '20', '13', '47', '11', '132', '16', '212', '0001', '115',\n",
       "       '162', '0003', '196', '25', '0002', '2500', '0005', '0008', '0004',\n",
       "       '84', '153', '379', '12', '171', '190', '51', '332', '93', '46',\n",
       "       '129', '18', '158', '117', '32', '92', '110', '*** NOT USED ***',\n",
       "       '01', '502', '135', '48', '42', '0285', '0224', '0131', '62',\n",
       "       '0010', '0011', '0337', '274.0', '206.0', '172.0', '266.0',\n",
       "       '167.0', '194.0', '388.0', '245.0', '254.0', '301.0', '140.0',\n",
       "       '158.0', '387.0', '383.0', '58.0', '110.0', '92.0', '314.0',\n",
       "       '450.0', '128.0', '197.0', '138.0', '74.0', '45.0', '425.0',\n",
       "       '47.0', '308.0', '661.0', '1180.0', '156.0', '201.0', '150.0',\n",
       "       '168.0', '88.0', '642.0', '323.0', '190.0', '137.0', '57.97',\n",
       "       '169.0', '224.0', '360.0', '139.0', '490.0', '330.0', '382.0',\n",
       "       '338.0', '216.0', '328.0', '127.0', '220.0', '339.0', '297.0',\n",
       "       '354.0', '303.0', '99.0', '538.0', '317.0', '203.0', '310.0',\n",
       "       '501.0', '111.0', '106.0', '214.0', '204.0', '228.0', '276.0',\n",
       "       '257.0', '302.0', '198.0', '151.0', '149.0', '231.0', '153.0',\n",
       "       '146.0', '288.0', '315.0', '179.0', '463.0', '77.0', '59.0',\n",
       "       '340.0', '343.0', '486.0', '248.0', '187.0', '280.0', '307.0',\n",
       "       '218.0', '344.0', '637.0', '430.0', '295.0', '234.0', '258.0',\n",
       "       '209.0', '422.0', '359.0', '283.0', '87.0'], dtype=object)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_permits['DWELLING_UNITS_CREATED'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DWELLING_UNITS_CREATED    float64\n",
       "DWELLING_UNITS_LOST       float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_permits.loc[:, 'DWELLING_UNITS_CREATED'] = pd.to_numeric(all_permits['DWELLING_UNITS_CREATED'], errors='coerce')\n",
    "all_permits[['DWELLING_UNITS_CREATED', 'DWELLING_UNITS_LOST']].dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2.2. Calculate difference between dwelling units created/lost"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>DWELLING_UNITS_CREATED</th>\n",
       "      <th>DWELLING_UNITS_LOST</th>\n",
       "      <th>DWELLING_UNITS_DIFFERENCE</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PERMIT_NUM</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>00 119162 BLD</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 119162 BLD</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 119926 CMB</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 119926 CMB</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>00 120273 CMB</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               DWELLING_UNITS_CREATED  DWELLING_UNITS_LOST  \\\n",
       "PERMIT_NUM                                                   \n",
       "00 119162 BLD                     NaN                  NaN   \n",
       "00 119162 BLD                     NaN                  NaN   \n",
       "00 119926 CMB                     1.0                  0.0   \n",
       "00 119926 CMB                     1.0                  0.0   \n",
       "00 120273 CMB                     NaN                  NaN   \n",
       "\n",
       "               DWELLING_UNITS_DIFFERENCE  \n",
       "PERMIT_NUM                                \n",
       "00 119162 BLD                        NaN  \n",
       "00 119162 BLD                        NaN  \n",
       "00 119926 CMB                        1.0  \n",
       "00 119926 CMB                        1.0  \n",
       "00 120273 CMB                        NaN  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_permits['DWELLING_UNITS_DIFFERENCE'] = all_permits['DWELLING_UNITS_CREATED'] - all_permits['DWELLING_UNITS_LOST']\n",
    "all_permits[['DWELLING_UNITS_CREATED', 'DWELLING_UNITS_LOST', 'DWELLING_UNITS_DIFFERENCE']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.3. Estimated Construction Cost\n",
    "Similar to dwelling units, will coerce the conversion to numeric."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_permits.loc[:, 'EST_CONST_COST'] = pd.to_numeric(all_permits['EST_CONST_COST'], errors='coerce')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6. Create CSV files with transformed data\n",
    "From discussions with the experts of the data discovered there are two every distinct time periods for this dataset, before and after 2006, due to changes to how the data was recorded. These changes are significant enough that these two time periods are not quite comparable; Residential Building permit types, for example, were recategorized under New Buildings and New Houses permit types around then. \n",
    "\n",
    "Therefore opted two create two distinct CSV files. As an added bonus, this split reduces the file size and makes visualizations more responsive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "permits_pre_2006 = all_permits[all_permits['ISSUED_DATE'].dt.year <= 2005]\n",
    "permits_post_2006 = all_permits[all_permits['ISSUED_DATE'].dt.year > 2005]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   file created: ./clearedpermits_issued_1973_to_2005.csv\n",
      "   file created: ./clearedpermits_issued_2006_to_2017.csv\n"
     ]
    }
   ],
   "source": [
    "while True:\n",
    "    write_to_csv = input(\"output into .CSV files? (yes / no): \")\n",
    "            \n",
    "    if \"yes\" in write_to_csv:\n",
    "        for table in [permits_pre_2006, permits_post_2006]:\n",
    "            output_file_name = \"clearedpermits_issued_{0}_to_{1}.csv\".format(table['ISSUED_DATE'].min().date().year, table['ISSUED_DATE'].max().date().year)\n",
    "            table.reset_index().to_csv(\"{0}{1}\".format(directory, output_file_name), index=True, index_label=\"Record_ID\")\n",
    "            print(\"   file created: {0}{1}\".format(directory, output_file_name))\n",
    "        break\n",
    "    \n",
    "    if \"no\" in write_to_csv:\n",
    "        print(\"finished (file not created)\".upper())\n",
    "        break\n",
    "        \n",
    "    else: print(\"try again\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}