Skip to content

Instantly share code, notes, and snippets.

@careduz
Last active March 29, 2018 15:24
Show Gist options
  • Save careduz/57f04f902c7001880d959a23b341ea40 to your computer and use it in GitHub Desktop.
Save careduz/57f04f902c7001880d959a23b341ea40 to your computer and use it in GitHub Desktop.
opendatatoronto
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 0. Context\n",
"The purpose of this script is to consolidate the separate cleared permits files into one for visualization and analysis. This requires three main actions:\n",
"> 1. Importing the raw .CSV files into Python as tables\n",
"> 2. Transforming data tables for visualization purposes\n",
"> 3. Consolidating tables into a single one\n",
"> 4. Writing the consolidated table into another file to visualize the data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import os\n",
"from datetime import datetime\n",
"from tabulate import tabulate\n",
"import re\n",
"\n",
"script_directory = \"./\"\n",
"input_files_dir = \"input_files/\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Import input files\n",
"## 1.1. Get file names from the *input_files* folder\n",
"Will retrieve file names that end with _.csv_ and start with _clearedpermits_, which are default when the data is downloaded from the Open Data Portal."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"files = [ filename for filename in os.listdir(script_directory + input_files_dir) \\\n",
" if (filename.startswith( \"clearedpermits\" ) and filename.endswith(\".csv\"))]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.2. Define function to bring CSV files in-memory\n",
"Created a function for loading CSV files, and relevant metadata, rather than repeating the code for each CSV file or looping it. I was interested in capturing the following data for each file:\n",
"1. Permits raw data\n",
"2. Distinct count of permits\n",
"3. Year of permits file\n",
"\n",
"Item 2 can be calculated from the imported data, and item 3 is in the default file name when downloading from the Open Data Portal."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def printLine(year, rows, columns, permits_count, filename=\"\"):\n",
" rows = \"{:,}\".format(rows)\n",
" permits_count = \"{:,}\".format(permits_count)\n",
" print_dict = {\"Year\": year,\n",
" \"File\": filename,\n",
" \"Rows\": rows,\n",
" \"Columns\": columns,\n",
" \"Permits\": permits_count\n",
" }\n",
" print_list = \" | \".join([ \"{0}: {1}\".format(d, print_dict[d]) for d in [\"Rows\", \"Columns\", \"Permits\"]\\\n",
" if len( str(print_dict[d])) > 0 ])\n",
" if len(filename) > 0: print( \" {0} | {1} | {2}\".format(year, filename, print_list) )\n",
" else: print( \" {0} | {1}\".format(year, print_list) )\n",
" \n",
" \n",
"def loadTable(filename, script_directory=script_directory, input_files_dir=input_files_dir):\n",
" df = pd.read_csv('{0}{1}'.format(script_directory + input_files_dir, filename), low_memory=False)\n",
" year = re.findall(r'\\d+', f)[0]\n",
" permits_count, rows, columns = len(df['PERMIT_NUM'].unique()), df.shape[0], df.shape[1]\n",
" printLine(year=year, filename=filename, rows=rows, columns=columns, permits_count=permits_count)\n",
" \n",
" return df, year, permits_count, rows\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.3. Load CSVs into individual dataframes"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LOADING INPUT FILES FROM EACH YEAR\n",
" 2001 | clearedpermits2001.csv | Rows: 21,458 | Columns: 30 | Permits: 21,231\n",
" 2002 | clearedpermits2002.csv | Rows: 25,060 | Columns: 30 | Permits: 24,528\n",
" 2003 | clearedpermits2003.csv | Rows: 25,708 | Columns: 30 | Permits: 25,092\n",
" 2004 | clearedpermits2004.csv | Rows: 25,184 | Columns: 30 | Permits: 24,393\n",
" 2005 | clearedpermits2005.csv | Rows: 32,809 | Columns: 30 | Permits: 31,660\n",
" 2006 | clearedpermits2006.csv | Rows: 28,777 | Columns: 30 | Permits: 27,419\n",
" 2007 | clearedpermits2007.csv | Rows: 25,616 | Columns: 30 | Permits: 24,145\n",
" 2008 | clearedpermits2008.csv | Rows: 26,073 | Columns: 30 | Permits: 24,608\n",
" 2009 | clearedpermits2009.csv | Rows: 26,385 | Columns: 30 | Permits: 24,814\n",
" 2010 | clearedpermits2010.csv | Rows: 26,461 | Columns: 30 | Permits: 24,790\n",
" 2011 | clearedpermits2011.csv | Rows: 30,541 | Columns: 30 | Permits: 28,719\n",
" 2012 | clearedpermits2012.csv | Rows: 31,525 | Columns: 30 | Permits: 29,140\n",
" 2013 | clearedpermits2013.csv | Rows: 34,796 | Columns: 30 | Permits: 32,488\n",
" 2014 | clearedpermits2014.csv | Rows: 37,096 | Columns: 30 | Permits: 34,732\n",
" 2015 | clearedpermits2015.csv | Rows: 38,969 | Columns: 30 | Permits: 36,152\n",
" 2016 | clearedpermits2016.csv | Rows: 41,056 | Columns: 30 | Permits: 37,685\n",
" 2017 | clearedpermits2017.csv | Rows: 40,047 | Columns: 30 | Permits: 36,627\n",
"in the files from 2001 to 2017 there were a total of 517,561 records and 488,223 permits\n"
]
}
],
"source": [
"tables, rows, permits = {}, [], []\n",
"\n",
"print(\"loading input files from each year\".upper())\n",
"for f in sorted(files):\n",
" table, year, permits_count, rows_count = loadTable(f)\n",
" tables[year] = table.reset_index(drop=True)\n",
" rows.append(rows_count)\n",
" permits.append(permits_count)\n",
" \n",
"years, rows, permits = sorted(list(tables)), sum(rows), sum(permits)\n",
"first_year, last_year = years[0], years[-1]\n",
"print(\"in the files from {0} to {1} there were a total of {2} records and {3} permits\".format(first_year,last_year, \"{:,}\".format(rows), \"{:,}\".format(permits)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Consolidate data tables\n",
"## 2.1. Append dataframes into one"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INTEGRATING PERMITS FROM ALL YEARS INTO ONE\n",
" 2001 | Rows: 21,458 | Columns: 30 | Permits: 21,231\n",
" 2001 | Rows: 46,518 | Columns: 30 | Permits: 45,693\n",
" 2001 | Rows: 72,226 | Columns: 30 | Permits: 70,722\n",
" 2001 | Rows: 97,410 | Columns: 30 | Permits: 95,020\n",
" 2001 | Rows: 130,219 | Columns: 30 | Permits: 126,525\n",
" 2001 | Rows: 158,996 | Columns: 30 | Permits: 153,819\n",
" 2001 | Rows: 184,612 | Columns: 30 | Permits: 177,868\n",
" 2001 | Rows: 210,685 | Columns: 30 | Permits: 202,399\n",
" 2001 | Rows: 237,070 | Columns: 30 | Permits: 227,025\n",
" 2001 | Rows: 263,531 | Columns: 30 | Permits: 251,759\n",
" 2001 | Rows: 294,072 | Columns: 30 | Permits: 280,371\n",
" 2001 | Rows: 325,597 | Columns: 30 | Permits: 309,416\n",
" 2001 | Rows: 360,393 | Columns: 30 | Permits: 338,292\n",
" 2001 | Rows: 397,489 | Columns: 30 | Permits: 372,671\n",
" 2001 | Rows: 436,458 | Columns: 30 | Permits: 408,579\n",
" 2001 | Rows: 477,514 | Columns: 30 | Permits: 446,002\n",
" 2001 | Rows: 517,561 | Columns: 30 | Permits: 482,355\n"
]
}
],
"source": [
"all_permits = tables[first_year] # initiate data table to hold all permits, with earliest available year preloaded\n",
"new_permits = len(all_permits['PERMIT_NUM'].unique())\n",
"print(\"integrating permits from all years into one\".upper())\n",
"printLine(year=first_year, rows=all_permits.shape[0], columns =all_permits.shape[1], permits_count=new_permits)\n",
"\n",
"for year in years[1:]:\n",
" all_permits = all_permits.append(tables[year])\n",
" new_permits = len(all_permits['PERMIT_NUM'].unique())\n",
" printLine(year=first_year, rows=all_permits.shape[0], columns =all_permits.shape[1], permits_count=new_permits)\n",
" \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As expected, the total number of rows matches in the integrated table matches the sum of number of rows across all years; however, note the difference in number of permits calculated before (across all the files)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" # rows in consolidated table # of rows across all tables\n",
"------------------------------ -----------------------------\n",
" 517561 517561\n",
"\n",
" # of permits ACROSS files # of permits in INTEGRATED difference\n",
"--------------------------- ---------------------------- ------------\n",
" 488223 482355 5868\n"
]
}
],
"source": [
"print(tabulate([(all_permits.shape[0], rows)], [\"# rows in consolidated table\", \"# of rows across all tables\"]))\n",
"print()\n",
"print(tabulate([[permits, new_permits, permits-new_permits]], [\"# of permits ACROSS files\", \"# of permits in INTEGRATED\", \"difference\"]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is likely because permit numbers repeated across years, such as a renewal, so they can be counted more than once when summing the number of permits in each file - the accurate number is the new one\n",
"\n",
"# 3. Write dataframe to output file\n",
"Opted for a .CSV file given it is a fairly standard format for tabular data and I had no issues with reading it in Tableau, the visualization software I used in this analysis. I mention his because it has happened before."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" \n",
"CREATING .CSV FOR ALL BUILDING PERMITS...\n",
" output file created: ./clearedpermits_2001to2017.csv\n"
]
}
],
"source": [
"while True:\n",
" write_to_csv = input(\"create a .CSV file with the {0} building permits from {1} to {2}? (yes / no): \".format(\"{:,}\".format(all_permits.shape[0]), first_year, last_year)).lower()\n",
" \n",
" if write_to_csv == \"yes\":\n",
" print(\" \\ncreating .CSV for all building permits...\".upper())\n",
" created = datetime.now().strftime(\"%d%b%Y\")\n",
" output_file_name = \"clearedpermits_{0}to{1}.csv\".format(first_year, last_year,created) # name of output file\n",
" all_permits.to_csv(\"{0}{1}\".format(script_directory, output_file_name), index=False)\n",
" print(\" output file created: {0}{1}\".format(script_directory, output_file_name))\n",
" break\n",
" \n",
" elif write_to_csv == \"no\": print(\"finished (file not created)\".upper()); break\n",
" \n",
" else: print(\"try again\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment