Last active
March 29, 2018 15:24
-
-
Save careduz/57f04f902c7001880d959a23b341ea40 to your computer and use it in GitHub Desktop.
opendatatoronto
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# 0. Context\n", | |
"The purpose of this script is to consolidate the separate cleared permits files into one for visualization and analysis. This requires three main actions:\n", | |
"> 1. Importing the raw .CSV files into Python as tables\n", | |
"> 2. Transforming data tables for visualization purposes\n", | |
"> 3. Consolidating tables into a single one\n", | |
"> 4. Writing the consolidated table into another file to visualize the data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"import os\n", | |
"from datetime import datetime\n", | |
"from tabulate import tabulate\n", | |
"import re\n", | |
"\n", | |
"script_directory = \"./\"\n", | |
"input_files_dir = \"input_files/\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# 1. Import input files\n", | |
"## 1.1. Get file names from the *input_files* folder\n", | |
"Will retrieve file names that end with _.csv_ and start with _clearedpermits_, which are default when the data is downloaded from the Open Data Portal." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"files = [ filename for filename in os.listdir(script_directory + input_files_dir) \\\n", | |
" if (filename.startswith( \"clearedpermits\" ) and filename.endswith(\".csv\"))]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 1.2. Define function to bring CSV files in-memory\n", | |
"Created a function for loading CSV files, and relevant metadata, rather than repeating the code for each CSV file or looping it. I was interested in capturing the following data for each file:\n", | |
"1. Permits raw data\n", | |
"2. Distinct count of permits\n", | |
"3. Year of permits file\n", | |
"\n", | |
"Item 2 can be calculated from the imported data, and item 3 is in the default file name when downloading from the Open Data Portal." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def printLine(year, rows, columns, permits_count, filename=\"\"):\n", | |
" rows = \"{:,}\".format(rows)\n", | |
" permits_count = \"{:,}\".format(permits_count)\n", | |
" print_dict = {\"Year\": year,\n", | |
" \"File\": filename,\n", | |
" \"Rows\": rows,\n", | |
" \"Columns\": columns,\n", | |
" \"Permits\": permits_count\n", | |
" }\n", | |
" print_list = \" | \".join([ \"{0}: {1}\".format(d, print_dict[d]) for d in [\"Rows\", \"Columns\", \"Permits\"]\\\n", | |
" if len( str(print_dict[d])) > 0 ])\n", | |
" if len(filename) > 0: print( \" {0} | {1} | {2}\".format(year, filename, print_list) )\n", | |
" else: print( \" {0} | {1}\".format(year, print_list) )\n", | |
" \n", | |
" \n", | |
"def loadTable(filename, script_directory=script_directory, input_files_dir=input_files_dir):\n", | |
" df = pd.read_csv('{0}{1}'.format(script_directory + input_files_dir, filename), low_memory=False)\n", | |
" year = re.findall(r'\\d+', f)[0]\n", | |
" permits_count, rows, columns = len(df['PERMIT_NUM'].unique()), df.shape[0], df.shape[1]\n", | |
" printLine(year=year, filename=filename, rows=rows, columns=columns, permits_count=permits_count)\n", | |
" \n", | |
" return df, year, permits_count, rows\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 1.3. Load CSVs into individual dataframes" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"LOADING INPUT FILES FROM EACH YEAR\n", | |
" 2001 | clearedpermits2001.csv | Rows: 21,458 | Columns: 30 | Permits: 21,231\n", | |
" 2002 | clearedpermits2002.csv | Rows: 25,060 | Columns: 30 | Permits: 24,528\n", | |
" 2003 | clearedpermits2003.csv | Rows: 25,708 | Columns: 30 | Permits: 25,092\n", | |
" 2004 | clearedpermits2004.csv | Rows: 25,184 | Columns: 30 | Permits: 24,393\n", | |
" 2005 | clearedpermits2005.csv | Rows: 32,809 | Columns: 30 | Permits: 31,660\n", | |
" 2006 | clearedpermits2006.csv | Rows: 28,777 | Columns: 30 | Permits: 27,419\n", | |
" 2007 | clearedpermits2007.csv | Rows: 25,616 | Columns: 30 | Permits: 24,145\n", | |
" 2008 | clearedpermits2008.csv | Rows: 26,073 | Columns: 30 | Permits: 24,608\n", | |
" 2009 | clearedpermits2009.csv | Rows: 26,385 | Columns: 30 | Permits: 24,814\n", | |
" 2010 | clearedpermits2010.csv | Rows: 26,461 | Columns: 30 | Permits: 24,790\n", | |
" 2011 | clearedpermits2011.csv | Rows: 30,541 | Columns: 30 | Permits: 28,719\n", | |
" 2012 | clearedpermits2012.csv | Rows: 31,525 | Columns: 30 | Permits: 29,140\n", | |
" 2013 | clearedpermits2013.csv | Rows: 34,796 | Columns: 30 | Permits: 32,488\n", | |
" 2014 | clearedpermits2014.csv | Rows: 37,096 | Columns: 30 | Permits: 34,732\n", | |
" 2015 | clearedpermits2015.csv | Rows: 38,969 | Columns: 30 | Permits: 36,152\n", | |
" 2016 | clearedpermits2016.csv | Rows: 41,056 | Columns: 30 | Permits: 37,685\n", | |
" 2017 | clearedpermits2017.csv | Rows: 40,047 | Columns: 30 | Permits: 36,627\n", | |
"in the files from 2001 to 2017 there were a total of 517,561 records and 488,223 permits\n" | |
] | |
} | |
], | |
"source": [ | |
"tables, rows, permits = {}, [], []\n", | |
"\n", | |
"print(\"loading input files from each year\".upper())\n", | |
"for f in sorted(files):\n", | |
" table, year, permits_count, rows_count = loadTable(f)\n", | |
" tables[year] = table.reset_index(drop=True)\n", | |
" rows.append(rows_count)\n", | |
" permits.append(permits_count)\n", | |
" \n", | |
"years, rows, permits = sorted(list(tables)), sum(rows), sum(permits)\n", | |
"first_year, last_year = years[0], years[-1]\n", | |
"print(\"in the files from {0} to {1} there were a total of {2} records and {3} permits\".format(first_year,last_year, \"{:,}\".format(rows), \"{:,}\".format(permits)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# 2. Consolidate data tables\n", | |
"## 2.1. Append dataframes into one" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"INTEGRATING PERMITS FROM ALL YEARS INTO ONE\n", | |
" 2001 | Rows: 21,458 | Columns: 30 | Permits: 21,231\n", | |
" 2001 | Rows: 46,518 | Columns: 30 | Permits: 45,693\n", | |
" 2001 | Rows: 72,226 | Columns: 30 | Permits: 70,722\n", | |
" 2001 | Rows: 97,410 | Columns: 30 | Permits: 95,020\n", | |
" 2001 | Rows: 130,219 | Columns: 30 | Permits: 126,525\n", | |
" 2001 | Rows: 158,996 | Columns: 30 | Permits: 153,819\n", | |
" 2001 | Rows: 184,612 | Columns: 30 | Permits: 177,868\n", | |
" 2001 | Rows: 210,685 | Columns: 30 | Permits: 202,399\n", | |
" 2001 | Rows: 237,070 | Columns: 30 | Permits: 227,025\n", | |
" 2001 | Rows: 263,531 | Columns: 30 | Permits: 251,759\n", | |
" 2001 | Rows: 294,072 | Columns: 30 | Permits: 280,371\n", | |
" 2001 | Rows: 325,597 | Columns: 30 | Permits: 309,416\n", | |
" 2001 | Rows: 360,393 | Columns: 30 | Permits: 338,292\n", | |
" 2001 | Rows: 397,489 | Columns: 30 | Permits: 372,671\n", | |
" 2001 | Rows: 436,458 | Columns: 30 | Permits: 408,579\n", | |
" 2001 | Rows: 477,514 | Columns: 30 | Permits: 446,002\n", | |
" 2001 | Rows: 517,561 | Columns: 30 | Permits: 482,355\n" | |
] | |
} | |
], | |
"source": [ | |
"all_permits = tables[first_year] # initiate data table to hold all permits, with earliest available year preloaded\n", | |
"new_permits = len(all_permits['PERMIT_NUM'].unique())\n", | |
"print(\"integrating permits from all years into one\".upper())\n", | |
"printLine(year=first_year, rows=all_permits.shape[0], columns =all_permits.shape[1], permits_count=new_permits)\n", | |
"\n", | |
"for year in years[1:]:\n", | |
" all_permits = all_permits.append(tables[year])\n", | |
" new_permits = len(all_permits['PERMIT_NUM'].unique())\n", | |
" printLine(year=first_year, rows=all_permits.shape[0], columns =all_permits.shape[1], permits_count=new_permits)\n", | |
" \n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"As expected, the total number of rows matches in the integrated table matches the sum of number of rows across all years; however, note the difference in number of permits calculated before (across all the files)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" # rows in consolidated table # of rows across all tables\n", | |
"------------------------------ -----------------------------\n", | |
" 517561 517561\n", | |
"\n", | |
" # of permits ACROSS files # of permits in INTEGRATED difference\n", | |
"--------------------------- ---------------------------- ------------\n", | |
" 488223 482355 5868\n" | |
] | |
} | |
], | |
"source": [ | |
"print(tabulate([(all_permits.shape[0], rows)], [\"# rows in consolidated table\", \"# of rows across all tables\"]))\n", | |
"print()\n", | |
"print(tabulate([[permits, new_permits, permits-new_permits]], [\"# of permits ACROSS files\", \"# of permits in INTEGRATED\", \"difference\"]))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This is likely because permit numbers repeated across years, such as a renewal, so they can be counted more than once when summing the number of permits in each file - the accurate number is the new one\n", | |
"\n", | |
"# 3. Write dataframe to output file\n", | |
"Opted for a .CSV file given it is a fairly standard format for tabular data and I had no issues with reading it in Tableau, the visualization software I used in this analysis. I mention his because it has happened before." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" \n", | |
"CREATING .CSV FOR ALL BUILDING PERMITS...\n", | |
" output file created: ./clearedpermits_2001to2017.csv\n" | |
] | |
} | |
], | |
"source": [ | |
"while True:\n", | |
" write_to_csv = input(\"create a .CSV file with the {0} building permits from {1} to {2}? (yes / no): \".format(\"{:,}\".format(all_permits.shape[0]), first_year, last_year)).lower()\n", | |
" \n", | |
" if write_to_csv == \"yes\":\n", | |
" print(\" \\ncreating .CSV for all building permits...\".upper())\n", | |
" created = datetime.now().strftime(\"%d%b%Y\")\n", | |
" output_file_name = \"clearedpermits_{0}to{1}.csv\".format(first_year, last_year,created) # name of output file\n", | |
" all_permits.to_csv(\"{0}{1}\".format(script_directory, output_file_name), index=False)\n", | |
" print(\" output file created: {0}{1}\".format(script_directory, output_file_name))\n", | |
" break\n", | |
" \n", | |
" elif write_to_csv == \"no\": print(\"finished (file not created)\".upper()); break\n", | |
" \n", | |
" else: print(\"try again\")" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.4" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment