albertovilla/OpenStreetMap Leganes.ipynb

## OpenStreetMap Leganes.ipynb
{
  "cells": [
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "# OpenStreetMap Data Case Study: Leganes"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## Overview of the data"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Map Area\nThe map area analyzed covers the city where I live. Leganés is a city in central Spain, part of the greater Madrid, a satellite-city with a population of 186,066 (1 January 2009) located about 11 km southwest of Madrid city centre:\n- [OpenStreetMap Leganés](https://www.openstreetmap.org/export#map=14/40.3375/-3.7587)\n- [Wikipedia Leganés](https://en.wikipedia.org/wiki/Legan%C3%A9s)\n"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Overview of the data\n```\n- The OSM XML file size is 58,4 MB\n- Number of nodes: 248715\n- Number of ways:  46024\n```"
    },
    {
      "metadata": {
        "collapsed": true
      },
      "cell_type": "markdown",
      "source": "## Processing the dataset"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Auditing the dataset\n\nThe file `audit_map.py` calls functions in three different modules to perform some basic audit of the map area.\n\n#### 1. Process basic statistics about the map area\nThe `process_basic_statistics` function in module `audit_basic_stats` goes through the XML structure and reports information about the number of nodes and ways, the number of different contributors and the top 5 contributors. I will be able to compare this information with the SQL queries I will perform later.\n\n```\nPROCESS BASIC STATISTICS: ---------------------------------------\nNumber of nodes: 248715\nNumber of ways:  46024\nNumber of different users 375\nTop 5 contributors:\n[('rafaerti', 100592),\n ('Cuenqui', 51684),\n ('cirdancarpintero', 25615),\n ('Sercontr', 21678),\n ('polkillas', 11572)]\n```\n\n#### 2. Street types correctness\nStarting from a list of common Spanish street types I will check whether the `addr:street` in both nodes and ways conforms with the expected types to identify possible errors in the dataset. The function `audit_street_types` inside the module `audit_street_types` goes through the XML file extracting the type of street and comparing with the ones defined in `COMMON_STREET_TYPES` shown below:\n\n```\nCOMMON_STREET_TYPES = {\n    u'Autopista',\n    u'Vía',\n    u'Carretera',\n    u'Calle',\n    u'Calleja',\n    u'Callejón',\n    u'Avenida',\n    u'Plaza',\n    u'Sendero',\n    u'Paseo',\n    u'Travesía',\n    u'Camino',\n    u'Acceso' }\n```"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "By running this function we get that out 5.052 streets there are 52 _suspicious_ streets and 22 different _suspicious_ street types:\n\n```\nSuspicious street types:\nset(['A-5',\n     'AVDA.',\n     'Alcocer,',\n     'Av.',\n     u'Avda.',\n     'C/',\n     'CALLE',\n     'CL',\n     'CR',\n     'CTRA.',\n     'Call',\n     'Charco',\n     'Ctra.',\n     'FIXME',\n     'Fundidores',\n     u'Gran',\n     'Hermandad',\n     u'Padr\\xf3n',\n     'Palmera',\n     'Paloma',\n     'Pedro',\n     'Pizarro',\n     'Pz',\n     'Rafaela',\n     u'Reina',\n     u'R\\xedo',\n     u'SAN',\n     'Telecomunicaciones',\n     'Travesia'])\n```\n\nThe function also returns the streets themselves to better analyze if they look right or not: \n\n```\nSuspicious streets:\n['Av. de los Arces',\n 'CTRA. MADRID TOLEDO km 10',\n 'CALLE FELIPE CALLEJA, S/N',\n 'CALLE EBANISTAS,',\n u'SAN BERNARDO, S/N - B\\xba DE LA FORTUNA',\n 'CR M-406 , P.K. 7,3',\n 'CL AVDA. S. MARTIN DE VALDEIGLE, S. (M-501 KM 0.3)',\n 'CTRA. N-V KM 12.5',\n 'Alcocer, 23',\n 'CR M-406 KM 10',\n 'CL MADRID',\n 'Hermandad Donantes Sangre',\n 'CTRA. MADRID TOLEDO km 10',\n 'Hermandad Donantes Sangre',\n 'Call de la Hermandad de Donantes Sangre',\n 'Hermandad de Donantes de Sangre',\n 'C/',\n u'Reina Sof\\xeda',\n u'Av. de los \\xc9banos',\n 'Pedro de Valdivia',\n 'Paloma',\n 'Av. Juan XXIII',\n 'Travesia Calle Ancha',\n 'Travesia Calle Ancha',\n 'A-5',\n 'FIXME Avenida La Verbena De La Paloma',\n u'Travesia del Ox\\xedgeno',\n u'Travesia del Ox\\xedgeno',\n u'Travesia del Ox\\xedgeno',\n u'Travesia del Ox\\xedgeno',\n u'Travesia del Ox\\xedgeno',\n u'Travesia del Ox\\xedgeno',\n 'Ctra. de Extremadura',\n u'Avda. Gran Breta\\xf1a',\n 'C/ Parque Grande',\n 'A-5',\n 'Fundidores',\n 'Palmera',\n u'R\\xedo Tormes',\n 'Charco',\n 'Pizarro',\n u'Gran Breta\\xf1a',\n 'AVDA. DE LA UNIVERSIDAD',\n 'CR N-4 P.K. 6,70 I',\n 'CR N-4 P.K. 6,70 D',\n 'Rafaela Ybarra',\n 'Pz San Antonio de la Florida',\n 'Telecomunicaciones',\n 'Telecomunicaciones',\n 'Telecomunicaciones',\n 'Telecomunicaciones',\n u'Padr\\xf3n']\n```"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Looking in detail to the 52 suspicious street types we see the following inconsistencies and error types:\n- Abbreviations of correct street types e.g. **Av.** instead of **Avenida** or **CR** instead of **Carretera**\n- Capitalized street types e.g. **CALLE** instead of **Calle**\n- Incorrectly written street types e.g. **Travesia** instead of **Travesía**\n- Missing street type in front of the street name e.g. **'Reina Sofía'** should be **'Avenida de la Reina Sofía'**"
    },
    {
      "metadata": {
        "collapsed": true
      },
      "cell_type": "markdown",
      "source": "#### 3. Postal codes correction\nThe function `process_street_postalcodes` in `audit_street_postalcodes` check the correctness of tags `addr:postcode`. In order for a postal code to be correct it must start with **28** which is the postal code in Madrid and must be a 5 digits code.\n\nRunning this function we get that are no errors in the postal codes so no correction is required. Note that most of the postal codes start with **289** which correspond to Leganés area. There are though some **280** as the area selected is not exactly just Leganés but covers also some surrounding parts of Madrid close to Leganés and their post code starts with that number.\n\n```\nNumber of postal codes: 2115\nNumber of different postal codes: 26\nset(['28919', '28918', '28915', '28914', '28917', '28916', '28911', '28913', '28912', '28054', '28924', '28925', '28921', '28922', '28923', '28906', '28907', '28904', '28902', '28903', '28901', '28021', '28025', '28024', '28041', '28044'])\nAll post codes are correct!\n``` "
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Cleaning the dataset\n\nThe file `process_map.py` performs the cleaning of the incorrect street types which can be corrected programmatically and proceeds to generate the **csv** files to be imported in the database. The file `schema.py` defines the schema against which the elements can be checked.\n\nI have run the function both with the `validate=True` and `False` to verify everything is working properly.\n\nRegarding the cleaning two activities are performed:\n\n1) I can automatically correct the abbreviations, capitalization and incorrectly written street types using a dictionary. I will convert to capitals the original street type in order to make the conversion so I can catch 'Avda.' and 'AVDA.' with the same key:\n\n```python\nSTREET_TYPES_CORRECTIONS = {\n     u'AVDA.': u'Avenida',\n     u'AV': u'Avenida',\n     u'AV.': u'Avenida',\n     u'C/': u'Calle',\n     u'CALLE': u'Calle',\n     u'CALL': u'Calle',\n     u'CL': u'Calle',\n     u'CR': u'Carretera',\n     u'CTRA.': u'Carretera',\n     u'Call': u'Calle',\n     u'PZ': u'Plaza',\n     u'TRAVESIA': u'Travesía'\n}\n```\n\nThis dictionary allows to remove all the incorrect street types which can be corrected programatically. There would be still some corrections to be done manually for those streets which do not specify their type. The number of such cases is very low and could be done manually, however, I will not do it as I don't think is the purpose of this exercise.\n\n2) On top of correcting the street type names it is possible to remove some characters which can be deleted from the addresses such as \"`.`\", \"`,`\" and \"`:`\""
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Running the `process_map.py` program I can generate 5 csv files: nodes.csv, nodes_tags.csv, ways.csv, ways_nodes.csv and ways_tags.csv with the extracted data to be imported in the database."
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## Creating the database"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### 1. Create database structure \n\nCreate the database file:\n\n```\nsqlite3 leganes.db\n```\n\nInitialize database structure using the `database_schema.sql` SQL file\n\n```\n.read database_schema.sql\n```\n\nCheck that the tables have been created correctly:\n```\nsqlite> .tables\nnodes       nodes_tags  ways        ways_nodes  ways_tags\n```\n\nCheck one of the tables to verify its schema:\n```\nsqlite> .schema nodes\nCREATE TABLE nodes (\n  id INTEGER PRIMARY KEY NOT NULL,\n  lat REAL,\n  lon REAL,\n  user TEXT,\n  uid INTEGER,\n  version INTEGER,\n  changeset INTEGER,\n  timestamp TEXT\n);\n```"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### 2. Import data\n\nBefore importing the csv files into the database we need to remove the csv header. In order to do so in Windows we can use the command `more`. Here below an example for the `nodes.csv` file:\n\n```\nmore /E +1 nodes.csv > nodes_no_header.csv\n```\n\nAnother option is to modify the original `process_map.py` file and omit the calls to `writeheader`function. The command `more` is quite slow so this option is recommended.\n\nOnce we have removed all the CSV headers (or generated the files without the headers) we can proceed to import them into the database as follows:\n\n```\nsqlite> .mode csv\nsqlite> .import nodes.csv nodes\nsqlite> .import nodes_tags.csv nodes_tags\nsqlite> .import ways_nodes.csv ways_nodes\nsqlite> .import ways.csv ways\nsqlite> .import ways_tags.csv ways_tags\n```\n"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "#### Quick test to check if data has been properly imported\n\nNow that the data has been imported we can perform queries to compare with the basic statistics calculated using the module `audit_basic_stats`:\n\n**Number of nodes**\n\nQuery:\n\n```SQL\nSELECT COUNT(*) FROM NODES;\n```\n> \n```\nsqlite> SELECT COUNT(*) FROM NODES;\n248715\n```\n\n**Number of ways**\n\nQuery: \n```SQL \nSELECT COUNT(*) FROM WAYS;\n```\n> \n```\nsqlite> SELECT COUNT(*) FROM WAYS;\n46024\n```\n\n**Number of different users contributing**\n\nQuery: \n```SQL \nSELECT COUNT(DISTINCT(user)) \nFROM (SELECT user FROM nodes UNION ALL SELECT user FROM WAYS);\n```\n\n> \n```\nsqlite> SELECT COUNT(DISTINCT(user)) FROM (SELECT user FROM nodes UNION ALL SELECT user FROM WAYS);\n375\n```\n\n**Top 5 contributors**\n\nQuery: \n```SQL \nSELECT user, COUNT(*) AS total \nFROM (SELECT user FROM NODES UNION ALL SELECT user FROM ways) \nGROUP BY user \nORDER BY total DESC \nLIMIT 5;\n```\n\n> \n```\nsqlite> SELECT user, COUNT(*) AS total FROM (SELECT user FROM NODES UNION ALL SELECT user FROM ways) GROUP BY user ORDER BY total DESC LIMIT 5;\nrafaerti|100592\nCuenqui|51684\ncirdancarpintero|25615\nSercontr|21678\npolkillas|11572\n```\n\n"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### 3. Additional queries\n\n#### 3.1 Top 20 types of amenities\nLet's check what types of amenities are available to make define other specific queries.\n\n```SQL\nSELECT DISTINCT(value) \nFROM (SELECT value, key FROM nodes_tags UNION ALL SELECT value, key FROM ways_tags) WHERE key='amenity';\n```\n\n> Result\n```\nfuel\nschool\ncollege\nkindergarten\ntraining\nbank\nrestaurant\nparking_entrance\nparking\npost_office\npolice\nclinic\nfountain\nplace_of_worship\npost_box\ntelephone\ntaxi\npharmacy\ncafe\npub\nrecycling\nhospital\ncar_wash\ndrinking_water\nbar\nmarketplace\nbicycle_rental\nbicycle_parking\natm\ncinema\ntoilets\nveterinary\ndriving_school\nretirement_home\nshelter\nfast_food\ndentist\nshop\ncommunity_centre\nwaste_disposal\nbench\npublic_building\nsocial_facility\nfood_court\nwaste_basket\ndoctors\nlibrary\nchildcare\ntheatre\nnightclub\ncasino\nlanguage_school\nphoto_booth\nuniversity\nbus_station\nanimal_shelter\nparking_space\ncourthouse\ntownhall\nvehicle_inspection\nswimming_pool\nfire_station\nnursing_home\ncourt_yard\ncar_rental\nhealth_post\n```\n\n#### 3.2 Top 20 types of amenities\nLet's have a look at the top 20 amenities\n\n```SQL\nSELECT value, COUNT(*) as num \nFROM (SELECT value, key FROM nodes_tags UNION ALL SELECT value, key FROM ways_tags) \nWHERE key='amenity' \nGROUP BY value \nORDER BY num DESC \nLIMIT 20;\n```\n\n> Result\n```\nparking|1428\nbench|433\nschool|284\ndrinking_water|151\nparking_entrance|150\nrestaurant|104\nbar|95\nrecycling|92\npharmacy|89\nbank|88\nfountain|72\ncafe|67\nbicycle_parking|54\nfuel|47\nkindergarten|46\nplace_of_worship|45\nwaste_disposal|37\nfast_food|28\npub|23\nclinic|22\n```\n\n\n#### 3.3 Number of entertaiment places \nLet's check the number of places for entertainment in Leganés\n\n```SQL\nSELECT value, COUNT(*) as NUM \nFROM (SELECT value, key FROM nodes_tags UNION ALL SELECT value, key FROM ways_tags)\nWHERE key='amenity' AND value IN ('cafe', 'restaurant', 'pub', 'bar', 'cinema', 'theatre', 'nightclub', 'casino')\nGROUP BY value\nORDER BY num DESC;\n```\n\n> Result\n```\nrestaurant|104\nbar|95\ncafe|67\npub|23\ntheatre|8\ncinema|2\ncasino|1\nnightclub|1\n```\n\nThere are in total 301 places, which compared with Leganés population gives a ratio of 1 place for every 618 persons. It could be interesting to compare this ratio with other places. \n\n#### 3.4 Healthcare services\nJust a matter of comparison, let's check how many healthcar services do we have.\n\n```SQL\nSELECT value, COUNT(*) as NUM \nFROM (SELECT value, key FROM nodes_tags UNION ALL SELECT value, key FROM ways_tags)\nWHERE key='amenity' AND value IN ('pharmacy', 'hospital', 'dentist', 'doctors', 'health_post', 'nursing_home', 'clinic')\nGROUP BY value\nORDER BY num DESC;\n```\n\n> Result\n```\npharmacy|89\nclinic|22\ndentist|22\nhospital|10\ndoctors|1\nhealth_post|1\nnursing_home|1\n```\n\nThere is a total of 146 health care services amenities in Leganés. Not suprisingly the number of `doctors` is very low as in Spain is common practice that they provide their services directly in hospitals or clinics, it's quite strange doctors providing services outside these facilities"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Areas for improvement\nThe initial goal when selected the area for analysis was picking just Leganés however filtering just by city is quite complicated in OpenStreetMap resulting in an area which covers the city of interest and part of other close by cities / towns. In order to restrict the analysis to the city of interest one option could be filtering the nodes and ways addresses by comparing the streets with another \"source of truth\". In order to do so, we could proceed as follows:\n\n1. Extract / scrape all the information of Leganés streets from https://www.codigo-postal.info/madrid/leganes; we could do that using BeautifulSoup.\n2. With that source of truth regarding street names and postal codes we could filter out the XML to just contain nodes and ways that are within the streets databases.\n3. An additional advantage of extracting the information from the above-referred web would be that we could also filter by postal code, removing all nodes and elements which do not belong to Leganés city\n"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## Conclusions\nThe selected area for the analysis was in general quite clean regarding the analysis of street types and postal codes. There was a great deal of consistency regarding the names with just some minor exceptions which could not be corrected programmatically. As stated in the areas of improvement there are some limitations in the analysis performed as I'm quite sure it does contain parts of Madrid which do not belong to Leganés. The visual matching of the area I think was quite good but it was impossible to leave outside other areas of Madrid as cities in general are not \"square\" and in the case of Leganés is close by to other towns and Madrid city."
    }
  ],
  "metadata": {
    "anaconda-cloud": {},
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.6.7",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "gist": {
      "id": "",
      "data": {
        "description": "OpenStreetMap Leganes.ipynb",
        "public": true
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}
	{
	"cells": [
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "# OpenStreetMap Data Case Study: Leganes"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Overview of the data"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### Map Area\nThe map area analyzed covers the city where I live. Leganés is a city in central Spain, part of the greater Madrid, a satellite-city with a population of 186,066 (1 January 2009) located about 11 km southwest of Madrid city centre:\n- [OpenStreetMap Leganés](https://www.openstreetmap.org/export#map=14/40.3375/-3.7587)\n- [Wikipedia Leganés](https://en.wikipedia.org/wiki/Legan%C3%A9s)\n"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### Overview of the data\n```\n- The OSM XML file size is 58,4 MB\n- Number of nodes: 248715\n- Number of ways: 46024\n```"
	},
	{
	"metadata": {
	"collapsed": true
	},
	"cell_type": "markdown",
	"source": "## Processing the dataset"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### Auditing the dataset\n\nThe file `audit_map.py` calls functions in three different modules to perform some basic audit of the map area.\n\n#### 1. Process basic statistics about the map area\nThe `process_basic_statistics` function in module `audit_basic_stats` goes through the XML structure and reports information about the number of nodes and ways, the number of different contributors and the top 5 contributors. I will be able to compare this information with the SQL queries I will perform later.\n\n```\nPROCESS BASIC STATISTICS: ---------------------------------------\nNumber of nodes: 248715\nNumber of ways: 46024\nNumber of different users 375\nTop 5 contributors:\n[('rafaerti', 100592),\n ('Cuenqui', 51684),\n ('cirdancarpintero', 25615),\n ('Sercontr', 21678),\n ('polkillas', 11572)]\n```\n\n#### 2. Street types correctness\nStarting from a list of common Spanish street types I will check whether the `addr:street` in both nodes and ways conforms with the expected types to identify possible errors in the dataset. The function `audit_street_types` inside the module `audit_street_types` goes through the XML file extracting the type of street and comparing with the ones defined in `COMMON_STREET_TYPES` shown below:\n\n```\nCOMMON_STREET_TYPES = {\n u'Autopista',\n u'Vía',\n u'Carretera',\n u'Calle',\n u'Calleja',\n u'Callejón',\n u'Avenida',\n u'Plaza',\n u'Sendero',\n u'Paseo',\n u'Travesía',\n u'Camino',\n u'Acceso' }\n```"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "By running this function we get that out 5.052 streets there are 52 _suspicious_ streets and 22 different _suspicious_ street types:\n\n```\nSuspicious street types:\nset(['A-5',\n 'AVDA.',\n 'Alcocer,',\n 'Av.',\n u'Avda.',\n 'C/',\n 'CALLE',\n 'CL',\n 'CR',\n 'CTRA.',\n 'Call',\n 'Charco',\n 'Ctra.',\n 'FIXME',\n 'Fundidores',\n u'Gran',\n 'Hermandad',\n u'Padr\\xf3n',\n 'Palmera',\n 'Paloma',\n 'Pedro',\n 'Pizarro',\n 'Pz',\n 'Rafaela',\n u'Reina',\n u'R\\xedo',\n u'SAN',\n 'Telecomunicaciones',\n 'Travesia'])\n```\n\nThe function also returns the streets themselves to better analyze if they look right or not: \n\n```\nSuspicious streets:\n['Av. de los Arces',\n 'CTRA. MADRID TOLEDO km 10',\n 'CALLE FELIPE CALLEJA, S/N',\n 'CALLE EBANISTAS,',\n u'SAN BERNARDO, S/N - B\\xba DE LA FORTUNA',\n 'CR M-406 , P.K. 7,3',\n 'CL AVDA. S. MARTIN DE VALDEIGLE, S. (M-501 KM 0.3)',\n 'CTRA. N-V KM 12.5',\n 'Alcocer, 23',\n 'CR M-406 KM 10',\n 'CL MADRID',\n 'Hermandad Donantes Sangre',\n 'CTRA. MADRID TOLEDO km 10',\n 'Hermandad Donantes Sangre',\n 'Call de la Hermandad de Donantes Sangre',\n 'Hermandad de Donantes de Sangre',\n 'C/',\n u'Reina Sof\\xeda',\n u'Av. de los \\xc9banos',\n 'Pedro de Valdivia',\n 'Paloma',\n 'Av. Juan XXIII',\n 'Travesia Calle Ancha',\n 'Travesia Calle Ancha',\n 'A-5',\n 'FIXME Avenida La Verbena De La Paloma',\n u'Travesia del Ox\\xedgeno',\n u'Travesia del Ox\\xedgeno',\n u'Travesia del Ox\\xedgeno',\n u'Travesia del Ox\\xedgeno',\n u'Travesia del Ox\\xedgeno',\n u'Travesia del Ox\\xedgeno',\n 'Ctra. de Extremadura',\n u'Avda. Gran Breta\\xf1a',\n 'C/ Parque Grande',\n 'A-5',\n 'Fundidores',\n 'Palmera',\n u'R\\xedo Tormes',\n 'Charco',\n 'Pizarro',\n u'Gran Breta\\xf1a',\n 'AVDA. DE LA UNIVERSIDAD',\n 'CR N-4 P.K. 6,70 I',\n 'CR N-4 P.K. 6,70 D',\n 'Rafaela Ybarra',\n 'Pz San Antonio de la Florida',\n 'Telecomunicaciones',\n 'Telecomunicaciones',\n 'Telecomunicaciones',\n 'Telecomunicaciones',\n u'Padr\\xf3n']\n```"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Looking in detail to the 52 suspicious street types we see the following inconsistencies and error types:\n- Abbreviations of correct street types e.g. Av. instead of Avenida or CR instead of Carretera\n- Capitalized street types e.g. CALLE instead of Calle\n- Incorrectly written street types e.g. Travesia instead of Travesía\n- Missing street type in front of the street name e.g. 'Reina Sofía' should be 'Avenida de la Reina Sofía'"
	},
	{
	"metadata": {
	"collapsed": true
	},
	"cell_type": "markdown",
	"source": "#### 3. Postal codes correction\nThe function `process_street_postalcodes` in `audit_street_postalcodes` check the correctness of tags `addr:postcode`. In order for a postal code to be correct it must start with 28 which is the postal code in Madrid and must be a 5 digits code.\n\nRunning this function we get that are no errors in the postal codes so no correction is required. Note that most of the postal codes start with 289 which correspond to Leganés area. There are though some 280 as the area selected is not exactly just Leganés but covers also some surrounding parts of Madrid close to Leganés and their post code starts with that number.\n\n```\nNumber of postal codes: 2115\nNumber of different postal codes: 26\nset(['28919', '28918', '28915', '28914', '28917', '28916', '28911', '28913', '28912', '28054', '28924', '28925', '28921', '28922', '28923', '28906', '28907', '28904', '28902', '28903', '28901', '28021', '28025', '28024', '28041', '28044'])\nAll post codes are correct!\n``` "
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### Cleaning the dataset\n\nThe file `process_map.py` performs the cleaning of the incorrect street types which can be corrected programmatically and proceeds to generate the csv files to be imported in the database. The file `schema.py` defines the schema against which the elements can be checked.\n\nI have run the function both with the `validate=True` and `False` to verify everything is working properly.\n\nRegarding the cleaning two activities are performed:\n\n1) I can automatically correct the abbreviations, capitalization and incorrectly written street types using a dictionary. I will convert to capitals the original street type in order to make the conversion so I can catch 'Avda.' and 'AVDA.' with the same key:\n\n```python\nSTREET_TYPES_CORRECTIONS = {\n u'AVDA.': u'Avenida',\n u'AV': u'Avenida',\n u'AV.': u'Avenida',\n u'C/': u'Calle',\n u'CALLE': u'Calle',\n u'CALL': u'Calle',\n u'CL': u'Calle',\n u'CR': u'Carretera',\n u'CTRA.': u'Carretera',\n u'Call': u'Calle',\n u'PZ': u'Plaza',\n u'TRAVESIA': u'Travesía'\n}\n```\n\nThis dictionary allows to remove all the incorrect street types which can be corrected programatically. There would be still some corrections to be done manually for those streets which do not specify their type. The number of such cases is very low and could be done manually, however, I will not do it as I don't think is the purpose of this exercise.\n\n2) On top of correcting the street type names it is possible to remove some characters which can be deleted from the addresses such as \"`.`\", \"`,`\" and \"`:`\""
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Running the `process_map.py` program I can generate 5 csv files: nodes.csv, nodes_tags.csv, ways.csv, ways_nodes.csv and ways_tags.csv with the extracted data to be imported in the database."
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Creating the database"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### 1. Create database structure \n\nCreate the database file:\n\n```\nsqlite3 leganes.db\n```\n\nInitialize database structure using the `database_schema.sql` SQL file\n\n```\n.read database_schema.sql\n```\n\nCheck that the tables have been created correctly:\n```\nsqlite> .tables\nnodes nodes_tags ways ways_nodes ways_tags\n```\n\nCheck one of the tables to verify its schema:\n```\nsqlite> .schema nodes\nCREATE TABLE nodes (\n id INTEGER PRIMARY KEY NOT NULL,\n lat REAL,\n lon REAL,\n user TEXT,\n uid INTEGER,\n version INTEGER,\n changeset INTEGER,\n timestamp TEXT\n);\n```"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### 2. Import data\n\nBefore importing the csv files into the database we need to remove the csv header. In order to do so in Windows we can use the command `more`. Here below an example for the `nodes.csv` file:\n\n```\nmore /E +1 nodes.csv > nodes_no_header.csv\n```\n\nAnother option is to modify the original `process_map.py` file and omit the calls to `writeheader`function. The command `more` is quite slow so this option is recommended.\n\nOnce we have removed all the CSV headers (or generated the files without the headers) we can proceed to import them into the database as follows:\n\n```\nsqlite> .mode csv\nsqlite> .import nodes.csv nodes\nsqlite> .import nodes_tags.csv nodes_tags\nsqlite> .import ways_nodes.csv ways_nodes\nsqlite> .import ways.csv ways\nsqlite> .import ways_tags.csv ways_tags\n```\n"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "#### Quick test to check if data has been properly imported\n\nNow that the data has been imported we can perform queries to compare with the basic statistics calculated using the module `audit_basic_stats`:\n\nNumber of nodes\n\nQuery:\n\n```SQL\nSELECT COUNT() FROM NODES;\n```\n> \n```\nsqlite> SELECT COUNT() FROM NODES;\n248715\n```\n\nNumber of ways\n\nQuery: \n```SQL \nSELECT COUNT() FROM WAYS;\n```\n> \n```\nsqlite> SELECT COUNT() FROM WAYS;\n46024\n```\n\nNumber of different users contributing\n\nQuery: \n```SQL \nSELECT COUNT(DISTINCT(user)) \nFROM (SELECT user FROM nodes UNION ALL SELECT user FROM WAYS);\n```\n\n> \n```\nsqlite> SELECT COUNT(DISTINCT(user)) FROM (SELECT user FROM nodes UNION ALL SELECT user FROM WAYS);\n375\n```\n\nTop 5 contributors\n\nQuery: \n```SQL \nSELECT user, COUNT() AS total \nFROM (SELECT user FROM NODES UNION ALL SELECT user FROM ways) \nGROUP BY user \nORDER BY total DESC \nLIMIT 5;\n```\n\n> \n```\nsqlite> SELECT user, COUNT() AS total FROM (SELECT user FROM NODES UNION ALL SELECT user FROM ways) GROUP BY user ORDER BY total DESC LIMIT 5;\nrafaerti\|100592\nCuenqui\|51684\ncirdancarpintero\|25615\nSercontr\|21678\npolkillas\|11572\n```\n\n"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### 3. Additional queries\n\n#### 3.1 Top 20 types of amenities\nLet's check what types of amenities are available to make define other specific queries.\n\n```SQL\nSELECT DISTINCT(value) \nFROM (SELECT value, key FROM nodes_tags UNION ALL SELECT value, key FROM ways_tags) WHERE key='amenity';\n```\n\n> Result\n```\nfuel\nschool\ncollege\nkindergarten\ntraining\nbank\nrestaurant\nparking_entrance\nparking\npost_office\npolice\nclinic\nfountain\nplace_of_worship\npost_box\ntelephone\ntaxi\npharmacy\ncafe\npub\nrecycling\nhospital\ncar_wash\ndrinking_water\nbar\nmarketplace\nbicycle_rental\nbicycle_parking\natm\ncinema\ntoilets\nveterinary\ndriving_school\nretirement_home\nshelter\nfast_food\ndentist\nshop\ncommunity_centre\nwaste_disposal\nbench\npublic_building\nsocial_facility\nfood_court\nwaste_basket\ndoctors\nlibrary\nchildcare\ntheatre\nnightclub\ncasino\nlanguage_school\nphoto_booth\nuniversity\nbus_station\nanimal_shelter\nparking_space\ncourthouse\ntownhall\nvehicle_inspection\nswimming_pool\nfire_station\nnursing_home\ncourt_yard\ncar_rental\nhealth_post\n```\n\n#### 3.2 Top 20 types of amenities\nLet's have a look at the top 20 amenities\n\n```SQL\nSELECT value, COUNT() as num \nFROM (SELECT value, key FROM nodes_tags UNION ALL SELECT value, key FROM ways_tags) \nWHERE key='amenity' \nGROUP BY value \nORDER BY num DESC \nLIMIT 20;\n```\n\n> Result\n```\nparking\|1428\nbench\|433\nschool\|284\ndrinking_water\|151\nparking_entrance\|150\nrestaurant\|104\nbar\|95\nrecycling\|92\npharmacy\|89\nbank\|88\nfountain\|72\ncafe\|67\nbicycle_parking\|54\nfuel\|47\nkindergarten\|46\nplace_of_worship\|45\nwaste_disposal\|37\nfast_food\|28\npub\|23\nclinic\|22\n```\n\n\n#### 3.3 Number of entertaiment places \nLet's check the number of places for entertainment in Leganés\n\n```SQL\nSELECT value, COUNT() as NUM \nFROM (SELECT value, key FROM nodes_tags UNION ALL SELECT value, key FROM ways_tags)\nWHERE key='amenity' AND value IN ('cafe', 'restaurant', 'pub', 'bar', 'cinema', 'theatre', 'nightclub', 'casino')\nGROUP BY value\nORDER BY num DESC;\n```\n\n> Result\n```\nrestaurant\|104\nbar\|95\ncafe\|67\npub\|23\ntheatre\|8\ncinema\|2\ncasino\|1\nnightclub\|1\n```\n\nThere are in total 301 places, which compared with Leganés population gives a ratio of 1 place for every 618 persons. It could be interesting to compare this ratio with other places. \n\n#### 3.4 Healthcare services\nJust a matter of comparison, let's check how many healthcar services do we have.\n\n```SQL\nSELECT value, COUNT(*) as NUM \nFROM (SELECT value, key FROM nodes_tags UNION ALL SELECT value, key FROM ways_tags)\nWHERE key='amenity' AND value IN ('pharmacy', 'hospital', 'dentist', 'doctors', 'health_post', 'nursing_home', 'clinic')\nGROUP BY value\nORDER BY num DESC;\n```\n\n> Result\n```\npharmacy\|89\nclinic\|22\ndentist\|22\nhospital\|10\ndoctors\|1\nhealth_post\|1\nnursing_home\|1\n```\n\nThere is a total of 146 health care services amenities in Leganés. Not suprisingly the number of `doctors` is very low as in Spain is common practice that they provide their services directly in hospitals or clinics, it's quite strange doctors providing services outside these facilities"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "### Areas for improvement\nThe initial goal when selected the area for analysis was picking just Leganés however filtering just by city is quite complicated in OpenStreetMap resulting in an area which covers the city of interest and part of other close by cities / towns. In order to restrict the analysis to the city of interest one option could be filtering the nodes and ways addresses by comparing the streets with another \"source of truth\". In order to do so, we could proceed as follows:\n\n1. Extract / scrape all the information of Leganés streets from https://www.codigo-postal.info/madrid/leganes; we could do that using BeautifulSoup.\n2. With that source of truth regarding street names and postal codes we could filter out the XML to just contain nodes and ways that are within the streets databases.\n3. An additional advantage of extracting the information from the above-referred web would be that we could also filter by postal code, removing all nodes and elements which do not belong to Leganés city\n"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Conclusions\nThe selected area for the analysis was in general quite clean regarding the analysis of street types and postal codes. There was a great deal of consistency regarding the names with just some minor exceptions which could not be corrected programmatically. As stated in the areas of improvement there are some limitations in the analysis performed as I'm quite sure it does contain parts of Madrid which do not belong to Leganés. The visual matching of the area I think was quite good but it was impossible to leave outside other areas of Madrid as cities in general are not \"square\" and in the case of Leganés is close by to other towns and Madrid city."
	}
	],
	"metadata": {
	"anaconda-cloud": {},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3",
	"language": "python"
	},
	"language_info": {
	"name": "python",
	"version": "3.6.7",
	"mimetype": "text/x-python",
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"pygments_lexer": "ipython3",
	"nbconvert_exporter": "python",
	"file_extension": ".py"
	},
	"gist": {
	"id": "",
	"data": {
	"description": "OpenStreetMap Leganes.ipynb",
	"public": true
	}
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}