Skip to content

Instantly share code, notes, and snippets.

@mhermans
Last active May 1, 2017 19:09
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mhermans/8c32eea0d5ec29e6b4329acbe7f0d3de to your computer and use it in GitHub Desktop.
Save mhermans/8c32eea0d5ec29e6b4329acbe7f0d3de to your computer and use it in GitHub Desktop.
Germany travel destination analysis
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:a01bf04e3c3faeadb96eb30195554aeabd9c31253a58f6671c64c9c0327dc326"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Guessing mystery travel locations using Pandas, Docker and OS(R)M"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2016-05-29 | Maarten Hermans - @hermansm\n",
"\n",
"In July we are travelling for a weekend to a mystery location. Some tips we have received:\n",
"\n",
"* Three times an 'd' and three times an 'e' in the locations' name.\n",
"* A good two hours travel eastwards from a given starting location (Heusden), which means it is in Germany.\n",
"* Look at the lines between Worms, Koblenz, Aarlen, and Malmedy.\n",
"* The symbols in the heraldic shield of the location refer to [Saint Jude](https://en.wikipedia.org/wiki/Jude_the_Apostle) and [Saint Simon](https://en.wikipedia.org/wiki/Simon_the_Zealot).\n",
"\n",
"To take a *very* calculated guess at the location, we will\n",
"\n",
"1. Get all the German location names and their coordinates from [Open Street Map](https://www.openstreetmap.org/) (OSM).\n",
"2. Filter the names on triple a's and triple e's.\n",
"3. Calculate the effective driving time from the starting location to each of the matching locations using the webservice of the [Open Source Routing Machine](http://project-osrm.org/) (OSRM).\n",
"4. Select those below 3h driving time, and near the intersection of the lines drawn from the four given cities.\n",
"5. Look a the heraldic shields of the remaining locations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Get all the German placenames from OSM"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"\n",
"\n",
"To get a dump of OSM location names into Pandas for analysis, there is a very useful Github repository [OSMNames](https://github.com/geometalab/OSMNames) with Docker files and some scripts. It uses OSM data-dumps from [Geofabrik](download.geofabrik.de/index.html), in our case the data available for [Germany](http://download.geofabrik.de/europe/germany.html).\n",
"\n",
"We clone the Github repository and download the entire OSM-dataset for Germany:\n",
"\n",
" git clone https://github.com/geometalab/OSMNames.git\n",
" cd OSMNames\n",
" wget --directory-prefix=./data http://download.geofabrik.de/europe/germany-latest.osm.pbf\n",
"\n",
"Then we load the OSM-database dump using Docker to get our environment in a pinch, and export the German location names\n",
"\n",
" docker-compose up -d postgres # launches Postgres-database\n",
" docker-compose run import-osm # import 2.8GB pbf-file into Postgres \n",
" docker-compose run schema # \n",
" docker-compose run export-osmnames # export location names to 500MB plain-text file\n",
"\n",
"For analysis, create a Python virtual environment, and install the required libraries with pip.\n",
"\n",
" mkvirtualenv osmde\n",
" pip install ipython pandas jupyter numpy requests numexpr\n",
" \n",
"Launch an interactive IPython notebook, you can download and re-run the one you are reading.\n",
" \n",
" ipython notebook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Filter all the place names with triple 'd' and triple 'e'"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd # for data loading and manipulation\n",
"import requests # for the request-function to OSRM\n",
"import numpy as np # some direct matrix manipulations\n",
"#import numexpr"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"de = pd.read_table('data/germany-latest.tsv', low_memory=False ) # read OSM datadump of DE places"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"de.shape # 3million spatial entities "
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
"(2948639, 18)"
]
}
],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"de.columns # OSM variables included"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"Index([u'name', u'class', u'type', u'lon', u'lat', u'place_rank',\n",
" u'importance', u'street', u'city', u'county', u'state', u'country_code',\n",
" u'country', u'display_name', u'west', u'south', u'east', u'north'],\n",
" dtype='object')"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"de.type.value_counts() # counts by type of spatial feature"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": [
"residential 1511729\n",
"secondary 229633\n",
"tertiary 210144\n",
"service 198968\n",
"unclassified 179072\n",
"track 141591\n",
"living_street 103261\n",
"primary 92683\n",
"footway 67933\n",
"path 50243\n",
"hamlet 43889\n",
"village 40548\n",
"administrative 22066\n",
"cycleway 12213\n",
"suburb 9783\n",
"steps 8371\n",
"trunk 6025\n",
"motorway 4052\n",
"neighbourhood 2526\n",
"construction 2525\n",
"town 2520\n",
"primary_link 2405\n",
"secondary_link 2209\n",
"trunk_link 1218\n",
"tertiary_link 1066\n",
"motorway_link 619\n",
"road 565\n",
"raceway 339\n",
"bridleway 270\n",
"city 95\n",
"quarter 38\n",
"borough 27\n",
"corridor 13\n",
"dtype: int64"
]
}
],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# function to test of the place-name matches the triple e's and d's\n",
"def match_hint(x):\n",
" try:\n",
" if x.lower().count('e') == x.lower().count('d') == 3:\n",
" return True\n",
" else:\n",
" return False\n",
" except AttributeError:\n",
" return False"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# apply the criteria test-function to each name, and record matches\n",
"de.ismatch = de.name.apply(match_hint)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"de.ismatch.value_counts() # we have 1483 matches across Germany"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": [
"False 2947156\n",
"True 1483\n",
"dtype: int64"
]
}
],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"de_match = de[de.ismatch] # filter out the matched entities"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"de_match.type.value_counts()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": [
"residential 568\n",
"secondary 220\n",
"unclassified 113\n",
"path 106\n",
"track 102\n",
"tertiary 93\n",
"service 84\n",
"living_street 44\n",
"administrative 32\n",
"primary 28\n",
"footway 23\n",
"village 18\n",
"hamlet 12\n",
"neighbourhood 10\n",
"suburb 7\n",
"cycleway 6\n",
"construction 4\n",
"tertiary_link 4\n",
"town 3\n",
"steps 2\n",
"bridleway 2\n",
"primary_link 1\n",
"secondary_link 1\n",
"dtype: int64"
]
}
],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# it has to be a town of some sort, filter out the rest.\n",
"de_towns = de_match[de_match.type.isin(['village', 'suburb', 'hamlet', 'suburb', 'town', 'neighbourhood', 'city'])]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"de_towns.type.value_counts()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 12,
"text": [
"village 18\n",
"hamlet 12\n",
"neighbourhood 10\n",
"suburb 7\n",
"town 3\n",
"dtype: int64"
]
}
],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"de_towns.type.shape # we have 50 places left"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 13,
"text": [
"(50,)"
]
}
],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"de_towns.name # possible locations with the right name"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": [
"11679 Adelheidsgroden\n",
"362646 Baddeckenstedt\n",
"364889 Bad Neustadt an der Saale\n",
"530409 Breddewarden\n",
"530410 Breddewarden\n",
"578347 Buchholz in der Nordheide\n",
"649310 Dederstedt\n",
"651680 Deiderode\n",
"658325 Deudesfeld\n",
"661581 Diedesfeld\n",
"665404 Diestedde\n",
"672703 Dodesheide\n",
"710991 Drei Linden Siedlung\n",
"714318 Drestedt-Valzik Siedlung\n",
"718311 Drosendorf an der Aufse\u00df\n",
"726726 Dunwarderfelde\n",
"733931 Ebersdorf (bei Ludwigsstadt)\n",
"840590 Fedderwarden\n",
"903566 Fredersdorf-Vogelsdorf\n",
"915214 Friedersdorf, Oderbruch\n",
"931535 Friedrichswalde-Ottendorf\n",
"1013722 G\u00f6ddeckenrode\n",
"1056754 Gro\u00dffedderwarden\n",
"1073998 Gundelfingen an der Donau\n",
"1187636 Heidenoldendorf\n",
"1280572 Hohenb\u00f6ddenstedt\n",
"1565076 Klein Waddewarden\n",
"1640130 Kummersdorf-Alexanderdorf\n",
"1829644 Meddewade\n",
"1841699 Menzenschwand-Vorderdorf\n",
"1928772 Naundorf vor der Heide\n",
"1931349 Nedderend\n",
"1940174 Neu Duvenstedt-Nord\n",
"1940175 Neu Duvenstedt-S\u00fcd\n",
"1965225 Niederdollendorf\n",
"1969279 Niederstadtfeld\n",
"2020344 \u00d6d in der Pechschneid\n",
"2083754 Pedden\u00f6de\n",
"2427090 Seebad R\u00fcdersdorf\n",
"2435941 Seidfeld (Sauerland)\n",
"2449680 Siedlung Denrodtstra\u00dfe\n",
"2449685 Siedlung der Freundschaft\n",
"2451328 Siedlung Waldfrieden\n",
"2451329 Siedlung Waldfrieden\n",
"2451330 Siedlung Waldfrieden\n",
"2557288 Studentendorf Waldh\u00e4user-Ost\n",
"2736841 Waldsiedlung Junkerfeld\n",
"2765743 Wedderm\u00f6de\n",
"2765807 Wedderstedt\n",
"2765812 Weddewarden\n",
"Name: name, dtype: object"
]
}
],
"prompt_number": 14
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"coords = zip(de_towns.lon, de_towns.lat) # create a list with latitude-longitude pairs \n",
"labels = de_towns.name.tolist() # get a list of place-names\n",
"coords.append((5.2205227, 51.023644)) # add the coordinates of Heusden to the list\n",
"labels.append('Heusden') # add Heusden to the list of place names"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Get the driving distance to all matches from OSRM"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Get the pairwise drive time in minutes for a list of coordinates,\n",
"# using the OSM-based open source online routing webservice http://project-osrm.org/\n",
"# (code from python-osrm: https://github.com/ustroetz/python-osrm/blob/master/osrm/core.py#L142 )\n",
"\n",
"def drivetime_table(list_coords, list_ids, output='df',\n",
" host='http://localhost:5000'):\n",
" \"\"\"\n",
" Function wrapping OSRM 'table' function in order to get a matrix of\n",
" time distance as a numpy array or as a DataFrame\n",
" Params :\n",
" list_coords: list\n",
" A list of coord as [x, y] , like :\n",
" list_coords = [[21.3224, 45.2358],\n",
" [21.3856, 42.0094],\n",
" [20.9574, 41.5286]] (coords have to be float)\n",
" list_ids: list\n",
" A list of the corresponding unique id, like :\n",
" list_ids = ['name1',\n",
" 'name2',\n",
" 'name3'] (id can be str, int or float)\n",
" host: str, default 'http://localhost:5000'\n",
" Url and port of the OSRM instance (no final bakslash)\n",
" output: str, default 'pandas'\n",
" The type of matrice to return (DataFrame or numpy array)\n",
" 'pandas', 'df' or 'DataFrame' for a DataFrame\n",
" 'numpy', 'array' or 'np' for a numpy array\n",
" Output:\n",
" - 'numpy' : a numpy array containing the time in minutes\n",
" (or NaN when OSRM encounter an error to compute a route)\n",
" or\n",
" - 'pandas' : a labeled DataFrame containing the time matrix in minutes\n",
" (or NaN when OSRM encounter an error to compute a route)\n",
" -1 is return in case of any other error (bad 'output' parameter,\n",
" wrong list of coords/ids, unknow host,\n",
" wrong response from the host, etc.)\n",
" \"\"\"\n",
" if output.lower() in ('numpy', 'array', 'np'):\n",
" output = 1\n",
" elif output.lower() in ('pandas', 'dataframe', 'df'):\n",
" output = 2\n",
" else:\n",
" print('Unknow output parameter')\n",
" return -1\n",
"\n",
" query = [host, '/table?loc=']\n",
" for coord, uid in zip(list_coords, list_ids): # Preparing the query\n",
" tmp = ''.join([str(coord[1]), ',', str(coord[0]), '&loc='])\n",
" query.append(tmp)\n",
" query = (''.join(query))[:-5]\n",
" try: # Querying the OSRM local instance\n",
" rep = requests.get(query)\n",
" parsed_json = rep.json()\n",
" except Exception as err:\n",
" print('Error while contacting OSRM instance : \\n{}'.format(err))\n",
" return -1\n",
"\n",
" if 'distance_table' in parsed_json.keys(): # Preparing the result matrix\n",
" mat = np.array(parsed_json['distance_table'], dtype='float64')\n",
" if len(mat) < len(list_coords):\n",
" print(('The array returned by OSRM is smaller to the size of the '\n",
" 'array requested\\nOSRM parameter --max-table-size should be'\n",
" ' increased or function osrm.table_OD(...) should be used'))\n",
" return -1\n",
" mat = mat/(10*60) # Conversion in minutes\n",
" mat = mat.round(1)\n",
" mat[mat == 3579139.4] = np.NaN # Flag the errors with NaN\n",
" if output == 1:\n",
" return mat\n",
" elif output == 2:\n",
" df = pd.DataFrame(mat, index=list_ids, columns=list_ids, dtype=float)\n",
" return df\n",
" else:\n",
" print('No distance table return by OSRM local instance')\n",
" return -10"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 16
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# get the driving-time in minutes for all the coords\n",
"time_matrix = drivetime_table(coords, labels, output='dataframe', host='http://router.project-osrm.org/')\n",
"time_matrix.shape # 51x51 matrix, for all german cities + Heusden"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
"text": [
"(51, 51)"
]
}
],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"solutions = time_matrix.loc[time_matrix.Heusden < 180, time_matrix.Heusden < 180]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 18
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"solutions # we are left with six posibilities"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Deudesfeld</th>\n",
" <th>Niederdollendorf</th>\n",
" <th>Niederstadtfeld</th>\n",
" <th>Pedden\u00f6de</th>\n",
" <th>Seidfeld (Sauerland)</th>\n",
" <th>Siedlung Denrodtstra\u00dfe</th>\n",
" <th>Heusden</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Deudesfeld</th>\n",
" <td>0.0</td>\n",
" <td>90.6</td>\n",
" <td>11.0</td>\n",
" <td>149.3</td>\n",
" <td>190.5</td>\n",
" <td>169.5</td>\n",
" <td>131.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Niederdollendorf</th>\n",
" <td>91.5</td>\n",
" <td>0.0</td>\n",
" <td>80.5</td>\n",
" <td>70.1</td>\n",
" <td>109.1</td>\n",
" <td>90.3</td>\n",
" <td>119.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Niederstadtfeld</th>\n",
" <td>11.1</td>\n",
" <td>79.6</td>\n",
" <td>0.0</td>\n",
" <td>138.2</td>\n",
" <td>179.5</td>\n",
" <td>158.5</td>\n",
" <td>132.8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Pedden\u00f6de</th>\n",
" <td>150.0</td>\n",
" <td>69.5</td>\n",
" <td>139.0</td>\n",
" <td>0.0</td>\n",
" <td>62.5</td>\n",
" <td>46.1</td>\n",
" <td>136.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Seidfeld (Sauerland)</th>\n",
" <td>191.1</td>\n",
" <td>109.9</td>\n",
" <td>180.5</td>\n",
" <td>63.9</td>\n",
" <td>0.0</td>\n",
" <td>63.8</td>\n",
" <td>179.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Siedlung Denrodtstra\u00dfe</th>\n",
" <td>171.1</td>\n",
" <td>90.6</td>\n",
" <td>160.1</td>\n",
" <td>47.2</td>\n",
" <td>62.5</td>\n",
" <td>0.0</td>\n",
" <td>132.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Heusden</th>\n",
" <td>131.8</td>\n",
" <td>118.6</td>\n",
" <td>133.3</td>\n",
" <td>137.4</td>\n",
" <td>175.3</td>\n",
" <td>132.4</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 19,
"text": [
" Deudesfeld Niederdollendorf Niederstadtfeld \\\n",
"Deudesfeld 0.0 90.6 11.0 \n",
"Niederdollendorf 91.5 0.0 80.5 \n",
"Niederstadtfeld 11.1 79.6 0.0 \n",
"Pedden\u00f6de 150.0 69.5 139.0 \n",
"Seidfeld (Sauerland) 191.1 109.9 180.5 \n",
"Siedlung Denrodtstra\u00dfe 171.1 90.6 160.1 \n",
"Heusden 131.8 118.6 133.3 \n",
"\n",
" Pedden\u00f6de Seidfeld (Sauerland) \\\n",
"Deudesfeld 149.3 190.5 \n",
"Niederdollendorf 70.1 109.1 \n",
"Niederstadtfeld 138.2 179.5 \n",
"Pedden\u00f6de 0.0 62.5 \n",
"Seidfeld (Sauerland) 63.9 0.0 \n",
"Siedlung Denrodtstra\u00dfe 47.2 62.5 \n",
"Heusden 137.4 175.3 \n",
"\n",
" Siedlung Denrodtstra\u00dfe Heusden \n",
"Deudesfeld 169.5 131.1 \n",
"Niederdollendorf 90.3 119.1 \n",
"Niederstadtfeld 158.5 132.8 \n",
"Pedden\u00f6de 46.1 136.9 \n",
"Seidfeld (Sauerland) 63.8 179.3 \n",
"Siedlung Denrodtstra\u00dfe 0.0 132.0 \n",
"Heusden 132.4 0.0 "
]
}
],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Plot the crossing lines \n",
"\n",
"Used R Leaflet for convenience, plus screenshot. Outer points are Worms, Koblenz, Aarlen, and Malmedy. Inner points are **Deudesfeld** and **Niederstadtfeld**, the rest are not that close to the intersection."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Image of Yaktocat](https://s33.postimg.org/kk7l6rc7z/germany_locations_lines.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Verify the heraldic shiels of the final matches\n",
"\n",
"The shield of [Deudesfeld](https://de.wikipedia.org/wiki/Deudesfeld) contains an axe and a woodcutters hatchet:\n",
"\n",
"![Deudesfeld Weapon](https://upload.wikimedia.org/wikipedia/commons/thumb/6/62/Wappen_von_Deudesfeld.png/140px-Wappen_von_Deudesfeld.png)\n",
"\n",
"A common attribute of Jude is an axe, given his death at the end of one. Simon is commonly depicted with a saw (sawn in half), or other woodworking tools such as a hatchet."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Final verdict:\n",
"\n",
"**Deudesfeld**"
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment