Created
July 20, 2015 13:31
-
-
Save kumekay/c1f18234d73ab10dba6a to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"metadata": { | |
"collapsed": true | |
}, | |
"cell_type": "markdown", | |
"source": "\n## Parse cities and countries from geonames.org" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "import csv\nimport sys\n\ncsv.field_size_limit(sys.maxsize)\n\nwdir = '/Users/ku/data/'", | |
"execution_count": 3, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## The table 'alternate names' :\n\nSource: http://download.geonames.org/export/dump/alternateNames.zip\n\n0. alternateNameId : the id of this alternate name, int\n1. geonameid : geonameId referring to id in table 'geoname', int\n2. isolanguage : iso 639 language code 2- or 3-characters; 4-characters 'post' for postal codes and 'iata','icao' and faac for airport codes, fr_1793 for French Revolution names, abbr for abbreviation, link for a website, varchar(7)\n3. alternate name : alternate name or name variant, varchar(200)\n4. isPreferredName : '1', if this alternate name is an official/preferred name\n5. isShortName : '1', if this is a short name like 'California' for 'State of California'\n6. isColloquial : '1', if this alternate name is a colloquial or slang term\n7. isHistoric : '1', if this alternate name is historic and was used in the past\n" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "names_ru = {}\nwith open(wdir + \"alternateNames.txt\", 'rb') as csvfile: \n idreader = csv.reader(csvfile, delimiter='\\t')\n for r in idreader:\n if r[2] == \"ru\":\n names_ru[r[1]]=r[3]\nprint ", | |
"execution_count": 4, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## The main 'geoname' table has the following fields :\n\nSource: http://download.geonames.org/export/dump/allCountries.zip\n\n0. geonameid | integer id of record in geonames database |\n1. name | name of geographical point (utf8) varchar(200)\n2. asciiname | name of geographical point in plain ascii characters, varchar(200)\n3. alternatenames | alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)\n4. latitude | latitude in decimal degrees (wgs84)\n5. longitude | longitude in decimal degrees (wgs84)\n6. feature class | see http://www.geonames.org/export/codes.html, char(1)\n7. feature code | see http://www.geonames.org/export/codes.html, varchar(10)\n8. country code | ISO-3166 2-letter country code, 2 characters\n9. cc2 | alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters\n10. admin1 code | fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)\n11. admin2 code | code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) \n12. admin3 code | code for third level administrative division, varchar(20)\n13. admin4 code | code for fourth level administrative division, varchar(20)\n14. population | bigint (8 byte int) \n15. elevation | in meters, integer\n16. dem | digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.\n17. timezone | the timezone id (see file timeZone.txt) varchar(40)\n18. modification date | date of last modification in yyyy-MM-dd format" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "not_localized = 0\nlocalized = 0\nwithout_country = 0\nro = []\nwith open(wdir + \"allCountries.txt\", 'rb') as csvfile: \n with open(wdir + \"out_cities.csv\", 'wb') as csvoutfile:\n pwriter = csv.writer(csvoutfile, dialect='excel')\n idreader = csv.reader(csvfile, delimiter='\\t')\n pwriter.writerow(['id','name_en', 'name_ru', 'latitude', 'longitude', 'code', 'population'])\n for r in idreader:\n name_ru = None\n if r[14] == '':\n continue\n if r[6] != 'P':\n continue\n if r[8] == '':\n without_country += 1\n continue\n if r[0] in names_ru:\n localized += 1\n name_ru = names_ru[r[0]]\n else:\n not_localized += 1\n \n ro = [r[0], r[1], name_ru, r[4], r[5], r[8].lower(), r[14]]\n \n if r[0] in names_ru and localized < 5:\n print ro\n pwriter.writerow(ro)\n\nprint \"Localized:\", localized, \"Not localized:\", not_localized, \"Without country:\", without_country\nprint \"Last row\", ro", | |
"execution_count": 8, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "['3039154', 'El Tarter', '\\xd0\\xad\\xd0\\xbb \\xd0\\xa2\\xd0\\xb0\\xd1\\x80\\xd1\\x82\\xd0\\xb5\\xd1\\x80', '42.57952', '1.65362', 'ad', '1052']\n['3039163', 'Sant Juli\\xc3\\xa0 de L\\xc3\\xb2ria', '\\xd0\\xa1\\xd0\\xb0\\xd0\\xbd\\xd1\\x82-\\xd0\\x96\\xd1\\x83\\xd0\\xbb\\xd0\\xb8\\xd1\\x8f-\\xd0\\xb4\\xd0\\xb5-\\xd0\\x9b\\xd0\\xbe\\xd1\\x80\\xd0\\xb8\\xd1\\x8f', '42.46372', '1.49129', 'ad', '8022']\n['3039604', 'Pas de la Casa', '\\xd0\\x9f\\xd0\\xb0\\xd1\\x81 \\xd0\\xb4\\xd0\\xb5 \\xd0\\xbb\\xd0\\xb0 \\xd0\\x9a\\xd0\\xb0\\xd1\\x81\\xd0\\xb0', '42.54277', '1.73361', 'ad', '2363']\n['3039678', 'Ordino', '\\xd0\\x9e\\xd1\\x80\\xd0\\xb4\\xd0\\xb8\\xd0\\xbd\\xd0\\xbe', '42.55623', '1.53319', 'ad', '3066']\nLocalized: 238621 Not localized: 3471962 Without country: 0\nLast row ['8534443', 'Nhlangano', None, '-19.85776', '27.82236', 'zw', '0']\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## Countries in multiple languages\nSample urls to get csv of countries: \n\n- http://api.geonames.org/countryInfoCSV?lang=en&username=demo\n- http://api.geonames.org/countryInfoCSV?lang=ru&username=demo\n\n### Format:\n- iso alpha2\t\n- iso alpha3\t\n- iso numeric\t\n- fips code\t\n- name\t\n- capital\t\n- areaInSqKm\t\n- population\t\n- continent\t\n- languages\t\n- currency\t\n- geonameId" | |
}, | |
{ | |
"metadata": { | |
"collapsed": true, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "countries_en = []\ncountries_ru_h = {}", | |
"execution_count": 9, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "# English countries\nwith open(wdir + \"countryInfoCSVen.txt\", 'rb') as encountriescsv: \n enreader = csv.DictReader(encountriescsv, delimiter='\\t')\n for r in enreader:\n r['iso alpha2'] = r['iso alpha2'].lower()\n countries_en.append(r)", | |
"execution_count": 10, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "# Russian countries\nwith open(wdir + \"countryInfoCSVru.txt\", 'rb') as rucountriescsv: \n rureader = csv.DictReader(rucountriescsv, delimiter='\\t')\n for r in rureader:\n countries_ru_h[r['iso alpha2'].lower()] = r['name']", | |
"execution_count": 11, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "print \"English country sample:\", countries_en[0]\nprint \"Russian country sample with 'ru':\", countries_ru_h['ru']", | |
"execution_count": 12, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "English country sample: {'geonameId': '3041565', 'name': 'Andorra', 'areaInSqKm': '468.0', 'iso numeric': '020', 'languages': 'ca', 'currency': 'EUR', 'capital': 'Andorra la Vella', 'iso alpha3': 'AND', 'iso alpha2': 'ad', 'continent': 'EU', 'fips code': 'AN', 'population': '84000'}\nRussian country sample with 'ru': Россия\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "out_country = {}\ncountries_count = 0\nbad_countries = []\nwith open(wdir + \"out_countries.csv\", 'wb') as csvoutfile:\n fieldnames = ['id', 'code', 'name_en','name_ru', 'population']\n pwriter = csv.DictWriter(csvoutfile, dialect='excel', fieldnames=fieldnames)\n pwriter.writeheader()\n for c in countries_en:\n if c['name'] == '' or c['population'] == '0':\n bad_countries.append(c)\n continue\n out_country['id'] = c['geonameId']\n out_country['code'] = c['iso alpha2']\n out_country['name_en'] = c['name']\n out_country['name_ru'] = countries_ru_h.get(c['iso alpha2'])\n out_country['population'] = c['population']\n countries_count += 1\n pwriter.writerow(out_country)\n\nprint \"Country sample:\", out_country\nprint \"Total countres:\", countries_count \nprint \"Bad countries:\", len(bad_countries), 'List:', map(lambda c: c['name'], bad_countries)", | |
"execution_count": 13, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "Country sample: {'name_ru': '\\xd0\\x97\\xd0\\xb8\\xd0\\xbc\\xd0\\xb1\\xd0\\xb0\\xd0\\xb1\\xd0\\xb2\\xd0\\xb5', 'name_en': 'Zimbabwe', 'code': 'zw', 'id': '878675', 'population': '11651858'}\nTotal countres: 246\nBad countries: 6 List: ['', 'Antarctica', 'Bouvet Island', '', 'Heard Island and McDonald Islands', 'U.S. Minor Outlying Islands']\n", | |
"name": "stdout" | |
} | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"name": "python2", | |
"display_name": "Python 2", | |
"language": "python" | |
}, | |
"language_info": { | |
"mimetype": "text/x-python", | |
"nbconvert_exporter": "python", | |
"name": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.10", | |
"file_extension": ".py", | |
"codemirror_mode": { | |
"version": 2, | |
"name": "ipython" | |
} | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment