Skip to content

Instantly share code, notes, and snippets.

@kumekay
Created July 20, 2015 13:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kumekay/c1f18234d73ab10dba6a to your computer and use it in GitHub Desktop.
Save kumekay/c1f18234d73ab10dba6a to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {
"collapsed": true
},
"cell_type": "markdown",
"source": "\n## Parse cities and countries from geonames.org"
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "import csv\nimport sys\n\ncsv.field_size_limit(sys.maxsize)\n\nwdir = '/Users/ku/data/'",
"execution_count": 3,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## The table 'alternate names' :\n\nSource: http://download.geonames.org/export/dump/alternateNames.zip\n\n0. alternateNameId : the id of this alternate name, int\n1. geonameid : geonameId referring to id in table 'geoname', int\n2. isolanguage : iso 639 language code 2- or 3-characters; 4-characters 'post' for postal codes and 'iata','icao' and faac for airport codes, fr_1793 for French Revolution names, abbr for abbreviation, link for a website, varchar(7)\n3. alternate name : alternate name or name variant, varchar(200)\n4. isPreferredName : '1', if this alternate name is an official/preferred name\n5. isShortName : '1', if this is a short name like 'California' for 'State of California'\n6. isColloquial : '1', if this alternate name is a colloquial or slang term\n7. isHistoric : '1', if this alternate name is historic and was used in the past\n"
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "names_ru = {}\nwith open(wdir + \"alternateNames.txt\", 'rb') as csvfile: \n idreader = csv.reader(csvfile, delimiter='\\t')\n for r in idreader:\n if r[2] == \"ru\":\n names_ru[r[1]]=r[3]\nprint ",
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"text": "\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## The main 'geoname' table has the following fields :\n\nSource: http://download.geonames.org/export/dump/allCountries.zip\n\n0. geonameid | integer id of record in geonames database |\n1. name | name of geographical point (utf8) varchar(200)\n2. asciiname | name of geographical point in plain ascii characters, varchar(200)\n3. alternatenames | alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)\n4. latitude | latitude in decimal degrees (wgs84)\n5. longitude | longitude in decimal degrees (wgs84)\n6. feature class | see http://www.geonames.org/export/codes.html, char(1)\n7. feature code | see http://www.geonames.org/export/codes.html, varchar(10)\n8. country code | ISO-3166 2-letter country code, 2 characters\n9. cc2 | alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters\n10. admin1 code | fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)\n11. admin2 code | code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) \n12. admin3 code | code for third level administrative division, varchar(20)\n13. admin4 code | code for fourth level administrative division, varchar(20)\n14. population | bigint (8 byte int) \n15. elevation | in meters, integer\n16. dem | digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.\n17. timezone | the timezone id (see file timeZone.txt) varchar(40)\n18. modification date | date of last modification in yyyy-MM-dd format"
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "not_localized = 0\nlocalized = 0\nwithout_country = 0\nro = []\nwith open(wdir + \"allCountries.txt\", 'rb') as csvfile: \n with open(wdir + \"out_cities.csv\", 'wb') as csvoutfile:\n pwriter = csv.writer(csvoutfile, dialect='excel')\n idreader = csv.reader(csvfile, delimiter='\\t')\n pwriter.writerow(['id','name_en', 'name_ru', 'latitude', 'longitude', 'code', 'population'])\n for r in idreader:\n name_ru = None\n if r[14] == '':\n continue\n if r[6] != 'P':\n continue\n if r[8] == '':\n without_country += 1\n continue\n if r[0] in names_ru:\n localized += 1\n name_ru = names_ru[r[0]]\n else:\n not_localized += 1\n \n ro = [r[0], r[1], name_ru, r[4], r[5], r[8].lower(), r[14]]\n \n if r[0] in names_ru and localized < 5:\n print ro\n pwriter.writerow(ro)\n\nprint \"Localized:\", localized, \"Not localized:\", not_localized, \"Without country:\", without_country\nprint \"Last row\", ro",
"execution_count": 8,
"outputs": [
{
"output_type": "stream",
"text": "['3039154', 'El Tarter', '\\xd0\\xad\\xd0\\xbb \\xd0\\xa2\\xd0\\xb0\\xd1\\x80\\xd1\\x82\\xd0\\xb5\\xd1\\x80', '42.57952', '1.65362', 'ad', '1052']\n['3039163', 'Sant Juli\\xc3\\xa0 de L\\xc3\\xb2ria', '\\xd0\\xa1\\xd0\\xb0\\xd0\\xbd\\xd1\\x82-\\xd0\\x96\\xd1\\x83\\xd0\\xbb\\xd0\\xb8\\xd1\\x8f-\\xd0\\xb4\\xd0\\xb5-\\xd0\\x9b\\xd0\\xbe\\xd1\\x80\\xd0\\xb8\\xd1\\x8f', '42.46372', '1.49129', 'ad', '8022']\n['3039604', 'Pas de la Casa', '\\xd0\\x9f\\xd0\\xb0\\xd1\\x81 \\xd0\\xb4\\xd0\\xb5 \\xd0\\xbb\\xd0\\xb0 \\xd0\\x9a\\xd0\\xb0\\xd1\\x81\\xd0\\xb0', '42.54277', '1.73361', 'ad', '2363']\n['3039678', 'Ordino', '\\xd0\\x9e\\xd1\\x80\\xd0\\xb4\\xd0\\xb8\\xd0\\xbd\\xd0\\xbe', '42.55623', '1.53319', 'ad', '3066']\nLocalized: 238621 Not localized: 3471962 Without country: 0\nLast row ['8534443', 'Nhlangano', None, '-19.85776', '27.82236', 'zw', '0']\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Countries in multiple languages\nSample urls to get csv of countries: \n\n- http://api.geonames.org/countryInfoCSV?lang=en&username=demo\n- http://api.geonames.org/countryInfoCSV?lang=ru&username=demo\n\n### Format:\n- iso alpha2\t\n- iso alpha3\t\n- iso numeric\t\n- fips code\t\n- name\t\n- capital\t\n- areaInSqKm\t\n- population\t\n- continent\t\n- languages\t\n- currency\t\n- geonameId"
},
{
"metadata": {
"collapsed": true,
"trusted": true
},
"cell_type": "code",
"source": "countries_en = []\ncountries_ru_h = {}",
"execution_count": 9,
"outputs": []
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "# English countries\nwith open(wdir + \"countryInfoCSVen.txt\", 'rb') as encountriescsv: \n enreader = csv.DictReader(encountriescsv, delimiter='\\t')\n for r in enreader:\n r['iso alpha2'] = r['iso alpha2'].lower()\n countries_en.append(r)",
"execution_count": 10,
"outputs": []
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "# Russian countries\nwith open(wdir + \"countryInfoCSVru.txt\", 'rb') as rucountriescsv: \n rureader = csv.DictReader(rucountriescsv, delimiter='\\t')\n for r in rureader:\n countries_ru_h[r['iso alpha2'].lower()] = r['name']",
"execution_count": 11,
"outputs": []
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "print \"English country sample:\", countries_en[0]\nprint \"Russian country sample with 'ru':\", countries_ru_h['ru']",
"execution_count": 12,
"outputs": [
{
"output_type": "stream",
"text": "English country sample: {'geonameId': '3041565', 'name': 'Andorra', 'areaInSqKm': '468.0', 'iso numeric': '020', 'languages': 'ca', 'currency': 'EUR', 'capital': 'Andorra la Vella', 'iso alpha3': 'AND', 'iso alpha2': 'ad', 'continent': 'EU', 'fips code': 'AN', 'population': '84000'}\nRussian country sample with 'ru': Россия\n",
"name": "stdout"
}
]
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "out_country = {}\ncountries_count = 0\nbad_countries = []\nwith open(wdir + \"out_countries.csv\", 'wb') as csvoutfile:\n fieldnames = ['id', 'code', 'name_en','name_ru', 'population']\n pwriter = csv.DictWriter(csvoutfile, dialect='excel', fieldnames=fieldnames)\n pwriter.writeheader()\n for c in countries_en:\n if c['name'] == '' or c['population'] == '0':\n bad_countries.append(c)\n continue\n out_country['id'] = c['geonameId']\n out_country['code'] = c['iso alpha2']\n out_country['name_en'] = c['name']\n out_country['name_ru'] = countries_ru_h.get(c['iso alpha2'])\n out_country['population'] = c['population']\n countries_count += 1\n pwriter.writerow(out_country)\n\nprint \"Country sample:\", out_country\nprint \"Total countres:\", countries_count \nprint \"Bad countries:\", len(bad_countries), 'List:', map(lambda c: c['name'], bad_countries)",
"execution_count": 13,
"outputs": [
{
"output_type": "stream",
"text": "Country sample: {'name_ru': '\\xd0\\x97\\xd0\\xb8\\xd0\\xbc\\xd0\\xb1\\xd0\\xb0\\xd0\\xb1\\xd0\\xb2\\xd0\\xb5', 'name_en': 'Zimbabwe', 'code': 'zw', 'id': '878675', 'population': '11651858'}\nTotal countres: 246\nBad countries: 6 List: ['', 'Antarctica', 'Bouvet Island', '', 'Heard Island and McDonald Islands', 'U.S. Minor Outlying Islands']\n",
"name": "stdout"
}
]
}
],
"metadata": {
"kernelspec": {
"name": "python2",
"display_name": "Python 2",
"language": "python"
},
"language_info": {
"mimetype": "text/x-python",
"nbconvert_exporter": "python",
"name": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10",
"file_extension": ".py",
"codemirror_mode": {
"version": 2,
"name": "ipython"
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment