kumekay/geonames.ipynb

## geonames.ipynb
{
 "cells": [
  {
   "metadata": {
    "collapsed": true
   },
   "cell_type": "markdown",
   "source": "\n## Parse cities and countries from geonames.org"
  },
  {
   "metadata": {
    "collapsed": false,
    "trusted": true
   },
   "cell_type": "code",
   "source": "import csv\nimport sys\n\ncsv.field_size_limit(sys.maxsize)\n\nwdir = '/Users/ku/data/'",
   "execution_count": 3,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## The table 'alternate names' :\n\nSource: http://download.geonames.org/export/dump/alternateNames.zip\n\n0. alternateNameId   : the id of this alternate name, int\n1. geonameid         : geonameId referring to id in table 'geoname', int\n2. isolanguage       : iso 639 language code 2- or 3-characters; 4-characters 'post' for postal codes and 'iata','icao' and faac for airport codes, fr_1793 for French Revolution names,  abbr for abbreviation, link for a website, varchar(7)\n3. alternate name    : alternate name or name variant, varchar(200)\n4. isPreferredName   : '1', if this alternate name is an official/preferred name\n5. isShortName       : '1', if this is a short name like 'California' for 'State of California'\n6. isColloquial      : '1', if this alternate name is a colloquial or slang term\n7. isHistoric        : '1', if this alternate name is historic and was used in the past\n"
  },
  {
   "metadata": {
    "collapsed": false,
    "trusted": true
   },
   "cell_type": "code",
   "source": "names_ru = {}\nwith open(wdir + \"alternateNames.txt\", 'rb') as csvfile:    \n    idreader = csv.reader(csvfile, delimiter='\\t')\n    for r in idreader:\n        if r[2] == \"ru\":\n            names_ru[r[1]]=r[3]\nprint         ",
   "execution_count": 4,
   "outputs": [
    {
     "output_type": "stream",
     "text": "\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## The main 'geoname' table has the following fields :\n\nSource: http://download.geonames.org/export/dump/allCountries.zip\n\n0. geonameid         | integer id of record in geonames database |\n1. name              | name of geographical point (utf8) varchar(200)\n2. asciiname         | name of geographical point in plain ascii characters, varchar(200)\n3. alternatenames    | alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)\n4. latitude          | latitude in decimal degrees (wgs84)\n5. longitude         | longitude in decimal degrees (wgs84)\n6. feature class     | see http://www.geonames.org/export/codes.html, char(1)\n7. feature code      | see http://www.geonames.org/export/codes.html, varchar(10)\n8. country code      | ISO-3166 2-letter country code, 2 characters\n9. cc2               | alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters\n10. admin1 code       | fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)\n11. admin2 code       | code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) \n12. admin3 code       | code for third level administrative division, varchar(20)\n13. admin4 code       | code for fourth level administrative division, varchar(20)\n14. population        | bigint (8 byte int) \n15. elevation         | in meters, integer\n16. dem               | digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.\n17. timezone          | the timezone id (see file timeZone.txt) varchar(40)\n18. modification date | date of last modification in yyyy-MM-dd format"
  },
  {
   "metadata": {
    "collapsed": false,
    "trusted": true
   },
   "cell_type": "code",
   "source": "not_localized = 0\nlocalized = 0\nwithout_country = 0\nro = []\nwith open(wdir + \"allCountries.txt\", 'rb') as csvfile:    \n    with open(wdir + \"out_cities.csv\", 'wb') as csvoutfile:\n        pwriter = csv.writer(csvoutfile, dialect='excel')\n        idreader = csv.reader(csvfile, delimiter='\\t')\n        pwriter.writerow(['id','name_en', 'name_ru', 'latitude', 'longitude', 'code', 'population'])\n        for r in idreader:\n            name_ru = None\n            if r[14] == '':\n                continue\n            if r[6] != 'P':\n                continue\n            if r[8] == '':\n                without_country += 1\n                continue\n            if r[0] in names_ru:\n                localized += 1\n                name_ru = names_ru[r[0]]\n            else:\n                not_localized += 1\n                \n            ro = [r[0], r[1], name_ru, r[4], r[5], r[8].lower(), r[14]]\n            \n            if r[0] in names_ru and localized < 5:\n                print ro\n            pwriter.writerow(ro)\n\nprint \"Localized:\", localized, \"Not localized:\", not_localized, \"Without country:\", without_country\nprint \"Last row\", ro",
   "execution_count": 8,
   "outputs": [
    {
     "output_type": "stream",
     "text": "['3039154', 'El Tarter', '\\xd0\\xad\\xd0\\xbb \\xd0\\xa2\\xd0\\xb0\\xd1\\x80\\xd1\\x82\\xd0\\xb5\\xd1\\x80', '42.57952', '1.65362', 'ad', '1052']\n['3039163', 'Sant Juli\\xc3\\xa0 de L\\xc3\\xb2ria', '\\xd0\\xa1\\xd0\\xb0\\xd0\\xbd\\xd1\\x82-\\xd0\\x96\\xd1\\x83\\xd0\\xbb\\xd0\\xb8\\xd1\\x8f-\\xd0\\xb4\\xd0\\xb5-\\xd0\\x9b\\xd0\\xbe\\xd1\\x80\\xd0\\xb8\\xd1\\x8f', '42.46372', '1.49129', 'ad', '8022']\n['3039604', 'Pas de la Casa', '\\xd0\\x9f\\xd0\\xb0\\xd1\\x81 \\xd0\\xb4\\xd0\\xb5 \\xd0\\xbb\\xd0\\xb0 \\xd0\\x9a\\xd0\\xb0\\xd1\\x81\\xd0\\xb0', '42.54277', '1.73361', 'ad', '2363']\n['3039678', 'Ordino', '\\xd0\\x9e\\xd1\\x80\\xd0\\xb4\\xd0\\xb8\\xd0\\xbd\\xd0\\xbe', '42.55623', '1.53319', 'ad', '3066']\nLocalized: 238621 Not localized: 3471962 Without country: 0\nLast row ['8534443', 'Nhlangano', None, '-19.85776', '27.82236', 'zw', '0']\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## Countries in multiple languages\nSample urls to get csv of countries: \n\n- http://api.geonames.org/countryInfoCSV?lang=en&username=demo\n- http://api.geonames.org/countryInfoCSV?lang=ru&username=demo\n\n### Format:\n- iso alpha2\t\n- iso alpha3\t\n- iso numeric\t\n- fips code\t\n- name\t\n- capital\t\n- areaInSqKm\t\n- population\t\n- continent\t\n- languages\t\n- currency\t\n- geonameId"
  },
  {
   "metadata": {
    "collapsed": true,
    "trusted": true
   },
   "cell_type": "code",
   "source": "countries_en = []\ncountries_ru_h = {}",
   "execution_count": 9,
   "outputs": []
  },
  {
   "metadata": {
    "collapsed": false,
    "trusted": true
   },
   "cell_type": "code",
   "source": "# English countries\nwith open(wdir + \"countryInfoCSVen.txt\", 'rb') as encountriescsv:    \n    enreader = csv.DictReader(encountriescsv, delimiter='\\t')\n    for r in enreader:\n        r['iso alpha2'] = r['iso alpha2'].lower()\n        countries_en.append(r)",
   "execution_count": 10,
   "outputs": []
  },
  {
   "metadata": {
    "collapsed": false,
    "trusted": true
   },
   "cell_type": "code",
   "source": "# Russian countries\nwith open(wdir + \"countryInfoCSVru.txt\", 'rb') as rucountriescsv:    \n    rureader = csv.DictReader(rucountriescsv, delimiter='\\t')\n    for r in rureader:\n        countries_ru_h[r['iso alpha2'].lower()] = r['name']",
   "execution_count": 11,
   "outputs": []
  },
  {
   "metadata": {
    "collapsed": false,
    "trusted": true
   },
   "cell_type": "code",
   "source": "print \"English country sample:\", countries_en[0]\nprint \"Russian country sample with 'ru':\", countries_ru_h['ru']",
   "execution_count": 12,
   "outputs": [
    {
     "output_type": "stream",
     "text": "English country sample: {'geonameId': '3041565', 'name': 'Andorra', 'areaInSqKm': '468.0', 'iso numeric': '020', 'languages': 'ca', 'currency': 'EUR', 'capital': 'Andorra la Vella', 'iso alpha3': 'AND', 'iso alpha2': 'ad', 'continent': 'EU', 'fips code': 'AN', 'population': '84000'}\nRussian country sample with 'ru': Россия\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "collapsed": false,
    "trusted": true
   },
   "cell_type": "code",
   "source": "out_country = {}\ncountries_count = 0\nbad_countries = []\nwith open(wdir + \"out_countries.csv\", 'wb') as csvoutfile:\n        fieldnames = ['id', 'code', 'name_en','name_ru', 'population']\n        pwriter = csv.DictWriter(csvoutfile, dialect='excel', fieldnames=fieldnames)\n        pwriter.writeheader()\n        for c in countries_en:\n                if c['name'] == '' or c['population'] == '0':\n                    bad_countries.append(c)\n                    continue\n                out_country['id'] = c['geonameId']\n                out_country['code'] = c['iso alpha2']\n                out_country['name_en'] = c['name']\n                out_country['name_ru'] = countries_ru_h.get(c['iso alpha2'])\n                out_country['population'] = c['population']\n                countries_count += 1\n                pwriter.writerow(out_country)\n\nprint \"Country sample:\", out_country\nprint \"Total countres:\", countries_count \nprint \"Bad countries:\", len(bad_countries), 'List:', map(lambda c: c['name'], bad_countries)",
   "execution_count": 13,
   "outputs": [
    {
     "output_type": "stream",
     "text": "Country sample: {'name_ru': '\\xd0\\x97\\xd0\\xb8\\xd0\\xbc\\xd0\\xb1\\xd0\\xb0\\xd0\\xb1\\xd0\\xb2\\xd0\\xb5', 'name_en': 'Zimbabwe', 'code': 'zw', 'id': '878675', 'population': '11651858'}\nTotal countres: 246\nBad countries: 6 List: ['', 'Antarctica', 'Bouvet Island', '', 'Heard Island and McDonald Islands', 'U.S. Minor Outlying Islands']\n",
     "name": "stdout"
    }
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "name": "python2",
   "display_name": "Python 2",
   "language": "python"
  },
  "language_info": {
   "mimetype": "text/x-python",
   "nbconvert_exporter": "python",
   "name": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10",
   "file_extension": ".py",
   "codemirror_mode": {
    "version": 2,
    "name": "ipython"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"metadata": {
	"collapsed": true
	},
	"cell_type": "markdown",
	"source": "\n## Parse cities and countries from geonames.org"
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": true
	},
	"cell_type": "code",
	"source": "import csv\nimport sys\n\ncsv.field_size_limit(sys.maxsize)\n\nwdir = '/Users/ku/data/'",
	"execution_count": 3,
	"outputs": []
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## The table 'alternate names' :\n\nSource: http://download.geonames.org/export/dump/alternateNames.zip\n\n0. alternateNameId : the id of this alternate name, int\n1. geonameid : geonameId referring to id in table 'geoname', int\n2. isolanguage : iso 639 language code 2- or 3-characters; 4-characters 'post' for postal codes and 'iata','icao' and faac for airport codes, fr_1793 for French Revolution names, abbr for abbreviation, link for a website, varchar(7)\n3. alternate name : alternate name or name variant, varchar(200)\n4. isPreferredName : '1', if this alternate name is an official/preferred name\n5. isShortName : '1', if this is a short name like 'California' for 'State of California'\n6. isColloquial : '1', if this alternate name is a colloquial or slang term\n7. isHistoric : '1', if this alternate name is historic and was used in the past\n"
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": true
	},
	"cell_type": "code",
	"source": "names_ru = {}\nwith open(wdir + \"alternateNames.txt\", 'rb') as csvfile: \n idreader = csv.reader(csvfile, delimiter='\\t')\n for r in idreader:\n if r[2] == \"ru\":\n names_ru[r[1]]=r[3]\nprint ",
	"execution_count": 4,
	"outputs": [
	{
	"output_type": "stream",
	"text": "\n",
	"name": "stdout"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## The main 'geoname' table has the following fields :\n\nSource: http://download.geonames.org/export/dump/allCountries.zip\n\n0. geonameid \| integer id of record in geonames database \|\n1. name \| name of geographical point (utf8) varchar(200)\n2. asciiname \| name of geographical point in plain ascii characters, varchar(200)\n3. alternatenames \| alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)\n4. latitude \| latitude in decimal degrees (wgs84)\n5. longitude \| longitude in decimal degrees (wgs84)\n6. feature class \| see http://www.geonames.org/export/codes.html, char(1)\n7. feature code \| see http://www.geonames.org/export/codes.html, varchar(10)\n8. country code \| ISO-3166 2-letter country code, 2 characters\n9. cc2 \| alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters\n10. admin1 code \| fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)\n11. admin2 code \| code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) \n12. admin3 code \| code for third level administrative division, varchar(20)\n13. admin4 code \| code for fourth level administrative division, varchar(20)\n14. population \| bigint (8 byte int) \n15. elevation \| in meters, integer\n16. dem \| digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.\n17. timezone \| the timezone id (see file timeZone.txt) varchar(40)\n18. modification date \| date of last modification in yyyy-MM-dd format"
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": true
	},
	"cell_type": "code",
	"source": "not_localized = 0\nlocalized = 0\nwithout_country = 0\nro = []\nwith open(wdir + \"allCountries.txt\", 'rb') as csvfile: \n with open(wdir + \"out_cities.csv\", 'wb') as csvoutfile:\n pwriter = csv.writer(csvoutfile, dialect='excel')\n idreader = csv.reader(csvfile, delimiter='\\t')\n pwriter.writerow(['id','name_en', 'name_ru', 'latitude', 'longitude', 'code', 'population'])\n for r in idreader:\n name_ru = None\n if r[14] == '':\n continue\n if r[6] != 'P':\n continue\n if r[8] == '':\n without_country += 1\n continue\n if r[0] in names_ru:\n localized += 1\n name_ru = names_ru[r[0]]\n else:\n not_localized += 1\n \n ro = [r[0], r[1], name_ru, r[4], r[5], r[8].lower(), r[14]]\n \n if r[0] in names_ru and localized < 5:\n print ro\n pwriter.writerow(ro)\n\nprint \"Localized:\", localized, \"Not localized:\", not_localized, \"Without country:\", without_country\nprint \"Last row\", ro",
	"execution_count": 8,
	"outputs": [
	{
	"output_type": "stream",
	"text": "['3039154', 'El Tarter', '\\xd0\\xad\\xd0\\xbb \\xd0\\xa2\\xd0\\xb0\\xd1\\x80\\xd1\\x82\\xd0\\xb5\\xd1\\x80', '42.57952', '1.65362', 'ad', '1052']\n['3039163', 'Sant Juli\\xc3\\xa0 de L\\xc3\\xb2ria', '\\xd0\\xa1\\xd0\\xb0\\xd0\\xbd\\xd1\\x82-\\xd0\\x96\\xd1\\x83\\xd0\\xbb\\xd0\\xb8\\xd1\\x8f-\\xd0\\xb4\\xd0\\xb5-\\xd0\\x9b\\xd0\\xbe\\xd1\\x80\\xd0\\xb8\\xd1\\x8f', '42.46372', '1.49129', 'ad', '8022']\n['3039604', 'Pas de la Casa', '\\xd0\\x9f\\xd0\\xb0\\xd1\\x81 \\xd0\\xb4\\xd0\\xb5 \\xd0\\xbb\\xd0\\xb0 \\xd0\\x9a\\xd0\\xb0\\xd1\\x81\\xd0\\xb0', '42.54277', '1.73361', 'ad', '2363']\n['3039678', 'Ordino', '\\xd0\\x9e\\xd1\\x80\\xd0\\xb4\\xd0\\xb8\\xd0\\xbd\\xd0\\xbe', '42.55623', '1.53319', 'ad', '3066']\nLocalized: 238621 Not localized: 3471962 Without country: 0\nLast row ['8534443', 'Nhlangano', None, '-19.85776', '27.82236', 'zw', '0']\n",
	"name": "stdout"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Countries in multiple languages\nSample urls to get csv of countries: \n\n- http://api.geonames.org/countryInfoCSV?lang=en&username=demo\n- http://api.geonames.org/countryInfoCSV?lang=ru&username=demo\n\n### Format:\n- iso alpha2\t\n- iso alpha3\t\n- iso numeric\t\n- fips code\t\n- name\t\n- capital\t\n- areaInSqKm\t\n- population\t\n- continent\t\n- languages\t\n- currency\t\n- geonameId"
	},
	{
	"metadata": {
	"collapsed": true,
	"trusted": true
	},
	"cell_type": "code",
	"source": "countries_en = []\ncountries_ru_h = {}",
	"execution_count": 9,
	"outputs": []
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": true
	},
	"cell_type": "code",
	"source": "# English countries\nwith open(wdir + \"countryInfoCSVen.txt\", 'rb') as encountriescsv: \n enreader = csv.DictReader(encountriescsv, delimiter='\\t')\n for r in enreader:\n r['iso alpha2'] = r['iso alpha2'].lower()\n countries_en.append(r)",
	"execution_count": 10,
	"outputs": []
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": true
	},
	"cell_type": "code",
	"source": "# Russian countries\nwith open(wdir + \"countryInfoCSVru.txt\", 'rb') as rucountriescsv: \n rureader = csv.DictReader(rucountriescsv, delimiter='\\t')\n for r in rureader:\n countries_ru_h[r['iso alpha2'].lower()] = r['name']",
	"execution_count": 11,
	"outputs": []
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": true
	},
	"cell_type": "code",
	"source": "print \"English country sample:\", countries_en[0]\nprint \"Russian country sample with 'ru':\", countries_ru_h['ru']",
	"execution_count": 12,
	"outputs": [
	{
	"output_type": "stream",
	"text": "English country sample: {'geonameId': '3041565', 'name': 'Andorra', 'areaInSqKm': '468.0', 'iso numeric': '020', 'languages': 'ca', 'currency': 'EUR', 'capital': 'Andorra la Vella', 'iso alpha3': 'AND', 'iso alpha2': 'ad', 'continent': 'EU', 'fips code': 'AN', 'population': '84000'}\nRussian country sample with 'ru': Россия\n",
	"name": "stdout"
	}
	]
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": true
	},
	"cell_type": "code",
	"source": "out_country = {}\ncountries_count = 0\nbad_countries = []\nwith open(wdir + \"out_countries.csv\", 'wb') as csvoutfile:\n fieldnames = ['id', 'code', 'name_en','name_ru', 'population']\n pwriter = csv.DictWriter(csvoutfile, dialect='excel', fieldnames=fieldnames)\n pwriter.writeheader()\n for c in countries_en:\n if c['name'] == '' or c['population'] == '0':\n bad_countries.append(c)\n continue\n out_country['id'] = c['geonameId']\n out_country['code'] = c['iso alpha2']\n out_country['name_en'] = c['name']\n out_country['name_ru'] = countries_ru_h.get(c['iso alpha2'])\n out_country['population'] = c['population']\n countries_count += 1\n pwriter.writerow(out_country)\n\nprint \"Country sample:\", out_country\nprint \"Total countres:\", countries_count \nprint \"Bad countries:\", len(bad_countries), 'List:', map(lambda c: c['name'], bad_countries)",
	"execution_count": 13,
	"outputs": [
	{
	"output_type": "stream",
	"text": "Country sample: {'name_ru': '\\xd0\\x97\\xd0\\xb8\\xd0\\xbc\\xd0\\xb1\\xd0\\xb0\\xd0\\xb1\\xd0\\xb2\\xd0\\xb5', 'name_en': 'Zimbabwe', 'code': 'zw', 'id': '878675', 'population': '11651858'}\nTotal countres: 246\nBad countries: 6 List: ['', 'Antarctica', 'Bouvet Island', '', 'Heard Island and McDonald Islands', 'U.S. Minor Outlying Islands']\n",
	"name": "stdout"
	}
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"name": "python2",
	"display_name": "Python 2",
	"language": "python"
	},
	"language_info": {
	"mimetype": "text/x-python",
	"nbconvert_exporter": "python",
	"name": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.10",
	"file_extension": ".py",
	"codemirror_mode": {
	"version": 2,
	"name": "ipython"
	}
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}