Skip to content

Instantly share code, notes, and snippets.

@mizmay
Last active April 5, 2021 05:54
Show Gist options
  • Save mizmay/89f6eed0e571e5e324bb7e13005715fc to your computer and use it in GitHub Desktop.
Save mizmay/89f6eed0e571e5e324bb7e13005715fc to your computer and use it in GitHub Desktop.
How to address issues filed against Natural Earth 10m Populated Places

Issues filed against Natural Earth 10m Populated Places

I catalogued ~50 issues filed against Natural Earth 10m Populated Places, with the goal of researching, verifying, and resolving them efficiently. These issues are summarized in the table below. There are multiple rows where multiple operations are required to address.

These are the fields that contain the name(s) of the populated place: NAME, NAMEPAR, NAMEALT, NAMEASCII, name_en, name_de, name_es, name_fr, name_pt, name_ru, name_zh, name_ar, name_bn, name_el, name_hi, name_hu, name_id, name_it, name_ja, name_ko, name_nl, name_pl, name_sv, name_tr, name_vi

Steps for Resolving Issues

This first PR was initially intended to address 1 & 2 by adding, deleting or modifying the metadata for specific records to reflect suggested changes to place IDs, etc. because I assumed it was better to do that ahead of batch edits/conflations with other data sources. In the end it also addressed 4, 5 and 6 for select issues as well, as requested in the review.

I've made no attempt to find additional places that need wikidataid or place type revisions - this may be warranted, as a more thorough review could surface more issues that could be addressed through reconflation of the entire populated places dataset, along with other NE datasets with localizations from Wikidata.

A second PR will revise all but a few of the issues covered in this gist. Unresolved issues noted below.

So many of these issues were found by, researched by, and verified by @elliotap - high five Elliot.

1. Verify Suggested Updates

These issues suggest adding or replacing a place: #449, #419, #411, #407, #406, #339, #334, #330, #323, #318 These issues suggest removing or replacing a place: #449, #411, #406, #405, #373, #365, #358, #356, #339, #334,#323

Proposed solution is to verify these places by searching for or reviewing an associated wikidata ID, looking them up on OpenStreetMap, and checking the place via commercial map providers.

2. Revise Feature Class (Capitals)

#389, #388, #387, #385, #384, #383, #381, #377, #371, #368, #367, #364, #362, #310, #293

3. Update Localizations

This is beyond the scope of what I've undertaken, as I'm only looking at 10m Populated Places, whereas localizations are included for many (most?) NE feature sets.

These issues should be resolved when localizations are batch updated: #399, #329

The name_* fields are associated with Wikidata via wikidataid. Once the localizations are updated, it might make sense to further reconcile NAME to name_en.

4. Update NAME, NAMEALT, NAMEPAR, NAMEASCII

Based on issues, modify NAME and move old value to NAMEALT or NAMEPAR.

These issues request updates to label (name) fields that I corroborated by looking up the associated Wikidata record, based on the wikidata ID for that record: #453, #452, #448, #430, #428, #413, #398, #381, #380, #369, #366, #336

All of these issues can be resolved my moving the current name to NAMEALT or NAMEPAR and overwriting NAME, NAMEASCII, and the localized names based on the wikidataid.

Comparison between suggestions and Wikidata

All label/name fields that could be adjusted to match Wikidata value

R Script used to generate these

EXCEPTIONS: These issues request updates to label (name) fields that are NOT supported by the associated wikidata ID: #454, #449, #408, #374, #335,#333.

5. Revise Population

These issues suggest revising population fields: #407, #419, #414, #412, #384, #359

6. Edit Geometry

These issues suggest location information for this place should be edited: #450, #407 #409, #338

Notes & Questions

Though I'm relatively familiar with Natural Earth data, I've not been involved with Natural Earth development, and I am learning as I go through issues, revisions, and contributions to this repo in the past. Thus please correct or comment on what I think I understand about how the records in are derived. I also understand the scope of revisions for the next release may not have been established yet. I have thus taken a very naive approach to investigating issues and presenting findings.

Here's my naive guess at the order of operations:

  1. Check IDs - For the purposes of my analysis I used wikidataid, which is missing values in 177 records. Of the 5 or 6 I investigated with missing wikidataids, none were good data, though they have unique/valid wof_id and ne_id. 10m Populated Places has 6 ID fields, but only 2 of these (wof_id and ne_id) are defined for all records. See code snippet with totals below. Suggest removing or resolving all values for the other four fields from the attribute tables and/or using them to validate existing records.
> not.defined = function(x) {
  if (is.factor(x)) { x <- as.character(x) }
  if (!grepl("\\D", x)) { x <- as.numeric(x) }
  is.null(x) || is.na(x) || is.nan(x) || x == '' || x <= 0
}

> ids <- grep("id$", names(ne_10m_populated_places), ignore.case = T, perl = T, value = T)
> sapply(ids, function(id) { sum(sapply(ne_10m_populated_places[[id]], not.defined)) } )
 GEONAMEID     UN_FID wikidataid     wof_id    name_id      ne_id 
       545       6754        177          0       7339          0 
  1. Add and remove new records - issues whose resolution would require this are listed in subsections below, however criteria for inclusion and thus the full scope of what should be added/removed I don't know. This step could be partially resolved by better understanding the IDs and conflation protocols they imply, as suggested above.

  2. Apply conflation and/or batch edits - e.g. in order to update localizations and all other name/label fields (currently there are gross inconsistencies between those derived from/corresponding to wikidataid and others), feature class types, population estimates, etc.

  3. Apply manual edits - wherever overrides may be appropriate for the purposes of this dataset, for instance point placement.

Preliminary comments/questions:

  1. For the sake of maintaining this data intelligibly, is it reasonable to assume that 10m Populated Places has dependencies on Who's on First, and by extension, Wikidata and/or OpenStreetMap that should govern (a) what places are included or excluded (b) canonical spellings of names and localizations (c) FEATURECLA (feature class) types (Admin 0, Admin 1, Populated Place, etc.)?

  2. Are there fields in the attribute table that it might be reasonable to deprecate? For example, those related to Geonames.

Issue Place Name Suggested Revision Source(s) Comment
#454 Pondicherry Puducherry Wikipedia, wikidata=Q639421 Wikipedia
#453 Rida Rada'a Wikipedia, wikidata=Q2125362 Suggest revise name fields
#452 Ammochostos Famagusta Wikipedia, news, other scholarship Suggest remove name Ammochostos and replace with Famagusta
#451 Wikipedia Add Farsi localization
#450 Bodø, Norway Bodø, Norway Wikipedia Move location
#449 Sakarya Adapazarı Wikipedia, wikidataid Q175323 Consolidate fields into 1159113425 and remove surplus record 1159146375
#448 11 places in Karnataka, India News article Suggested correction accepted
#430 Shijianzhuang Shijiazhuang Wikipedia Suggested correction supported by associated wikidata ID Q58401
#430 Xiangfan Xiangyang Wikipedia Suggested correction supported by associated wikidata ID Q71284
#428 Savetskaya Gavan Sovetskaya Gavan Wikipedia Suggested correction supported by associated wikidata ID Q196374
#419 Saguenay Wikipedia, OSM Add place
#414 Newcastle Wikipedia Update population
#413 Astana Nur-Sultan Wikipedia Suggested correction supported by associated wikidata ID Q1520
#412 Korla Wikipedia Update population
#411 Jinxi Huludao Wikipedia Suggestion to replace old place name verified
#409 Wikipedia Adjust the locations of 4 places
#408 Phnom Tbeng Meanchey Preah Vihear Wikipedia Suggestion to replace old place name verified
#407 Nnewi Wikipedia Add place
#406 Bose Baise Wikipedia Wrong wikidata ID (should be Q571949)
#405 North Shore Wikipedia Suggests deleting from Populated Places as it was incorporated into Aukland NZ in 2010
#399 Puebla Garcia Puebla Wikipedia Puebla not Puebla Garcia should be the name_en (wikidata name_en for Q125293 is now 'Puebla City')
#398 Ujungpandang Makassar Wikipedia Suggested correction supported by associated wikidata ID Q14634
#389 Dar es Salaam Wikipedia Not an Admin-0 Capital
#388 Kyoto Wikipedia Not an Admin-0 Capital
#387 Tomakomai Wikipedia Not an Admin-1 Capital
#385 Lagos Wikipedia Not an Admin-0 Capital
#384 Zhengzhou Wikipedia Should be an Admin-0 Capital
#383 Shijianzhuang Shijiazhuang Wikipedia Should be an Admin-0 Capital
#381 Rangoon Yangon Wikipedia Should be an Admin-1 Capital instead of Admin-0
#380 Sevastapol Sevastopol Wikipedia Suggested correction supported by associated wikidata ID Q7525
#377 Makkah Makkah Wikipedia Should be an Admin-1 Capital instead of a Populated Place
#374 AmundseniScott South Pole Station Amundsen-Scott South Pole Station Wikipedia Suggested correction supported by associated wikidata ID Q243307
#373 Vila Velha Wikipedia Suggests removing as this place does not exist
#371 Macau Wikipedia Should be an Admin-1 Capital instead of Admin-0
#371 Hong Kong Wikipedia Should be an Admin-1 Capital instead of Admin-0
#369 Lupanshui Liupanshui Wikipedia Suggested correction supported by associated wikidata ID
#368 Patra Patra Wikipedia Should be an Admin-1 Capital instead of a Populated Place
#367 Novi Sad Novi Sad Wikipedia Should be Admin-1 Capital instead of Admin-0
#366 Akureyi Akureyri Wikipedia Suggested correction supported by associated wikidata ID
#365 Bandar Lampung Wikipedia Listed twice: suggests removing the Populated Place record and keeping Admin-1 record
#364 Johannesburg Should be an Admin-1 Capital instead of Admin-0
#362 Xian Should be an Admin-1 not a populated place
#359 Kandalaksha Suggests MAXPOP is wrong
#358 Natal Suggests removing as this place does not exist
#356 Noginsk Wikipedia Suggests removing as this place does not exist
#339 Gar Wikipedia Suggest deleting this record as place no longer exists and no associated Wikidata ID
#339 Shiquanhe Wikipedia, OSM, Google Add place Q2279283
#338 Wikipedia Adjust the locations of 3 places
#336 Bandarlampung Bandar Lampung Wikipedia,city website Bandar Lampung not Bandarlampung should be the name_en - suggested correction supported by associated wikidata ID
#335 Kailua Kailua-Kona Wikipedia Suggests updating name_* localizations that list name as Kailua to Kailua-Kona - suggested correction NOT supported by associated wikidata ID
#334 Bol'sheretsk Wikipedia, news stories Suggest deleting this record as place no longer exists and no associated Wikidata ID
#334 Ust-Bolsheretsk Wikipedia, news stories Add place Q2502620
#333 Lansdowne House Wikipedia and other sources Suggest removing as this is an old name
#333 Neskantaga Wikipedia and other sources Add this place (no wikidata ID for the town just the sovereignty)
#330 Belushya Guba Wikipedia Add place
#329 Chittagong Chattogram News and official website Suggested correction NOT supported by associated wikidata ID
#329 Jessore Jashore News and official website Suggested correction NOT supported by associated wikidata ID
#329 Barisal Barishal News and official website Suggested correction NOT supported by associated wikidata ID
#323 El Cayo Wikipedia, OSM Suggest removing as this place no longer exists (no wikidata ID)
#323 San Ignacio Wikipedia, OSM Add place (modern name of El Cayo)
#318 Horten Wikipedia Add place
#310 Chinhoyi Wikipedia Should be an Admin-1 not a Populated Place
#310 Kariba Wikipedia Should be a Populated Place not an Admin-1
#293 Wikipedia Several Admin-1 Capitals in Chad should be added or updated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment