I catalogued ~50 issues filed against Natural Earth 10m Populated Places, with the goal of researching, verifying, and resolving them efficiently. These issues are summarized in the table below. There are multiple rows where multiple operations are required to address.
These are the fields that contain the name(s) of the populated place: NAME, NAMEPAR, NAMEALT, NAMEASCII, name_en, name_de, name_es, name_fr, name_pt, name_ru, name_zh, name_ar, name_bn, name_el, name_hi, name_hu, name_id, name_it, name_ja, name_ko, name_nl, name_pl, name_sv, name_tr, name_vi
This first PR was initially intended to address 1 & 2 by adding, deleting or modifying the metadata for specific records to reflect suggested changes to place IDs, etc. because I assumed it was better to do that ahead of batch edits/conflations with other data sources. In the end it also addressed 4, 5 and 6 for select issues as well, as requested in the review.
I've made no attempt to find additional places that need wikidataid or place type revisions - this may be warranted, as a more thorough review could surface more issues that could be addressed through reconflation of the entire populated places dataset, along with other NE datasets with localizations from Wikidata.
A second PR will revise all but a few of the issues covered in this gist. Unresolved issues noted below.
So many of these issues were found by, researched by, and verified by @elliotap - high five Elliot.
These issues suggest adding or replacing a place: #449, #419, #411, #407, #406, #339, #334, #330, #323, #318
These issues suggest removing or replacing a place: #449, #411, #406, #405, #373, #365, #358, #356, #339, #334,#323
Proposed solution is to verify these places by searching for or reviewing an associated wikidata ID, looking them up on OpenStreetMap, and checking the place via commercial map providers.
#389, #388, #387, #385, #384, #383, #381, #377, #371, #368, #367, #364, #362, #310, #293
This is beyond the scope of what I've undertaken, as I'm only looking at 10m Populated Places, whereas localizations are included for many (most?) NE feature sets.
These issues should be resolved when localizations are batch updated: #399, #329
The name_*
fields are associated with Wikidata via wikidataid
. Once the localizations are updated, it might make sense to further reconcile NAME
to name_en
.
Based on issues, modify NAME
and move old value to NAMEALT
or NAMEPAR
.
These issues request updates to label (name) fields that I corroborated by looking up the associated Wikidata record, based on the wikidata ID for that record: #453, #452, #448, #430, #428, #413, #398, #381, #380, #369, #366, #336
All of these issues can be resolved my moving the current name to NAMEALT
or NAMEPAR
and overwriting NAME, NAMEASCII, and the localized names based on the wikidataid.
Comparison between suggestions and Wikidata
All label/name fields that could be adjusted to match Wikidata value
R Script used to generate these
EXCEPTIONS: These issues request updates to label (name) fields that are NOT supported by the associated wikidata ID: #454, .#449, #408, #374, #335,#333
These issues suggest revising population fields: #407, #419, #414, #412, #384, #359
These issues suggest location information for this place should be edited: #450, #407 #409, #338
Though I'm relatively familiar with Natural Earth data, I've not been involved with Natural Earth development, and I am learning as I go through issues, revisions, and contributions to this repo in the past. Thus please correct or comment on what I think I understand about how the records in are derived. I also understand the scope of revisions for the next release may not have been established yet. I have thus taken a very naive approach to investigating issues and presenting findings.
Here's my naive guess at the order of operations:
- Check IDs - For the purposes of my analysis I used
wikidataid
, which is missing values in 177 records. Of the 5 or 6 I investigated with missing wikidataids, none were good data, though they have unique/validwof_id
andne_id
. 10m Populated Places has 6 ID fields, but only 2 of these (wof_id
andne_id
) are defined for all records. See code snippet with totals below. Suggest removing or resolving all values for the other four fields from the attribute tables and/or using them to validate existing records.
> not.defined = function(x) {
if (is.factor(x)) { x <- as.character(x) }
if (!grepl("\\D", x)) { x <- as.numeric(x) }
is.null(x) || is.na(x) || is.nan(x) || x == '' || x <= 0
}
> ids <- grep("id$", names(ne_10m_populated_places), ignore.case = T, perl = T, value = T)
> sapply(ids, function(id) { sum(sapply(ne_10m_populated_places[[id]], not.defined)) } )
GEONAMEID UN_FID wikidataid wof_id name_id ne_id
545 6754 177 0 7339 0
-
Add and remove new records - issues whose resolution would require this are listed in subsections below, however criteria for inclusion and thus the full scope of what should be added/removed I don't know. This step could be partially resolved by better understanding the IDs and conflation protocols they imply, as suggested above.
-
Apply conflation and/or batch edits - e.g. in order to update localizations and all other name/label fields (currently there are gross inconsistencies between those derived from/corresponding to wikidataid and others), feature class types, population estimates, etc.
-
Apply manual edits - wherever overrides may be appropriate for the purposes of this dataset, for instance point placement.
Preliminary comments/questions:
-
For the sake of maintaining this data intelligibly, is it reasonable to assume that 10m Populated Places has dependencies on Who's on First, and by extension, Wikidata and/or OpenStreetMap that should govern (a) what places are included or excluded (b) canonical spellings of names and localizations (c)
FEATURECLA
(feature class) types (Admin 0, Admin 1, Populated Place, etc.)? -
Are there fields in the attribute table that it might be reasonable to deprecate? For example, those related to Geonames.
Issue | Place Name | Suggested Revision | Source(s) | Comment |
---|---|---|---|---|
#454 | Pondicherry | Puducherry | Wikipedia, wikidata=Q639421 | Wikipedia |
#453 | Rida | Rada'a | Wikipedia, wikidata=Q2125362 | Suggest revise name fields |
#452 | Ammochostos | Famagusta | Wikipedia, news, other scholarship | Suggest remove name Ammochostos and replace with Famagusta |
#451 | Wikipedia | Add Farsi localization | ||
#450 | Bodø, Norway | Bodø, Norway | Wikipedia | Move location |
#449 | Sakarya | Adapazarı | Wikipedia, wikidataid Q175323 | Consolidate fields into 1159113425 and remove surplus record 1159146375 |
#448 | 11 places in Karnataka, India | News article | Suggested correction accepted | |
#430 | Shijianzhuang | Shijiazhuang | Wikipedia | Suggested correction supported by associated wikidata ID Q58401 |
#430 | Xiangfan | Xiangyang | Wikipedia | Suggested correction supported by associated wikidata ID Q71284 |
#428 | Savetskaya Gavan | Sovetskaya Gavan | Wikipedia | Suggested correction supported by associated wikidata ID Q196374 |
#419 | Saguenay | Wikipedia, OSM | Add place | |
#414 | Newcastle | Wikipedia | Update population | |
#413 | Astana | Nur-Sultan | Wikipedia | Suggested correction supported by associated wikidata ID Q1520 |
#412 | Korla | Wikipedia | Update population | |
#411 | Jinxi | Huludao | Wikipedia | Suggestion to replace old place name verified |
#409 | Wikipedia | Adjust the locations of 4 places | ||
#408 | Phnom Tbeng Meanchey | Preah Vihear | Wikipedia | Suggestion to replace old place name verified |
#407 | Nnewi | Wikipedia | Add place | |
#406 | Bose | Baise | Wikipedia | Wrong wikidata ID (should be Q571949) |
#405 | North Shore | Wikipedia | Suggests deleting from Populated Places as it was incorporated into Aukland NZ in 2010 | |
#399 | Puebla Garcia | Puebla | Wikipedia | Puebla not Puebla Garcia should be the name_en (wikidata name_en for Q125293 is now 'Puebla City') |
#398 | Ujungpandang | Makassar | Wikipedia | Suggested correction supported by associated wikidata ID Q14634 |
#389 | Dar es Salaam | Wikipedia | Not an Admin-0 Capital | |
#388 | Kyoto | Wikipedia | Not an Admin-0 Capital | |
#387 | Tomakomai | Wikipedia | Not an Admin-1 Capital | |
#385 | Lagos | Wikipedia | Not an Admin-0 Capital | |
#384 | Zhengzhou | Wikipedia | Should be an Admin-0 Capital | |
#383 | Shijianzhuang | Shijiazhuang | Wikipedia | Should be an Admin-0 Capital |
#381 | Rangoon | Yangon | Wikipedia | Should be an Admin-1 Capital instead of Admin-0 |
#380 | Sevastapol | Sevastopol | Wikipedia | Suggested correction supported by associated wikidata ID Q7525 |
#377 | Makkah | Makkah | Wikipedia | Should be an Admin-1 Capital instead of a Populated Place |
#374 | AmundseniScott South Pole Station | Amundsen-Scott South Pole Station | Wikipedia | Suggested correction supported by associated wikidata ID Q243307 |
#373 | Vila Velha | Wikipedia | Suggests removing as this place does not exist | |
#371 | Macau | Wikipedia | Should be an Admin-1 Capital instead of Admin-0 | |
#371 | Hong Kong | Wikipedia | Should be an Admin-1 Capital instead of Admin-0 | |
#369 | Lupanshui | Liupanshui | Wikipedia | Suggested correction supported by associated wikidata ID |
#368 | Patra | Patra | Wikipedia | Should be an Admin-1 Capital instead of a Populated Place |
#367 | Novi Sad | Novi Sad | Wikipedia | Should be Admin-1 Capital instead of Admin-0 |
#366 | Akureyi | Akureyri | Wikipedia | Suggested correction supported by associated wikidata ID |
#365 | Bandar Lampung | Wikipedia | Listed twice: suggests removing the Populated Place record and keeping Admin-1 record | |
#364 | Johannesburg | Should be an Admin-1 Capital instead of Admin-0 | ||
#362 | Xian | Should be an Admin-1 not a populated place | ||
#359 | Kandalaksha | Suggests MAXPOP is wrong | ||
#358 | Natal | Suggests removing as this place does not exist | ||
#356 | Noginsk | Wikipedia | Suggests removing as this place does not exist | |
#339 | Gar | Wikipedia | Suggest deleting this record as place no longer exists and no associated Wikidata ID | |
#339 | Shiquanhe | Wikipedia, OSM, Google | Add place Q2279283 | |
#338 | Wikipedia | Adjust the locations of 3 places | ||
#336 | Bandarlampung | Bandar Lampung | Wikipedia,city website | Bandar Lampung not Bandarlampung should be the name_en - suggested correction supported by associated wikidata ID |
#335 | Kailua | Kailua-Kona | Wikipedia | Suggests updating name_* localizations that list name as Kailua to Kailua-Kona - suggested correction NOT supported by associated wikidata ID |
#334 | Bol'sheretsk | Wikipedia, news stories | Suggest deleting this record as place no longer exists and no associated Wikidata ID | |
#334 | Ust-Bolsheretsk | Wikipedia, news stories | Add place Q2502620 | |
#333 | Lansdowne House | Wikipedia and other sources | Suggest removing as this is an old name | |
#333 | Neskantaga | Wikipedia and other sources | Add this place (no wikidata ID for the town just the sovereignty) | |
#330 | Belushya Guba | Wikipedia | Add place | |
#329 | Chittagong | Chattogram | News and official website | Suggested correction NOT supported by associated wikidata ID |
#329 | Jessore | Jashore | News and official website | Suggested correction NOT supported by associated wikidata ID |
#329 | Barisal | Barishal | News and official website | Suggested correction NOT supported by associated wikidata ID |
#323 | El Cayo | Wikipedia, OSM | Suggest removing as this place no longer exists (no wikidata ID) | |
#323 | San Ignacio | Wikipedia, OSM | Add place (modern name of El Cayo) | |
#318 | Horten | Wikipedia | Add place | |
#310 | Chinhoyi | Wikipedia | Should be an Admin-1 not a Populated Place | |
#310 | Kariba | Wikipedia | Should be a Populated Place not an Admin-1 | |
#293 | Wikipedia | Several Admin-1 Capitals in Chad should be added or updated |