Skip to content

Instantly share code, notes, and snippets.

@jsta
Last active December 7, 2017 14:31
Show Gist options
  • Save jsta/e486f337be6d5bcdb3aeb1335959de52 to your computer and use it in GitHub Desktop.
Save jsta/e486f337be6d5bcdb3aeb1335959de52 to your computer and use it in GitHub Desktop.
Counting the number of GLEON lake sites that resolve to a Wikipedia page
# GLEON lakes according to Wikipedia
library(rvest)
library(wikilake)
library(sf)
library(maps)
# get data ####
gleon_lakes <- read_html("http://gleon.org/lakes")
gleon_lakes <- html_nodes(gleon_lakes, ".views-field")
gleon_lakes <- html_text(gleon_lakes)
gleon_lake_names <- gleon_lakes[seq(1, length(gleon_lakes), by = 3)]
gleon_lake_names <- gsub("\n", "", gleon_lake_names)
gleon_lake_names <- trimws(gleon_lake_names)[-1]
res <- lapply(gleon_lake_names, function(x) tryCatch(wikilake::lake_wiki(x), error = function(err){NA}))
# clean missing lakes ####
res_clean <- res[unlist(lapply(res, function(x) !is.null(x)))]
res_clean <- res_clean[unlist(lapply(res_clean, function(x) length(x) > 1))]
res_clean <- res_clean[unlist(lapply(res_clean, function(x) !is.na(x[,"Lat"])))]
length(res_clean) / length(gleon_lake_names) # proportion of gleon lakes that resolve to wikipedia pages
##### collapse list to data.frame
res_df_names <- unique(unlist(lapply(res_clean, names)))
res_df <- data.frame(matrix(NA, nrow = length(res_clean),
ncol = length(res_df_names)))
names(res_df) <- res_df_names
for(i in seq_len(length(res_clean))){
# print(i) # debugging
dt_pad <- data.frame(matrix(NA, nrow = 1,
ncol = length(res_df_names) - ncol(res_clean[[i]])),
stringsAsFactors = FALSE)
names(dt_pad) <- res_df_names[!(res_df_names %in% names(res_clean[[i]]))]
dt <- cbind(res_clean[[i]], dt_pad)
dt <- dt[,res_df_names]
res_df[i,] <- dt
}
# Keep only common columns #####
good_cols <- data.frame(as.numeric(as.character(apply(milakes,
2, function(x) sum(!is.na(x))))))
good_cols <- cbind(good_cols, names(milakes))
good_cols <- good_cols[good_cols[,1] > 20 ,2]
res_final <- res_df[,good_cols]
# saveRDS(res_final, "gleon-lakes_wikipedia.rds")
res_final <- readRDS("gleon-lakes_wikipedia.rds")
world <- sf::st_as_sf(map("world", plot = FALSE, fill = TRUE))
res_sf <- st_as_sf(res_final, coords = c("Lon", "Lat"), crs = 4326)
plot(world$geometry)
plot(res_sf$geometry, add = TRUE, pch = 21, col = "red")
nrow(res_sf) / length(gleon_lake_names) # proportion found on Wikipedia
knitr::kable(gleon_lake_names[!(gleon_lake_names %in% res_sf$Name)]) # missing lakes
@jsta
Copy link
Author

jsta commented Dec 7, 2017

According to this analysis only 38.5 % of GLEON lakes have a Wikipedia page.

Ignoring the fact that some missing information is due to simple naming differences, here are the missing lakes:

Name
Alexandrina
Alkali
Annecy
Anyang River
Auburn Lake
Beaver Lake
Beaverdam Reservoir
Blue Chalk
Bright Lake
Bure
Caribou Lake
Carioca
Castle
Chub
Crampton Lake
Crosson
Crystal Bog Lake
Crystal Lake
Desbarats Lake
Dickie
Dom Helvecio
East Twin Lake
Emaiksoun Lake
Erken
Falling Creek Reservoir
Fredriksburg Slotso
Geneva
Gossenkoellesee
Hampenso
Heney
Highland Lake
Jekl Bog
Jhalar Lake
Kalar Kahar Lake
Kernan Lough
Khabbeki Lake
La Salada
Lac Croche
Lac Simoncouche
Lagoa Mangueira
Laguna Carpincho
Laguna Chascomús
Laguna Gómez
Laguna Grande
Lake Ähijärv
Lake Annie
Lake Auburn
Lake Bonney
Lake Champlain
Lake Erken
Lake Giles
Lake Giles
Lake Kinneret
Lake Kortowskie
Lake Lacawac
Lake Langtjern
Lake Lugano
Lake Mäeküla
Lake Maggiore
Lake Okaro
Lake Oneida
Lake Paajarvi
Lake Parentis
Lake Pühajärv
Lake Rotoiti
Lake Soyang
Lake Stechlin
Lake Taihu
Lake Tovel
Lake Tündre
Lake Tutira
Lake Verevi
Lake Viisjaagu
Lake Waahi
Lake Washington
Lake Waynewood
Lawrence
Le Bourget
Little Basswood Lake
Long Lake
Long Lake, Harrison Maine
Lough Corbet
Lough Erne
Lough Feeagh
Lough Furnace
Lough Namachree
Lough Neagh
Lunzer See
Manso Lake
Maumelle
Mirror Lake
Mouser Bog
Muggelsee
Myvatn
North Sparkling Bog
Otsego Lake
Peter Lake
Plastic
Red Chalk (and potentially others)
Represa de Itumbiara
Represa do Funil
Rondout
Sandy
Shelburne Pond
Simoncouche
Sparkling Lake
St Gribso
Terra Alta
Timber Bog
Toolik Lake
Trout Lake
Vedstedso
Vortsjarv
Ward Lake
West Long Lake (Kinwamakwad Lake)
West Twin
Wintergreen
Yuan-Yang Lake

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment