Skip to content

Instantly share code, notes, and snippets.

@mnazarov
Created September 6, 2016 16:03
Show Gist options
  • Save mnazarov/2b03bed0873a864c87f4ec2939d78e2c to your computer and use it in GitHub Desktop.
Save mnazarov/2b03bed0873a864c87f4ec2939d78e2c to your computer and use it in GitHub Desktop.
## Check all three-letter acronyms (TLA) in Wikipedia
## Assuming TLA is unused if no wikipedia entry with its name exists
library(rvest)
# splits from https://en.wikipedia.org/wiki/Wikipedia:TLAs
ranges <- c("AAA_to_DZZ", "EAA_to_HZZ", "IAA_to_LZZ", "MAA_to_PZZ", "QAA_to_TZZ", "UAA_to_XZZ", "YAA_to_ZZZ")
baseUrl <- "https://en.wikipedia.org/wiki/Wikipedia:TLAs_from_"
all <- c()
unused <- c()
for (range in ranges) {
url <- paste0(baseUrl, range)
all <- c(all, read_html(url) %>% html_nodes("pre > a") %>% html_text())
unused <- unused <- c(unused, read_html(url) %>% html_nodes("pre > a.new") %>% html_text())
}
length(all) # 26^3
# [1] 17576
length(unused)
# [1] 3563
head(unused)
# [1] "AKZ" "AQV" "AQX" "AQZ" "AWJ" "AXO"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment