Skip to content

Instantly share code, notes, and snippets.

@fdabl
Created February 20, 2015 22:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save fdabl/9d26ea30d1b3c18e9699 to your computer and use it in GitHub Desktop.
Save fdabl/9d26ea30d1b3c18e9699 to your computer and use it in GitHub Desktop.
require(rvest)
require(dplyr)
scrape <- function(year, top) {
possible <- c(10, 25, 100)
if (!top %in% possible) stop("top must be 10, 25 or 100!")
if (!year %in% 2012:2014) stop("data available only for years 2012 - 2014")
targets <- top[1:which(possible == top)]
separate <- sapply(possible, function(t) sub("rank", t, "tbody > tr[class='toprank']"))
sel <- paste(separate, collapse = ", ")
url <- sub("year", year, "https://tools.wmflabs.org/wikitrends/year.html")
page <- html(url)
clean <- function(text) sapply(strsplit(text, "\n"), "[", 1)
item <- page %>%
html_nodes(sel) %>%
html_text() %>%
clean() %>%
gsub(pattern = ".*\\. ", replacement = "", x = .)
language <- page %>%
html_nodes("a[name]") %>%
html_attr("name") %>%
rep(., each = top)
count <- page %>%
html_nodes(sel) %>%
html_nodes(".c") %>%
html_text() %>%
Filter(function(x) x != "View count", .) %>%
gsub(pattern = " ", replacement = "", x = as.character(.)) %>%
as.numeric()
tbl_df(data.frame(language = language, item = item, count = count, stringsAsFactors = FALSE))
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment