Skip to content

Instantly share code, notes, and snippets.

@bearloga
Created May 23, 2016 21:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bearloga/71f8934c4c832ff12b59ba755f9126fd to your computer and use it in GitHub Desktop.
Save bearloga/71f8934c4c832ff12b59ba755f9126fd to your computer and use it in GitHub Desktop.
Visits every Wikipedia in every language and grabs the title + subtitle.
suppressMessages({
# Preamble ========================================
# ======== Web Scraping ===========================
library(rvest) # install.packages('rvest')
library(magrittr)
# ======== I/O ====================================
library(httr) # install.packages('httr')
})
html <- read_html("https://wikipedia.org")
wikipedias <- html %>%
html_nodes('div[data-el-section="secondary links"] ul li a') %>%
{ data.frame(href = paste0("https:", html_attr(., "href")), name = html_text(.)) }
subtitles <- do.call(rbind, apply(wikipedias, 1, function(wikipedia) {
html <- read_html(paste0(wikipedia['href'], "wiki/MediaWiki:Sitesubtitle"))
title <- html %>%
html_nodes("title") %>%
html_text()
subtitle <- html %>%
html_nodes("#mw-content-text p") %>%
html_text()
return(data.frame(title = title, subtitle = subtitle, stringsAsFactors = FALSE))
}))
subtitles$title %<>%
sub("MediaWiki:Sitesubtitle.{3}", "\\1", .) %>%
sub(", .*", "", .)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment