Skip to content

Instantly share code, notes, and snippets.

@noamross
Last active April 1, 2018 17:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save noamross/88a51bb880f18da88e4b259eefdefe87 to your computer and use it in GitHub Desktop.
Save noamross/88a51bb880f18da88e4b259eefdefe87 to your computer and use it in GitHub Desktop.
Get the content of all bat Wikipedia pages
library(tidyverse)
library(xml2)
library(rvest)
library(WikipediR)
library(urltools)
# Get all speceies-level page titles from the Wikipedia list of bats
bat_titles <- read_html("https://en.wikipedia.org/wiki/List_of_bats") %>%
html_nodes(xpath="//ul/li[contains(., 'Genus')]/ul/li/a[starts-with(@href, '/wiki/')]") %>%
xml_attr("href") %>%
basename() %>%
url_decode()
# Get the content of all those pages (takes a couple of mins!)
bat_info <- map_df(bat_titles, function(x) {
return <- page_content(language="en", project="wikipedia", page_name=x)
data_frame(title = return$parse$title,
content = return$parse$text$`*`)
})
# Extract just the text from the HTML
bat_text <- bat_info %>%
mutate(content = map_chr(content, ~html_text(read_html(.))))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment