Skip to content

Instantly share code, notes, and snippets.

@rentrop
Last active October 30, 2017 21:53
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rentrop/8c3b0a0cd9991f60180b to your computer and use it in GitHub Desktop.
Save rentrop/8c3b0a0cd9991f60180b to your computer and use it in GitHub Desktop.
Parse nested xml_children in R with rvest/xml2
parse_nested <- function(x, prefix = ''){
kids = x %>% xml_children()
ind = which(sapply(kids, xml_length) != 0)
if(!length(ind)){
return(setNames(kids %>% xml_text(),
paste0(prefix,kids %>% xml_name())))
}
nested = parse_nested(kids[ind],
prefix = paste0(prefix, kids[ind] %>% xml_name(), "_"))
unnested = setNames(kids[-ind] %>% xml_text(),
paste0(prefix, kids[-ind] %>% xml_name()))
as.list(c(unnested, nested))
}
require(httr)
r <- POST("http://en.wikipedia.org/w/index.php?title=Special:Export",
body = "pages=Euroswydd&offset=1&limit=2&action=submit")
require(rvest)
doc <- read_html(r)
# Binding via data.table::rbindlist (results in a data.table by default)
doc %>%
html_nodes("revision") %>%
lapply(parse_nested) %>% #Parse each revison seperately
data.table::rbindlist(fill=TRUE) # rbind and fill
# Binding via plyr::rbind.fill (results in a data.frame by default)
doc %>%
html_nodes("revision") %>%
lapply(parse_nested) %>%
lapply(function(x) data.frame(x)) %>% #Convert character-vector to data.frame
plyr::rbind.fill() # rbind and fill
@rentrop
Copy link
Author

rentrop commented Mar 19, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment