Skip to content

Instantly share code, notes, and snippets.

@dubsnipe
Created April 10, 2021 00:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dubsnipe/a27eeaa808e5fcc65df49932c96d625b to your computer and use it in GitHub Desktop.
Save dubsnipe/a27eeaa808e5fcc65df49932c96d625b to your computer and use it in GitHub Desktop.
Script to extract infoboxes from Appropedia dumps
require(xml2, quietly=T)
require(tidyverse, quietly=T)
require(lubridate, quietly=T)
require(tidytext, quietly=T)
require(stringr, quietly=T)
pages <- read_xml("Appropedia-20210409194434.xml")
pages_list <- as_list(pages)
pages_tibble <- as_tibble(
sapply(pages_list[[1]], function(x){
unlist(c(
if(is.list(x$title)) x$title else NA,
if(is.list(x$id)) x$id else NA,
if(is.list(x$revision$id)) x$revision$id else NA,
if(is.list(x$revision$parentid)) x$revision$parentid else NA,
if(is.list(x$revision$timestamp)) x$revision$timestamp else NA,
if(is.list(x$revision$contributor$username)) x$revision$contributor$username else NA,
if(is.list(x$revision$text) & length(x$revision$text)>0) as.character(x$revision$text) else NA,
if(is.list(x$revision$minor)) T else F
))}), .name_repair = "minimal")
# https://stackoverflow.com/questions/42790219/how-do-i-transpose-a-tibble-in-r
pages_trans <- as_tibble(t(pages_tibble), name_repair = "minimal")
tibble_names <- c("title",
"id",
"revision_id",
"parent_id",
"timestamp",
"username",
"text",
"is_minor"
)
colnames(pages_trans) <- tibble_names
all_pages <- pages_trans
all_pages$timestamp <- as_date(all_pages$timestamp)
pages <- all_pages %>% select(title, text)
infoboxes <- pages$text %>% str_extract("\\{\\{Infobox device([^\\}\\}]*)\\}\\}")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment