Skip to content

Instantly share code, notes, and snippets.

@jmclawson
Last active November 4, 2023 16:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jmclawson/72678ef880e4844dfb3d92ab65fa6090 to your computer and use it in GitHub Desktop.
Save jmclawson/72678ef880e4844dfb3d92ab65fa6090 to your computer and use it in GitHub Desktop.
Functions for retrieving metadata and texts from the Michigan Corpus of Upper-Level Student Papers (https://elicorpora.info/main)
# helper function get_if_needed for downloading online documents exactly once: https://gist.github.com/jmclawson/65899e2de6bfee692b08141a98422240
source("https://gist.githubusercontent.com/jmclawson/65899e2de6bfee692b08141a98422240/raw/7c5590377332e427691f2331b69abd58be2141ec/get_if_needed.R")
get_micusp_metadata <- function(micusp_dir = "micusp"){
get_if_needed("https://elicorpora.info/browse?mode=download&start=1&sort=dept&direction=desc",
filename = "micusp_metadata.csv",
destdir = micusp_dir)
readr::read_csv("micusp/micusp_metadata.csv", show_col_types = FALSE) |>
janitor::clean_names()
}
parse_micusp_paper <- function(paperid,
htmldir = "micusp/corpus_html",
textdir = "micusp/corpus"){
filename_text <- paperid |>
stringr::str_replace_all("[.]","_") |>
paste0(".txt") |>
{\(x) paste0(textdir,"/",x)}()
filename_html <- paperid |>
stringr::str_replace_all("[.]","_") |>
paste0(".html") |>
{\(x) paste0(htmldir,"/",x)}()
if(!dir.exists(textdir)){dir.create(textdir)}
if(!file.exists(filename_text)){
filename_html |>
rvest::read_html() |>
rvest::html_element(css = "div#paperBody") |>
rvest::html_text() |>
readr::write_lines(filename_text)
}
readr::read_lines(filename_text) |>
paste0(collapse = "\n")
}
get_micusp_corpus <- function(...){
the_df <-
get_micusp_metadata() |>
dplyr::filter(...)
the_urls <-
the_df |>
dplyr::pull(paper_id) |>
{\(x) paste0("https://elicorpora.info/view?pid=", x)}()
the_filenames <-
the_df |>
dplyr::pull(paper_id) |>
stringr::str_replace_all("[.]", "_") |>
paste0(".html")
the_urls |>
purrr::walk2(.x = the_urls,
.y = the_filenames,
.f = ~ get_if_needed(.x, .y, destdir = "micusp/corpus_html"))
the_df |>
dplyr::rowwise() |>
dplyr::mutate(text = parse_micusp_paper(paper_id))
}
@jmclawson
Copy link
Author

jmclawson commented Nov 4, 2023

Functions and use

get_micusp_metadata()

Load this code in R and then optionally use get_micusp_metadata() to get a copy of the metadata:

source("https://gist.githubusercontent.com/jmclawson/72678ef880e4844dfb3d92ab65fa6090/raw/9f848f40781f9a5c82864feb8ca6c0e1ba17c40c/corpus_micusp.R")

micusp_metadata <- get_micusp_metadata()

parse_micusp_paper()

The parse_micusp_paper() function accepts the ID value of a single paper. Use it to check for a local copy of the web page containing the paper (in "micusp/corpus_html/"), download it once if needed to avoid hitting the server too often, parse this local copy, save the parsed text locally in "micusp/corpus/", and then return the text. It is used like this:

parse_micusp_paper("SOC.G3.05.1")

get_micusp_corpus()

The get_micusp_corpus() function accepts a filter to the metadata before passing paper IDs to parse_micusp_paper() and saving the resulting text in the text column of a data frame. Commonly, the function will be useful to select a subset of texts, parse it, and then use tidytext to unnest them to one word per line:

library(tidytext)
three_soc_texts <- 
  get_micusp_corpus(paper_id %in% c("SOC.G3.05.1", "SOC.G3.06.2", "SOC.G3.10.1")) |> 
  unnest_tokens(word, text)

physics_texts <- 
  get_micusp_corpus(discipline == "Physics") |> 
  unnest_tokens(word, text)

The get_micusp_corpus() function can accept multiple filters separated by commas:

physics_texts_ug <- 
  get_micusp_corpus(discipline == "Physics", 
                    student_level == "Final Year Undergraduate") |> 
  unnest_tokens(word, text)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment