Skip to content

Instantly share code, notes, and snippets.

@jmcastagnetto
Created June 10, 2021 19:53
Show Gist options
  • Save jmcastagnetto/b08cce1cb6cacdaa271dd1143eb1603d to your computer and use it in GitHub Desktop.
Save jmcastagnetto/b08cce1cb6cacdaa271dd1143eb1603d to your computer and use it in GitHub Desktop.
library(tidyverse)
library(rvest)
library(V8)
url <- "https://www.greatschools.org/new-york/new-york/schools/?view=table"
xpath <- "/html/head/script[1]"
ctx <- v8()
txt <- read_html(url) %>%
html_elements(xpath = xpath) %>%
html_text(trim = TRUE) %>%
str_replace(fixed("window.gon"), "gon")
tmpfile <- tempfile()
write_file(
txt,
file = tmpfile
)
txt2 <- read_lines(
tmpfile
)
ctx$eval(txt2[2])
tmp <- ctx$get("gon")
df <- as_tibble(tmp$search$schools)
saveRDS(df, file = "schools.rds")
@jmcastagnetto
Copy link
Author

FYI, the code abovs is just for the first page of results. To get the rest use URLs of the form: "https://www.greatschools.org/new-york/new-york/schools/?page=N&view=table" where N: 2,3,...

@jmcastagnetto
Copy link
Author

Also, the dataframe generated will contain columns that have lists (e.g.: df$links that has the elements: "profile", "reviews" & "collegeSuccess"), but that is not too difficult to parse/reorg/extract

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment