Skip to content

Instantly share code, notes, and snippets.

@dill
Created May 31, 2015 00:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dill/63d455cc0b864031b48b to your computer and use it in GitHub Desktop.
Save dill/63d455cc0b864031b48b to your computer and use it in GitHub Desktop.
UN Resolutions downloader
# get UN resolutions
# Annoyingly the UN website is not very crawl-able.
# Using RSelenium you can download Resolution data
library(RSelenium)
library(stringr)
# download directory
dl_dir <- path.expand("~/Downloads/UN69/")
# navigate to 69th session page
session_id <- 69
## download
RSelenium::startServer()
# need to set the Firefox profile to download rather than open
# PDFs and then set download directory
fprof <- makeFirefoxProfile(list(
"pdfjs.disabled" = TRUE,
"browser.helperApps.neverAsk.saveToDisk" = "application/pdf,application/octet-stream",
"plugin.disable_full_page_plugin_for_types" = "application/pdf",
"browser.download.useDownloadDir" = TRUE,
"browser.download.folderList" = 2L,
"browser.download.dir" = dl_dir))
# setup the driver and connect to the server
remDr <- remoteDriver(remoteServerAddr = "localhost",
port = 4444,
browserName = "firefox",
extraCapabilities = fprof)
remDr$open()
# navigate to the relevant page
remDr$navigate(paste0("http://www.un.org/en/ga/",
session_id, "/resolutions.shtml"))
# get the source and extract links to PDFs
ps <- remDr$getPageSource()
urls_to_get <- str_extract_all(ps[[1]],
paste0("http://www.un.org/en/ga/search/view_doc.asp\\?symbol=A/RES/",
session_id,"/\\d+"))
# iterate over the URLs
i <- 1
for(a_url in urls_to_get[[1]]){
remDr$navigate(a_url)
Sys.sleep(5)
i <- i + 1
# need an extra sleep
if(i %% 25 == 0) Sys.sleep(10)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment