Skip to content

Instantly share code, notes, and snippets.

@jeroen
Last active January 17, 2021 17:15
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jeroen/74ba34be590fcb56dcc7cc1526bd0cdf to your computer and use it in GitHub Desktop.
Save jeroen/74ba34be590fcb56dcc7cc1526bd0cdf to your computer and use it in GitHub Desktop.
Fast scraping of package metadata using curl multi API
# Globals
repos <- 'https://cloud.r-project.org'
pkgdata <- available.packages(repos = repos)
pkgs <- row.names(pkgdata)
# On success
make_callback <- function(i, url){
function(res){
if(res$status == 200){
buf <- rawConnection(res$content)
on.exit(close(buf))
output[[i]] <<- as.list(read.dcf(buf)[1,])
cat(sprintf("[%d] OK: %s\n", i, url))
} else {
warning(sprintf("Download failed for %s (HTTP %d)", url, res$status))
}
}
}
# Setup 12000 handles...
pool <- curl::new_pool(host_con = 20)
output <- vector('list', nrow(pkgdata))
cat("Setting up handles: ")
invisible(sapply(seq_along(output), function(i){
url <- sprintf("%s/web/packages/%s/DESCRIPTION", repos, pkgs[i])
curl::curl_fetch_multi(url, done = make_callback(i, url), fail = function(err){
stop(sprintf("Networking error: %s", err))
}, pool = pool)
cat(".")
}))
# Let's do it:
status <- curl::multi_run(pool = pool)
print(status)
# Convert to data frame
library(dplyr)
df <- bind_rows(output)
# This is what I actually needed :)
sysreqs <- df %>% filter(!is.na(SystemRequirements)) %>% select(Package, SystemRequirements)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment