Skip to content

Instantly share code, notes, and snippets.

@gzt
Created May 10, 2020 06:35
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gzt/85dcb4df806aaf906a942b4894bbae51 to your computer and use it in GitHub Desktop.
Save gzt/85dcb4df806aaf906a942b4894bbae51 to your computer and use it in GitHub Desktop.
get PDFs from some site
####### This presumes you have a .bib file which has, for each
####### bibliography entry, a legal file name as a key and a DOI
####### and that you have a directory you want to save the files in.
####### If your bibtex entry is like
####### @Article{Bob_1980, ..... }
####### then the file will be saved as Bob_1980.pdf.
####### You must have a DOI for each entry.
bibfile = "/path/to/bib/file.bib"
savedir = "/path/to/save/dir/"
tags = gsub("^.*\\{(.*),", "\\1", grep("@", readLines(bibfile), value = TRUE))
doilist = gsub("^.*\\{(.*)\\},", "\\1", grep("doi.*=", readLines(bibfile), value = TRUE))
filelist = paste0(savedir, tags, ".pdf")
urllist = paste0("https://sci-hub.tw/", doilist)
pdfpattern = "https://.*pdf\\?download=true"
if (length(doilist) != length(tags)) stop("Error: missing DOIs")
for (i in 1:length(doilist)) {
if (!file.exists(filelist[i])) {
scihubhtml = readLines(urllist[i])
pdfurl = gsub("^.*(https.*true).*", "\\1", grep(pdfpattern, scihubhtml, value = TRUE))
####### skip file if no pdf found
if(length(pdfurl) > 0) download.file(pdfurl, destfile = filelist[i], method = "auto")
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment