Skip to content

Instantly share code, notes, and snippets.

@timriffe
Created October 14, 2011 16:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save timriffe/1287642 to your computer and use it in GitHub Desktop.
Save timriffe/1287642 to your computer and use it in GitHub Desktop.
An R function to download the MIT classics text files and set up a local database
DownloadMITclassics <- function(dirpath){
require(Rcurl)
MITwriters <- c("Confucius","Lao","Ferdowsi","Khayyam","Sadi","Tzu",
"Aeschylus","Aesop","Apollonius","Apuleius","Aristophanes","Aristotle",
"Antoninus","Augustus","Caesar","Epictetus","Epicurus","Euripides",
"Galen","Herodotus","Hippocrates","Homer","Carus","Ovid",
"Plato","Plotinus","Plutarch","Porphyry","Quintus","Sophocles",
"Tacitus","Thucydides","Virgil")
# make directory structure:
sapply(paste(dirpath,MITwriters,sep="\\"),dir.create, recursive=TRUE)
baseurl <- "classics.mit.edu/"
urls <- paste(baseurl,MITwriters,"/",sep="")
# each url here will contain one or more text file that needs to be extracted:
for (i in 1:length(urls)){
A <- getURL(urls[i])
A <- readLines(tc <- textConnection(A)); close(tc)
A <- A[grep(".txt",A)]
A <- unlist(strsplit(A,split="href=\""))
A <- unlist(strsplit(A,split="\""))
A <- A[-grep("</a>",A)]
# now A is a vector of the extensions of works
A <- A[grep(".txt",A)]
# get urls for writer j
urlsj <- paste(urls[i],A,sep="")
worknames <- unlist(strsplit(urlsj,split="/"))
worknames <- worknames[grep(".txt",worknames)]
worknames <- unlist(lapply(strsplit(worknames,split="\\."),function(x){x[1]}))
for (j in 1:length(urlsj)){
download.file(url=paste("http://",urlsj[j],sep=""),
destfile=paste(dirpath,MITwriters[i],paste(worknames[j],".txt",sep=""),sep="\\"))
}
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment