Skip to content

Instantly share code, notes, and snippets.

@abresler
Forked from christophergandrud/gist:1284498
Created August 3, 2012 17:26
Show Gist options
  • Save abresler/3249776 to your computer and use it in GitHub Desktop.
Save abresler/3249776 to your computer and use it in GitHub Desktop.
Simple Web Crawler for Text
library(foreign)
library(RCurl)
addresses <- read.csv("~/links.csv") # Create a .csv file with all of the links you want to crawl
for (i in addresses) full.text <- getURL(i)
text.sub <- gsub("<.+?>", "", full.text) # Removes HTML tags
text <- data.frame(text.sub)
outpath <- "~/text.indv/"
x <- 1:nrow(text)
for(i in x) {
write(as.character(text[i,1]), file = paste(outpath,"/",i,".txt",sep=""))
} # Note: this is for Mac OS paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment