Skip to content

Instantly share code, notes, and snippets.

@shawngraham
Last active November 11, 2019 14:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shawngraham/8acc5fd88f2455b4284a3b48569c6025 to your computer and use it in GitHub Desktop.
Save shawngraham/8acc5fd88f2455b4284a3b48569c6025 to your computer and use it in GitHub Desktop.
why is this not working? the loop is the problem
library(rvest)
base_url <- "https://www.masshist.org"
# Load the page
main.page <- read_html(x = "https://www.masshist.org/digitaladams/archive/browse/diaries_by_date.php")
# Get link URLs
urls <- main.page %>% # feed `main.page` to the next step
html_nodes("a") %>% # get the CSS nodes
html_attr("href") # extract the URLs
# Get link text
links <- main.page %>% # feed `main.page` to the next step
html_nodes("a") %>% # get the CSS nodes
html_text() # extract the link text
# Combine `links` and `urls` into a data.frame
# because the links are all relative, let's add the base url with paste
diaries <- data.frame(links = links, urls = paste(base_url,urls, sep=""), stringsAsFactors = FALSE)
# Loop over each row in `diaries`
for(i in seq((diaries))) {
text <- read_html(diaries$urls[i]) %>% # load the page
html_nodes(".entry") %>% # isloate the text; maybe .transcription
html_text() # get the text
# Create the file name
filename <- paste0(diaries$links[i], ".txt")
sink(file = filename) %>% # open file to write
cat(text) # write the file
sink() # close the file
}
@shawngraham
Copy link
Author

figure it out.

The scrape of links was also grabbing the links to 'home' 'back' etc, and so the scrape would get hung up on those things or break when it came time to loop. So if you filter that stuff out, after line 16, with

# but we have a few links to 'home' etc that we don't want
# so we'll filter those out with grepl and a regular
# expression that looks for 'John' at the start of
# the links field.
diaries <- diaries %>% filter(grepl("^John", links))

then all is well with the world. Well, that and line 19 has to be changed to

for(i in seq(nrow(diaries))) {

then you're golden.

@jeffblackadar
Copy link

library(rvest)

base_url <- "https://www.masshist.org"

Load the page

main.page <-
read_html(x = "https://www.masshist.org/digitaladams/archive/browse/diaries_by_date.php")

Get link URLs

urls <- main.page %>% # feed main.page to the next step
html_nodes("a") %>% # get the CSS nodes
html_attr("href") # extract the URLs

Get link text

links <- main.page %>% # feed main.page to the next step
html_nodes("a") %>% # get the CSS nodes
html_text() # extract the link text

Combine links and urls into a data.frame

because the links are all relative, let's add the base url with paste

diaries <-
data.frame(
links = links,
urls = paste(base_url, urls, sep = ""),
stringsAsFactors = FALSE
)
#====================================

Loop over each row in diaries

for (i in 1:length(diaries$urls)) {
filename <- paste0(diaries$links[i], ".txt")
print(filename)
out <- tryCatch({
download.file(diaries$urls[i], destfile = filename, quiet = TRUE)
},
error = function(cond) {
message(paste("URL does not seem to exist:", diaries$urls[i]))
message("Here's the original error message:")
message(cond)
# Choose a return value in case of error
return(NA)
},
warning = function(cond) {
message(paste("URL caused a warning:", diaries$urls[i]))
message("Here's the original warning message:")
message(cond)
# Choose a return value in case of warning
return(NULL)
})

}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment