Skip to content

Instantly share code, notes, and snippets.

@clayford
Created May 25, 2017 15:49
Show Gist options
  • Save clayford/ae998788f4ec5251b4430197405a3508 to your computer and use it in GitHub Desktop.
Save clayford/ae998788f4ec5251b4430197405a3508 to your computer and use it in GitHub Desktop.
scrape craftcans.com database
# craftcans.com - web site devoted to canned craft beer
# scrape craftcans.com database
library(rvest)
library(magrittr) # for extract()
library(stringr)
URL <- "http://www.craftcans.com/db.php?search=all&sort=beerid&ord=desc&view=text"
page <- read_html(URL)
# table 11 contains the beer
beer <- page %>%
html_nodes("table") %>%
extract(11) %>%
html_table(header = TRUE) %>%
`[[`(1)
# Red numbers indicate that a can is retired.
page_raw <- readLines(URL, warn = FALSE)
retired <- str_extract_all(page_raw, pattern = '(?<=height:25px;color:red;font-weight:bold;">)\\d+') %>%
unlist() %>%
as.numeric()
# clean up
names(beer) <- tolower(names(beer))
beer$abv <- readr::parse_number(beer$abv, na = "???")
beer$ibus <- readr::parse_number(beer$ibus, na = "N/A")
beer$state <- stringr::str_extract(beer$location, pattern = "[A-Z]{2}$")
beer$style[beer$style == ""] <- NA
beer$retired <- ifelse(beer$entry %in% retired, 1, 0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment