Skip to content

Instantly share code, notes, and snippets.

@dfalbel
Last active December 13, 2017 18:49
Show Gist options
  • Save dfalbel/81441b2d3d417bbae10dda88345d775e to your computer and use it in GitHub Desktop.
Save dfalbel/81441b2d3d417bbae10dda88345d775e to your computer and use it in GitHub Desktop.
download.file("https://snap.stanford.edu/data/finefoods.txt.gz", "finefoods.txt.gz")
library(readr)
library(stringr)
library(purrr)
reviews <- read_lines("finefoods.txt.gz")
reviews <- reviews[str_sub(reviews, 1, 12) == "review/text:"]
reviews <- str_sub(reviews, start = 14)
library(tokenizers)
library(forcats)
reviews <- tokenize_words(reviews)
vocab <- reviews %>%
unlist() %>%
table() %>%
sort(decreasing = TRUE) %>%
names() %>%
head(49999)
reviews_int <- map(
reviews,
~match(.x, vocab) %>% coalesce(0L)
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment