Skip to content

Instantly share code, notes, and snippets.

@Athospd
Last active December 13, 2017 19:21
Show Gist options
  • Save Athospd/a92ae0f80482fb44e97ffd269289ad5c to your computer and use it in GitHub Desktop.
Save Athospd/a92ae0f80482fb44e97ffd269289ad5c to your computer and use it in GitHub Desktop.
library(readr)
library(stringr)
library(tidyr)
library(dplyr)
library(purrr)
library(tokenizers)
library(forcats)
reviews <- read_lines("finefoods.txt.gz")
reviews <- reviews[str_sub(reviews, 1, 12) == "review/text:"]
reviews <- str_sub(reviews, start = 14)
reviews <- tokenize_words(reviews)
reviews <- tibble(review_id = 1:length(reviews), review = reviews)
vocab <- reviews$review %>%
unlist() %>%
table() %>%
sort(decreasing = TRUE) %>%
names() %>%
head(49999)
vocab <- tibble(vocab = vocab, vocab_id = 1:length(vocab))
reviews_final <- reviews %>%
unnest %>%
left_join(vocab, by = c("review" = "vocab")) %>%
select(review_id, vocab_id) %>%
nest(vocab_id)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment