Skip to content

Instantly share code, notes, and snippets.

Last active October 12, 2022 03:35
Show Gist options
  • Save zimana/c510e4d711b622631e7ee5d3f1b6ce9b to your computer and use it in GitHub Desktop.
Save zimana/c510e4d711b622631e7ee5d3f1b6ce9b to your computer and use it in GitHub Desktop.
TF-IDF with tidytext bind function
# import tidytext for the TF-IDF function.
#import for count function
# import readtext for reading text.
# First read the text source into the program. The read functions in R can be used. In this case, a text file containing your content can be read using paste0 via readtext function.
web_content <- readtext(paste0("TEXT FILE GOES HERE"))
# A column called text contains the words we want to break into a corpus
# Next, split the words in the $text column to create the corpus, the "bag of words". To split columns and create a flat table of tokens, use the unnest_tokens() function.
# Then use the count() function to audit the unique words. The sort sets up a rank from largest count to the smallest value.
content_words <- web_content |>
unnest_tokens(word, text, token = "words") |>
count(word, sort=TRUE)
# In essence, if your class of object is not a dataframe (In this example, the id does not appear in the return object) create a new corpus as a dataframe to combine the doc ids, text, and n. For our function, the doc_id should reflect one-row-per-term-per-document to match the bind.
new_corpus <- data.frame(web_content$doc_id,content_words$word,content_words$n)
# Give the new corpus column names for convenience and simplifying the td_idf formula
colnames(new_corpus) <- c("doc_id","text","number")
# With the words separated, the TF-IDF can be applied via bind_tf_idf. The text document to be examined should be a tidy dataset with one-row-per-term-per-document
content_sentiment <- new_corpus |>
# When you run the object, you will end up with a table with number of words, and score values for tf, idf, and the tf-idf.
Copy link

zimana commented Apr 13, 2022

Up to date as of 9:33 am Wednesday April 13th 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment