Skip to content

Instantly share code, notes, and snippets.

@cagdasyetkin
Forked from earino/quiz2.txt
Last active February 19, 2018 08:15
Show Gist options
  • Save cagdasyetkin/4b3388731d8240762d8c4362385ec22c to your computer and use it in GitHub Desktop.
Save cagdasyetkin/4b3388731d8240762d8c4362385ec22c to your computer and use it in GitHub Desktop.
1. Explain in your words what the unnest_token function does
It is a function from Tidytext library which restructures text: Creates one token for each row. It splits a text column (this is our input) into tokens (like words). It helps us doing this tokenization.
2. Explain your words what the gutenbergr package does
Project Gutenberg digitizes the books for which copyright has expired with the help of volunteers. Gutenbergr R package provides these books to R users. We can download and process these books using this library.
3. Explain in your words how sentiment lexicon work
They are like dictionaries which matches words with their sentiment or emotion. Such as classifying them into Positive - Negative - Neutral categories. Once we match the words in our text with lexicon, we can start analyzing the frequencies. Even if we dont know the language in which the text has been written, we can have an overall understanding.
4. How does inner_join provide sentiment analysis functionality
We match the words in our text with the sentiments in the lexicon. There can be lots of words which are not available in the lexicon. Similarly, there can be lots of words in the lexicon which are not mentioned in our text. inner_join brings us the intersection between our text and the lexicon.
5. Explain in your words what tf-idf does
It tells us how importand a word is in the text we are analyzing.
6. Explain why you may want to do tokenization by bigram
7. Please install the following packages, if you have not already:
1. tidyverse
2. tidytext
3. gutenbergr
Pick two or more authors that you are familiar with, download their texts using the gutenbergr package, and do a basic analysis of word frequencies and TF-IDF
# until 7 by Monday. 7 can be delivered by Friday. But dont do it :) give them all by monday.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment