-
-
Save cagdasyetkin/4b3388731d8240762d8c4362385ec22c to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. Explain in your words what the unnest_token function does | |
It is a function from Tidytext library which restructures text: Creates one token for each row. It splits a text column (this is our input) into tokens (like words). It helps us doing this tokenization. | |
2. Explain your words what the gutenbergr package does | |
Project Gutenberg digitizes the books for which copyright has expired with the help of volunteers. Gutenbergr R package provides these books to R users. We can download and process these books using this library. | |
3. Explain in your words how sentiment lexicon work | |
They are like dictionaries which matches words with their sentiment or emotion. Such as classifying them into Positive - Negative - Neutral categories. Once we match the words in our text with lexicon, we can start analyzing the frequencies. Even if we dont know the language in which the text has been written, we can have an overall understanding. | |
4. How does inner_join provide sentiment analysis functionality | |
We match the words in our text with the sentiments in the lexicon. There can be lots of words which are not available in the lexicon. Similarly, there can be lots of words in the lexicon which are not mentioned in our text. inner_join brings us the intersection between our text and the lexicon. | |
5. Explain in your words what tf-idf does | |
It tells us how importand a word is in the text we are analyzing. | |
6. Explain why you may want to do tokenization by bigram | |
7. Please install the following packages, if you have not already: | |
1. tidyverse | |
2. tidytext | |
3. gutenbergr | |
Pick two or more authors that you are familiar with, download their texts using the gutenbergr package, and do a basic analysis of word frequencies and TF-IDF | |
# until 7 by Monday. 7 can be delivered by Friday. But dont do it :) give them all by monday. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment