Last active
March 10, 2017 04:00
-
-
Save mindyng/3b97e11092140310253cb56a619f1324 to your computer and use it in GitHub Desktop.
Capstone Project I Proposal
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The problem is I want to assign a sentiment to a review as +/-/neutral based on words used in product reviews. (Given a review, the goal is to predict the user’s attitude.) | |
According to Wikipedia, sentiment analysis is (sometimes known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. | |
My client would be amazon.com or some other e-commerce giant that would like to know which of their products are highly liked. Then the company can invest accordingly. Based on my analysis, the company would make certain products more available or recommend similar products in order to retain and grow their customer base. | |
For negative sentiments, client could do research on what are the drivers behind negative sentiments, especially related to competitors. If there is negative conversation, reach out to these reviewers. | |
With sentiment analysis, my client can close the analytical loop between publicly expressed sentiment, engagement action, subsequent purchase intent, and ultimately, product purchase. | |
Lastly, with my sentiment analysis, client can use it to track trends in his/her industry or even their competitors, allowing them to better manage their own brand strategy. | |
The data I am going to use for this is book reviews from amazon.com. There are 213.335 book reviews for 8 randomly chosen books. | |
Features are: | |
Each entry is separated by a newline character (''). Each entry contains four attributes, which are separated by a space (' '): | |
1. review score | |
2. tail of review url ([Web Link]) 3. review title 4. HTML of review text | |
My approach to solving this problem would be: | |
PREP THE TEXT; TRAIN SET | |
-prepare review text by eliminating irrelevant spaces, punctuation | |
-split string into words | |
-compute word frequency | |
-sort by most common words | |
BUILD SENTIMENT LEXICON (there are 5 different stars, 3 different sentiments) | |
-figure out which words are most common in each 5 star category | |
-break up words into different sentiment categories (good, neutral, or bad) | |
-allocate training and test sets | |
-train model | |
-group words into three different sentiment categories | |
-test with unseen data | |
-accuracy? | |
-refine model? | |
-consider context for language nuances such as to discern sarcasm, hope? | |
5. Deliverables: code + slide deck |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment