AllieUbisse/Readme

## Readme
1.	Introduction

	What do we understand when we talk about the term Machine-Learning in today’s perspective of Technology? What can we achieve through means of complex algorithms?
Simple answer to these questions comes from the need to recognize patterns, make predictions and the ability of a machine to operate over data without having to give static program instructions to it. Machine Learning is the field of computer science that gives machines/computers the ability to learn without being explicitly programmed. It is employed in a range of computing tasks where designing & programming explicit algorithms with great performance is infeasible, this includes email filtering, intruder detection in networks, computer vision, optical character recognition (OCR), etc.
Machine learning is considered to be closely related to computational statistics which as we know focuses on prediction-making through the use of computers. It is also conflated with Data mining because of the exploratory data analysis involved in both fields. Within the field of data analytics, machine learning is a method used to devise complex models and algorithms that are greatly helpful in prediction, also called as Predictive analysis. These analytical models allow us to produce reliable, repeatable decisions and results and discover insights through learning from trends in data and historical relationships.
	Now that we have discussed Machine learning in general and how it is considered as major part of the future in problem solving domain; we can consider Sentiment analysis, otherwise known as  Opinion mining. This consist of many different fields like natural language processing (NLP), text mining, decision making and linguistics. It is a type of text analysis that classifies the text and makes decision by extracting and analyzing the text. Opinions can be categorized as positive and negative and measures the degree of positive or negative associated with that event (people, organization, social issues). So, it's basically people's opinion study, study of emotions and appraisals in the direction of any social issue, people or entity.


2.	Related Work

	More recently most of the researches and studies have been done on the sentiment analysis of products and services like Amazon, Walmart, etc. although the analysis of events and issues, data is retrieved from social media like twitter etc. Sentiment analysis is one of the major tasks of NLP (Natural Language Processing). It has gain much attention in recent years. For a recommender system, sentiment analysis has been proven to be a valuable technique. A recommender system aims to predict the preference to an item of a target user. Mainstream recommender systems work on explicit data set. For example, collaborative filtering works on the rating matrix, and content-based filtering works on the metadata of the items.
In many social networking services or e-commerce websites, users can provide text review, comment or feedback to the items. These user-generated text provide a rich source of user's sentiment opinions about numerous products and items. Potentially, for an item, such text can reveal both the related feature/aspects of the item and the users' sentiments on each feature. The item's feature/aspects described in the text play the same role with the metadata in content-based filtering, but the former are more valuable for the recommender system. Since these features are broadly mentioned by users in their reviews, they can be seen as the most crucial features that can significantly influence the user's experience on the item, while the metadata of the item may ignore features that are concerned by the users. For different items with common features, a user may give different sentiments. Also, a feature of the same item may receive different sentiments from different users. Users' sentiments on the features can be regarded as a multidimensional rating score, reflecting their preference on the items.
Based on the feature/aspects and the sentiments extracted from the user-generated text, a hybrid recommender system can be constructed.There are two types of motivation to recommend a candidate item to a user. The first motivation is the candidate item have numerous common features with the user's preferred items, while the second motivation is that the candidate item receives a high sentiment on its features. For a preferred item, it is reasonable to believe that items with the same features will have a similar function or utility. So, these items will also likely to be preferred by the user. On the other hand, for a shared feature of two candidate items, other users may give positive sentiment to one of them while give negative sentiment to another. Clearly, the high evaluated item should be recommended to the user. Based on these two motivations, a combination ranking score of similarity and sentiment rating can be constructed for each candidate item.


Except the difficulty of the sentiment analysis itself, applying sentiment analysis on reviews or feedback also face the challenge of spam and biased reviews. One direction of work is focused on evaluating the helpfulness of each review. Review or feedback poorly written are hardly helpful for recommender system. Besides, a review can be designed to hinder sales of a target product, thus be harmful to the recommender system even it is well written.
Researchers also found that long and short form of user-generated text should be treated differently. An interesting result shows that short form reviews are sometimes more helpful than long form, because it is easier to filter out the noise in a short form text. For the long form text, the growing length of the text does not always bring a proportionate increase of the number of features or sentiments in the text.
In this project, the aim is to tackle the problem of sentiment polarity categorization, which is one of the fundamental problems of sentiment analysis. A general process for sentiment polarity categorization is proposed with detailed process descriptions. Data used in this study are online restaurant reviews collected from various websites. Experiments for both sentence-level categorization is performed using various algorithms and comparisons have been shown. At last, we also give insight into our future work on sentiment analysis

3.	Implementation

	3.1	Software Requirements
The Technologies used for the project can vary but we stick to the most common softwares used for achieving machine learning.
•	Python : Pycharm IDLE, Spider, Anaconda can be used for implementing Machine Learning using Python.

•	Libraries: Multiple libraries are used here for preprocessing the data and for various steps during data filtering.

NumPy: NumPy is the fundamental package for scientific computing with Python. It contains among other things
o	a powerful N-dimensional array object
o	sophisticated (broadcasting) functions
o	tools for integrating C/C++ and Fortran code
o	useful linear algebra, Fourier transform, and random number capabilities
o	Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases


Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Natural Language Text Processing (NLTK):  NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

Scikit-learn-library : Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python.
It is licensed under a permissive simplified BSD license and is distributed under many Linux distributions, encouraging academic and commercial use.
The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-learn. This stack that includes:
•	NumPy: Base n-dimensional array package
•	IPython: Enhanced interactive console
•	Sympy: Symbolic mathematics
•	Pandas: Data structures and analysis
•	Countvectorizer


3.2 Project Architecture


3.3 Algorithms in Context:

Different kinds of classifiers should be always considered for a comparative study over a given dataset. Given the properties of the dataset, you might have some clues that may give preference to some methods. However, it would still be advisable to experiment with all, if possible. If you are dicing between using decision trees vs naive bayes vs SVM to solve a problem often times it best to test each one. Build a decision tree and build a naive bayes classifier then have a shoot out using the training and validation data you have. Whichever performs best will more likely perform better in the field.
Naive Bayes requires you build a classification by hand. There's no way to just toss a bunch of tabular data at it and have it pick the best features it will use to classify. Picking which features matter is up to you. Decisions trees will pick the best features for you from tabular data. If there were a way for Naive Bayes to pick features you'd be getting close to using the same techniques that make decision trees work like that. Give this fact that means you may need to combine Naive Bayes with other statistical techniques to help guide you towards what features best classify and that could be using decision trees. Naive bayes will answer as a continuous classifier. There are techniques to adapt it to categorical prediction however they will answer in terms of probabilities like (A 90%, B 5%, C 2.5% D 2.5%) Bayes can perform quite well, and it doesn't over fit nearly as much so there is no need to prune or process the network. That makes them simpler algorithms to implement. However, they are harder to debug and understand because it's all probabilities getting multiplied 1000's of times so you have to be careful to test it's doing what you expect. Naive bayes does quite well when the training data doesn't contain all possibilities so it can be very good with low amounts of data. Decision trees work better with lots of data compared to Naive Bayes.
Naive Bayes is used a lot in robotics and computer vision, and does quite well with those tasks. Decision trees perform very poorly in those situations. Teaching a decision tree to recognize poker hands by looking a millions of poker hands does very poorly because royal flushes and quads occurs so little it often gets pruned out. If it's pruned out of the resulting tree it will misclassify those important hands (recall tall trees discussion from above). Now just think if you are trying to diagnose cancer using this. Cancer doesn't occur in the population in large amounts, and it will get pruned out more likely. Good news is this can be handled by using weights so we weight a winning hand or having cancer as higher than a losing hand or not having cancer and that boosts it up the tree so it won't get pruned out. Again this is the part of tuning the resulting tree to a situation that.
On the other hand, Decision Trees are very flexible, easy to understand, and easy to debug. They will work with classification problems and regression problems. So if you are trying to predict a categorical value like (red, green, up, down) or if you are trying to predict a continuous value like 2.9, 3.4 etc Decision Trees will handle both problems. Probably one of the coolest things about Decision Trees is they only need a table of data and they will build a classifier directly from that data without needing any up front design work to take place. To some degree properties that don't matter won't be chosen as splits and will get eventually pruned so it's very tolerant of nonsense. To start it's set and forget it.

However, the downside. Simple decision trees tend to overfit the training data more so that other techniques which means you generally have to do tree pruning and tune the pruning procedures. You didn't have any upfront design cost, but you'll pay that back on tuning the trees performance. Also simple decision trees divide the data into squares so building clusters around things means it has to split a lot to encompass clusters of data. Splitting a lot leads to complex trees and raises probability you are overfitting. Tall trees get pruned back so while you can build a cluster around some feature in the data it might not survive the pruning process. There are other techniques like surrogate splits which let you split along several variables at once creating splits in the space that aren't either horizontal or perpendicular ( 0 < slope < infinity ). Cool, but your tree starts to become harder to understand, and it's complex to implement these algorithms. Other techniques like boosting and random forest decision trees can perform quite well, and some feel these techniques are essential to get the best performance out of decision trees. Again this adds more things to understand and use to tune the tree and hence more things to implement. In the end the more we add to the algorithm the taller the barrier to using it.
Decision trees are neat because they tell you what inputs are the best predictors of the outputs so often decision trees can guide you to find if there is a statistical relationship between a given input to the output and how strong that relationship is. Often the resulting decision tree is less important than relationships it describes. So decision trees can be used a research tool as you learn about your data so you can build other classifiers.

Moreover, Naive Bayes Classifier (NBC) and Support Vector Machine (SVM) have different options including the choice of kernel function for each. They are both sensitive to parameter optimization (i.e. different parameter selection can significantly change their output) . So, if you have a result showing that NBC is performing better than SVM. This is only true for the selected parameters. However, for another parameter selection, you might find SVM is performing better. In general, if the assumption of independence in NBC is satisfied by the variables of your dataset and the degree of class overlapping is small (i.e. potential linear decision boundary), NBC would be expected to achieve good. For some datasets, with optimization using wrapper feature selection, for example, NBC may defeat other classifiers. Even if it achieves a comparable performance, NBC will be more desirable because of its high speed.


Stepwise Code Explanation:

•	Importing the basic libraries, Numpy & Pandas.
•	Next we need to Import dataset with the delimiter set as one tab space.
•	Setting the Quoting size to 3, this represents missing error of tab space
•	In order to filter the data we need to download the stopword list.
•	For Natural language processing we need to import NLTK.
•	 What are Stop word: A list of words, from the review, which need to be removed from dataset
•	Next, we filter the data by removing the punctuations, following with converting the upper case letters to lowercase letters.
•	Here, splitting is needed, so we must split the string to the least with as many as word items
•	PorterStemmer is key to replace the words to root words and remove extensions, prepositions and verbs (like ‘ly’, ‘ed’, ‘es’)
The above process is done on the basis of a word and it is repeated for all the words in that sentence.

Once the words are filtered and processed then they are joined to form string.
The whole splitting, preprocessing and joining step is wrapped in the for loop and repeated for whole dataset.Corpus list will contain the final filtered list.

For creating the bag of words

•	We must import countvectorizer class from Scikit-learn
•	We need to make object of countvectorizer of max feature of up to 1500 words.
•	Dependent and independent data.
•	Dependent is Class of Yes and No, Independent is the bag of words-features
Now we are done with feature extraction and data processing, we will begin classification.


4. Result

Following is the screenshot of the result set that shows the comparison between the different algorithms being used, which are Gaussian Naive bayes, Decision Tree and Support Vector Machine.


As we can see, although the Accuracy_score, F1 score and recall_scores are comparatively better for GaussianNB, SVM does best on Precision overall.

Confusion Matrix:


TN: Correct Prediction of Negative review
TP: Correct Prediction of Positive review
FP: Incorrect Prediction of Positive review
FN: Incorrect Prediction of Negative review

Total correct predictions = TN + TP
Total Incorrect predictions = FN + FP


4. Conclusion

The result of our experiment here shows the importance of different algorithms and its applications. The intention was to test the above stated algorithms, GaussianNB, DT and SVM to do simple textual analysis. Make predictions and run the comparison among the algorithms to denote the applicability and usability of the most effective algorithm in a respective problem domain.
For beginners trying to understand machine learning, testing out the algorithms individually is a good way to go by. Each algorithm has it’s own merits and demerits in terms of application domain, efficiency, dealing with varied dataset size, be it supervised or unsupervised and speed at which the output is generated.


5. References

1.	Gottschalk, Louis August, and Goldine C. Gleser. The measurement of psychological states through the content analysis of verbal behavior. Univ of California Press, 1969.
2.	USA Issued 7,136,877, Volcani, Yanon; & Fogel, David B., "System and method for determining and controlling the impact of text", published June 28, 2001.
3.	Turney, Peter (2002). "Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews". Proceedings of the Association for Computational Linguistics. pp. 417–424. arXiv:cs.LG/0212032 .
4.	Pang, Bo; Lee, Lillian; Vaithyanathan, Shivakumar (2002). "Thumbs up? Sentiment Classification using Machine Learning Techniques". Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 79–86.
5.	Pang, Bo; Lee, Lillian (2005). "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales". Proceedings of the Association for Computational Linguistics (ACL). pp. 115–124.
6.	Snyder, Benjamin; Barzilay, Regina (2007). "Multiple Aspect Ranking using the Good Grief Algorithm". Proceedings of the Joint Human Language Technology/North American Chapter of the ACL Conference (HLT-NAACL). pp. 300–307.
7.	Qu, Yan, James Shanahan, and Janyce Wiebe. "Exploring attitude and affect in text: Theories and applications." In AAAI Spring Symposium) Technical report SS-04-07. AAAI Press, Menlo Park, CA. 2004.
8.	Vryniotis, Vasilis (2013). The importance of Neutral Class in Sentiment Analysis.
9.	 https://pandas.pydata.org/
10.	 http://www.nltk.org/
11.	https://machinelearningmastery.com/a-gentle-introduction-to-scikit-learn-a-python-machine-learning-library/
12.	 https://www.python.org/


## Restaurant_Reviews
# Natural Language Processing

# Importing the libraries
import numpy as np
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

# Cleaning the texts
import re
import nltk
#list of words go throuth all the word of the review and remove that are present in this top word list
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])#keep letter either lower case or upper case
    review = review.lower()#convert all upper case letter to lower case
    review = review.split()#split the string to list with as many as words items
    ps = PorterStemmer()
    #go throuth word by word in our review and remove word which is irrelevent
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
x = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

"""--------------------------------------------------_SVC---------------------------------------------------"""

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)

# Fitting Kernel SVM to the Training set
from sklearn.svm import SVC
classifier_SVC= SVC(kernel = 'linear', random_state = 0)
classifier_SVC.fit(x_train, y_train)

# Predicting the Test set results
y_pred = classifier_SVC.predict(x_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
"""-------------------------------------------GaussianNB()---------------------------------------------------"""

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)

# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(x_train, y_train)

# Predicting the Test set results
y_pred_1 = classifier.predict(x_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm_1 = confusion_matrix(y_test, y_pred_1)
"""------------------------------------------------DecisionTree----------------------------------------------"""
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)


# Fitting Decision Tree Classification to the Training set
from sklearn.tree import DecisionTreeClassifier
classifier_DecisionTree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier_DecisionTree.fit(x_train, y_train)

# Predicting the Test set results
y_pred_2 = classifier_DecisionTree.predict(x_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm_2 = confusion_matrix(y_test, y_pred_2)
"""-----------------------------------------------------------------------------------------"""


# save confusion matrix and slice into four pieces
from sklearn import metrics
confusion = metrics.confusion_matrix(y_test, y_pred_1)
print(confusion)
#[row, column]
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
# use float to perform true division, not integer division

from sklearn.metrics import precision_score, f1_score,recall_score

print("_______________________________GaussianNB________________________________________________")
print("Accuracy_score for Gaussian Naive Bayes =  ",metrics.accuracy_score(y_test, y_pred_1))
print(" Precision = ",precision_score(y_test, y_pred_1))
print(" f1_score = ", f1_score(y_test, y_pred_1))
print(" recall_score = ", recall_score(y_test, y_pred_1))
print ('\nConfussion matrix:\n', confusion_matrix(y_test, y_pred_1))

print("_______________________________DecisionTree________________________________________________")
print("Accuracy_score for DecisionTree =  ",metrics.accuracy_score(y_test, y_pred_2))
print(" Precision = ", precision_score(y_test, y_pred_2))
print(" f1_score = ", f1_score(y_test, y_pred_2))
print(" recall_score = ", recall_score(y_test, y_pred_2))
print ('\nConfussion matrix:\n', confusion_matrix(y_test, y_pred_2))

print("________________________________________Support Vector M _______________________________________")
print("Accuracy_score for Support Vector Machine =  ",metrics.accuracy_score(y_test, y_pred))
print(" Precision = ",precision_score(y_test, y_pred))
print(" f1_score = ", f1_score(y_test, y_pred))
print(" recall_score ", recall_score(y_test, y_pred))
print ('\nConfussion matrix:\n', confusion_matrix(y_test, y_pred))


"""_____________________________________________________________________________________

Accuracy = TP + TN / TP + TN + FP + FN

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

F1 Score = 2 * Precision * Recall / Precision + Recall

# calculate accuracy

from sklearn.metrics import classification_report
print ('\nClasification report:\n', classification_report(y_test,))
print ('\nConfussion matrix:\n', confusion_matrix(y_test, classifier.predict(x_test)))


from sklearn import metrics
from sklearn.metrics import classification_report
ova_models=[classifier_SVC,classifier,classifier_DecisionTree]
for model in ova_models:
    print("***********************{}**********".format(model))
    print(classification_report(y_test, model.predict(x_test)))
    print ('\nConfussion matrix:\n', confusion_matrix(y_test, model.predict(x_test)))
    print("Accuracy_score:  ",metrics.accuracy_score(y_test, model.predict(x_test)))
    print(" ")
	1. Introduction

	What do we understand when we talk about the term Machine-Learning in today’s perspective of Technology? What can we achieve through means of complex algorithms?
	Simple answer to these questions comes from the need to recognize patterns, make predictions and the ability of a machine to operate over data without having to give static program instructions to it. Machine Learning is the field of computer science that gives machines/computers the ability to learn without being explicitly programmed. It is employed in a range of computing tasks where designing & programming explicit algorithms with great performance is infeasible, this includes email filtering, intruder detection in networks, computer vision, optical character recognition (OCR), etc.
	Machine learning is considered to be closely related to computational statistics which as we know focuses on prediction-making through the use of computers. It is also conflated with Data mining because of the exploratory data analysis involved in both fields. Within the field of data analytics, machine learning is a method used to devise complex models and algorithms that are greatly helpful in prediction, also called as Predictive analysis. These analytical models allow us to produce reliable, repeatable decisions and results and discover insights through learning from trends in data and historical relationships.
	Now that we have discussed Machine learning in general and how it is considered as major part of the future in problem solving domain; we can consider Sentiment analysis, otherwise known as Opinion mining. This consist of many different fields like natural language processing (NLP), text mining, decision making and linguistics. It is a type of text analysis that classifies the text and makes decision by extracting and analyzing the text. Opinions can be categorized as positive and negative and measures the degree of positive or negative associated with that event (people, organization, social issues). So, it's basically people's opinion study, study of emotions and appraisals in the direction of any social issue, people or entity.



	2. Related Work

	More recently most of the researches and studies have been done on the sentiment analysis of products and services like Amazon, Walmart, etc. although the analysis of events and issues, data is retrieved from social media like twitter etc. Sentiment analysis is one of the major tasks of NLP (Natural Language Processing). It has gain much attention in recent years. For a recommender system, sentiment analysis has been proven to be a valuable technique. A recommender system aims to predict the preference to an item of a target user. Mainstream recommender systems work on explicit data set. For example, collaborative filtering works on the rating matrix, and content-based filtering works on the metadata of the items.
	In many social networking services or e-commerce websites, users can provide text review, comment or feedback to the items. These user-generated text provide a rich source of user's sentiment opinions about numerous products and items. Potentially, for an item, such text can reveal both the related feature/aspects of the item and the users' sentiments on each feature. The item's feature/aspects described in the text play the same role with the metadata in content-based filtering, but the former are more valuable for the recommender system. Since these features are broadly mentioned by users in their reviews, they can be seen as the most crucial features that can significantly influence the user's experience on the item, while the metadata of the item may ignore features that are concerned by the users. For different items with common features, a user may give different sentiments. Also, a feature of the same item may receive different sentiments from different users. Users' sentiments on the features can be regarded as a multidimensional rating score, reflecting their preference on the items.
	Based on the feature/aspects and the sentiments extracted from the user-generated text, a hybrid recommender system can be constructed.There are two types of motivation to recommend a candidate item to a user. The first motivation is the candidate item have numerous common features with the user's preferred items, while the second motivation is that the candidate item receives a high sentiment on its features. For a preferred item, it is reasonable to believe that items with the same features will have a similar function or utility. So, these items will also likely to be preferred by the user. On the other hand, for a shared feature of two candidate items, other users may give positive sentiment to one of them while give negative sentiment to another. Clearly, the high evaluated item should be recommended to the user. Based on these two motivations, a combination ranking score of similarity and sentiment rating can be constructed for each candidate item.



	Except the difficulty of the sentiment analysis itself, applying sentiment analysis on reviews or feedback also face the challenge of spam and biased reviews. One direction of work is focused on evaluating the helpfulness of each review. Review or feedback poorly written are hardly helpful for recommender system. Besides, a review can be designed to hinder sales of a target product, thus be harmful to the recommender system even it is well written.
	Researchers also found that long and short form of user-generated text should be treated differently. An interesting result shows that short form reviews are sometimes more helpful than long form, because it is easier to filter out the noise in a short form text. For the long form text, the growing length of the text does not always bring a proportionate increase of the number of features or sentiments in the text.
	In this project, the aim is to tackle the problem of sentiment polarity categorization, which is one of the fundamental problems of sentiment analysis. A general process for sentiment polarity categorization is proposed with detailed process descriptions. Data used in this study are online restaurant reviews collected from various websites. Experiments for both sentence-level categorization is performed using various algorithms and comparisons have been shown. At last, we also give insight into our future work on sentiment analysis

	3. Implementation

	3.1 Software Requirements
	The Technologies used for the project can vary but we stick to the most common softwares used for achieving machine learning.
	• Python : Pycharm IDLE, Spider, Anaconda can be used for implementing Machine Learning using Python.

	• Libraries: Multiple libraries are used here for preprocessing the data and for various steps during data filtering.

	NumPy: NumPy is the fundamental package for scientific computing with Python. It contains among other things
	o a powerful N-dimensional array object
	o sophisticated (broadcasting) functions
	o tools for integrating C/C++ and Fortran code
	o useful linear algebra, Fourier transform, and random number capabilities
	o Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases


	Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

	Natural Language Text Processing (NLTK): NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

	Scikit-learn-library : Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python.
	It is licensed under a permissive simplified BSD license and is distributed under many Linux distributions, encouraging academic and commercial use.
	The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-learn. This stack that includes:
	• NumPy: Base n-dimensional array package
	• IPython: Enhanced interactive console
	• Sympy: Symbolic mathematics
	• Pandas: Data structures and analysis
	• Countvectorizer


	3.2 Project Architecture





	3.3 Algorithms in Context:

	Different kinds of classifiers should be always considered for a comparative study over a given dataset. Given the properties of the dataset, you might have some clues that may give preference to some methods. However, it would still be advisable to experiment with all, if possible. If you are dicing between using decision trees vs naive bayes vs SVM to solve a problem often times it best to test each one. Build a decision tree and build a naive bayes classifier then have a shoot out using the training and validation data you have. Whichever performs best will more likely perform better in the field.
	Naive Bayes requires you build a classification by hand. There's no way to just toss a bunch of tabular data at it and have it pick the best features it will use to classify. Picking which features matter is up to you. Decisions trees will pick the best features for you from tabular data. If there were a way for Naive Bayes to pick features you'd be getting close to using the same techniques that make decision trees work like that. Give this fact that means you may need to combine Naive Bayes with other statistical techniques to help guide you towards what features best classify and that could be using decision trees. Naive bayes will answer as a continuous classifier. There are techniques to adapt it to categorical prediction however they will answer in terms of probabilities like (A 90%, B 5%, C 2.5% D 2.5%) Bayes can perform quite well, and it doesn't over fit nearly as much so there is no need to prune or process the network. That makes them simpler algorithms to implement. However, they are harder to debug and understand because it's all probabilities getting multiplied 1000's of times so you have to be careful to test it's doing what you expect. Naive bayes does quite well when the training data doesn't contain all possibilities so it can be very good with low amounts of data. Decision trees work better with lots of data compared to Naive Bayes.
	Naive Bayes is used a lot in robotics and computer vision, and does quite well with those tasks. Decision trees perform very poorly in those situations. Teaching a decision tree to recognize poker hands by looking a millions of poker hands does very poorly because royal flushes and quads occurs so little it often gets pruned out. If it's pruned out of the resulting tree it will misclassify those important hands (recall tall trees discussion from above). Now just think if you are trying to diagnose cancer using this. Cancer doesn't occur in the population in large amounts, and it will get pruned out more likely. Good news is this can be handled by using weights so we weight a winning hand or having cancer as higher than a losing hand or not having cancer and that boosts it up the tree so it won't get pruned out. Again this is the part of tuning the resulting tree to a situation that.
	On the other hand, Decision Trees are very flexible, easy to understand, and easy to debug. They will work with classification problems and regression problems. So if you are trying to predict a categorical value like (red, green, up, down) or if you are trying to predict a continuous value like 2.9, 3.4 etc Decision Trees will handle both problems. Probably one of the coolest things about Decision Trees is they only need a table of data and they will build a classifier directly from that data without needing any up front design work to take place. To some degree properties that don't matter won't be chosen as splits and will get eventually pruned so it's very tolerant of nonsense. To start it's set and forget it.

	However, the downside. Simple decision trees tend to overfit the training data more so that other techniques which means you generally have to do tree pruning and tune the pruning procedures. You didn't have any upfront design cost, but you'll pay that back on tuning the trees performance. Also simple decision trees divide the data into squares so building clusters around things means it has to split a lot to encompass clusters of data. Splitting a lot leads to complex trees and raises probability you are overfitting. Tall trees get pruned back so while you can build a cluster around some feature in the data it might not survive the pruning process. There are other techniques like surrogate splits which let you split along several variables at once creating splits in the space that aren't either horizontal or perpendicular ( 0 < slope < infinity ). Cool, but your tree starts to become harder to understand, and it's complex to implement these algorithms. Other techniques like boosting and random forest decision trees can perform quite well, and some feel these techniques are essential to get the best performance out of decision trees. Again this adds more things to understand and use to tune the tree and hence more things to implement. In the end the more we add to the algorithm the taller the barrier to using it.
	Decision trees are neat because they tell you what inputs are the best predictors of the outputs so often decision trees can guide you to find if there is a statistical relationship between a given input to the output and how strong that relationship is. Often the resulting decision tree is less important than relationships it describes. So decision trees can be used a research tool as you learn about your data so you can build other classifiers.

	Moreover, Naive Bayes Classifier (NBC) and Support Vector Machine (SVM) have different options including the choice of kernel function for each. They are both sensitive to parameter optimization (i.e. different parameter selection can significantly change their output) . So, if you have a result showing that NBC is performing better than SVM. This is only true for the selected parameters. However, for another parameter selection, you might find SVM is performing better. In general, if the assumption of independence in NBC is satisfied by the variables of your dataset and the degree of class overlapping is small (i.e. potential linear decision boundary), NBC would be expected to achieve good. For some datasets, with optimization using wrapper feature selection, for example, NBC may defeat other classifiers. Even if it achieves a comparable performance, NBC will be more desirable because of its high speed.





	Stepwise Code Explanation:

	• Importing the basic libraries, Numpy & Pandas.
	• Next we need to Import dataset with the delimiter set as one tab space.
	• Setting the Quoting size to 3, this represents missing error of tab space
	• In order to filter the data we need to download the stopword list.
	• For Natural language processing we need to import NLTK.
	• What are Stop word: A list of words, from the review, which need to be removed from dataset
	• Next, we filter the data by removing the punctuations, following with converting the upper case letters to lowercase letters.
	• Here, splitting is needed, so we must split the string to the least with as many as word items
	• PorterStemmer is key to replace the words to root words and remove extensions, prepositions and verbs (like ‘ly’, ‘ed’, ‘es’)
	The above process is done on the basis of a word and it is repeated for all the words in that sentence.

	Once the words are filtered and processed then they are joined to form string.
	The whole splitting, preprocessing and joining step is wrapped in the for loop and repeated for whole dataset.Corpus list will contain the final filtered list.

	For creating the bag of words

	• We must import countvectorizer class from Scikit-learn
	• We need to make object of countvectorizer of max feature of up to 1500 words.
	• Dependent and independent data.
	• Dependent is Class of Yes and No, Independent is the bag of words-features
	Now we are done with feature extraction and data processing, we will begin classification.














	4. Result

	Following is the screenshot of the result set that shows the comparison between the different algorithms being used, which are Gaussian Naive bayes, Decision Tree and Support Vector Machine.



	As we can see, although the Accuracy_score, F1 score and recall_scores are comparatively better for GaussianNB, SVM does best on Precision overall.

	Confusion Matrix:



	TN: Correct Prediction of Negative review
	TP: Correct Prediction of Positive review
	FP: Incorrect Prediction of Positive review
	FN: Incorrect Prediction of Negative review

	Total correct predictions = TN + TP
	Total Incorrect predictions = FN + FP




	4. Conclusion

	The result of our experiment here shows the importance of different algorithms and its applications. The intention was to test the above stated algorithms, GaussianNB, DT and SVM to do simple textual analysis. Make predictions and run the comparison among the algorithms to denote the applicability and usability of the most effective algorithm in a respective problem domain.
	For beginners trying to understand machine learning, testing out the algorithms individually is a good way to go by. Each algorithm has it’s own merits and demerits in terms of application domain, efficiency, dealing with varied dataset size, be it supervised or unsupervised and speed at which the output is generated.














	5. References

	1. Gottschalk, Louis August, and Goldine C. Gleser. The measurement of psychological states through the content analysis of verbal behavior. Univ of California Press, 1969.
	2. USA Issued 7,136,877, Volcani, Yanon; & Fogel, David B., "System and method for determining and controlling the impact of text", published June 28, 2001.
	3. Turney, Peter (2002). "Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews". Proceedings of the Association for Computational Linguistics. pp. 417–424. arXiv:cs.LG/0212032 .
	4. Pang, Bo; Lee, Lillian; Vaithyanathan, Shivakumar (2002). "Thumbs up? Sentiment Classification using Machine Learning Techniques". Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 79–86.
	5. Pang, Bo; Lee, Lillian (2005). "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales". Proceedings of the Association for Computational Linguistics (ACL). pp. 115–124.
	6. Snyder, Benjamin; Barzilay, Regina (2007). "Multiple Aspect Ranking using the Good Grief Algorithm". Proceedings of the Joint Human Language Technology/North American Chapter of the ACL Conference (HLT-NAACL). pp. 300–307.
	7. Qu, Yan, James Shanahan, and Janyce Wiebe. "Exploring attitude and affect in text: Theories and applications." In AAAI Spring Symposium) Technical report SS-04-07. AAAI Press, Menlo Park, CA. 2004.
	8. Vryniotis, Vasilis (2013). The importance of Neutral Class in Sentiment Analysis.
	9. https://pandas.pydata.org/
	10. http://www.nltk.org/
	11. https://machinelearningmastery.com/a-gentle-introduction-to-scikit-learn-a-python-machine-learning-library/
	12. https://www.python.org/
	# Natural Language Processing

	# Importing the libraries
	import numpy as np
	import pandas as pd

	# Importing the dataset
	dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

	# Cleaning the texts
	import re
	import nltk
	#list of words go throuth all the word of the review and remove that are present in this top word list
	nltk.download('stopwords')
	from nltk.corpus import stopwords
	from nltk.stem.porter import PorterStemmer
	corpus = []
	for i in range(0, 1000):
	review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])#keep letter either lower case or upper case
	review = review.lower()#convert all upper case letter to lower case
	review = review.split()#split the string to list with as many as words items
	ps = PorterStemmer()
	#go throuth word by word in our review and remove word which is irrelevent
	review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
	review = ' '.join(review)
	corpus.append(review)

	# Creating the Bag of Words model
	from sklearn.feature_extraction.text import CountVectorizer
	cv = CountVectorizer(max_features = 1500)
	x = cv.fit_transform(corpus).toarray()
	y = dataset.iloc[:, 1].values

	"""--------------------------------------------------_SVC---------------------------------------------------"""

	# Splitting the dataset into the Training set and Test set
	from sklearn.cross_validation import train_test_split
	x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)

	# Fitting Kernel SVM to the Training set
	from sklearn.svm import SVC
	classifier_SVC= SVC(kernel = 'linear', random_state = 0)
	classifier_SVC.fit(x_train, y_train)

	# Predicting the Test set results
	y_pred = classifier_SVC.predict(x_test)

	# Making the Confusion Matrix
	from sklearn.metrics import confusion_matrix
	cm = confusion_matrix(y_test, y_pred)
	"""-------------------------------------------GaussianNB()---------------------------------------------------"""

	# Splitting the dataset into the Training set and Test set
	from sklearn.cross_validation import train_test_split
	x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)

	# Fitting Naive Bayes to the Training set
	from sklearn.naive_bayes import GaussianNB
	classifier = GaussianNB()
	classifier.fit(x_train, y_train)

	# Predicting the Test set results
	y_pred_1 = classifier.predict(x_test)

	# Making the Confusion Matrix
	from sklearn.metrics import confusion_matrix
	cm_1 = confusion_matrix(y_test, y_pred_1)
	"""------------------------------------------------DecisionTree----------------------------------------------"""
	# Splitting the dataset into the Training set and Test set
	from sklearn.cross_validation import train_test_split
	x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)


	# Fitting Decision Tree Classification to the Training set
	from sklearn.tree import DecisionTreeClassifier
	classifier_DecisionTree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
	classifier_DecisionTree.fit(x_train, y_train)

	# Predicting the Test set results
	y_pred_2 = classifier_DecisionTree.predict(x_test)

	# Making the Confusion Matrix
	from sklearn.metrics import confusion_matrix
	cm_2 = confusion_matrix(y_test, y_pred_2)
	"""-----------------------------------------------------------------------------------------"""


	# save confusion matrix and slice into four pieces
	from sklearn import metrics
	confusion = metrics.confusion_matrix(y_test, y_pred_1)
	print(confusion)
	#[row, column]
	TP = confusion[1, 1]
	TN = confusion[0, 0]
	FP = confusion[0, 1]
	FN = confusion[1, 0]
	# use float to perform true division, not integer division

	from sklearn.metrics import precision_score, f1_score,recall_score

	print("_______________________________GaussianNB________________________________________________")
	print("Accuracy_score for Gaussian Naive Bayes = ",metrics.accuracy_score(y_test, y_pred_1))
	print(" Precision = ",precision_score(y_test, y_pred_1))
	print(" f1_score = ", f1_score(y_test, y_pred_1))
	print(" recall_score = ", recall_score(y_test, y_pred_1))
	print ('\nConfussion matrix:\n', confusion_matrix(y_test, y_pred_1))

	print("_______________________________DecisionTree________________________________________________")
	print("Accuracy_score for DecisionTree = ",metrics.accuracy_score(y_test, y_pred_2))
	print(" Precision = ", precision_score(y_test, y_pred_2))
	print(" f1_score = ", f1_score(y_test, y_pred_2))
	print(" recall_score = ", recall_score(y_test, y_pred_2))
	print ('\nConfussion matrix:\n', confusion_matrix(y_test, y_pred_2))

	print("________________________________________Support Vector M _______________________________________")
	print("Accuracy_score for Support Vector Machine = ",metrics.accuracy_score(y_test, y_pred))
	print(" Precision = ",precision_score(y_test, y_pred))
	print(" f1_score = ", f1_score(y_test, y_pred))
	print(" recall_score ", recall_score(y_test, y_pred))
	print ('\nConfussion matrix:\n', confusion_matrix(y_test, y_pred))



	"""_____________________________________________________________________________________

	Accuracy = TP + TN / TP + TN + FP + FN

	Precision = TP / (TP + FP)

	Recall = TP / (TP + FN)

	F1 Score = 2 * Precision * Recall / Precision + Recall

	# calculate accuracy

	from sklearn.metrics import classification_report
	print ('\nClasification report:\n', classification_report(y_test,))
	print ('\nConfussion matrix:\n', confusion_matrix(y_test, classifier.predict(x_test)))


	from sklearn import metrics
	from sklearn.metrics import classification_report
	ova_models=[classifier_SVC,classifier,classifier_DecisionTree]
	for model in ova_models:
	print("*********************{}********".format(model))
	print(classification_report(y_test, model.predict(x_test)))
	print ('\nConfussion matrix:\n', confusion_matrix(y_test, model.predict(x_test)))
	print("Accuracy_score: ",metrics.accuracy_score(y_test, model.predict(x_test)))
	print(" ")