Skip to content

Instantly share code, notes, and snippets.

@john-nash-rs
Created December 31, 2017 09:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save john-nash-rs/26f2301caeeeb616c6a9c7c4afdf8a1d to your computer and use it in GitHub Desktop.
Save john-nash-rs/26f2301caeeeb616c6a9c7c4afdf8a1d to your computer and use it in GitHub Desktop.
A tutorial To Find Best Scikit classifiers For Sentiment Analysis
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"<H1>Sentiment Analysis</H1>\n",
"<br>\n",
"Internet is reaching to more and more people everyday. There is more and more interaction among people through social networking sites. While we have plenty of positives, we can't deny existence of onlune bullying. Sentiment analysis can help us tackle this. Here's an attempt at doing that using scikit, nltk and panda.\n",
"\n",
"Panda helps us read csv data. There will be a seperate post on pandas.\n",
"\n",
"Panda read the training data and presented the data to us in the form of data frame. You may think data frame as sql table from where we can query data.\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>comment_text</th>\n",
" <th>toxic</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1231.0</td>\n",
" <td>you are bad.</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id comment_text toxic\n",
"0 1231.0 you are bad. 1"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"file_path = \"../../..//Downloads/train_1.csv\"\n",
"df = pd.read_csv(file_path)\n",
"df[:1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"It's very important to split the given data into two parts. One for training our model and other for testing our model. Scikit provides us with an option to split the data with <i>train_test_split</i>. <i>random_state</i> ensures same data is there in test and train, how much ever times you split.\n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"X_train, X_test, y_train, y_test = train_test_split(df.comment_text, df.toxic, test_size=0.20, random_state=42)\n",
"name = df.toxic.name"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"Let's be practical. Any data in the world will not be ideal with correct spelling. Some of them will also have internet typo like itttttttt for it. Textblob provides us with collection of words. <i>TextBlob(message).words</i> will give us collection of words from a sentence. We can do word.correct() to correct spelling of the word. It has about 70% accuracy. Lemma is converting words into it's root form. Like given <i>playing</i> it will return <i>play</i>. word.lemmatize() is a callable function to do the same. \n",
"\n",
"It's important to preprocess because any model will ultimately rely on the occurence of similar words in test and train data. \n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from textblob import TextBlob\n",
"import nltk\n",
"def split_into_lemmas(message):\n",
" message = unicode(message, 'utf8').lower()\n",
" words = TextBlob(message).words\n",
" return [word.lemma for word in words]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"We are going to train four different learning models on the data given. We are going to compare the results of the\n",
"different classifiers. Describing each classifier is beyond the scope of this post. I will provide the relevent link \n",
"at the end of this post.\n",
"\n",
"Pipelines are like a collection of function which is applied to text data sequentially.\n",
"\n",
"Here our pipeline has three functions:\n",
" a. CountVectorizer()\n",
" b. TfidfTransformer()\n",
" c. Classifier() which is SGDClassifier, LogisticRegression, MultinomialNB, SVC in our four different pipeline\n",
" \n",
"CountVectorizer : Convert a collection of text documents to a matrix of token counts. For example : \"You are awesome\"\n",
" will be returned as per the analyzer given to CountVectorizer which is split_into_lemmas in our case. So, the \n",
" CountVectorizer will turn \"You are awesome\" into \"You\", \"are\", \"awesome\". ngram_range further split each token\n",
" into substrings. We have given ngram_range as (2,4), which means each word will be substringed into substring\n",
" of 2, 3 and 4 characters. stop_words remove the commonly occuring english words from the given text.\n",
"\n",
"TfidfTransformer : Transform a count matrix to a normalized tf or tf-idf representation. Tf means term-frequency \n",
" while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme \n",
" in information retrieval, that has also found good use in document classification. The goal of using \n",
" tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down \n",
" the impact of tokens that occur very frequently in a given corpus and that are hence empirically less \n",
" informative than features that occur in a small fraction of the training corpus.\n",
" \n",
"Classifier : Classifier actually learns the data and classifies into the labels. \n",
"</br>"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.feature_extraction.text import TfidfTransformer\n",
"from sklearn.linear_model import SGDClassifier\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.naive_bayes import MultinomialNB\n",
"from sklearn.svm import SVC"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<h2>Making pipeline of SGDClassifier</h2>\n",
"<br>\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(steps=[('vect', CountVectorizer(analyzer=<function split_into_lemmas at 0x11bd320c8>,\n",
" binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>,\n",
" encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0,\n",
" max_features=None, min_df=1, ngram_range=(2, 4), preprocess... penalty='l2', power_t=0.5, random_state=None, shuffle=True,\n",
" verbose=0, warm_start=False))])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.pipeline import Pipeline\n",
"text_clf_SGDClassifier = Pipeline([('vect', CountVectorizer(analyzer=split_into_lemmas, ngram_range=(2,4), stop_words='english',lowercase=True)),\n",
" ('tfidf', TfidfTransformer()),\n",
" ('clf', SGDClassifier()),\n",
"])\n",
"text_clf_SGDClassifier.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<h2>Making pipeline of LogisticRegression<h2>\n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(steps=[('vect', CountVectorizer(analyzer=<function split_into_lemmas at 0x11bd320c8>,\n",
" binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>,\n",
" encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0,\n",
" max_features=None, min_df=1, ngram_range=(2, 4), preprocess...ty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
" verbose=0, warm_start=False))])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_clf_LogisticRegression = Pipeline([('vect', CountVectorizer(analyzer=split_into_lemmas, ngram_range=(2,4), stop_words='english',lowercase=True)),\n",
" ('tfidf', TfidfTransformer()),\n",
" ('clf', LogisticRegression()),\n",
"])\n",
"text_clf_LogisticRegression.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<h2>Making pipeline of MultinomialNB<h2>\n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(steps=[('vect', CountVectorizer(analyzer=<function split_into_lemmas at 0x11bd320c8>,\n",
" binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>,\n",
" encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0,\n",
" max_features=None, min_df=1, ngram_range=(2, 4), preprocess...False,\n",
" use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_clf_MultinomialNB = Pipeline([('vect', CountVectorizer(analyzer=split_into_lemmas, ngram_range=(2,4), stop_words='english',lowercase=True)),\n",
" ('tfidf', TfidfTransformer()),\n",
" ('clf', MultinomialNB()),\n",
"])\n",
"text_clf_MultinomialNB.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<h2>Making pipeline of SVC<h2>\n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(steps=[('vect', CountVectorizer(analyzer=<function split_into_lemmas at 0x11bd320c8>,\n",
" binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>,\n",
" encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0,\n",
" max_features=None, min_df=1, ngram_range=(2, 4), preprocess...,\n",
" max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
" tol=0.001, verbose=False))])"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_clf_SVC = Pipeline([('vect', CountVectorizer(analyzer=split_into_lemmas, ngram_range=(2,4), stop_words='english',lowercase=True)),\n",
" ('tfidf', TfidfTransformer()),\n",
" ('clf', SVC(kernel='linear')),\n",
"])\n",
"text_clf_SVC.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"Learning of data is done by using fit method. Now, our model is ready to predict on the test data.\n",
"Let's start prediction.\n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"predicted_SVC = text_clf_SVC.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"predicted_MultinomialNB = text_clf_MultinomialNB.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"predicted_LogisticRegression = text_clf_LogisticRegression.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"predicted_SGDClassifier = text_clf_SGDClassifier.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 0.97 0.99 0.98 17335\n",
" 1 0.90 0.66 0.76 1836\n",
"\n",
"avg / total 0.96 0.96 0.96 19171\n",
"\n",
" precision recall f1-score support\n",
"\n",
" 0 0.91 1.00 0.95 17335\n",
" 1 1.00 0.11 0.19 1836\n",
"\n",
"avg / total 0.92 0.91 0.88 19171\n",
"\n",
" precision recall f1-score support\n",
"\n",
" 0 0.96 0.99 0.98 17335\n",
" 1 0.92 0.59 0.72 1836\n",
"\n",
"avg / total 0.95 0.96 0.95 19171\n",
"\n",
" precision recall f1-score support\n",
"\n",
" 0 0.95 1.00 0.97 17335\n",
" 1 0.97 0.47 0.63 1836\n",
"\n",
"avg / total 0.95 0.95 0.94 19171\n",
"\n"
]
}
],
"source": [
"from sklearn.metrics import classification_report\n",
"print (classification_report(y_test, predicted_SVC))\n",
"print (classification_report(y_test, predicted_MultinomialNB))\n",
"print (classification_report(y_test, predicted_LogisticRegression))\n",
"print (classification_report(y_test, predicted_SGDClassifier))"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"<br>\n",
"Let's say our data has to be classified as 0 and 1. \n",
"Number of 1 correctly predicted is recall score.\n",
"So, Percentage correct prediction of 1 in different classification model is:\n",
"\n",
"SVC: 66%\n",
"MultinomialNB: 11%\n",
"LogisticRegression: 59%\n",
"SGDClassifier: 47%\n",
"\n",
"So, Naive Bayes gives very bad result. It can just predict 11% of bad comments. SGDClassifier predicted 47% of bad comments correctly which is a considerable improvement over the Naive Bayes. Logistic Regression though has regression in its surname but its a classifier and it shows good improvement over SGDClassifier. \n",
"\n",
"<b>SVC comes out as winner with 66 % correct prediction.</b>\n",
"\n",
"As you can see, each classifier consist of many different parameters. \n",
"\n",
"For example MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) has alpha, class_prior and fit_prior. \n",
"\n",
"In this post, we have run each classifier with the default setting. We will try to see how we can do performance tuning by changing parameters in the next post. \n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"celltoolbar": "Raw Cell Format",
"kernelspec": {
"display_name": "Python [Root]",
"language": "python",
"name": "Python [Root]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment