h5li/toxic_comments_classification.ipynb

## toxic_comments_classification.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multi-label Classification for Toxic Comments\n",
    "\n",
    "This notebook takes a dataset of Wikipedia comments that have been labeled as toxic by humans and labels the comments with six given labels, toxic, severe toxic, obscene, threat, insult, and identity_hate. The output of the model used gives a probability prediction for each label for a given comment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "import pandas as pd, numpy as np\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Reading in the data\n",
    "train = pd.read_csv('train.csv')\n",
    "test = pd.read_csv('test.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>comment_text</th>\n",
       "      <th>toxic</th>\n",
       "      <th>severe_toxic</th>\n",
       "      <th>obscene</th>\n",
       "      <th>threat</th>\n",
       "      <th>insult</th>\n",
       "      <th>identity_hate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0000997932d777bf</td>\n",
       "      <td>Explanation\\nWhy the edits made under my usern...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>000103f0d9cfb60f</td>\n",
       "      <td>D'aww! He matches this background colour I'm s...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>000113f07ec002fd</td>\n",
       "      <td>Hey man, I'm really not trying to edit war. It...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0001b41b1c6bb37e</td>\n",
       "      <td>\"\\nMore\\nI can't make any real suggestions on ...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0001d958c54c6e35</td>\n",
       "      <td>You, sir, are my hero. Any chance you remember...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 id                                       comment_text  toxic  \\\n",
       "0  0000997932d777bf  Explanation\\nWhy the edits made under my usern...      0   \n",
       "1  000103f0d9cfb60f  D'aww! He matches this background colour I'm s...      0   \n",
       "2  000113f07ec002fd  Hey man, I'm really not trying to edit war. It...      0   \n",
       "3  0001b41b1c6bb37e  \"\\nMore\\nI can't make any real suggestions on ...      0   \n",
       "4  0001d958c54c6e35  You, sir, are my hero. Any chance you remember...      0   \n",
       "\n",
       "   severe_toxic  obscene  threat  insult  identity_hate  \n",
       "0             0        0       0       0              0  \n",
       "1             0        0       0       0              0  \n",
       "2             0        0       0       0              0  \n",
       "3             0        0       0       0              0  \n",
       "4             0        0       0       0              0  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preprocessing the data\n",
    "\n",
    "In order to use comments as input for a model, we need a vector representation of the comments. This can be done through a technique called tf-idf, which first removes common words from the text and then creates a vector representation. Both the testing and training data will be preprocessed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of rows:  159571\n",
      "Number of NaNs in comments column:  0\n"
     ]
    }
   ],
   "source": [
    "# Check number of missing values in the data we're working with\n",
    "print 'Number of rows: ' , len(train)\n",
    "print 'Number of NaNs in comments column: ', sum(pd.isnull(train['comment_text']) == True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train_X = train.comment_text\n",
    "train_y = train.iloc[:, 2:8]\n",
    "test_X = test.comment_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "159571"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(train_X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Since we want a vector representation of all words, we need to take both testing and training and tfidf them\n",
    "combined_text = train_X.append(test_X, ignore_index=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 21.3 s, sys: 288 ms, total: 21.6 s\n",
      "Wall time: 21.5 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# Initialize the tf-idf matrix from sklearn\n",
    "vectorizer = TfidfVectorizer(strip_accents='unicode',\n",
    "                             analyzer='word',\n",
    "                             lowercase=True, # Convert all uppercase to lowercase\n",
    "                             stop_words='english', # Remove commonly found english words ('it', 'a', 'the') which do not typically contain much signal\n",
    "                             max_df = 0.9) # Only consider words that appear in fewer than max_df percent of all documents          \n",
    "tfidf_matrix = vectorizer.fit(combined_text) # fit tfidf to comments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "# Transform test and train into a numerical representation of comments\n",
    "train_features_X = tfidf_matrix.transform(train_X)\n",
    "test_features_X = tfidf_matrix.transform(test_X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fitting a model\n",
    "\n",
    "We will use the sklearn RandomForestClassifier for this dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Initialize the model\n",
    "clf = RandomForestClassifier()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 8min 14s, sys: 68 ms, total: 8min 14s\n",
      "Wall time: 8min 14s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
       "            max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
       "            min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "            min_samples_leaf=1, min_samples_split=2,\n",
       "            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n",
       "            oob_score=False, random_state=None, verbose=0,\n",
       "            warm_start=False)"
      ]
     },
     "execution_count": 125,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "# Takes around 8 minutes\n",
    "# Fit the net to the training data\n",
    "clf.fit(train_features_X, train_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using the model to output probabilities for test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 126,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "labels = clf.predict(test_features_X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating a submission file for the test submission"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "submission = pd.DataFrame(labels, columns=['toxic', 'severe_toxic', 'obscene','threat', 'insult', 'identity_hate'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "submission['id'] = test['id']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "cols = submission.columns.tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "cols = ['id', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "submission = submission[cols]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "submission.to_csv(\"first_submission.csv\", index=None)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Multi-label Classification for Toxic Comments\n",
	"\n",
	"This notebook takes a dataset of Wikipedia comments that have been labeled as toxic by humans and labels the comments with six given labels, toxic, severe toxic, obscene, threat, insult, and identity_hate. The output of the model used gives a probability prediction for each label for a given comment."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"from sklearn.ensemble import RandomForestClassifier\n",
	"import pandas as pd, numpy as np\n",
	"from sklearn.feature_extraction.text import TfidfVectorizer"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# Reading in the data\n",
	"train = pd.read_csv('train.csv')\n",
	"test = pd.read_csv('test.csv')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style>\n",
	" .dataframe thead tr:only-child th {\n",
	" text-align: right;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: left;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>id</th>\n",
	" <th>comment_text</th>\n",
	" <th>toxic</th>\n",
	" <th>severe_toxic</th>\n",
	" <th>obscene</th>\n",
	" <th>threat</th>\n",
	" <th>insult</th>\n",
	" <th>identity_hate</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>0000997932d777bf</td>\n",
	" <td>Explanation\\nWhy the edits made under my usern...</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>000103f0d9cfb60f</td>\n",
	" <td>D'aww! He matches this background colour I'm s...</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>000113f07ec002fd</td>\n",
	" <td>Hey man, I'm really not trying to edit war. It...</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>0001b41b1c6bb37e</td>\n",
	" <td>\"\\nMore\\nI can't make any real suggestions on ...</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>0001d958c54c6e35</td>\n",
	" <td>You, sir, are my hero. Any chance you remember...</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" id comment_text toxic \\\n",
	"0 0000997932d777bf Explanation\\nWhy the edits made under my usern... 0 \n",
	"1 000103f0d9cfb60f D'aww! He matches this background colour I'm s... 0 \n",
	"2 000113f07ec002fd Hey man, I'm really not trying to edit war. It... 0 \n",
	"3 0001b41b1c6bb37e \"\\nMore\\nI can't make any real suggestions on ... 0 \n",
	"4 0001d958c54c6e35 You, sir, are my hero. Any chance you remember... 0 \n",
	"\n",
	" severe_toxic obscene threat insult identity_hate \n",
	"0 0 0 0 0 0 \n",
	"1 0 0 0 0 0 \n",
	"2 0 0 0 0 0 \n",
	"3 0 0 0 0 0 \n",
	"4 0 0 0 0 0 "
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"train.head()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Preprocessing the data\n",
	"\n",
	"In order to use comments as input for a model, we need a vector representation of the comments. This can be done through a technique called tf-idf, which first removes common words from the text and then creates a vector representation. Both the testing and training data will be preprocessed."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Number of rows: 159571\n",
	"Number of NaNs in comments column: 0\n"
	]
	}
	],
	"source": [
	"# Check number of missing values in the data we're working with\n",
	"print 'Number of rows: ' , len(train)\n",
	"print 'Number of NaNs in comments column: ', sum(pd.isnull(train['comment_text']) == True)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"train_X = train.comment_text\n",
	"train_y = train.iloc[:, 2:8]\n",
	"test_X = test.comment_text"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"159571"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"len(train_X)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# Since we want a vector representation of all words, we need to take both testing and training and tfidf them\n",
	"combined_text = train_X.append(test_X, ignore_index=True)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 21.3 s, sys: 288 ms, total: 21.6 s\n",
	"Wall time: 21.5 s\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"# Initialize the tf-idf matrix from sklearn\n",
	"vectorizer = TfidfVectorizer(strip_accents='unicode',\n",
	" analyzer='word',\n",
	" lowercase=True, # Convert all uppercase to lowercase\n",
	" stop_words='english', # Remove commonly found english words ('it', 'a', 'the') which do not typically contain much signal\n",
	" max_df = 0.9) # Only consider words that appear in fewer than max_df percent of all documents \n",
	"tfidf_matrix = vectorizer.fit(combined_text) # fit tfidf to comments"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"%%time\n",
	"# Transform test and train into a numerical representation of comments\n",
	"train_features_X = tfidf_matrix.transform(train_X)\n",
	"test_features_X = tfidf_matrix.transform(test_X)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Fitting a model\n",
	"\n",
	"We will use the sklearn RandomForestClassifier for this dataset. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 124,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# Initialize the model\n",
	"clf = RandomForestClassifier()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 125,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 8min 14s, sys: 68 ms, total: 8min 14s\n",
	"Wall time: 8min 14s\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
	" max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
	" min_impurity_decrease=0.0, min_impurity_split=None,\n",
	" min_samples_leaf=1, min_samples_split=2,\n",
	" min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n",
	" oob_score=False, random_state=None, verbose=0,\n",
	" warm_start=False)"
	]
	},
	"execution_count": 125,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"%%time\n",
	"# Takes around 8 minutes\n",
	"# Fit the net to the training data\n",
	"clf.fit(train_features_X, train_y)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Using the model to output probabilities for test data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 126,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"labels = clf.predict(test_features_X)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Creating a submission file for the test submission"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 54,
	"metadata": {
	"collapsed": true,
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"submission = pd.DataFrame(labels, columns=['toxic', 'severe_toxic', 'obscene','threat', 'insult', 'identity_hate'])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 56,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"submission['id'] = test['id']"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 61,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"cols = submission.columns.tolist()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 63,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"cols = ['id', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 65,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"submission = submission[cols]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 66,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"submission.to_csv(\"first_submission.csv\", index=None)"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.13"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}