Skip to content

Instantly share code, notes, and snippets.

@h5li
Created November 7, 2018 18:21
Show Gist options
  • Save h5li/456930a13b77b85c67ba36f3a99c3dc8 to your computer and use it in GitHub Desktop.
Save h5li/456930a13b77b85c67ba36f3a99c3dc8 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multi-label Classification for Toxic Comments\n",
"\n",
"This notebook takes a dataset of Wikipedia comments that have been labeled as toxic by humans and labels the comments with six given labels, toxic, severe toxic, obscene, threat, insult, and identity_hate. The output of the model used gives a probability prediction for each label for a given comment."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.ensemble import RandomForestClassifier\n",
"import pandas as pd, numpy as np\n",
"from sklearn.feature_extraction.text import TfidfVectorizer"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Reading in the data\n",
"train = pd.read_csv('train.csv')\n",
"test = pd.read_csv('test.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>comment_text</th>\n",
" <th>toxic</th>\n",
" <th>severe_toxic</th>\n",
" <th>obscene</th>\n",
" <th>threat</th>\n",
" <th>insult</th>\n",
" <th>identity_hate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0000997932d777bf</td>\n",
" <td>Explanation\\nWhy the edits made under my usern...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>000103f0d9cfb60f</td>\n",
" <td>D'aww! He matches this background colour I'm s...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>000113f07ec002fd</td>\n",
" <td>Hey man, I'm really not trying to edit war. It...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0001b41b1c6bb37e</td>\n",
" <td>\"\\nMore\\nI can't make any real suggestions on ...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0001d958c54c6e35</td>\n",
" <td>You, sir, are my hero. Any chance you remember...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id comment_text toxic \\\n",
"0 0000997932d777bf Explanation\\nWhy the edits made under my usern... 0 \n",
"1 000103f0d9cfb60f D'aww! He matches this background colour I'm s... 0 \n",
"2 000113f07ec002fd Hey man, I'm really not trying to edit war. It... 0 \n",
"3 0001b41b1c6bb37e \"\\nMore\\nI can't make any real suggestions on ... 0 \n",
"4 0001d958c54c6e35 You, sir, are my hero. Any chance you remember... 0 \n",
"\n",
" severe_toxic obscene threat insult identity_hate \n",
"0 0 0 0 0 0 \n",
"1 0 0 0 0 0 \n",
"2 0 0 0 0 0 \n",
"3 0 0 0 0 0 \n",
"4 0 0 0 0 0 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Preprocessing the data\n",
"\n",
"In order to use comments as input for a model, we need a vector representation of the comments. This can be done through a technique called tf-idf, which first removes common words from the text and then creates a vector representation. Both the testing and training data will be preprocessed."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of rows: 159571\n",
"Number of NaNs in comments column: 0\n"
]
}
],
"source": [
"# Check number of missing values in the data we're working with\n",
"print 'Number of rows: ' , len(train)\n",
"print 'Number of NaNs in comments column: ', sum(pd.isnull(train['comment_text']) == True)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train_X = train.comment_text\n",
"train_y = train.iloc[:, 2:8]\n",
"test_X = test.comment_text"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"159571"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(train_X)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Since we want a vector representation of all words, we need to take both testing and training and tfidf them\n",
"combined_text = train_X.append(test_X, ignore_index=True)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 21.3 s, sys: 288 ms, total: 21.6 s\n",
"Wall time: 21.5 s\n"
]
}
],
"source": [
"%%time\n",
"# Initialize the tf-idf matrix from sklearn\n",
"vectorizer = TfidfVectorizer(strip_accents='unicode',\n",
" analyzer='word',\n",
" lowercase=True, # Convert all uppercase to lowercase\n",
" stop_words='english', # Remove commonly found english words ('it', 'a', 'the') which do not typically contain much signal\n",
" max_df = 0.9) # Only consider words that appear in fewer than max_df percent of all documents \n",
"tfidf_matrix = vectorizer.fit(combined_text) # fit tfidf to comments"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%%time\n",
"# Transform test and train into a numerical representation of comments\n",
"train_features_X = tfidf_matrix.transform(train_X)\n",
"test_features_X = tfidf_matrix.transform(test_X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fitting a model\n",
"\n",
"We will use the sklearn RandomForestClassifier for this dataset. "
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Initialize the model\n",
"clf = RandomForestClassifier()"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 8min 14s, sys: 68 ms, total: 8min 14s\n",
"Wall time: 8min 14s\n"
]
},
{
"data": {
"text/plain": [
"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
" max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n",
" oob_score=False, random_state=None, verbose=0,\n",
" warm_start=False)"
]
},
"execution_count": 125,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"# Takes around 8 minutes\n",
"# Fit the net to the training data\n",
"clf.fit(train_features_X, train_y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using the model to output probabilities for test data"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"labels = clf.predict(test_features_X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating a submission file for the test submission"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"submission = pd.DataFrame(labels, columns=['toxic', 'severe_toxic', 'obscene','threat', 'insult', 'identity_hate'])"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"submission['id'] = test['id']"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"cols = submission.columns.tolist()"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"cols = ['id', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"submission = submission[cols]"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"submission.to_csv(\"first_submission.csv\", index=None)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment