Created
November 7, 2018 18:21
-
-
Save h5li/456930a13b77b85c67ba36f3a99c3dc8 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Multi-label Classification for Toxic Comments\n", | |
"\n", | |
"This notebook takes a dataset of Wikipedia comments that have been labeled as toxic by humans and labels the comments with six given labels, toxic, severe toxic, obscene, threat, insult, and identity_hate. The output of the model used gives a probability prediction for each label for a given comment." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.ensemble import RandomForestClassifier\n", | |
"import pandas as pd, numpy as np\n", | |
"from sklearn.feature_extraction.text import TfidfVectorizer" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Reading in the data\n", | |
"train = pd.read_csv('train.csv')\n", | |
"test = pd.read_csv('test.csv')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>id</th>\n", | |
" <th>comment_text</th>\n", | |
" <th>toxic</th>\n", | |
" <th>severe_toxic</th>\n", | |
" <th>obscene</th>\n", | |
" <th>threat</th>\n", | |
" <th>insult</th>\n", | |
" <th>identity_hate</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0000997932d777bf</td>\n", | |
" <td>Explanation\\nWhy the edits made under my usern...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>000103f0d9cfb60f</td>\n", | |
" <td>D'aww! He matches this background colour I'm s...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>000113f07ec002fd</td>\n", | |
" <td>Hey man, I'm really not trying to edit war. It...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>0001b41b1c6bb37e</td>\n", | |
" <td>\"\\nMore\\nI can't make any real suggestions on ...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>0001d958c54c6e35</td>\n", | |
" <td>You, sir, are my hero. Any chance you remember...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" id comment_text toxic \\\n", | |
"0 0000997932d777bf Explanation\\nWhy the edits made under my usern... 0 \n", | |
"1 000103f0d9cfb60f D'aww! He matches this background colour I'm s... 0 \n", | |
"2 000113f07ec002fd Hey man, I'm really not trying to edit war. It... 0 \n", | |
"3 0001b41b1c6bb37e \"\\nMore\\nI can't make any real suggestions on ... 0 \n", | |
"4 0001d958c54c6e35 You, sir, are my hero. Any chance you remember... 0 \n", | |
"\n", | |
" severe_toxic obscene threat insult identity_hate \n", | |
"0 0 0 0 0 0 \n", | |
"1 0 0 0 0 0 \n", | |
"2 0 0 0 0 0 \n", | |
"3 0 0 0 0 0 \n", | |
"4 0 0 0 0 0 " | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"train.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Preprocessing the data\n", | |
"\n", | |
"In order to use comments as input for a model, we need a vector representation of the comments. This can be done through a technique called tf-idf, which first removes common words from the text and then creates a vector representation. Both the testing and training data will be preprocessed." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Number of rows: 159571\n", | |
"Number of NaNs in comments column: 0\n" | |
] | |
} | |
], | |
"source": [ | |
"# Check number of missing values in the data we're working with\n", | |
"print 'Number of rows: ' , len(train)\n", | |
"print 'Number of NaNs in comments column: ', sum(pd.isnull(train['comment_text']) == True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"train_X = train.comment_text\n", | |
"train_y = train.iloc[:, 2:8]\n", | |
"test_X = test.comment_text" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"159571" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"len(train_X)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Since we want a vector representation of all words, we need to take both testing and training and tfidf them\n", | |
"combined_text = train_X.append(test_X, ignore_index=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 21.3 s, sys: 288 ms, total: 21.6 s\n", | |
"Wall time: 21.5 s\n" | |
] | |
} | |
], | |
"source": [ | |
"%%time\n", | |
"# Initialize the tf-idf matrix from sklearn\n", | |
"vectorizer = TfidfVectorizer(strip_accents='unicode',\n", | |
" analyzer='word',\n", | |
" lowercase=True, # Convert all uppercase to lowercase\n", | |
" stop_words='english', # Remove commonly found english words ('it', 'a', 'the') which do not typically contain much signal\n", | |
" max_df = 0.9) # Only consider words that appear in fewer than max_df percent of all documents \n", | |
"tfidf_matrix = vectorizer.fit(combined_text) # fit tfidf to comments" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%%time\n", | |
"# Transform test and train into a numerical representation of comments\n", | |
"train_features_X = tfidf_matrix.transform(train_X)\n", | |
"test_features_X = tfidf_matrix.transform(test_X)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Fitting a model\n", | |
"\n", | |
"We will use the sklearn RandomForestClassifier for this dataset. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 124, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Initialize the model\n", | |
"clf = RandomForestClassifier()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 125, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 8min 14s, sys: 68 ms, total: 8min 14s\n", | |
"Wall time: 8min 14s\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", | |
" max_depth=None, max_features='auto', max_leaf_nodes=None,\n", | |
" min_impurity_decrease=0.0, min_impurity_split=None,\n", | |
" min_samples_leaf=1, min_samples_split=2,\n", | |
" min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n", | |
" oob_score=False, random_state=None, verbose=0,\n", | |
" warm_start=False)" | |
] | |
}, | |
"execution_count": 125, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%%time\n", | |
"# Takes around 8 minutes\n", | |
"# Fit the net to the training data\n", | |
"clf.fit(train_features_X, train_y)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Using the model to output probabilities for test data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 126, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"labels = clf.predict(test_features_X)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Creating a submission file for the test submission" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 54, | |
"metadata": { | |
"collapsed": true, | |
"scrolled": true | |
}, | |
"outputs": [], | |
"source": [ | |
"submission = pd.DataFrame(labels, columns=['toxic', 'severe_toxic', 'obscene','threat', 'insult', 'identity_hate'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 56, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"submission['id'] = test['id']" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 61, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"cols = submission.columns.tolist()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 63, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"cols = ['id', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 65, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"submission = submission[cols]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 66, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"submission.to_csv(\"first_submission.csv\", index=None)" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.13" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment