Skip to content

Instantly share code, notes, and snippets.

@h5li
Created November 28, 2018 04:40
Show Gist options
  • Save h5li/1d15b05b591317f038054fd4691680f0 to your computer and use it in GitHub Desktop.
Save h5li/1d15b05b591317f038054fd4691680f0 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 💀Spooky Authors Notebook💀\n",
"\n",
"This week we are doing more natural language processing with Kaggle's Spooky Authors dataset. The goal is to be able to recognize an author from snippets of their stories. Let's hit them with that Count Vectorizer and Naive Bayes! 💅🏽"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.naive_bayes import MultinomialNB\n",
"from sklearn.pipeline import Pipeline"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Read in testing and training data into two dataframes\n",
"test_df=pd.read_csv(\"test.csv\")\n",
"train_df=pd.read_csv(\"train.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The abbreviations for the three authors we're studying are:\n",
"EAP: Edgar Allan Poe, HPL: HP Lovecraft; MWS: Mary Wollstonecraft Shelley"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"EAP 7900\n",
"MWS 6044\n",
"HPL 5635\n",
"Name: author, dtype: int64"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df['author'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This value counts command shows us that our dataset is pretty balanced between authors, which means less work for us!"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Split into features and labels\n",
"X_train = train_df['text']\n",
"y_train = train_df['author']\n",
"X_test = test_df['text']"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
" dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
" ngram_range=(1, 1), preprocessor=None, stop_words='english',\n",
" strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
" tokenizer=None, vocabulary=None)"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We need to transform our text data into vectors so that we can run it though a machine learning model\n",
"vectorizer = CountVectorizer(stop_words='english')\n",
"corpus = pd.concat([train_df['text'], test_df['text']])\n",
"vectorizer.fit(corpus)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
"X_train = vectorizer.transform(X_train)\n",
"X_test = vectorizer.transform(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are going to be using a multinomial Naive Bayes classifier, a simple and fast model for nlp."
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"classifier = MultinomialNB()\n",
"classifier.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [],
"source": [
"y_pred_proba = classifier.predict_proba(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"submission = pd.DataFrame(y_pred_proba, columns=[\"EAP\",\"HPL\",\"MWS\"])\n",
"submission['id'] = test_df['id']"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"submission = submission[[\"id\",\"EAP\",\"HPL\",\"MWS\"]]"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"submission.to_csv('submission.csv', index=None)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.15"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment