h5li/solution_notebook.ipynb

## solution_notebook.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 💀Spooky Authors Notebook💀\n",
    "\n",
    "This week we are doing more natural language processing with Kaggle's Spooky Authors dataset. The goal is to be able to recognize an author from snippets of their stories. Let's hit them with that Count Vectorizer and Naive Bayes! 💅🏽"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.pipeline import Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Read in testing and training data into two dataframes\n",
    "test_df=pd.read_csv(\"test.csv\")\n",
    "train_df=pd.read_csv(\"train.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The abbreviations for the three authors we're studying are:\n",
    "EAP: Edgar Allan Poe, HPL: HP Lovecraft; MWS: Mary Wollstonecraft Shelley"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "EAP    7900\n",
       "MWS    6044\n",
       "HPL    5635\n",
       "Name: author, dtype: int64"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df['author'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This value counts command shows us that our dataset is pretty balanced between authors, which means less work for us!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Split into features and labels\n",
    "X_train = train_df['text']\n",
    "y_train = train_df['author']\n",
    "X_test = test_df['text']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
       "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), preprocessor=None, stop_words='english',\n",
       "        strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "        tokenizer=None, vocabulary=None)"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We need to transform our text data into vectors so that we can run it though a machine learning model\n",
    "vectorizer = CountVectorizer(stop_words='english')\n",
    "corpus = pd.concat([train_df['text'], test_df['text']])\n",
    "vectorizer.fit(corpus)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = vectorizer.transform(X_train)\n",
    "X_test = vectorizer.transform(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to be using a multinomial Naive Bayes classifier, a simple and fast model for nlp."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classifier = MultinomialNB()\n",
    "classifier.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred_proba = classifier.predict_proba(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "submission = pd.DataFrame(y_pred_proba, columns=[\"EAP\",\"HPL\",\"MWS\"])\n",
    "submission['id'] = test_df['id']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "submission = submission[[\"id\",\"EAP\",\"HPL\",\"MWS\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "submission.to_csv('submission.csv', index=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# 💀Spooky Authors Notebook💀\n",
	"\n",
	"This week we are doing more natural language processing with Kaggle's Spooky Authors dataset. The goal is to be able to recognize an author from snippets of their stories. Let's hit them with that Count Vectorizer and Naive Bayes! 💅🏽"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 80,
	"metadata": {},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"import numpy as np\n",
	"from sklearn.feature_extraction.text import CountVectorizer\n",
	"from sklearn.naive_bayes import MultinomialNB\n",
	"from sklearn.pipeline import Pipeline"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 62,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# Read in testing and training data into two dataframes\n",
	"test_df=pd.read_csv(\"test.csv\")\n",
	"train_df=pd.read_csv(\"train.csv\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The abbreviations for the three authors we're studying are:\n",
	"EAP: Edgar Allan Poe, HPL: HP Lovecraft; MWS: Mary Wollstonecraft Shelley"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 63,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"EAP 7900\n",
	"MWS 6044\n",
	"HPL 5635\n",
	"Name: author, dtype: int64"
	]
	},
	"execution_count": 63,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"train_df['author'].value_counts()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This value counts command shows us that our dataset is pretty balanced between authors, which means less work for us!"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 64,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# Split into features and labels\n",
	"X_train = train_df['text']\n",
	"y_train = train_df['author']\n",
	"X_test = test_df['text']"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 66,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',\n",
	" dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
	" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
	" ngram_range=(1, 1), preprocessor=None, stop_words='english',\n",
	" strip_accents=None, token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
	" tokenizer=None, vocabulary=None)"
	]
	},
	"execution_count": 66,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# We need to transform our text data into vectors so that we can run it though a machine learning model\n",
	"vectorizer = CountVectorizer(stop_words='english')\n",
	"corpus = pd.concat([train_df['text'], test_df['text']])\n",
	"vectorizer.fit(corpus)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 67,
	"metadata": {},
	"outputs": [],
	"source": [
	"X_train = vectorizer.transform(X_train)\n",
	"X_test = vectorizer.transform(X_test)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We are going to be using a multinomial Naive Bayes classifier, a simple and fast model for nlp."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 79,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
	]
	},
	"execution_count": 79,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"classifier = MultinomialNB()\n",
	"classifier.fit(X_train, y_train)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 75,
	"metadata": {},
	"outputs": [],
	"source": [
	"y_pred_proba = classifier.predict_proba(X_test)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 76,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"submission = pd.DataFrame(y_pred_proba, columns=[\"EAP\",\"HPL\",\"MWS\"])\n",
	"submission['id'] = test_df['id']"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 77,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"submission = submission[[\"id\",\"EAP\",\"HPL\",\"MWS\"]]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 78,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"submission.to_csv('submission.csv', index=None)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.15"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}